Advanced NLP with spaCy v3 (opens in new tab)

(course.spacy.io)

207 pointspvpv4y ago38 comments

38 comments

21 comments · 3 top-level

artembugara4y ago· 13 in thread

We've been using spaCy a lot for the past few months.

Mostly for non-production use cases, however, I can say that it is the most robust framework for NLP at the moment.

V3 added support for transformers: that's a killer feature as many models from https://huggingface.co/docs/transformers/index work great out of the box.

At the same time, I found NER models provided by spaCy to have a low accuracy while working with real data: we deal with news articles https://demo.newscatcherapi.com/

Also, while I see how much attention ML models get from the crowd, I think that many problems can be solved with rule-based approach: and spaCy is just amazing for these.

Btw, we recently wrote a blog post comparing spaCy to NLTK for text normalization task: https://newscatcherapi.com/blog/spacy-vs-nltk-text-normaliza...

artembugara4y ago

Also I have an article about spaCy NER: https://newscatcherapi.com/blog/named-entity-recognition-wit...

The conclusion I came up with:

"A few notes on my Spacy NER accuracy with "real world" data

Low accuracy with sentences without a proper casing

1. Low accuracy overall, even with a large model

2. You'd need to fine-tune your model if you want to use it in production

3. Overall, there's no open-source high accuracy NER model that you can use out-of-a-box"

Vetch4y ago

> Overall, there's no open-source high accuracy NER model that you can use out-of-a-box"

Part of it is most underestimate the complexity of NER and the rest of it, in my opinion, is that NER is not well-defined as a classification problem.

At least in my experience, having a specific battery of questions to query documents, first by transformer based semantic search and narrowed by Q/A models, removed the need for explicit NER, entity linking or relation extraction. For the case of entities as features for rule systems, shallow models and using all label predictions instead of just selecting argmax has been sufficiently robust. Using big transformers for classification doesn't pay enough to be worth it there.

wyldfire4y ago

I assume your product does some kind of entity disambiguation and/or link to an ontology? Spacy doesn't provide this out of the box either, AFAICT. Can you share more info about how you do it?

1 more reply

pantsforbirds4y ago

We use spaCy at work for (mostly) news articles as well. We've been pretty impressed with it overall for detecting larger trends using the NER models. I've been contemplating whether it might be useful to make a spaCy module that uses a Count-Min Sketch to track the top N of each of the NER categories partitioned on a daily (or weekly etc.) time.

Think it could be an interesting use case to get sort of similar results to Google's search trends.

artembugara4y ago

I'd really love to chat about that. Any chance to connect? email in bio

brd4y ago

I really appreciate how accessible SpaCy has made NLP work but their NER is definitely low accuracy.

Where stem/lem felt critical to successful NLP processing a few years ago, we've found stem/lem work to be much less important for downstream tasks when transformer based models are involved.

For topic extraction stem/lem still seems to do a lot to improve accuracy and for rules based approaches I can still see how it would facilitate more efficient processing at scale. I'd be curious to hear your experience fine tuning and/or training new models after stem/lem processing with transformers, we've admittedly done little testing to see how transformers actually performer if properly tuned to post-processed data.

artembugara4y ago

Did you try something like autoNLP by huggingface?

1 more reply

robbedpeter4y ago

Rule based processing can augment transformers by both filtering out bad input and by parsing good input into a form that plays to the strengths of a model.

You can do some fantastic things with BERT and spaCy, or gpt-neo/J/3, or combinations as needed. Expert systems and ontological tools and things like nltk, spaCy, and LinkGrammar are excellent complements to an ai workflow. Use the fast, "dumb" tools to do the fast, dumb tasks, and only use the huge smart models when you need it.

GPT-3 shouldn't be used if you're just doing tagging or NER, but you can get higher quality nuanced extrapolation or summarization if you run things through a mad libs style prompt generator that leans into prompts that work really well.

kulikalov4y ago

Are you using the high accuracy eng model for NER? I’ve been very happy with orgs recognition, it actually did way better than any other open source model in my case.

artembugara4y ago

Try it on a sentence where all tokens are lower/upper case. It just doesn’t really work.

1 more reply

Xenoamorphous4y ago

I don’t know how it compares with other paid alternatives (like Google’s or Amazon’s) but spaCy’s NER was pretty close to the (paid) service we were using (IBM) to the point we ditched IBM. Also for news articles.

But yeah disambiguation/entity linking would be nice.

artembugara4y ago

I'd be happy to chat more if you want.

Eridrus4y ago

I feel like NER is a poorly designed task in general. You're eventually trying to link the entities to some kind of KB, so you should be injecting that entity information into your system for detecting mentions.

minimaxir4y ago· 5 in thread

A relatively underdiscussed quirk of the rise of superlarge language models like GPT-3 for certain NLP tasks is that since those models have incorporated so much real world grammar, there's no need to do advanced preprocessing and can just YOLO and work with generated embeddings instead without going into spaCy's (excellent) parsing/NER features.

OpenAI recently released an Embeddings API for GPT-3 with good demos and explanations: https://beta.openai.com/docs/guides/embeddings

Hugging Face Transformers makes this easier (and for free) as most models can be configured to return a "last_hidden_state" which will return the aggregated embedding. Just use DistilBERT uncased/cased (which is fast enough to run on consumer CPUs) and you're probably good to go.

Vetch4y ago

While you make sensible points, in the case of GPT-3, not everyone will be willing to route their data through OpenAI's servers.

> Just use DistilBERT uncased/cased (which is fast enough to run on consumer CPUs)

This can still be impractical, at least in my case of regularly needing to process hundreds of pages of text. Simpler systems can be much faster for an acceptable loss and you can get more robustness by working with label distributions instead of just picking argmax.

Fast simpler classifiers can also help decide where the more resource intensive models should focus attention.

Another reason for preprocessing is rule systems. Even if not glamorous to talk about, they still see heavy use in practical settings. While dependency parses are hard to make use of, shallow parses (chunking) and parts of speech data can be usefully fed into rule systems.

new_stranger4y ago

I imagine it being very useful to understand what you just said

hooande4y ago

lol. a rough translation is that the new super language models are good enough that you don't have to keep track of specific parts of speech in your programming. if you look at the arrays of floating point weights that underlie gpt-3 etc, you can use them to match present participle phrases with other present participle phrases and so forth

this is of course a correct and prescient observation. minimaxir is kind of an NLP final boss, so I wouldn't expect most people to be able to follow everything he says

2 more replies

mtqwerty4y ago

Readjusting expectations for pre-processing was one of the biggest differences I noticed going from NLP courses to working on NLP in production. For the amount of pre-processing learning material there is, I expected it to be much more important in practice.

I feel lucky to gotten into NLP when I did (learning in 2017/2018 and working in the beginning of 2020). Changing our system from glove to BERT was super exciting and a great way to learn about the drawbacks and benefits of each.

PeterisP4y ago

IMHO it's not a difference between courses and production, but rather about the difference between preprocessing needs of different NLP ML approaches.

For some of NLP methods all the extra preprocessing steps were absolutely crucial (and took most of the time in production) and for other NLP methods they are of limited benefit and even harmful - and it's just that in older courses (and many production environments still!) the former methods are used, so the preprocessing needs to be discussed, but if you're using a BERT-like system, then BERT (or something similar) and its subword tokenization effectively becomes your preprocessing stage.

412094y ago

I really love spaCy, it's trivial to throw up a server which handles basic NLP. No complaints here, very happy to see it still being updated

j / k navigate · click thread line to collapse

38 comments

21 comments · 3 top-level

artembugara4y ago· 13 in thread

We've been using spaCy a lot for the past few months.

Mostly for non-production use cases, however, I can say that it is the most robust framework for NLP at the moment.

V3 added support for transformers: that's a killer feature as many models from https://huggingface.co/docs/transformers/index work great out of the box.

At the same time, I found NER models provided by spaCy to have a low accuracy while working with real data: we deal with news articles https://demo.newscatcherapi.com/

Also, while I see how much attention ML models get from the crowd, I think that many problems can be solved with rule-based approach: and spaCy is just amazing for these.

Btw, we recently wrote a blog post comparing spaCy to NLTK for text normalization task: https://newscatcherapi.com/blog/spacy-vs-nltk-text-normaliza...

artembugara4y ago

Also I have an article about spaCy NER: https://newscatcherapi.com/blog/named-entity-recognition-wit...

The conclusion I came up with:

"A few notes on my Spacy NER accuracy with "real world" data

Low accuracy with sentences without a proper casing

1. Low accuracy overall, even with a large model

2. You'd need to fine-tune your model if you want to use it in production

3. Overall, there's no open-source high accuracy NER model that you can use out-of-a-box"

Vetch4y ago

> Overall, there's no open-source high accuracy NER model that you can use out-of-a-box"

Part of it is most underestimate the complexity of NER and the rest of it, in my opinion, is that NER is not well-defined as a classification problem.

wyldfire4y ago

I assume your product does some kind of entity disambiguation and/or link to an ontology? Spacy doesn't provide this out of the box either, AFAICT. Can you share more info about how you do it?

1 more reply

pantsforbirds4y ago

Think it could be an interesting use case to get sort of similar results to Google's search trends.

artembugara4y ago

I'd really love to chat about that. Any chance to connect? email in bio

brd4y ago

I really appreciate how accessible SpaCy has made NLP work but their NER is definitely low accuracy.

Where stem/lem felt critical to successful NLP processing a few years ago, we've found stem/lem work to be much less important for downstream tasks when transformer based models are involved.

artembugara4y ago

Did you try something like autoNLP by huggingface?

1 more reply

robbedpeter4y ago

Rule based processing can augment transformers by both filtering out bad input and by parsing good input into a form that plays to the strengths of a model.

kulikalov4y ago

Are you using the high accuracy eng model for NER? I’ve been very happy with orgs recognition, it actually did way better than any other open source model in my case.

artembugara4y ago

Try it on a sentence where all tokens are lower/upper case. It just doesn’t really work.

1 more reply

Xenoamorphous4y ago

But yeah disambiguation/entity linking would be nice.

artembugara4y ago

I'd be happy to chat more if you want.

Eridrus4y ago

minimaxir4y ago· 5 in thread

OpenAI recently released an Embeddings API for GPT-3 with good demos and explanations: https://beta.openai.com/docs/guides/embeddings

Vetch4y ago

While you make sensible points, in the case of GPT-3, not everyone will be willing to route their data through OpenAI's servers.

> Just use DistilBERT uncased/cased (which is fast enough to run on consumer CPUs)

Fast simpler classifiers can also help decide where the more resource intensive models should focus attention.

new_stranger4y ago

I imagine it being very useful to understand what you just said

hooande4y ago

this is of course a correct and prescient observation. minimaxir is kind of an NLP final boss, so I wouldn't expect most people to be able to follow everything he says

2 more replies

mtqwerty4y ago

PeterisP4y ago

IMHO it's not a difference between courses and production, but rather about the difference between preprocessing needs of different NLP ML approaches.

412094y ago

I really love spaCy, it's trivial to throw up a server which handles basic NLP. No complaints here, very happy to see it still being updated

j / k navigate · click thread line to collapse