Mostly for non-production use cases, however, I can say that it is the most robust framework for NLP at the moment.
V3 added support for transformers: that's a killer feature as many models from https://huggingface.co/docs/transformers/index work great out of the box.
At the same time, I found NER models provided by spaCy to have a low accuracy while working with real data: we deal with news articles https://demo.newscatcherapi.com/
Also, while I see how much attention ML models get from the crowd, I think that many problems can be solved with rule-based approach: and spaCy is just amazing for these.
Btw, we recently wrote a blog post comparing spaCy to NLTK for text normalization task: https://newscatcherapi.com/blog/spacy-vs-nltk-text-normaliza...
The conclusion I came up with:
"A few notes on my Spacy NER accuracy with "real world" data
Low accuracy with sentences without a proper casing
1. Low accuracy overall, even with a large model
2. You'd need to fine-tune your model if you want to use it in production
3. Overall, there's no open-source high accuracy NER model that you can use out-of-a-box"
Part of it is most underestimate the complexity of NER and the rest of it, in my opinion, is that NER is not well-defined as a classification problem.
At least in my experience, having a specific battery of questions to query documents, first by transformer based semantic search and narrowed by Q/A models, removed the need for explicit NER, entity linking or relation extraction. For the case of entities as features for rule systems, shallow models and using all label predictions instead of just selecting argmax has been sufficiently robust. Using big transformers for classification doesn't pay enough to be worth it there.
Think it could be an interesting use case to get sort of similar results to Google's search trends.
Where stem/lem felt critical to successful NLP processing a few years ago, we've found stem/lem work to be much less important for downstream tasks when transformer based models are involved.
For topic extraction stem/lem still seems to do a lot to improve accuracy and for rules based approaches I can still see how it would facilitate more efficient processing at scale. I'd be curious to hear your experience fine tuning and/or training new models after stem/lem processing with transformers, we've admittedly done little testing to see how transformers actually performer if properly tuned to post-processed data.
You can do some fantastic things with BERT and spaCy, or gpt-neo/J/3, or combinations as needed. Expert systems and ontological tools and things like nltk, spaCy, and LinkGrammar are excellent complements to an ai workflow. Use the fast, "dumb" tools to do the fast, dumb tasks, and only use the huge smart models when you need it.
GPT-3 shouldn't be used if you're just doing tagging or NER, but you can get higher quality nuanced extrapolation or summarization if you run things through a mad libs style prompt generator that leans into prompts that work really well.
But yeah disambiguation/entity linking would be nice.
OpenAI recently released an Embeddings API for GPT-3 with good demos and explanations: https://beta.openai.com/docs/guides/embeddings
Hugging Face Transformers makes this easier (and for free) as most models can be configured to return a "last_hidden_state" which will return the aggregated embedding. Just use DistilBERT uncased/cased (which is fast enough to run on consumer CPUs) and you're probably good to go.
> Just use DistilBERT uncased/cased (which is fast enough to run on consumer CPUs)
This can still be impractical, at least in my case of regularly needing to process hundreds of pages of text. Simpler systems can be much faster for an acceptable loss and you can get more robustness by working with label distributions instead of just picking argmax.
Fast simpler classifiers can also help decide where the more resource intensive models should focus attention.
Another reason for preprocessing is rule systems. Even if not glamorous to talk about, they still see heavy use in practical settings. While dependency parses are hard to make use of, shallow parses (chunking) and parts of speech data can be usefully fed into rule systems.
this is of course a correct and prescient observation. minimaxir is kind of an NLP final boss, so I wouldn't expect most people to be able to follow everything he says
I feel lucky to gotten into NLP when I did (learning in 2017/2018 and working in the beginning of 2020). Changing our system from glove to BERT was super exciting and a great way to learn about the drawbacks and benefits of each.
For some of NLP methods all the extra preprocessing steps were absolutely crucial (and took most of the time in production) and for other NLP methods they are of limited benefit and even harmful - and it's just that in older courses (and many production environments still!) the former methods are used, so the preprocessing needs to be discussed, but if you're using a BERT-like system, then BERT (or something similar) and its subword tokenization effectively becomes your preprocessing stage.