What's hot in the NLP area in 2022 and in the next 2-3 years, in both academia and industry?(particularly more interested in the industry)
How do you guys stay updated in that matter (links, blogs, etc?) ?
Thank you!
For industry: Feels like “making BERT / some other language model do things” is a common job nowadays. On the more engineering side - I think we’ll see more tools to quickly and efficiently fine-tune language models, especially tools that allow a human in the loop.
Overall it feels like we’re getting to a point where there’s a pretty standardized approach to simple NLP problems like text classification - no more real feature engineering, just throw BERT at the problem. I expect for this trend to continue - with more and more of a focus on dataset creation and validation and less of an emphasis on model architecture.
I also think there will be a rise in multi-modal language models - combination of language and vision models for example. But I think the more interesting application will be combining dense language model representations with sparser tabular data. Think of trying to predict a users likelihood to buy a product given a review of another product (dense embedding of text), but also their clicks over the last 2 hours. (sparser tabular data) - this feels like a much more common problem people have.
To stay updated: read papers (arxiv-sanity.com is a lifesaver) and watch talks (usually just on youtube or a lot of uni reading groups are public on zoom nowadays).
The example I gave of multi-modal learning was really just highlighting a dichotomy in the techniques that we use in machine learning today. FWIW I am a couple of years removed from working heavily with tabular data, so do take this with a grain of salt. But there are essentially two different modeling approaches for two different types of datasets. On the one hand, you have deep learning (BERT, language models, CV models) which does well on raw data like text or images. These usually work by mapping the raw data to dense embeddings, which are the output of neural models. On the other hand, you have decision trees / forests (think XG boost) that work great on tabular data - spreadsheets or other data of that nature.
But what do you do if you have a spreadsheet of data and one of the columns is raw text data but the other columns are say sparse boolean features? How can you incorporation the extra information from the spreadsheet into your language model? I think this is a common problem in industry that there's not a clear solution for right now.