Transfer learning works great for vision problems (just reuse one of the big SoTA trained on ImageNet networks - I like resnet50). This was enabled by the amazingly shared structure of vision problems. There was nothing similar for NLP, besides pre-trained first layers like word2vec. If you want to learn more, check out the fast.ai DL course, it features transfer learning a lot.
But this model and ULMFiT (nlp.fast.ai) show that deeper nets can be pretrained for NLP, and achieve good results when transfered to other datasets and problems.
This enables not just the obvious use case of "I don't have N GPUs to train a deep net from scratch but I can now finetune a pre-trained model" but more subtle and interesting cases like fine-tuning on a very small dataset (compared to ImageNet or 100000 samples NLP data sets) and cheap training on demand. Training a new model for every user was way too expensive if training from scratch, but if fine-tuning a pre-trained net takes just a few minutes, why not ?
Recent research is finally checking off a few important boxes that are required for widespread applicability:
- Minimal configuration required
Aside from tweaking the language modeling loss coefficient language model finetuning seems to "just work". ULMFiT's approach also requires minimal configuration.
- Reasonable training times
You can finetune these transformer models on a few hundred examples in 10 minutes on a single GPU.
- Beneficial with very small amounts of labeled training data
This approach consistently beats out the use of pretrained word/document embeddings at ~200 training examples. Will be posting some benchmarks on two dozen classification tasks in the near future.
There are a few remaining conditions that I think need to be met before this kind of approach sees widespread use:
- Reasonable inference times
Inference is still rather slow because of model complexity.
- Reasonable memory consumption
Transfer learning is typically well suited to personalization tasks because of limited training data requirements, but large memory footprints mean that it's hard to swap out models for different users on the fly.
Not that this library isn't promising, but the name and presentation makes it seem far more general than it really is.
https://pytoune.org/ (Keras-like interface for PyTorch) and https://github.com/dnouri/skorch (Scikit-learn interface for PyTorch).
As a side note, a project of mine: super-simple Jupyter Notebook training plots for Keras and PyToune: https://github.com/stared/livelossplot (with bare API, so you can connect it to anything you wish)