In fact, the stuff mentioned in the blog post is only the tip of the iceberg. There's a lot of opportunities to fine tune the model in all kinds ways, which I expect will go far beyond what we've managed to achieve in our limited exploration so far.
Anyhoo, if anyone has any questions, feel free to ask!
Do you expect the ModernBERT STs to carry the same advantages over ModernBERT that BERT STs had over the original BERT? Or would you expect caveats based on ModernBERT's updated architecture and capabilities?
Tiny feedback maybe you can pass along to whoever maintains the HuggingFace blog — the GTE-en-MLM link is broken.
https://huggingface.co/thenlper/gte-en-mlm-large should be https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base
1) Going by the Runtime vs GLUE graph, the ModernBERT-Base is roughly as fast as BERT-BAse. Given its architecture (especially Alternating Attention), I'm curious why the model not considerably faster than its predecessor. Any insight you could share on that?
2) Most modern LLMs are Encoder+Decoder model. Why not chop of the decoder of one of these (e.g. a small Llama or Mistral or other liberally-licensed model) and train a short head on top?
For (1), it's because BERT has both noticeably fewer parameters, and we're comparing at short context length (in the interest of providing a broader comparison), so local attention is a lot impactful than it is at the longer context lengths.
For (2), most LLMs are actually decoder-only, so there is no "encoder" here. But also, there's not a lot of LLMs in the ±100M parameter range in the first place!
You could put it on the decoder instead, but then you have the problem that in the causal language-modeling setting that the model was trained for, every token can only attend to preceding tokens and is blind to subsequent ones.
Script is near identical with the one below, updated with new imports;
https://huggingface.co/docs/transformers/en/tasks/token_clas...
There are a couple of reasons.. 1) That size (even for the large) is too much for multiple languages with good BLEU scores. 2) Encoder and decoder models don't tend to get trained for translation as much as e.g. GPT models with large translation texts in their datasets across multiple languages (with exceptions such as T5 translation task).
Could you shed some lights on what parts of bge-m3 would modernbert overlap with or would this is comparing apples to oranges?
BGE-M3 is a fine-tuned embedding models. This means that they’ve taken a base language model, which was trained for just language modeling, then applied further fine-tuning to make it useful for a given application, in this case, retrieval.
ModernBERT is one step back earlier in the pipeline: it’s the language model that application-specific models such as M3 build on.
This is partially because people using decoders aren’t using huggingface at all (they would use an API call) but also because encoders are the unsung heroes of most serious ML applications.
If you want to do any ranking, recommendation, RAG, etc it will probably require an encoder. And typically that meant something in the BERT/RoBERTa/ALBERT family. So this is huge.
Excited about trying this out, less excited about recalculating a petabyte worth of embedding if it's as good as it looks like it will be. At least I can keep my house warm.
What do the encoders do vs the decoders, in this ecosystem? What are some good links to learn about these concepts on a high level? I find all most of the writing about different layers and architectures a bit arcane and inscrutable, especially when it comes to Attention and Self-Attention with multiple heads.
1. an encoder takes an input (e.g. text), and turns it into a numerical representation (e.g. an embedding).
2. a decoder takes an input (e.g. text), and then extends the text.
(There's also encoder-decoders, but I won't go into those)
These two simple definitions immediately give information on how they can be used. Decoders are at the heart of text generation models, whereas encoders return embeddings with which you can do further computations. For example, if your encoder model is finetuned for it, the embeddings can be fed through another linear layer to give you classes (e.g. token classification like NER, or sequence classification for full texts). Or the embeddings can be compared with cosine similarity to determine the similarity of questions and answers. This is at the core of information retrieval/search (see https://sbert.net/). Such similarity between embeddings can also be used for clustering, etc.
In my humble opinion (but it's perhaps a dated opinion), (encoder-)decoders are for when your output is text (chatbots, summarization, translation), and encoders are for when your output is literally anything else. Embeddings are your toolbox, you can shape them into anything, and encoders are the wonderful providers of these embeddings.
I’d love to distill a “ModernTinyBERT”, but it seems a bit more complex with the interleaved layers.
That’s a question I’m interested in as well! DistilBERT and friends have been terribly useful at the edge. I wonder if/when we may see something similar for ModernBERT.
- Can I fine tune it with SentenceTransformers?
- I see ColBERT in the benchmarks, is there an answerai-colbert-small-v2 coming soon?(Also yes, it works great with ST and we provide a full example script.)
Check out their documentation page linked on the bottom of the article. https://huggingface.co/docs/transformers/main/en/model_doc/m...
I am going to wait until Ollama has this in their library, even though consuming HF is straight forward.
The speedup is impressive, but then so are the massive speed improvements for LLMs recently.
Apple has supported BERT models in their SDKs for Apple developers for years, it will be interesting to see how quickly they update to this newer tech.
I was given to understand that they are a better alternative to LLM type models for specific tasks like topic classification because they are trained to discriminate rather than to generate (plus they are bidirectional so they can “understand” context better through lookahead). But LLMs are pretty strong so I wonder if the difference is negligible?
In pre-processing you would have calculated the vector encoding of all the million keywords before hand and now with the keyword the user input, you calculate the vector and then find the most similar vectors
LLM is used by end user, encoders are used by devs in app to search/retrieve text.
For other tasks, such as retrieval, we still need people to finetune them for it. The ModernBERT documentation has some scripts for finetuning with Sentence Transformers and PyLate for retrieval: https://huggingface.co/docs/transformers/main/en/model_doc/m... But people still need to make and release these models. I have high hopes for them.
Jina V3 is an embedding model, so it's a base model, further fine-tuned specifically for embedding-ish tasks (retrieval, similarity...). This is what we call "downstream" models/applications.
ModernBERT is a base model & architecture. It's not supposed to be out of the box, but fine-tuned for other use-cases, serving as their backbone. In theory (and, given early signal, most likely in practice too), it'll make for really good downstream embeddings once people build on top of it!
Besides, PostModernBERT will be there for us for the next generational jump.
ERNIE is probably the most famous "computer" in the UK, which has been picking winners for the UK's premium bonds scheme since the 1950s. It was heavily marketed, to get the public used to the new-fangled idea of electronics, and is sometimes considered one of the first computers; though (a) it was more of a special-purpose random number generator rather than a computer, and (b) it descended from the earlier Colossus code-breaking machines of World War II (though the latter's existence was kept secret for decades). The latest ERNIE is version 5, which uses quantum effects to generate its random numbers (earlier versions used electrical and thermal noise).
Or to make an overly worded / researched reply to a petulant comment short, they are very much not specific to one culture.
So an embedding focused finetune of modern Bert should be compared to something like voyageai, but not modern Bert itself.