A Replacement for BERT (opens in new tab)

(huggingface.co)

348 pointscubie1y ago75 comments

75 comments

64 comments · 19 top-level

jph001y ago· 19 in thread

Hi gang, Jeremy from Answer.AI here. Nice to see this on HN! :) We're very excited about this model release -- it feels like it could be the basis of all kinds of interesting new startups and projects.

In fact, the stuff mentioned in the blog post is only the tip of the iceberg. There's a lot of opportunities to fine tune the model in all kinds ways, which I expect will go far beyond what we've managed to achieve in our limited exploration so far.

Anyhoo, if anyone has any questions, feel free to ask!

ZQ-Dev81y ago

Jeremy, this is awesome! Personally excited for a new wave of sentence transformers built off ModernBERT. A poster below provided the link to a sample ST training script in the ModernBERT repo, so that's great.

Do you expect the ModernBERT STs to carry the same advantages over ModernBERT that BERT STs had over the original BERT? Or would you expect caveats based on ModernBERT's updated architecture and capabilities?

jph001y ago

Yes absolutely the same advantages -- in fact the maintainer of ST is on the paper team, and it's been a key goal from day one to make this work well.

data_ders1y ago

what’s ST stand for here? I googled and only got results for BERT STS (semantic text similarity)

1 more reply

derbaum1y ago

Hey Jeremy, very exciting release! I'm currently building my first product with RoBERTa as one central component, and I'm very excited to see how ModernBERT compares. Quick question: When do you think the first multilingual versions will show up? Any plans of you training your own?

newfocogi1y ago

Thank you so much for doing this work. I expect many NLP projects and organizations are going to benefit from this, and I'm looking forward to all the models that will be derived from this. I'm already imagining the things I might try to build with it over the holiday break.

Tiny feedback maybe you can pass along to whoever maintains the HuggingFace blog — the GTE-en-MLM link is broken.

https://huggingface.co/thenlper/gte-en-mlm-large should be https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base

bclavie1y ago

Thank you! We're fixing the link.

querez1y ago

Two questions:

1) Going by the Runtime vs GLUE graph, the ModernBERT-Base is roughly as fast as BERT-BAse. Given its architecture (especially Alternating Attention), I'm curious why the model not considerably faster than its predecessor. Any insight you could share on that?

2) Most modern LLMs are Encoder+Decoder model. Why not chop of the decoder of one of these (e.g. a small Llama or Mistral or other liberally-licensed model) and train a short head on top?

bclavie1y ago

Hey, Ben here, one of the paper's core author authors. The responses you got were mostly spot on.

For (1), it's because BERT has both noticeably fewer parameters, and we're comparing at short context length (in the interest of providing a broader comparison), so local attention is a lot impactful than it is at the longer context lengths.

For (2), most LLMs are actually decoder-only, so there is no "encoder" here. But also, there's not a lot of LLMs in the ±100M parameter range in the first place!

cubieOP1y ago

Beyond what the others have said about 1) ModernBERT-base being 149M parameters vs BERT-base's 110M and 2) most LLMs being decoder-only models, also consider that alternating attention (local vs global) only starts helping once you're processing longer texts. With short texts, local attention is equivalent to global attention. I'm not sure what length was used in the picture, but GLUE is mostly pretty short text.

janalsncm1y ago

On your second point, most modern LLMs are decoder only. And as for why adding a classification head isn’t optimal, the decoders you’re referring to have 10x the parameters, and aren’t trained on encoder-type tasks like MLM. So there’s no advantage on any dimension really.

yorwba1y ago

Llama and Mistral are decoder-only models; there is no encoder you could put a head on.

You could put it on the decoder instead, but then you have the problem that in the causal language-modeling setting that the model was trained for, every token can only attend to preceding tokens and is blind to subsequent ones.

spott1y ago

ModernBERT-Base is larger than BERT-Base by 39M parameters.

bertobugreport1y ago

Trying to fine tune on single rig multi-gpu and it crashes, going back down to 1 GPU fixes and training continues (excited to see its results).

Script is near identical with the one below, updated with new imports;

https://huggingface.co/docs/transformers/en/tasks/token_clas...

bomewish1y ago

I can't find any info on whether ModernBERT will handle languages other than English; German, Chinese, Arabic? Any info there would be super helpful.

authorfly1y ago

Probably a multilingual version will be needed, like with BERT and RoBERTa. I should hasten to add for multi language tasks(beyond detection), either simpler methods for tasks like multiple language classification/prediction(e.g. word frequency, BERTopic like approaches or SVMs) or LLMs are generally a better candidate.

There are a couple of reasons.. 1) That size (even for the large) is too much for multiple languages with good BLEU scores. 2) Encoder and decoder models don't tend to get trained for translation as much as e.g. GPT models with large translation texts in their datasets across multiple languages (with exceptions such as T5 translation task).

1 more reply

geekodour1y ago

Hi Jeremy, I am trying to navigate the space and trying to understand what fits where.

Could you shed some lights on what parts of bge-m3 would modernbert overlap with or would this is comparing apples to oranges?

https://huggingface.co/BAAI/bge-m3

bclavie1y ago

Hey! It’s more like comparing apples to apple pie.

BGE-M3 is a fine-tuned embedding models. This means that they’ve taken a base language model, which was trained for just language modeling, then applied further fine-tuning to make it useful for a given application, in this case, retrieval.

ModernBERT is one step back earlier in the pipeline: it’s the language model that application-specific models such as M3 build on.

TheTaytay1y ago

Thank you for this. I can't wait to try this, especially on GLiNER tasks.

LunaSea1y ago

Hi Jeremy, do you have plans to adapt this model for different languages?

zelias1y ago· 9 in thread

missed opportunity to call it ERNIE

timClicks1y ago

More generally, using the prefix "Modern" haunts every product name that uses it. Technologies move fast and modern becomes antiquated very quickly.

bclavie1y ago

We had a bit of a discussion around it, but I figured that 6 years warranted the prefix, and it's easier to remember in the sea of new acronyms popping up everyday.

Besides, PostModernBERT will be there for us for the next generational jump.

int_19h1y ago

It'll just get shortened to Mobert in the long run anyway.

lrog1y ago

yep, too late: https://huggingface.co/docs/transformers/en/model_doc/ernie

chriswarbo1y ago

Tangentially:

ERNIE is probably the most famous "computer" in the UK, which has been picking winners for the UK's premium bonds scheme since the 1950s. It was heavily marketed, to get the public used to the new-fangled idea of electronics, and is sometimes considered one of the first computers; though (a) it was more of a special-purpose random number generator rather than a computer, and (b) it descended from the earlier Colossus code-breaking machines of World War II (though the latter's existence was kept secret for decades). The latest ERNIE is version 5, which uses quantum effects to generate its random numbers (earlier versions used electrical and thermal noise).

https://en.wikipedia.org/wiki/Premium_Bonds#ERNIE

amrrs1y ago

I remember back in the day there was an Ernie model

axpy9061y ago

Don’t forget ELMO. The bi-lstm.

behnamoh1y ago

I never liked the names BERT and its derivatives. Of all the names on the world, they chose words that are ugly, specific to one culture, and frankly childish.

Cthulhu_1y ago

Sesame Street has been broadcast in 140 countries; Bert (and Ernie) have been localized to 18 languages, including Arabic, Hindi, Japanese, Hebrew and Chinese, with China having an AI called ERNIE because of course.

Or to make an overly worded / researched reply to a petulant comment short, they are very much not specific to one culture.

1 more reply

janalsncm1y ago· 5 in thread

> encoder-only models add up to over a billion downloads per month, nearly three times more than decoder-only models

This is partially because people using decoders aren’t using huggingface at all (they would use an API call) but also because encoders are the unsung heroes of most serious ML applications.

If you want to do any ranking, recommendation, RAG, etc it will probably require an encoder. And typically that meant something in the BERT/RoBERTa/ALBERT family. So this is huge.

llm_trw1y ago

Encoders are suffering from the curse of all successful AI applications: they work so they are no longer AI.

Excited about trying this out, less excited about recalculating a petabyte worth of embedding if it's as good as it looks like it will be. At least I can keep my house warm.

martin821y ago

Kinda curious what kind of data you have lying around there and what stack you use to create the embeddings and keep them up to date and how you use then...

1 more reply

EGreg1y ago

Can you go into detail for those of us who aren't as well versed in the tech?

What do the encoders do vs the decoders, in this ecosystem? What are some good links to learn about these concepts on a high level? I find all most of the writing about different layers and architectures a bit arcane and inscrutable, especially when it comes to Attention and Self-Attention with multiple heads.

cubieOP1y ago

On a very high level, for NLP:

1. an encoder takes an input (e.g. text), and turns it into a numerical representation (e.g. an embedding).

2. a decoder takes an input (e.g. text), and then extends the text.

(There's also encoder-decoders, but I won't go into those)

These two simple definitions immediately give information on how they can be used. Decoders are at the heart of text generation models, whereas encoders return embeddings with which you can do further computations. For example, if your encoder model is finetuned for it, the embeddings can be fed through another linear layer to give you classes (e.g. token classification like NER, or sequence classification for full texts). Or the embeddings can be compared with cosine similarity to determine the similarity of questions and answers. This is at the core of information retrieval/search (see https://sbert.net/). Such similarity between embeddings can also be used for clustering, etc.

In my humble opinion (but it's perhaps a dated opinion), (encoder-)decoders are for when your output is text (chatbots, summarization, translation), and encoders are for when your output is literally anything else. Embeddings are your toolbox, you can shape them into anything, and encoders are the wonderful providers of these embeddings.

2 more replies

janalsncm1y ago

If you’re interested in learning more, the linked article isn’t a bad place to start.

Arcuru1y ago· 3 in thread

I'm not sure I am understanding where exactly this slots in, but isn't this an embedding model? Shouldn't they be comparing it to a service like Voyage AI?

- https://docs.voyageai.com/docs/embeddings

janalsncm1y ago

You’re comparing SaaS to open weights. A SaaS will never compete on the flexibility of adding a classification head to BERT (where the gradients flow all the way back), training it, knowledge transferring to a similar domain, distilling it down, pruning layers, fine-tuning some more, etc. which is a common ML workflow.

spott1y ago

Embedding models are frequently based on Bert style models, but Bert models can be finetuned to do a lot more than just embeddings.

So an embedding focused finetune of modern Bert should be compared to something like voyageai, but not modern Bert itself.

KTibow1y ago

What are the people who keep downloading Bert doing then? Are they the minority who directly use it for embeddings?

2 more replies

jbellis1y ago· 2 in thread

Looks great, thanks for training this!

  - Can I fine tune it with SentenceTransformers?
  - I see ColBERT in the benchmarks, is there an answerai-colbert-small-v2 coming soon?

jph001y ago

The creator of answerai-colbert-small-v2 (bclavie) is also the person that launched the ModernBERT project, so yes, you can expect to see a lot of activity in this space! :D

(Also yes, it works great with ST and we provide a full example script.)

gunalx1y ago

Seems like it. They even have example training scripts available. https://github.com/AnswerDotAI/ModernBERT/blob/main/examples...

Check out their documentation page linked on the bottom of the article. https://huggingface.co/docs/transformers/main/en/model_doc/m...

wenc1y ago· 2 in thread

Can I ask where BERT models are used in production these days?

I was given to understand that they are a better alternative to LLM type models for specific tasks like topic classification because they are trained to discriminate rather than to generate (plus they are bidirectional so they can “understand” context better through lookahead). But LLMs are pretty strong so I wonder if the difference is negligible?

vietvu1y ago

LLMs like GPT are heavy and costly (and BERT are LLMs too, params can up to like 1.5B). For niche problems like classification on a small domain, BERT like models are much better, cheaper. You don't need all knowledge gen AI LLM has. I have seen many companies using DeBERTa or RoBERTa for text classification, not using GPT/LLaMA.

ganeshkrishnan1y ago

LLMs dont have the same usecase as encoder only models. Lets assume you have around million keywords and you want to find the most similar to a keyword that the user input.

In pre-processing you would have calculated the vector encoding of all the million keywords before hand and now with the keyword the user input, you calculate the vector and then find the most similar vectors

LLM is used by end user, encoders are used by devs in app to search/retrieve text.

deepsquirrelnet1y ago· 1 in thread

I read your paper this morning, and am just thrilled with the work. Love the added local attention layers. I’ve experimented with them for years (lucidrains repo), and was always surprised they didn’t go further. Inference speeds are awesome on this model. Scrapping NSP, awesome. Increased masking, awesome. RoPE and longer context, again, bravo. There’s so many great incremental improvements learned over the years and you guys made so many good decisions here.

I’d love to distill a “ModernTinyBERT”, but it seems a bit more complex with the interleaved layers.

anon3738391y ago

> I’d love to distill a “ModernTinyBERT

That’s a question I’m interested in as well! DistilBERT and friends have been terribly useful at the edge. I wonder if/when we may see something similar for ModernBERT.

dmezzetti1y ago· 1 in thread

Great news here. Will takes some time for it to trickle downstream but expect to see better vector embeddings models, entity extraction and more.

cubieOP1y ago

Spot on

pantsforbirds1y ago· 1 in thread

Awesome news and something I really want to checkout for work. Has anyone seen any RAG evals for ModernBERT yet?

cubieOP1y ago

Not yet - these are base models, or "foundational models". They're great for molding into different use cases via finetuning, better than common models like BERT, RoBERTa, etc. in fact, but like those models, these ModernBERT checkpoints can only do one thing: mask filling.

For other tasks, such as retrieval, we still need people to finetune them for it. The ModernBERT documentation has some scripts for finetuning with Sentence Transformers and PyLate for retrieval: https://huggingface.co/docs/transformers/main/en/model_doc/m... But people still need to make and release these models. I have high hopes for them.

carschno1y ago· 1 in thread

The model cars says only English, is that correct? Are there any plans to publish a multilingual model or monolingual ones for other languages?

amunozo1y ago

Yes, the paper says that is only English.

neodypsis1y ago· 1 in thread

How does it compare to Jina V3 [0], which also has 8192 context length?

0. https://arxiv.org/abs/2409.10173

bclavie1y ago

They perform different roles, so they're not directly comparable.

Jina V3 is an embedding model, so it's a base model, further fine-tuned specifically for embedding-ish tasks (retrieval, similarity...). This is what we call "downstream" models/applications.

ModernBERT is a base model & architecture. It's not supposed to be out of the box, but fine-tuned for other use-cases, serving as their backbone. In theory (and, given early signal, most likely in practice too), it'll make for really good downstream embeddings once people build on top of it!

shahjaidev1y ago

The community would benefit a lot from a multilingual ModernBERT. Pretraining on a multilingual corpus is crucial for a ranking/retrieval model to be deployed in many industry settings.Simply extending the vocab and fine tuning the en checkpoint won’t quite work. Any plans to release a multilingual checkpoint ?

mark_l_watson1y ago

I saw this early this morning. About for or five years ago I used BERT models for summarization, etc. BERT seemed like a miracle to me back then.

I am going to wait until Ollama has this in their library, even though consuming HF is straight forward.

The speedup is impressive, but then so are the massive speed improvements for LLMs recently.

Apple has supported BERT models in their SDKs for Apple developers for years, it will be interesting to see how quickly they update to this newer tech.

readthenotes11y ago

I guess the next release is going to be postmodern bert.

303bookworm1y ago

Really excited to see this! 2 Questions: 1. Did you try using RTD (Electra like pretraining)? Or did you skip that for reasons of compatability? 2. Why not incorporate jamba like Mamba2 alternating layers?

Labo3331y ago

Sad that it is English only, not multilingual.

GaggiX1y ago

It would be really cool to have a model like this but multilingual, it would really help with things like moderation.

vietvu1y ago

So that what's Jeremy Howard was teasing about. Nice one.

crimsoneer1y ago

Answer.ai team are DELIVERING today. Well done Jeremy and team!

j / k navigate · click thread line to collapse

75 comments

64 comments · 19 top-level

jph001y ago· 19 in thread

Anyhoo, if anyone has any questions, feel free to ask!

ZQ-Dev81y ago

jph001y ago

Yes absolutely the same advantages -- in fact the maintainer of ST is on the paper team, and it's been a key goal from day one to make this work well.

data_ders1y ago

what’s ST stand for here? I googled and only got results for BERT STS (semantic text similarity)

1 more reply

derbaum1y ago

newfocogi1y ago

Tiny feedback maybe you can pass along to whoever maintains the HuggingFace blog — the GTE-en-MLM link is broken.

https://huggingface.co/thenlper/gte-en-mlm-large should be https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base

bclavie1y ago

Thank you! We're fixing the link.

querez1y ago

Two questions:

2) Most modern LLMs are Encoder+Decoder model. Why not chop of the decoder of one of these (e.g. a small Llama or Mistral or other liberally-licensed model) and train a short head on top?

bclavie1y ago

Hey, Ben here, one of the paper's core author authors. The responses you got were mostly spot on.

For (2), most LLMs are actually decoder-only, so there is no "encoder" here. But also, there's not a lot of LLMs in the ±100M parameter range in the first place!

cubieOP1y ago

janalsncm1y ago

yorwba1y ago

Llama and Mistral are decoder-only models; there is no encoder you could put a head on.

spott1y ago

ModernBERT-Base is larger than BERT-Base by 39M parameters.

bertobugreport1y ago

Trying to fine tune on single rig multi-gpu and it crashes, going back down to 1 GPU fixes and training continues (excited to see its results).

Script is near identical with the one below, updated with new imports;

https://huggingface.co/docs/transformers/en/tasks/token_clas...

bomewish1y ago

I can't find any info on whether ModernBERT will handle languages other than English; German, Chinese, Arabic? Any info there would be super helpful.

authorfly1y ago

1 more reply

geekodour1y ago

Hi Jeremy, I am trying to navigate the space and trying to understand what fits where.

Could you shed some lights on what parts of bge-m3 would modernbert overlap with or would this is comparing apples to oranges?

https://huggingface.co/BAAI/bge-m3

bclavie1y ago

Hey! It’s more like comparing apples to apple pie.

ModernBERT is one step back earlier in the pipeline: it’s the language model that application-specific models such as M3 build on.

TheTaytay1y ago

Thank you for this. I can't wait to try this, especially on GLiNER tasks.

LunaSea1y ago

Hi Jeremy, do you have plans to adapt this model for different languages?

zelias1y ago· 9 in thread

missed opportunity to call it ERNIE

timClicks1y ago

More generally, using the prefix "Modern" haunts every product name that uses it. Technologies move fast and modern becomes antiquated very quickly.

bclavie1y ago

We had a bit of a discussion around it, but I figured that 6 years warranted the prefix, and it's easier to remember in the sea of new acronyms popping up everyday.

Besides, PostModernBERT will be there for us for the next generational jump.

int_19h1y ago

It'll just get shortened to Mobert in the long run anyway.

lrog1y ago

yep, too late: https://huggingface.co/docs/transformers/en/model_doc/ernie

chriswarbo1y ago

Tangentially:

https://en.wikipedia.org/wiki/Premium_Bonds#ERNIE

amrrs1y ago

I remember back in the day there was an Ernie model

axpy9061y ago

Don’t forget ELMO. The bi-lstm.

behnamoh1y ago

I never liked the names BERT and its derivatives. Of all the names on the world, they chose words that are ugly, specific to one culture, and frankly childish.

Cthulhu_1y ago

Or to make an overly worded / researched reply to a petulant comment short, they are very much not specific to one culture.

1 more reply

janalsncm1y ago· 5 in thread

> encoder-only models add up to over a billion downloads per month, nearly three times more than decoder-only models

This is partially because people using decoders aren’t using huggingface at all (they would use an API call) but also because encoders are the unsung heroes of most serious ML applications.

If you want to do any ranking, recommendation, RAG, etc it will probably require an encoder. And typically that meant something in the BERT/RoBERTa/ALBERT family. So this is huge.

llm_trw1y ago

Encoders are suffering from the curse of all successful AI applications: they work so they are no longer AI.

Excited about trying this out, less excited about recalculating a petabyte worth of embedding if it's as good as it looks like it will be. At least I can keep my house warm.

martin821y ago

Kinda curious what kind of data you have lying around there and what stack you use to create the embeddings and keep them up to date and how you use then...

1 more reply

EGreg1y ago

Can you go into detail for those of us who aren't as well versed in the tech?

cubieOP1y ago

On a very high level, for NLP:

1. an encoder takes an input (e.g. text), and turns it into a numerical representation (e.g. an embedding).

2. a decoder takes an input (e.g. text), and then extends the text.

(There's also encoder-decoders, but I won't go into those)

2 more replies

janalsncm1y ago

If you’re interested in learning more, the linked article isn’t a bad place to start.

Arcuru1y ago· 3 in thread

I'm not sure I am understanding where exactly this slots in, but isn't this an embedding model? Shouldn't they be comparing it to a service like Voyage AI?

- https://docs.voyageai.com/docs/embeddings

janalsncm1y ago

spott1y ago

Embedding models are frequently based on Bert style models, but Bert models can be finetuned to do a lot more than just embeddings.

So an embedding focused finetune of modern Bert should be compared to something like voyageai, but not modern Bert itself.

KTibow1y ago

What are the people who keep downloading Bert doing then? Are they the minority who directly use it for embeddings?

2 more replies

jbellis1y ago· 2 in thread

Looks great, thanks for training this!

  - Can I fine tune it with SentenceTransformers?
  - I see ColBERT in the benchmarks, is there an answerai-colbert-small-v2 coming soon?

jph001y ago

The creator of answerai-colbert-small-v2 (bclavie) is also the person that launched the ModernBERT project, so yes, you can expect to see a lot of activity in this space! :D

(Also yes, it works great with ST and we provide a full example script.)

gunalx1y ago

Seems like it. They even have example training scripts available. https://github.com/AnswerDotAI/ModernBERT/blob/main/examples...

Check out their documentation page linked on the bottom of the article. https://huggingface.co/docs/transformers/main/en/model_doc/m...

wenc1y ago· 2 in thread

Can I ask where BERT models are used in production these days?

vietvu1y ago

ganeshkrishnan1y ago

LLMs dont have the same usecase as encoder only models. Lets assume you have around million keywords and you want to find the most similar to a keyword that the user input.

LLM is used by end user, encoders are used by devs in app to search/retrieve text.

deepsquirrelnet1y ago· 1 in thread

I’d love to distill a “ModernTinyBERT”, but it seems a bit more complex with the interleaved layers.

anon3738391y ago

> I’d love to distill a “ModernTinyBERT

That’s a question I’m interested in as well! DistilBERT and friends have been terribly useful at the edge. I wonder if/when we may see something similar for ModernBERT.

dmezzetti1y ago· 1 in thread

Great news here. Will takes some time for it to trickle downstream but expect to see better vector embeddings models, entity extraction and more.

cubieOP1y ago

Spot on

pantsforbirds1y ago· 1 in thread

Awesome news and something I really want to checkout for work. Has anyone seen any RAG evals for ModernBERT yet?

cubieOP1y ago

carschno1y ago· 1 in thread

The model cars says only English, is that correct? Are there any plans to publish a multilingual model or monolingual ones for other languages?

amunozo1y ago

Yes, the paper says that is only English.

neodypsis1y ago· 1 in thread

How does it compare to Jina V3 [0], which also has 8192 context length?

0. https://arxiv.org/abs/2409.10173

bclavie1y ago

They perform different roles, so they're not directly comparable.

Jina V3 is an embedding model, so it's a base model, further fine-tuned specifically for embedding-ish tasks (retrieval, similarity...). This is what we call "downstream" models/applications.

shahjaidev1y ago

mark_l_watson1y ago

I saw this early this morning. About for or five years ago I used BERT models for summarization, etc. BERT seemed like a miracle to me back then.

I am going to wait until Ollama has this in their library, even though consuming HF is straight forward.

The speedup is impressive, but then so are the massive speed improvements for LLMs recently.

Apple has supported BERT models in their SDKs for Apple developers for years, it will be interesting to see how quickly they update to this newer tech.

readthenotes11y ago

I guess the next release is going to be postmodern bert.

303bookworm1y ago

Labo3331y ago

Sad that it is English only, not multilingual.

GaggiX1y ago

It would be really cool to have a model like this but multilingual, it would really help with things like moderation.

vietvu1y ago

So that what's Jeremy Howard was teasing about. Nice one.

crimsoneer1y ago

Answer.ai team are DELIVERING today. Well done Jeremy and team!

j / k navigate · click thread line to collapse