Mistral Saba (opens in new tab)

(mistral.ai)

145 pointsstephen371y ago21 comments

21 comments

I wonder why they grouped languages from the Middle East and South Asia together. Arabic and Hebrew are Semitic languages - no language from that family tree is native to the subcontinent. It would make sense if northern languages like Hindi, Urdu, Bengali, Nepali, etc were grouped with Persian, French, Russian, etc since those are all from the Indo-European family. South Indian languages like Telugu and Tamil are from a completely different family (Dravidian).

Why not either train the model exclusively on Semitic languages for further performance for those languages or on a wider set of languages for better multilingual performance overall? I don't understand the logic here.

achrono1y ago

There are a far greater number of speakers of Arabic in Germany (1.4M) [1] than in Afghanistan (420K) [2].

So properly speaking, they should be advertising the target region as Europe, Middle East and Africa. [3]

[1] https://en.wikipedia.org/wiki/Languages_of_Germany [2] https://en.wikipedia.org/wiki/Languages_of_Afghanistan [3] https://en.wikipedia.org/wiki/List_of_countries_and_territor...

jazzyjackson1y ago

There is a lot of Indian laborers in the Middle East, so it’s not that Tamil and Arabic are related, but a model used for that region should be fluent in both

reallymental1y ago

Not sure if you've been to the middle east, but there's no way the labourers will have access to the internet besides their phones. And those phones can only be used to communicate with their loved ones back home, using Whatsapp.

They don't (or rather CAN) care about anything else in the world.

They have a lot more problems than "this model doesn't convert urdu to arabic well".

2 more replies

aitchnyu1y ago

I'm Malayalam from Kerala state was not first if cultural exchange was the metric. ME natives often ask if somebody is from Kerala or (rest of) India. Malabar traded with Middle East for millenia (now cash crops, trades and skilled laborers, medical tourism) and Malayalam loans many words from Arabic and there is an Arabic script for Malayalam.

viraptor1y ago

There's going to be diminishing returns in splitting the languages where you get less information related to the region / concept just because you're avoiding mixing languages. The language was not the only aspect: "cultural background, and in-depth regional knowledge". There's going to be lots of information shared in south/North languages just because of the geographically close (relatively anyway) distance.

I mean you wouldn't want to split a model into 3 separate ones, where one contains Austrian, another Slovakian, and another Hungarian, since there's going to be lots of cultural overlap.

laserduck1y ago

I agree that it makes sense to group the Indic languages together due to cultural proximity but why would you group the Indic languages with Middle Eastern ones? Might as well group it with European or African or Sinitic languages at that point.

sangnoir1y ago

> I wonder why they grouped languages from the Middle East and South Asia together

Geography

Cyph0n1y ago

Context on the name: https://en.wikipedia.org/wiki/Sheba

jorams1y ago

That makes a lot more sense than the Saba I initially thought of[1]. The "specific geographies" seemed a bit overly specific.

[1]: https://en.wikipedia.org/wiki/Saba_(island)

yodon1y ago

> Mistral Saba is a 24B parameter model trained on meticulously curated datasets from across the Middle East and South Asia.

hazrmard1y ago

It's great to see proliferation of models in other languages!

Shoutout to Alif, a finetune of Llama 3 8b on Urdu datasets: https://huggingface.co/large-traversaal/Alif-1.0-8B-Instruct

It'd be great to see a comparison.

elashri1y ago

That's interesting. It would be interesting to compare how this will fare against Fanar (Arabic oriented models) [1]. I got access to their API last week but still didn't play with it. I think they did pretty good job with arabic dialects [2]. I don't know if they have any plans to release weights though. There are two models one trained from scratch and the other ia fine turned of Google's Gemma.

Saba vs fanar. I like the names too.

[1] https://fanar.qa/en

[2] https://arxiv.org/abs/2501.13944

diggan1y ago

Considering they don't talk about licensing, one can assume this is proprietary?

~2 years ago (Sep 27, 2023), Mistral AI said:

> we believe that an open approach to generative AI is necessary. Community-backed model development is the surest path to fight censorship and bias in a technology shaping our future. We strongly believe that by training our own models, releasing them openly, and fostering community contributions, we can build a credible alternative to the emerging AI oligopoly. Open-weight generative models will play a pivotal role in the upcoming AI revolution.

> Mistral AI’s mission is to spearhead the revolution of open models.

https://mistral.ai/en/news/about-mistral-ai

Did something change since then, or why did they have a change of hearts? Are they just doing a "OpenAI" and appear to believe in something in order to further their own cause, or does it have some particular reason behind it?

JimDabell1y ago

> One of the many custom-trained models to serve specific geographies, markets, and customers

> Mistral Saba is a result of working closely with strategic regional customers to address very specific challenges in addressing bespoke use cases.

It seems like a customer paid them to train this model, so presumably that customer gets to decide on licensing terms.

Isn’t this Mistral’s business model? Make general purpose models available as open-source and train more specific models for their customers?

diggan1y ago

It's available on their API service, so I'd assume it's not one of the private models others pay them to create, otherwise it would be fully private to the customer. Or, the customer wants it to be released/open, then the customer would release it, not Mistral. But I might misunderstand how they operate, happy to correct what I understand.

Edit: Actually, it is outlined in the bottom of the post:

> we have also begun to train models for strategic customers with the power of their deep and proprietary enterprise context. These models stay exclusive and private to the respective customers. If you would like to explore custom training with Mistral AI, explore our applied AI offerings, or please contact us.

So this is not one of those, as then it would be exclusive and private to the customer. This (Saba) is one of the models that I understood they would have released as at least "open-weights", if following their initial goals according to the early blog posts.

throwaway6386371y ago

It says south asia but the blog post is about Arabic. Where are the numbers on Urdu?

Terretta1y ago

GPT-4o mini keeps quietly demonstrating value per cost.

homarp1y ago

care to elaborate ?

Saba: input $0.2/M tokens / Output $0.6 /M tokens

GPT-4o Input: $0.15/M tokens Cached input:$0.075/M tokens Output:$0.6/1M tokens

sources: https://openai.com/api/pricing/ and https://mistral.ai/en/products/la-plateforme#pricing

j / k navigate · click thread line to collapse

21 comments

laserduck1y ago

achrono1y ago

There are a far greater number of speakers of Arabic in Germany (1.4M) [1] than in Afghanistan (420K) [2].

So properly speaking, they should be advertising the target region as Europe, Middle East and Africa. [3]

[1] https://en.wikipedia.org/wiki/Languages_of_Germany [2] https://en.wikipedia.org/wiki/Languages_of_Afghanistan [3] https://en.wikipedia.org/wiki/List_of_countries_and_territor...

jazzyjackson1y ago

There is a lot of Indian laborers in the Middle East, so it’s not that Tamil and Arabic are related, but a model used for that region should be fluent in both

reallymental1y ago

They don't (or rather CAN) care about anything else in the world.

They have a lot more problems than "this model doesn't convert urdu to arabic well".

2 more replies

aitchnyu1y ago

viraptor1y ago

I mean you wouldn't want to split a model into 3 separate ones, where one contains Austrian, another Slovakian, and another Hungarian, since there's going to be lots of cultural overlap.

laserduck1y ago

sangnoir1y ago

> I wonder why they grouped languages from the Middle East and South Asia together

Geography

Cyph0n1y ago

Context on the name: https://en.wikipedia.org/wiki/Sheba

jorams1y ago

That makes a lot more sense than the Saba I initially thought of[1]. The "specific geographies" seemed a bit overly specific.

[1]: https://en.wikipedia.org/wiki/Saba_(island)

yodon1y ago

> Mistral Saba is a 24B parameter model trained on meticulously curated datasets from across the Middle East and South Asia.

hazrmard1y ago

It's great to see proliferation of models in other languages!

Shoutout to Alif, a finetune of Llama 3 8b on Urdu datasets: https://huggingface.co/large-traversaal/Alif-1.0-8B-Instruct

It'd be great to see a comparison.

elashri1y ago

Saba vs fanar. I like the names too.

[1] https://fanar.qa/en

[2] https://arxiv.org/abs/2501.13944

diggan1y ago

Considering they don't talk about licensing, one can assume this is proprietary?

~2 years ago (Sep 27, 2023), Mistral AI said:

> Mistral AI’s mission is to spearhead the revolution of open models.

https://mistral.ai/en/news/about-mistral-ai

JimDabell1y ago

> One of the many custom-trained models to serve specific geographies, markets, and customers

> Mistral Saba is a result of working closely with strategic regional customers to address very specific challenges in addressing bespoke use cases.

It seems like a customer paid them to train this model, so presumably that customer gets to decide on licensing terms.

Isn’t this Mistral’s business model? Make general purpose models available as open-source and train more specific models for their customers?

diggan1y ago

Edit: Actually, it is outlined in the bottom of the post:

throwaway6386371y ago

It says south asia but the blog post is about Arabic. Where are the numbers on Urdu?

Terretta1y ago

GPT-4o mini keeps quietly demonstrating value per cost.

homarp1y ago

care to elaborate ?

Saba: input $0.2/M tokens / Output $0.6 /M tokens

GPT-4o Input: $0.15/M tokens Cached input:$0.075/M tokens Output:$0.6/1M tokens

sources: https://openai.com/api/pricing/ and https://mistral.ai/en/products/la-plateforme#pricing

j / k navigate · click thread line to collapse