Phi-4: Microsoft's Newest Small Language Model Specializing in Complex Reasoning (opens in new tab)

(techcommunity.microsoft.com)

439 pointslappa1y ago143 comments

143 comments

The most interesting thing about this is the way it was trained using synthetic data, which is described in quite a bit of detail in the technical report: https://arxiv.org/abs/2412.08905

Microsoft haven't officially released the weights yet but there are unofficial GGUFs up on Hugging Face already. I tried this one: https://huggingface.co/matteogeniaccio/phi-4/tree/main

I got it working with my LLM tool like this:

  llm install llm-gguf
  llm gguf download-model https://huggingface.co/matteogeniaccio/phi-4/resolve/main/phi-4-Q4_K_M.gguf
  llm chat -m gguf/phi-4-Q4_K_M

Here are some initial transcripts: https://gist.github.com/simonw/0235fd9f8c7809d0ae078495dd630...

More of my notes on Phi-4 here: https://simonwillison.net/2024/Dec/15/phi-4-technical-report...

tkellogg1y ago

I added Phi-4 to my reasoning model collection because it seems to exhibit reasoning behavior, it stopped to consider alternatives before concluding. I assume this is related to their choice in training data:

> Chain-of-Thought: Data should encourage systematic reasoning, teaching the model various approaches to the problems in a step-by-step manner.

https://github.com/tkellogg/lrm-reasoning/blob/main/phi4.md

mirekrusin1y ago

This "draw pelican riding on bicycle" is quite deep if you think about it.

Phi is all about synthetic training and prompt -> svg -> render -> evaluate image -> feedback loop feels like ideal fit for synthetic learning.

You can push it quite far with stuff like basic 2d physics etc with plotting scene after N seconds or optics/rays, magnetic force etc.

SVG as LLM window to physical world.

dartos1y ago

> SVG as LLM window to physical world.

What? let’s try not to go full forehead into hype.

SVGs would be an awfully poor analogy for the physical world…

2 more replies

syntaxing1y ago

Wow, those responses are better than I expected. Part of me was expecting terrible responses since Phi-3 was amazing on paper too but terrible in practice.

refulgentis1y ago

One of the funniest tech subplots in recent memory.

TL;DR it was nigh-impossible to get it to emit the proper "end of message" token. (IMHO the chat training was too rushed). So all the local LLM apps tried silently hacking around it. The funny thing to me was no one would say it out loud. Field isn't very consumer friendly, yet.

2 more replies

algo_trader1y ago

> More of my notes on Phi-4 here: https://simonwillison.net/2024/Dec/15/phi-4-technical-report...

Nice. Thanks.

Do you think sampling the stack traces of millions of machines is a good dataset for improving code performance? Maybe sample android/jvm bytecode.

Maybe a sort of novelty sampling to avoid re-sampling hot-path?

lifeisgood991y ago

The SVG created for the first prompt is valid but is a garbage image.

bentcorner1y ago

In general I've had poor results with LLMs generating pictures using text instructions (in my case I've tried to get them to generate pictures using plots in KQL). They work but the pictures are very very basic.

I'd be interested for any LLM emitting any kind of text-to-picture instructions to get results that are beyond a kindergartner-cardboard-cutout levels of art.

3 more replies

chen_dev1y ago

Amazon Nova models:

https://gist.github.com/uschen/38fc65fa7e43f5765a584c6cd24e1...

simonw1y ago

Yeah, it didn't do very well on that one. The best I've had from a local model there was from QwQ: https://simonwillison.net/2024/Nov/27/qwq/

refulgentis1y ago

For context, pelican riding a bicycle: https://imgur.com/a/2nhm0XM

Copied SVG from gist into figma, added dark gray #444444 background, exported as PNG 1x.

fisherjeff1y ago

Looks like someone’s finally caught up with The Hallmark Channel’s LLM performance

selcuka1y ago

I tried the same GGUF on LMStudio and it seems to sometimes leak its training data.

    Prompt:
    Are you based on OpenAI?

    Reply:
    No, I am not based on OpenAI's models; I am developed by Microsoft. While both systems are designed for natural language processing and understanding, we may have different architectures, training methodologies, or datasets. However, our shared goal is to provide helpful and informative responses while ensuring user safety and privacy.<|user|>
    Are you similar to ChatGPT?<|end|>

vergessenmir1y ago

When working with GGUF what chat templates do you use? Pretty much every gguf I've imported into ollama has given me garbage response. Converting the tokenizer json has yielded mixed results.

For example how do you handle the phi-4 models gguf chat template?

simonw1y ago

I use whatever what template is baked into the GGUF file.

You can click on the little info icon on Hugging Face to see that directly.

For https://huggingface.co/matteogeniaccio/phi-4/tree/main?show_... that's this:

  {% for message in messages %}{% if
  (message['role'] == 'system')
  %}{{'<|im_start|>system<|im_sep|>' +
  message['content'] + '<|im_end|>'}}{%
  elif (message['role'] == 'user')
  %}{{'<|im_start|>user<|im_sep|>' +
  message['content'] +
  '<|im_end|><|im_start|>assistant<|im_sep|>'}}{%
  elif (message['role'] == 'assistant')
  %}{{message['content'] + '<|im_end|>'}}{%
  endif %}{% endfor %}

belter1y ago

> it was trained using synthetic data

Is this not supposed to cause Model collapse?

rhdunn1y ago

It depends on how you construct the synthetic data and how the model is trained on that data.

For diffusion-based image generators training only on synthetic data over repeated model training can cause model collapse as errors in the output can amplify in the trained model. It's usually the 2nd or 3rd model created this way (with output of the previous used as input for the first) for it to collapse.

It was found that using primary data along side synthetic data avoided the model collapse. Likewise, if you also have some sort of human scoring/evaluation you can help avoid artefacts.

simonw1y ago

This is why I don't think model collapse actually matters: people have been deliberately training LLMs on synthetic data for over a year at this point.

As far as I can tell model collapse happens when you deliberately train LLMs on low quality LLM-generated data so that you can write a paper about it.

ziofill1y ago

I may have misunderstood, but I think that it depends a lot on the existence of a validation mechanism. Programming languages have interpreters and compilers that can provide a useful signal, while for images and natural language there isn’t such an automated mechanism, or at least its not that straightforward.

nxobject1y ago

As someone who's a completely layman: I wonder if the results of model collapse are no worse than, say, sufficiently complex symbolic AI (modulo consistency and fidelity?)

fulafel1y ago

No.

1 more reply

mhh__1y ago

Along those lines (synthetic data) I would keep an eye on the chinese labs given that they are probably quite data and compute constrained, in English at least.

Havoc1y ago

>Microsoft haven't officially released the weights

Thought it was official just not on huggingface but rather whatever azure competitor thing they're pushing?

simonw1y ago

I found their AI Foundry thing so hard to figure out I couldn't tell if they had released weights (as opposed to a way of running it via an API).

Since there are GGUFs now so someone must have released some weights somewhere.

2 more replies

patrick0d1y ago

this vibe check is more insightful to me than the popular evals. nice job!

thot_experiment1y ago

For prompt adherence it still fails on tasks that Gemma2 27b nails every time. I haven't been impressed with any of the Phi family of models. The large context is very nice, though Gemma2 plays very well with self-extend.

impossiblefork1y ago

It's a much smaller model though.

I think the point is more the demonstration that such a small model can have such good performance than any actual usefulness.

magicalhippo1y ago

Gemma2 9B has significantly better prompt adherence than Llama 3.1 8B in my experience.

I've just assumed it's down to how it was trained, but no expert.

jacoblambda1y ago

Yeah they mention this in the weaknesses section.

> While phi-4 demonstrates relatively strong performance in answering questions and performing reasoning tasks, it is less proficient at rigorously following detailed instructions, particularly those involving specific formatting requirements.

thot_experiment1y ago

Ah good catch, I am forever cursed in my preference for snake over camel.

xeckr1y ago

Looks like it punches way above its weight(s).

How far are we from running a GPT-3/GPT-4 level LLM on regular consumer hardware, like a MacBook Pro?

anon3738391y ago

We’re already past that point! MacBooks can easily run models exceeding GPT-3.5, such as Llama 3.1 8B, Qwen 2.5 8B, or Gemma 2 9B. These models run at very comfortable speeds on Apple Silicon. And they are distinctly more capable and less prone to hallucination than GPT-3.5 was.

Llama 3.3 70B and Qwen 2.5 72B are certainly comparable to GPT-4, and they will run on MacBook Pros with at least 64GB of RAM. However, I have an M3 Max and I can’t say that models of this size run at comfortable speeds. They’re a bit sluggish.

noman-land1y ago

The coolness of local LLMs is THE only reason I am sadly eyeing upgrading from M1 64GB to M4/5 128+GB.

5 more replies

kgeist1y ago

>MacBooks can easily run models exceeding GPT-3.5, such as Llama 3.1 8B, Qwen 2.5 8B, or Gemma 2 9B.

If only those models supported anything other than English

3 more replies

bloomingkales1y ago

M4 Mac mini 16gb for $500. It's literally an inferencing block (small too, fits in my palm). I feel like the whole world needs one.

alganet1y ago

> inferencing block

Did you mean _external gpu_?

Choose any 12GB or more video card with GDDR6 or superior and you'll have at least double the performance of a base m4 mini.

The base model is almost an older generation. Thunderbolt 4 instead of 5, slower bandwidths, slower SSDs.

1 more reply

lappaOP1y ago

It's easy to argue that Llama-3.3 8B performs better than GPT-3.5. Compare their benchmarks, and try the two side-by-side.

Phi-4 is yet another step towards a small, open, GPT-4 level model. I think we're getting quite close.

Check the benchmarks comparing to GPT-4o on the first page of their technical report if you haven't already https://arxiv.org/pdf/2412.08905

vulcanash9991y ago

Did you mean Llama-3.1 8B? Llama 3.3 currently only has a 70B model as far as I’m aware.

ActorNightly1y ago

Why would you want to though? You already can get free access to large LLMs and nobody is doing anything groundbreaking with them.

jckahn1y ago

I only use local, open source LLMs because I don’t trust cloud-based LLM hosts with my data. I also don’t want to build a dependence on proprietary technology.

refulgentis1y ago

We're there, Llama 3.1 8B beats Gemini Advanced for $20/month. Telosnex with llama 3.1 8b GGUF from bartowski. https://telosnex.com/compare/ (How!? tl;dr: I assume Google is sandbagging and hasn't updated the underlying Gemini)

simonw1y ago

We're there. Llama 3.3 70B is GPT-4 level and runs on my 64GB MacBook Pro: https://simonwillison.net/2024/Dec/9/llama-33-70b/

The Qwen2 models that run on my MacBook Pro are GPT-4 level too.

BoorishBears1y ago

Saying these models are at GPT-4 level is setting anyone who doesn't place special value on the local aspect up for disappointment.

Some people do place value on running locally, and I'm not against then for it, but realistically no 70B class model has the amount of general knowledge or understanding of nuance as any recent GPT-4 checkpoint.

That being said these models are still very strong compared to what we had a year ago and capable of useful work

1 more reply

n144q1y ago

I wouldn't call 64GB MacBook Pro "regular consumer hardware".

3 more replies

excerionsforte1y ago

Looks like someone converted it for Ollama use already: https://ollama.com/vanilj/Phi-4

accrual1y ago

I've had great success with quantized Phi-4 12B and Ollama so far. It's as fast as Llama 3.1 8B but the results have been (subjectively) higher quality. I copy/pasted some past requests into Phi-4 and found the answers were generally better.

_ea1k1y ago

I really like the ~3B param version of phi-3. It wasn't very powerful and overused memory, but was surprisingly strong for such a small model.

I'm not sure how I can be impressed by a 14B Phi-4. That isn't really small any more, and I doubt it will be significantly better than llama 3 or Mistral at this point. Maybe that will be wrong, but I don't have high hopes.

travisgriggs1y ago

Where have I been? What is a “small” language model? Wikipedia just talks about LLMs. Is this a sort of spectrum? Are there medium language models? Or is it a more nuanced classifier?

hagen_dogs1y ago

I think it came from this paper, TinyStories (https://arxiv.org/abs/2305.07759). iirc this was also the inspiration for the Phi family of models. The essential point (of the TinyStories paper), "if we train a model on text meant for 3-4 year olds, since that's much simpler shouldn't we need fewer parameters?" Which is correct. In the original they have a model that's 32 Million parameters and they compare it GPT-2 (1.5 Billion parameters) and the 32M model does much better. Microsoft has been interesed in this because "lower models == less resource usage" which means they can run on consumer devices. You can easily run TinyStories from your phone, which is presumably what Microsoft wants to do too.

dboreham1y ago

There are all sizes of models from a few GB to hundreds of GB. Small presumably means small enough to run on end-user hardware.

narag1y ago

7B vs 70B parameters... I think. The small ones fit in the memory of consumer grade cards. That's what I more or less know (waiting for my new computer to arrive this week)

agnishom1y ago

How many parameters did ChatGPT have in Dec 2022 when it first broke into mainstream news?

2 more replies

tbrownaw1y ago

It's a marketing term for the idea that quality over quantity in training data will lead to smaller models that work as well as larger models.

mupuff12341y ago

So we moved from "reasoning" to "complex reasoning".

I wonder what will be next month's buzzphrase.

TeMPOraL1y ago

> So we moved from "reasoning" to "complex reasoning".

Only from the perspective of those still complaining about the use of the term "reasoning", who now find themselves left behind as the world has moved on.

For everyone else, the phrasing change perfectly fits the technological change.

HarHarVeryFunny1y ago

Reasoning basically means multi-step prediction, but to be general the reasoner also needs to be able to:

1) Realize when it's reached an impasse, then backtrack and explore alternatives

2) Recognize when no further progress towards the goal appears possible, and switch from exploiting existing knowledge to exploring/acquiring new knowledge to attempt to proceed. An LLM has limited agency, but could for example ask a question or do a web search.

In either case, prediction failure needs to be treated as a learning signal so the same mistake isn't repeated, and when new knowledge is acquired that needs to be remembered. In both cases this learning would need to persist beyond the current context in order to be something that the LLM can build on in the future - e.g. to acquire a job skill that may take a lot of experience/experimentation to master.

It doesn't matter what you call it (basic or advanced), but it seems that current attempts at adding reasoning to LLMs (e.g. GPT-o1) are based around 1), a search-like strategy, and learning is in-context and ephemeral. General animal-like reasoning needs to also support 2) - resolving impasses by targeted new knowledge acquisition (and/or just curiosity-driven experimentation), as well as continual learning.

criddell1y ago

If you graded humanity on their reasoning ability, I wonder where these models would score?

I think once they get to about the 85th percentile, we could upgrade the phrase to advanced reasoning. I'm roughly equating it with the percentage of the US population with at least a master's degree.

chairhairair1y ago

All current LLMs openly make simple mistakes that are completely incompatible with true "reasoning" (in the sense any human would have used that term years ago).

I feel like I'm taking crazy pills sometimes.

3 more replies

zurfer1y ago

Model releases without comprehensive coverage of benchmarks make me deeply skeptical.

The worst was the gpt4o update in November. Basically a 2 liner on what it is better at and in reality it regressed in multiple benchmarks.

Here we just get MMLU, which is widely known to be saturated and knowing they trained on synthetic data, we have no idea how much "weight" was given to having MMLU like training data.

Benchmarks are not perfect, but they give me context to build upon. ---

edit: the benchmarks are covered in the paper: https://arxiv.org/pdf/2412.08905

PoignardAzur1y ago

Saying that a 14B model is "small" feels a little silly at this point. I guess it doesn't require a high-end graphics card?

ai_biden1y ago

I'm not too excited by Phi-4 benchmark results - It is#BenchmarkInflation.

Microsoft Research just dropped Phi-4 14B, an open-source model that’s turning heads. It claims to rival Llama 3.3 70B with a fraction of the parameters — 5x fewer, to be exact.

What’s the secret? Synthetic data. -> Higher quality, Less misinformation, More diversity

But the Phi models always have great benchmark scores, but they always disappoint me in real-world use cases.

Phi series is famous for to be trained on benchmarks.

I tried again with the hashtag#phi4 through Ollama - but its not satisfactory.

To me, at the moment - IFEval is the most important llm benchmark.

But look the smart business strategy of Microsoft:

have unlimited access to gpt-4 the input prompt it to generate 30B tokens train a 1B parameter model call it phi-1 show benchmarks beating models 10x the size never release the data never detail how to generate the data( this time they told in very high level) claim victory over small models

liminal1y ago

Is 14B parameters still considered small?

parmesean1y ago

13.8 epochs of the benchmarks?

j / k navigate · click thread line to collapse

143 comments

simonw1y ago

The most interesting thing about this is the way it was trained using synthetic data, which is described in quite a bit of detail in the technical report: https://arxiv.org/abs/2412.08905

Microsoft haven't officially released the weights yet but there are unofficial GGUFs up on Hugging Face already. I tried this one: https://huggingface.co/matteogeniaccio/phi-4/tree/main

I got it working with my LLM tool like this:

  llm install llm-gguf
  llm gguf download-model https://huggingface.co/matteogeniaccio/phi-4/resolve/main/phi-4-Q4_K_M.gguf
  llm chat -m gguf/phi-4-Q4_K_M

Here are some initial transcripts: https://gist.github.com/simonw/0235fd9f8c7809d0ae078495dd630...

More of my notes on Phi-4 here: https://simonwillison.net/2024/Dec/15/phi-4-technical-report...

tkellogg1y ago

> Chain-of-Thought: Data should encourage systematic reasoning, teaching the model various approaches to the problems in a step-by-step manner.

https://github.com/tkellogg/lrm-reasoning/blob/main/phi4.md

mirekrusin1y ago

This "draw pelican riding on bicycle" is quite deep if you think about it.

Phi is all about synthetic training and prompt -> svg -> render -> evaluate image -> feedback loop feels like ideal fit for synthetic learning.

You can push it quite far with stuff like basic 2d physics etc with plotting scene after N seconds or optics/rays, magnetic force etc.

SVG as LLM window to physical world.

dartos1y ago

> SVG as LLM window to physical world.

What? let’s try not to go full forehead into hype.

SVGs would be an awfully poor analogy for the physical world…

2 more replies

syntaxing1y ago

Wow, those responses are better than I expected. Part of me was expecting terrible responses since Phi-3 was amazing on paper too but terrible in practice.

refulgentis1y ago

One of the funniest tech subplots in recent memory.

2 more replies

algo_trader1y ago

> More of my notes on Phi-4 here: https://simonwillison.net/2024/Dec/15/phi-4-technical-report...

Nice. Thanks.

Do you think sampling the stack traces of millions of machines is a good dataset for improving code performance? Maybe sample android/jvm bytecode.

Maybe a sort of novelty sampling to avoid re-sampling hot-path?

lifeisgood991y ago

The SVG created for the first prompt is valid but is a garbage image.

bentcorner1y ago

I'd be interested for any LLM emitting any kind of text-to-picture instructions to get results that are beyond a kindergartner-cardboard-cutout levels of art.

3 more replies

chen_dev1y ago

Amazon Nova models:

https://gist.github.com/uschen/38fc65fa7e43f5765a584c6cd24e1...

simonw1y ago

Yeah, it didn't do very well on that one. The best I've had from a local model there was from QwQ: https://simonwillison.net/2024/Nov/27/qwq/

refulgentis1y ago

For context, pelican riding a bicycle: https://imgur.com/a/2nhm0XM

Copied SVG from gist into figma, added dark gray #444444 background, exported as PNG 1x.

fisherjeff1y ago

Looks like someone’s finally caught up with The Hallmark Channel’s LLM performance

selcuka1y ago

I tried the same GGUF on LMStudio and it seems to sometimes leak its training data.

    Prompt:
    Are you based on OpenAI?

    Reply:
    No, I am not based on OpenAI's models; I am developed by Microsoft. While both systems are designed for natural language processing and understanding, we may have different architectures, training methodologies, or datasets. However, our shared goal is to provide helpful and informative responses while ensuring user safety and privacy.<|user|>
    Are you similar to ChatGPT?<|end|>

vergessenmir1y ago

When working with GGUF what chat templates do you use? Pretty much every gguf I've imported into ollama has given me garbage response. Converting the tokenizer json has yielded mixed results.

For example how do you handle the phi-4 models gguf chat template?

simonw1y ago

I use whatever what template is baked into the GGUF file.

You can click on the little info icon on Hugging Face to see that directly.

For https://huggingface.co/matteogeniaccio/phi-4/tree/main?show_... that's this:

  {% for message in messages %}{% if
  (message['role'] == 'system')
  %}{{'<|im_start|>system<|im_sep|>' +
  message['content'] + '<|im_end|>'}}{%
  elif (message['role'] == 'user')
  %}{{'<|im_start|>user<|im_sep|>' +
  message['content'] +
  '<|im_end|><|im_start|>assistant<|im_sep|>'}}{%
  elif (message['role'] == 'assistant')
  %}{{message['content'] + '<|im_end|>'}}{%
  endif %}{% endfor %}

belter1y ago

> it was trained using synthetic data

Is this not supposed to cause Model collapse?

rhdunn1y ago

It depends on how you construct the synthetic data and how the model is trained on that data.

It was found that using primary data along side synthetic data avoided the model collapse. Likewise, if you also have some sort of human scoring/evaluation you can help avoid artefacts.

simonw1y ago

This is why I don't think model collapse actually matters: people have been deliberately training LLMs on synthetic data for over a year at this point.

As far as I can tell model collapse happens when you deliberately train LLMs on low quality LLM-generated data so that you can write a paper about it.

ziofill1y ago

nxobject1y ago

As someone who's a completely layman: I wonder if the results of model collapse are no worse than, say, sufficiently complex symbolic AI (modulo consistency and fidelity?)

fulafel1y ago

No.

1 more reply

mhh__1y ago

Along those lines (synthetic data) I would keep an eye on the chinese labs given that they are probably quite data and compute constrained, in English at least.

Havoc1y ago

>Microsoft haven't officially released the weights

Thought it was official just not on huggingface but rather whatever azure competitor thing they're pushing?

simonw1y ago

I found their AI Foundry thing so hard to figure out I couldn't tell if they had released weights (as opposed to a way of running it via an API).

Since there are GGUFs now so someone must have released some weights somewhere.

2 more replies

patrick0d1y ago

this vibe check is more insightful to me than the popular evals. nice job!

thot_experiment1y ago

impossiblefork1y ago

It's a much smaller model though.

I think the point is more the demonstration that such a small model can have such good performance than any actual usefulness.

magicalhippo1y ago

Gemma2 9B has significantly better prompt adherence than Llama 3.1 8B in my experience.

I've just assumed it's down to how it was trained, but no expert.

jacoblambda1y ago

Yeah they mention this in the weaknesses section.

thot_experiment1y ago

Ah good catch, I am forever cursed in my preference for snake over camel.

xeckr1y ago

Looks like it punches way above its weight(s).

How far are we from running a GPT-3/GPT-4 level LLM on regular consumer hardware, like a MacBook Pro?

anon3738391y ago

noman-land1y ago

The coolness of local LLMs is THE only reason I am sadly eyeing upgrading from M1 64GB to M4/5 128+GB.

5 more replies

kgeist1y ago

>MacBooks can easily run models exceeding GPT-3.5, such as Llama 3.1 8B, Qwen 2.5 8B, or Gemma 2 9B.

If only those models supported anything other than English

3 more replies

bloomingkales1y ago

M4 Mac mini 16gb for $500. It's literally an inferencing block (small too, fits in my palm). I feel like the whole world needs one.

alganet1y ago

> inferencing block

Did you mean _external gpu_?

Choose any 12GB or more video card with GDDR6 or superior and you'll have at least double the performance of a base m4 mini.

The base model is almost an older generation. Thunderbolt 4 instead of 5, slower bandwidths, slower SSDs.

1 more reply

lappaOP1y ago

It's easy to argue that Llama-3.3 8B performs better than GPT-3.5. Compare their benchmarks, and try the two side-by-side.

Phi-4 is yet another step towards a small, open, GPT-4 level model. I think we're getting quite close.

Check the benchmarks comparing to GPT-4o on the first page of their technical report if you haven't already https://arxiv.org/pdf/2412.08905

vulcanash9991y ago

Did you mean Llama-3.1 8B? Llama 3.3 currently only has a 70B model as far as I’m aware.

ActorNightly1y ago

Why would you want to though? You already can get free access to large LLMs and nobody is doing anything groundbreaking with them.

jckahn1y ago

I only use local, open source LLMs because I don’t trust cloud-based LLM hosts with my data. I also don’t want to build a dependence on proprietary technology.

refulgentis1y ago

simonw1y ago

We're there. Llama 3.3 70B is GPT-4 level and runs on my 64GB MacBook Pro: https://simonwillison.net/2024/Dec/9/llama-33-70b/

The Qwen2 models that run on my MacBook Pro are GPT-4 level too.

BoorishBears1y ago

Saying these models are at GPT-4 level is setting anyone who doesn't place special value on the local aspect up for disappointment.

That being said these models are still very strong compared to what we had a year ago and capable of useful work

1 more reply

n144q1y ago

I wouldn't call 64GB MacBook Pro "regular consumer hardware".

3 more replies

excerionsforte1y ago

Looks like someone converted it for Ollama use already: https://ollama.com/vanilj/Phi-4

accrual1y ago

_ea1k1y ago

I really like the ~3B param version of phi-3. It wasn't very powerful and overused memory, but was surprisingly strong for such a small model.

travisgriggs1y ago

Where have I been? What is a “small” language model? Wikipedia just talks about LLMs. Is this a sort of spectrum? Are there medium language models? Or is it a more nuanced classifier?

hagen_dogs1y ago

dboreham1y ago

There are all sizes of models from a few GB to hundreds of GB. Small presumably means small enough to run on end-user hardware.

narag1y ago

7B vs 70B parameters... I think. The small ones fit in the memory of consumer grade cards. That's what I more or less know (waiting for my new computer to arrive this week)

agnishom1y ago

How many parameters did ChatGPT have in Dec 2022 when it first broke into mainstream news?

2 more replies

tbrownaw1y ago

It's a marketing term for the idea that quality over quantity in training data will lead to smaller models that work as well as larger models.

mupuff12341y ago

So we moved from "reasoning" to "complex reasoning".

I wonder what will be next month's buzzphrase.

TeMPOraL1y ago

> So we moved from "reasoning" to "complex reasoning".

Only from the perspective of those still complaining about the use of the term "reasoning", who now find themselves left behind as the world has moved on.

For everyone else, the phrasing change perfectly fits the technological change.

HarHarVeryFunny1y ago

Reasoning basically means multi-step prediction, but to be general the reasoner also needs to be able to:

1) Realize when it's reached an impasse, then backtrack and explore alternatives

criddell1y ago

If you graded humanity on their reasoning ability, I wonder where these models would score?

chairhairair1y ago

All current LLMs openly make simple mistakes that are completely incompatible with true "reasoning" (in the sense any human would have used that term years ago).

I feel like I'm taking crazy pills sometimes.

3 more replies

zurfer1y ago

Model releases without comprehensive coverage of benchmarks make me deeply skeptical.

The worst was the gpt4o update in November. Basically a 2 liner on what it is better at and in reality it regressed in multiple benchmarks.

Here we just get MMLU, which is widely known to be saturated and knowing they trained on synthetic data, we have no idea how much "weight" was given to having MMLU like training data.

Benchmarks are not perfect, but they give me context to build upon. ---

edit: the benchmarks are covered in the paper: https://arxiv.org/pdf/2412.08905

PoignardAzur1y ago

Saying that a 14B model is "small" feels a little silly at this point. I guess it doesn't require a high-end graphics card?

ai_biden1y ago

I'm not too excited by Phi-4 benchmark results - It is#BenchmarkInflation.

Microsoft Research just dropped Phi-4 14B, an open-source model that’s turning heads. It claims to rival Llama 3.3 70B with a fraction of the parameters — 5x fewer, to be exact.

What’s the secret? Synthetic data. -> Higher quality, Less misinformation, More diversity

But the Phi models always have great benchmark scores, but they always disappoint me in real-world use cases.

Phi series is famous for to be trained on benchmarks.

I tried again with the hashtag#phi4 through Ollama - but its not satisfactory.

To me, at the moment - IFEval is the most important llm benchmark.

But look the smart business strategy of Microsoft:

liminal1y ago

Is 14B parameters still considered small?

parmesean1y ago

13.8 epochs of the benchmarks?

j / k navigate · click thread line to collapse