Benchmarks and comparison of LLM AI models and API hosting providers (opens in new tab)

(artificialanalysis.ai)

152 pointsGcam2y ago70 comments

Hi HN, ArtificialAnalysis.ai provides objective benchmarks and analysis of LLM AI models and API hosting providers so you can compare which to use in your next (or current) project.

The site consolidates different quality benchmarks, pricing information and our own technical benchmarking data. Technical benchmarking (throughput, latency) is conducted through sending API requests every 3 hours.

Check out the site at https://artificialanalysis.ai, and our twitter at https://twitter.com/ArtificialAnlys

Twitter thread with initial insights: https://twitter.com/ArtificialAnlys/status/17472648324397343...

All feedback is welcome and happy to discuss methodology, etc.

Benchmarks and comparison of LLM AI models and API hosting providers

(artificialanalysis.ai)

152 pointsGcam2y ago70 comments

Hi HN, ArtificialAnalysis.ai provides objective benchmarks and analysis of LLM AI models and API hosting providers so you can compare which to use in your next (or current) project.

Check out the site at https://artificialanalysis.ai, and our twitter at https://twitter.com/ArtificialAnlys

Twitter thread with initial insights: https://twitter.com/ArtificialAnlys/status/17472648324397343...

All feedback is welcome and happy to discuss methodology, etc.

70 comments

60 comments · 23 top-level

bearjaws2y ago· 7 in thread

I've been using Mixtral and Bard ever since the end of the year. I am pleased with their performance overall for a mixture of content generation and coding.

It seems to me GPT4 has become short in its outputs, you have to do a lot more COT type prompting to get it to actually output a good result. Which is excruciating given how slow it is to produce content.

Mixtral on together AI is crazy to see ~70-100token/s, and the quality works for my use case as well.

avereveard2y ago

OpenAi it's an unreliable provider. Even if their model don't change as they say, there's a current issue where they blanked added a guardian tool to enforce content policies that are obscure and the tool is over eager, causing quite a stir across startups where this manifests on the surface like an outage.

It will get better as they fix it and tune it, but their entire release pipeline is absolutely bonkers, like no forewarning, no test environment, no opt out. It's scary amateurish for a billion dollar company.

outside4152y ago

$80bn dollar company *

vunderba2y ago

If we're talking about the API, it seems like it's short because it is shorter. The latest version of GPT-4 (1106) might have a significantly larger input window, but its maximum output token size is limited to 4096 tokens.

It's likely that ChatGPT uses the 1106 model underneath the covers or some variant, so it probably suffers from the same restricted output window.

djsh2y ago

If you like that speed, you would love Mixtral running at >500 tokens/s @ Groq https://www.youtube.com/watch?v=5fJyOVtOk4Y

In full disclosure, I have worked on getting this up @ Groq.

PS: Experience the speed for yourself, LLama2-70B, at https://chat.groq.com/

thierrydamiba2y ago

Can you give an example of a query where you find GPT4 is short with outputs? I’ve use custom instructions so that may have shielded me from this change.

declaredapple2y ago

At least for me making tests has been very frustrating, full of many "test conditions here" and "continue with the rest of the tests".

It _hates_ making assumptions about things it doesn't know for sure, I suspect because of "anti-hallucination" nonsense. Instead it has to be shoved to even try making any assumptions, even reasonable ones.

I know it's capable of making reasonable assumptions for class structures/behaviour, etc where I can just tweak it as needed to work. It just refuses too. I've even seen comments like "We'll put the rest of the code in later"

1 more reply

bearjaws2y ago

Given this JSON: <JSON examples> And this Table schema: <Table Schema in SQL>

Create JavaScript to insert the the JSON into the SQL using knex('table_name')

Below is part of its output:

  // Insert into course_module table

  await knex('course_module').insert({

    id: moduleId,

    name: courseData.name,

    description: courseData.description,

    // include other required fields with appropriate

values });

It's missing several columns it could populate with the data it knows from the prompt, primarily created_at, updated_at, account_id, user_id, lesson number... and instead I get a comment telling me to do it.

Theres a lot of people complaining about this, primarily on Reddit, but usually the ChatGPT fan boys jump in to defend OAI.

2 more replies

GcamOP2y ago· 5 in thread

Hi HN, Thanks for checking this out! Goal with this project is to provide objective benchmarks and analysis of LLM AI models and API hosting providers to compare which to use in your next (or current) project. Benchmark comparisons include quality, price, technical performance (e.g. throughput, latency).

Twitter thread with initial insights: https://twitter.com/ArtificialAnlys/status/17472648324397343...

All feedback is welcome

ttt3ts2y ago

Any chance of including some of the better fine tunes, e.g. wizard or tulu? (worse than mixtral but I assume other finetines will be better just like wizard and tulu are better than LLAMA2)

I guess their cost is same as base model although would effect performance.

_micah_h2y ago

Hey, yeah the bar for adding finetunes will probably be that they're being hosted by ~3 supported hosting providers. Very much open to it!

YetAnotherNick2y ago

Can quality score be added for each inference provider for the same model. Many of them use different quantization and approximation so that it's not just price and throughput that's important. Specially for model like Mixtral.

bravura2y ago

I'd love to see replicate.com (pay per sip) on there. And lambdalabs.com

[edit: And also MPS]

_micah_h2y ago

We've been waiting on Replicate to launch per-token pricing for LLMs because their previous pay-per-second model was uncompetitive - but it looks like they might have just turned it on with no big announcement! They'll go straight to the top of the priority list.

Do Lambda have a serverless inference API? Not aware of them playing in this space yet.

Presume you mean MPT not MPS - yep we'll look into MosaicML soon.

avereveard2y ago· 4 in thread

I wish there was claude instant in there is a damn fine model often overlooked

coder5432y ago

What do you like about it? Compared to GPT-3.5, Claude Instant seems to be the same or worse in quality according to both human and automated benchmarks, but also more expensive. It seems undifferentiated. And I would rather use Mixtral than either of those in most cases, since Mixtral often outperforms GPT-3.5 and can be run on my own hardware.

avereveard2y ago

Data extraction mostly. Supports long document, cheaper input tokens than gpt3 turbo, and when I ask to stick to document informations it doesn't try to fill in gaps out with his trained knowledge.

Sure you can't have a chat with it or expect it to do high level reasoning, but has enough to do the basic deductions for grounded answers.

GcamOP2y ago

We have Claude Instant on the models page: https://artificialanalysis.ai/models Can add it via the select at the top right of each card where it says '9 Selected' (below the highlight charts)

avereveard2y ago

Ah cool was on mobile didn't eee the select

1 more reply

scribu2y ago· 3 in thread

I’m not sure about the Speed chart. I would expect gpt-4-turbo to be faster than plain gpt-4.

_micah_h2y ago

Check out the graphs over time on the model pages - https://artificialanalysis.ai/models/gpt-4-turbo-1106-previe....

OpenAI are doing a ton of load balancing, presumably constantly tweaking batch sizes to try to optmize across all their workloads.

You can test the GPT-4 vs GPT-4 Turbo on Playground to intuitively confirm that the speeds are similar.

pseudosavant2y ago

I thought so too. Could it be that gpt-4 turbo is more efficient for them to run, so the price is lower, but tries to maintain the token throughput of GPT4 over their API? There are a lot of ways they could allocate and configure their GPU resources so that GPT-4 Turbo provides the same per user throughput while greatly increasing their system throughput.

bredren2y ago

The speed of GPT-4 via chatgpt varies greatly on when you’re using it.

Could the data have been collected when the system is under different loads?

2 more replies

badFEengineer2y ago· 2 in thread

nice, I've been looking for something like this! A few notes / wishlist items:

* Looks like for gpt-4 turbo (https://artificialanalysis.ai/models/gpt-4-turbo-1106-previe...), there was a huge latency spike on December 28, which is causing the avg. latency to be very high. Perhaps dropping top and bottom 10% of requests will help with avg (or switch over to median + include variance)

* Adding latency variance would be truly awesome, I've run into issues with some LLM API providers where they've had incredibly high variance, but I haven't seen concrete data across providers

GcamOP2y ago

Thanks for the feedback and glad it is useful! Yes, agree might better representative of future use. I think a view of variance would be a good idea, currently just shown in over-time views - maybe a histogram of response times or a box and whisker. We have a newsletter subscribe form on the website or twitter (https://twitter.com/ArtificialAnlys) if you want to follow future updates

AaronFriel2y ago

Variance would be good, and I've also seen significant variance on "cold" request patterns, which may correspond to resources scaling up on the backend of providers.

Would be interesting to see request latency and throughput when API calls occur cold (first data point), and once per hour, minute, and per second with the first N samples dropped.

Also, at least with Azure OpenAI, the AI safety features (filtering & annotations) make a significant difference in time to first token.

com2kid2y ago· 2 in thread

I wish more places showed Time To First Token. For scenarios real time human interaction, the important part is how long until the first token is returned, and are tokens generated faster than people consume them.

Sadly very few benchmarks bother to track this.

GcamOP2y ago

Hi, we have this if you take a look at the models page (https://artificialanalysis.ai/models) and scroll down to 'Latency', and also on the API host comparison pages for each model (e.g. https://artificialanalysis.ai/models/llama-2-chat-70b)

com2kid2y ago

Ah so you do!

Your latency numbers for OpenAI (and Azure's equivalents) seem really high, I run time to first token tests and I see much better numbers!

(Also are those numbers average, p50, p99, etc? I'd honestly expect a box plot to really see what is going on!)

1 more reply

sabareesh2y ago· 2 in thread

I want to see benchmarks for RAG. Most of the models are not very good with RAG

BeetleB2y ago

Curious to hear your experience. I built a simple RAG using GPT4-Turbo some weeks ago. Only used it for a few hours but was mostly satisfied. I did notice if I sent it too many documents, it would not find the (one) doc I was looking for.

sabareesh2y ago

GPT4 Turbo is top of the class it does RAG very well, it is important to provide good context with help of Vector DB but if you cannot provide relevant document it cannot do much. All the opensource models are super bad at this and mostly i want to blame the fine tuning to get to the leaderboard is affecting the quality

throwawaymaths2y ago· 2 in thread

Latency (ttft) would be a nice metric.

GcamOP2y ago

We have this (and other more detailed metrics) on the models page https://artificialanalysis.ai/models if you scroll down and for individual hosts if you click into a model (nav or click one of the model bars/bubbles) :)

There are some interesting views of throughput vs. latency whereby some models are slower to the first chunk but faster for subsequent chunks and vice versa, and so suit different use cases (e.g. if just want a true/false vs. more detailed model responses)

throwawaymaths2y ago

Thanks!

elicksaur2y ago· 2 in thread

> Application error: a client-side exception has occurred (see the browser console for more information).

iOS Safari

GcamOP2y ago

Thanks for the letting me know. Odd as not occurring with my iOS Safari, can anyone else please let me know if they are encountering this issue (any their iOS version if possible). There is a console error but should be just a react defaultprops deprecation notice from a library being used (should not break DOM)

elicksaur2y ago

Tried the link again, and it works now! Sorry for not having more info.

zurfer2y ago· 1 in thread

This is great. Thank you! I would be especially interested in more details around speed. Average is a good starting point, but I would love to also see standard distribution or 90, 99 percentiles.

In my experience speed varies a lot and it make it big difference if a requests takes 10 seconds or 50 seconds.

GcamOP2y ago

Thanks for the feedback! Yes, agree this would be a good idea. We don't have this view but best place to get an idea of this with current site would be the /models page (https://artificialanalysis.ai/models) and scrolling to the over time graphs and looking at the variance. To see if being driven by individual hosts can also click into the by-model pages and see the over time graphs, e.g. https://artificialanalysis.ai/models/mixtral-8x7b-instruct

causal2y ago· 1 in thread

Thanks for putting this together! Amazon is far and away the priciest option here, but I wonder if a big part of that is the convenience tax for the Bedrock service. Would be interesting to compare that to the price of just renting AWS GPUs on EC2.

GcamOP2y ago

Yes! An interesting insight is that the smaller, emerging hosts also offer strong relative performance (throughput - tokens per second)

binsquare2y ago· 1 in thread

I'm surprised to see perplexity's 70B online model score so low on model quality and somehow far worse mixtral and gpt3.5(they use a fine tuned gpt3.5 as the foundational model AFAIK)

I run https://www.labophase.com and my data suggests that it's one of the top 3 models in terms of users liking to interact with it. May I know how model quality is benchmarked to understand this discrepancy?

GcamOP2y ago

Model quality index methodology is as per this comment (can add perplexity using the dropdown): https://news.ycombinator.com/item?id=39014985#39017632

It's a combination of different quality metrics which have Perplexity, overall, not performing as well. That being said, I think we are in the very early stages of model quality scoring/ranking - and (for closed sourced models) we are seeing frequent changes. Will be interesting to see how measures evolve / model ranks change

idiliv2y ago· 1 in thread

I'm curious how they evaluated model quality. The only information I could find is "Quality: Index based on several quality benchmarks".

GcamOP2y ago

Quality index is equally-weighted normalized values of Chatbot Arena Elo Score, MMLU, and MT Bench.

We have a bit more information in the FAQ: https://artificialanalysis.ai/faq but thanks for the feedback, will look into expanding more on how the normalization works. We are thinking of ways to improve this generalized metric.

A sticking point is quality can of course be thought of from different perspectives, reasoning, knowledge (retrieval), use-case specific (coding, math, readability), etc. This is why show individual scores on home page and models page: https://artificialanalysis.ai/models

luke-stanley2y ago· 1 in thread

This is awesome. I was looking at benchmarking speed and quality myself but didn't go this far! I wonder about Claude Instant and Phi 2? Modal.com for inference felt crazy fast, but I didn't note the metrics. Good ones to add? Replicate.com too maybe?

GcamOP2y ago

Thanks! For Claude instant, select the dropdown on the top right of the card where it says '8 Selected' and can add it to the graphs. Thanks for the suggestions for adding Phi 2, Model.com as a host, can look into these!

jdthedisciple2y ago· 1 in thread

Really neat!

And I did not realize how much Gemini Pro lags behind GPT4 in terms of quality, wow!

_micah_h2y ago

Gemini Ultra is the model they claim will match GPT-4, not out yet!

rubymamis2y ago· 1 in thread

I wish there were more details about how you measure "quality".

pseudosavant2y ago

See this comment: https://news.ycombinator.com/item?id=39014985#39017792

jafitc2y ago· 1 in thread

Deepinfra Mixtral is $0.27 / M tokens as per their website

_micah_h2y ago

Hey, yep looks like they updated their pricing - we've now updated it on the site!

chadash2y ago

I love it. One minor change I'd make is changing the pricing chart to put lowest on the left. On the other highlights, left to right goes from best to worst, but this one is the opposite.

I'm excited to see where things land. What I find interesting is that pricing is either wildly expensive or wildly cheap, depending on your use case. For example, if you want to run GPT-4 to glean insights on every webpage your users visit, a freemium business model is likely completely unviable. On the other hand, if I'm using an LLM to spot issues in a legal contract, I'd happily pay 10x what GPT4 currently charges for something marginally better (It doesn't make much difference if this task costs $4 vs $0.40). I think that the ultimate "winners" in this space will have a range of models at various price points and let you seamlessly shift between them depending on the task (e.g., in a single workflow, I might have some sub-tasks that need a cheap model and some that require an expensive one).

m3kw92y ago

I feel sorry for all other models when gpt4.5 comes out. If you are not at gpt4 level it’s pretty useless other than have some fun.

djsh2y ago

Since we are talking about throughput of API hosting providers, I wanted to add in the work we have done at Groq. I understand that the team is getting in touch with the ArtificialAnalysis folks to get benchmarked.

Mixtral running at >500 tokens/s @ Groq https://www.youtube.com/watch?v=5fJyOVtOk4Y Experience the speed for yourself, LLama2-70B, at https://chat.groq.com/

vunderba2y ago

It's probably beyond the scope of this project, but it would be great to see comparisons across different quant levels (e.g. 4-bit, etc), since this can sometimes result in an extreme drop off in quality, but it's an important factor to consider when hosting your own LLM.

MacsHeadroom2y ago

Perhaps price should be tokens per dollar, to keep the charts all "higher is better."

wonderfuly2y ago

If you want to compare LLMs on daily usage, checkout: https://chathub.gg

j / k navigate · click thread line to collapse

70 comments

60 comments · 23 top-level

bearjaws2y ago· 7 in thread

I've been using Mixtral and Bard ever since the end of the year. I am pleased with their performance overall for a mixture of content generation and coding.

Mixtral on together AI is crazy to see ~70-100token/s, and the quality works for my use case as well.

avereveard2y ago

outside4152y ago

$80bn dollar company *

vunderba2y ago

It's likely that ChatGPT uses the 1106 model underneath the covers or some variant, so it probably suffers from the same restricted output window.

djsh2y ago

If you like that speed, you would love Mixtral running at >500 tokens/s @ Groq https://www.youtube.com/watch?v=5fJyOVtOk4Y

In full disclosure, I have worked on getting this up @ Groq.

PS: Experience the speed for yourself, LLama2-70B, at https://chat.groq.com/

thierrydamiba2y ago

Can you give an example of a query where you find GPT4 is short with outputs? I’ve use custom instructions so that may have shielded me from this change.

declaredapple2y ago

At least for me making tests has been very frustrating, full of many "test conditions here" and "continue with the rest of the tests".

1 more reply

bearjaws2y ago

Given this JSON: <JSON examples> And this Table schema: <Table Schema in SQL>

Create JavaScript to insert the the JSON into the SQL using knex('table_name')

Below is part of its output:

  // Insert into course_module table

  await knex('course_module').insert({

    id: moduleId,

    name: courseData.name,

    description: courseData.description,

    // include other required fields with appropriate

values });

Theres a lot of people complaining about this, primarily on Reddit, but usually the ChatGPT fan boys jump in to defend OAI.

2 more replies

GcamOP2y ago· 5 in thread

Twitter thread with initial insights: https://twitter.com/ArtificialAnlys/status/17472648324397343...

All feedback is welcome

ttt3ts2y ago

Any chance of including some of the better fine tunes, e.g. wizard or tulu? (worse than mixtral but I assume other finetines will be better just like wizard and tulu are better than LLAMA2)

I guess their cost is same as base model although would effect performance.

_micah_h2y ago

Hey, yeah the bar for adding finetunes will probably be that they're being hosted by ~3 supported hosting providers. Very much open to it!

YetAnotherNick2y ago

bravura2y ago

I'd love to see replicate.com (pay per sip) on there. And lambdalabs.com

[edit: And also MPS]

_micah_h2y ago

Do Lambda have a serverless inference API? Not aware of them playing in this space yet.

Presume you mean MPT not MPS - yep we'll look into MosaicML soon.

avereveard2y ago· 4 in thread

I wish there was claude instant in there is a damn fine model often overlooked

coder5432y ago

avereveard2y ago

Data extraction mostly. Supports long document, cheaper input tokens than gpt3 turbo, and when I ask to stick to document informations it doesn't try to fill in gaps out with his trained knowledge.

Sure you can't have a chat with it or expect it to do high level reasoning, but has enough to do the basic deductions for grounded answers.

GcamOP2y ago

We have Claude Instant on the models page: https://artificialanalysis.ai/models Can add it via the select at the top right of each card where it says '9 Selected' (below the highlight charts)

avereveard2y ago

Ah cool was on mobile didn't eee the select

1 more reply

scribu2y ago· 3 in thread

I’m not sure about the Speed chart. I would expect gpt-4-turbo to be faster than plain gpt-4.

_micah_h2y ago

Check out the graphs over time on the model pages - https://artificialanalysis.ai/models/gpt-4-turbo-1106-previe....

OpenAI are doing a ton of load balancing, presumably constantly tweaking batch sizes to try to optmize across all their workloads.

You can test the GPT-4 vs GPT-4 Turbo on Playground to intuitively confirm that the speeds are similar.

pseudosavant2y ago

bredren2y ago

The speed of GPT-4 via chatgpt varies greatly on when you’re using it.

Could the data have been collected when the system is under different loads?

2 more replies

badFEengineer2y ago· 2 in thread

nice, I've been looking for something like this! A few notes / wishlist items:

* Adding latency variance would be truly awesome, I've run into issues with some LLM API providers where they've had incredibly high variance, but I haven't seen concrete data across providers

GcamOP2y ago

AaronFriel2y ago

Variance would be good, and I've also seen significant variance on "cold" request patterns, which may correspond to resources scaling up on the backend of providers.

Would be interesting to see request latency and throughput when API calls occur cold (first data point), and once per hour, minute, and per second with the first N samples dropped.

Also, at least with Azure OpenAI, the AI safety features (filtering & annotations) make a significant difference in time to first token.

com2kid2y ago· 2 in thread

Sadly very few benchmarks bother to track this.

GcamOP2y ago

com2kid2y ago

Ah so you do!

Your latency numbers for OpenAI (and Azure's equivalents) seem really high, I run time to first token tests and I see much better numbers!

(Also are those numbers average, p50, p99, etc? I'd honestly expect a box plot to really see what is going on!)

1 more reply

sabareesh2y ago· 2 in thread

I want to see benchmarks for RAG. Most of the models are not very good with RAG

BeetleB2y ago

sabareesh2y ago

throwawaymaths2y ago· 2 in thread

Latency (ttft) would be a nice metric.

GcamOP2y ago

throwawaymaths2y ago

Thanks!

elicksaur2y ago· 2 in thread

> Application error: a client-side exception has occurred (see the browser console for more information).

iOS Safari

GcamOP2y ago

elicksaur2y ago

Tried the link again, and it works now! Sorry for not having more info.

zurfer2y ago· 1 in thread

This is great. Thank you! I would be especially interested in more details around speed. Average is a good starting point, but I would love to also see standard distribution or 90, 99 percentiles.

In my experience speed varies a lot and it make it big difference if a requests takes 10 seconds or 50 seconds.

GcamOP2y ago

causal2y ago· 1 in thread

GcamOP2y ago

Yes! An interesting insight is that the smaller, emerging hosts also offer strong relative performance (throughput - tokens per second)

binsquare2y ago· 1 in thread

I'm surprised to see perplexity's 70B online model score so low on model quality and somehow far worse mixtral and gpt3.5(they use a fine tuned gpt3.5 as the foundational model AFAIK)

GcamOP2y ago

Model quality index methodology is as per this comment (can add perplexity using the dropdown): https://news.ycombinator.com/item?id=39014985#39017632

idiliv2y ago· 1 in thread

I'm curious how they evaluated model quality. The only information I could find is "Quality: Index based on several quality benchmarks".

GcamOP2y ago

Quality index is equally-weighted normalized values of Chatbot Arena Elo Score, MMLU, and MT Bench.

luke-stanley2y ago· 1 in thread

GcamOP2y ago

jdthedisciple2y ago· 1 in thread

Really neat!

And I did not realize how much Gemini Pro lags behind GPT4 in terms of quality, wow!

_micah_h2y ago

Gemini Ultra is the model they claim will match GPT-4, not out yet!

rubymamis2y ago· 1 in thread

I wish there were more details about how you measure "quality".

pseudosavant2y ago

See this comment: https://news.ycombinator.com/item?id=39014985#39017792

jafitc2y ago· 1 in thread

Deepinfra Mixtral is $0.27 / M tokens as per their website

_micah_h2y ago

Hey, yep looks like they updated their pricing - we've now updated it on the site!

chadash2y ago

I love it. One minor change I'd make is changing the pricing chart to put lowest on the left. On the other highlights, left to right goes from best to worst, but this one is the opposite.

m3kw92y ago

I feel sorry for all other models when gpt4.5 comes out. If you are not at gpt4 level it’s pretty useless other than have some fun.

djsh2y ago

Mixtral running at >500 tokens/s @ Groq https://www.youtube.com/watch?v=5fJyOVtOk4Y Experience the speed for yourself, LLama2-70B, at https://chat.groq.com/

vunderba2y ago

MacsHeadroom2y ago

Perhaps price should be tokens per dollar, to keep the charts all "higher is better."

wonderfuly2y ago

If you want to compare LLMs on daily usage, checkout: https://chathub.gg

j / k navigate · click thread line to collapse