The site consolidates different quality benchmarks, pricing information and our own technical benchmarking data. Technical benchmarking (throughput, latency) is conducted through sending API requests every 3 hours.
Check out the site at https://artificialanalysis.ai, and our twitter at https://twitter.com/ArtificialAnlys
Twitter thread with initial insights: https://twitter.com/ArtificialAnlys/status/17472648324397343...
All feedback is welcome and happy to discuss methodology, etc.
It seems to me GPT4 has become short in its outputs, you have to do a lot more COT type prompting to get it to actually output a good result. Which is excruciating given how slow it is to produce content.
Mixtral on together AI is crazy to see ~70-100token/s, and the quality works for my use case as well.
It will get better as they fix it and tune it, but their entire release pipeline is absolutely bonkers, like no forewarning, no test environment, no opt out. It's scary amateurish for a billion dollar company.
It's likely that ChatGPT uses the 1106 model underneath the covers or some variant, so it probably suffers from the same restricted output window.
In full disclosure, I have worked on getting this up @ Groq.
PS: Experience the speed for yourself, LLama2-70B, at https://chat.groq.com/
It _hates_ making assumptions about things it doesn't know for sure, I suspect because of "anti-hallucination" nonsense. Instead it has to be shoved to even try making any assumptions, even reasonable ones.
I know it's capable of making reasonable assumptions for class structures/behaviour, etc where I can just tweak it as needed to work. It just refuses too. I've even seen comments like "We'll put the rest of the code in later"
Create JavaScript to insert the the JSON into the SQL using knex('table_name')
Below is part of its output:
// Insert into course_module table
await knex('course_module').insert({
id: moduleId,
name: courseData.name,
description: courseData.description,
// include other required fields with appropriate
values
});It's missing several columns it could populate with the data it knows from the prompt, primarily created_at, updated_at, account_id, user_id, lesson number... and instead I get a comment telling me to do it.
Theres a lot of people complaining about this, primarily on Reddit, but usually the ChatGPT fan boys jump in to defend OAI.
Twitter thread with initial insights: https://twitter.com/ArtificialAnlys/status/17472648324397343...
All feedback is welcome
I guess their cost is same as base model although would effect performance.
[edit: And also MPS]
Do Lambda have a serverless inference API? Not aware of them playing in this space yet.
Presume you mean MPT not MPS - yep we'll look into MosaicML soon.
Sure you can't have a chat with it or expect it to do high level reasoning, but has enough to do the basic deductions for grounded answers.
OpenAI are doing a ton of load balancing, presumably constantly tweaking batch sizes to try to optmize across all their workloads.
You can test the GPT-4 vs GPT-4 Turbo on Playground to intuitively confirm that the speeds are similar.
Could the data have been collected when the system is under different loads?
* Looks like for gpt-4 turbo (https://artificialanalysis.ai/models/gpt-4-turbo-1106-previe...), there was a huge latency spike on December 28, which is causing the avg. latency to be very high. Perhaps dropping top and bottom 10% of requests will help with avg (or switch over to median + include variance)
* Adding latency variance would be truly awesome, I've run into issues with some LLM API providers where they've had incredibly high variance, but I haven't seen concrete data across providers
Would be interesting to see request latency and throughput when API calls occur cold (first data point), and once per hour, minute, and per second with the first N samples dropped.
Also, at least with Azure OpenAI, the AI safety features (filtering & annotations) make a significant difference in time to first token.
Sadly very few benchmarks bother to track this.
Your latency numbers for OpenAI (and Azure's equivalents) seem really high, I run time to first token tests and I see much better numbers!
(Also are those numbers average, p50, p99, etc? I'd honestly expect a box plot to really see what is going on!)
There are some interesting views of throughput vs. latency whereby some models are slower to the first chunk but faster for subsequent chunks and vice versa, and so suit different use cases (e.g. if just want a true/false vs. more detailed model responses)
iOS Safari
In my experience speed varies a lot and it make it big difference if a requests takes 10 seconds or 50 seconds.
I run https://www.labophase.com and my data suggests that it's one of the top 3 models in terms of users liking to interact with it. May I know how model quality is benchmarked to understand this discrepancy?
It's a combination of different quality metrics which have Perplexity, overall, not performing as well. That being said, I think we are in the very early stages of model quality scoring/ranking - and (for closed sourced models) we are seeing frequent changes. Will be interesting to see how measures evolve / model ranks change
We have a bit more information in the FAQ: https://artificialanalysis.ai/faq but thanks for the feedback, will look into expanding more on how the normalization works. We are thinking of ways to improve this generalized metric.
A sticking point is quality can of course be thought of from different perspectives, reasoning, knowledge (retrieval), use-case specific (coding, math, readability), etc. This is why show individual scores on home page and models page: https://artificialanalysis.ai/models
And I did not realize how much Gemini Pro lags behind GPT4 in terms of quality, wow!
I'm excited to see where things land. What I find interesting is that pricing is either wildly expensive or wildly cheap, depending on your use case. For example, if you want to run GPT-4 to glean insights on every webpage your users visit, a freemium business model is likely completely unviable. On the other hand, if I'm using an LLM to spot issues in a legal contract, I'd happily pay 10x what GPT4 currently charges for something marginally better (It doesn't make much difference if this task costs $4 vs $0.40). I think that the ultimate "winners" in this space will have a range of models at various price points and let you seamlessly shift between them depending on the task (e.g., in a single workflow, I might have some sub-tasks that need a cheap model and some that require an expensive one).
Mixtral running at >500 tokens/s @ Groq https://www.youtube.com/watch?v=5fJyOVtOk4Y Experience the speed for yourself, LLama2-70B, at https://chat.groq.com/