Persimmon-8B (opens in new tab)

(adept.ai)

175 pointsjgershen2y ago56 comments

56 comments

45 comments · 14 top-level

gardnr2y ago· 7 in thread

Two important takeaways on the base model:

* scored 18.9 on HumanEval (coding) where Llama2 7B scored 12.2

* was trained from the beginning with a 16k context using a modified RoPe where many models are simply fine-tuned using RoPe to gain longer context windows after the base model has been trained at 4k.

Can anyone share ideas on how important the 2nd one is? Do LLMs benefit from large context windows using RoPe during pretraining?

sbierwagen2y ago

phi-1 supposedly does 50.6 on HumanEval with 1.3B parameters. (Python only) https://arxiv.org/abs/2306.11644

Weights haven't been released, though.

euclaise2y ago

phi-1 is a code-specific base model, with further finetuning on top of that. This is a general language base model, not really comparable.

imjonse2y ago

no code or dataset either for phi-1.

swyx2y ago

its not so much about benefit, as it is a design goal to want large context windows.

https://twitter.com/suchenzang/status/1699926157028897078?s=... notes some issues directly comparing the 16k context number. the odd choice of tokenizer means its effectively like a 10-12k model (? ballpark, not calculated)

euclaise2y ago

That tweet had it backwards, more tokens in tokenizer means that the 16k token context window typically allows for even longer passages than if LLaMA were 16k

craigacp2y ago

There's a correction to that tweet, larger vocab means fewer tokens for any given sequence (usually, assuming it's not to add other languages or character sets).

coder5432y ago

> scored 18.9 on HumanEval (coding) where Llama2 7B scored 12.2

The article claims 18.9 for the base model, but also claims 20.7 for the fine tuned model.

automatistist2y ago· 6 in thread

> The standard practice for achieving fast inference is to rewrite the entire model inference loop in C++, as in FasterTransformer, and call out to special fused kernels in CUDA. But this means that any changes to the model require painfully reimplementing every feature twice: once in Python / PyTorch in the training code and again in C++ in the inference codebase. We found this process too cumbersome and error prone to iterate quickly on the model.

I am an AI novice but why can't they automated this with AI? I thought the whole point of these tools was to automated tasks that are error prone and require lots of attention to details. Computers are great at that kind of stuff so it's surprising they haven't applied AI techniques to automate parts of the AI pipeline like converting code from Python to C++.

ironrabbit2y ago

Automatic kernel fusion (compilation) is a very active field, and most major frameworks support some easy-to-use compilation (e.g. jax's jit, or torch.compile which iirc uses openai's triton under the hood). Often you can still do better than the compiler by writing fused kernels yourself (either in cuda c++ or in something like triton (python which compiles down to cuda) but compilers are getting pretty good.

edit: not sure why op is getting downvotes, this is a very reasonable question imo; maybe the characterization of kernel compilation as "AI" vs. just "software"?

loopist2y ago

Both AI and compilers are just software and right now the optimizers are written manually which is kinda weird because the whole point of LLMs is to generate sequences of tokens that minimize some scalar valued loss function. In the case of compilers the input is some high level code in python expressing tensor operations and the output is whatever is executable by GPUs as fast as possible by combination of kernels which are formally equivalent to the tensor operations expressed in Python (or whatever higher level language is used to write the tensor specifications to be optimized for the task at hand). Everything in this loop has a well defined input with a well defined output and an associated scalar valued metric (execution time) and even a normalization factor (output length with shorter sequences being "better").

The whole thing seems obviously amenable to gradient based optimization and data augmentation with synthetic code generators. It is surprising that no one is pursuing such approaches to improving the optimization pipeline in kernel compilation/fusion/optimization because it is just another symbol game with much better defined metrics than natural language models.

Bnjoroge2y ago

thanks for explaining pretty concisely w/out being rude :)

automatistredo2y ago

Can someone explain the down votes? What exactly is incorrect in OPs comment?

snissn2y ago

i don't know why people downvote, but writing highly performant gpu code across multiple languages is still in the realm of only a few people with a lot of the right experience can do well, and while ai can help assist those people it's not a problem that can be fully solved by an ai at this moment, maybe a few years with a large feedback loop of iterating, testing, benchmarking, repeating. i guess one day but not now

1 more reply

TrueDuality2y ago

I would have to guess it has something to do with that task not actually being suitable for language models at their current stage. Even if they could be trusted to perform the task, its actually not that much work to just... write code to handle keeping this kind of thing in sync. It's really really not that much more work. You really don't even need to do it, both training and inference can be done within PyTorch or in C++.

If it was necessary for some reason... Running a language model to keep something like this is sync over long term training and iteration would likely be more expensive than a developer's time AND block the researcher in a verification loop on the output which still probably needs to be checked by the developer (they could be the same person which will just deepen the frustration they experience).

The use of a lot of garbage accounts in this thread and lack of model details also looks pretty shady...

1 more reply

123yawaworht4562y ago· 3 in thread

I applaud you guys for not including any nauseating gibberish in this press release or seemingly anywhere else on your website. It's like a breath of fresh air comparing to every other AI-related resource I saw recently. Please, keep it up.

YetAnotherNick2y ago

This is the least detailed foundational model release I have seen. Llama paper offers lot more details like ablations, loss curves etc. Falcon has data preparation details etc. Google's model release papers like T5 are some of the best and includes many ablations.

123yawaworht4562y ago

I mean "I am become death, destroyer of worlds" bullshit about AI safety/ethics/etc that is included in every press release from Google/Meta/OpenAI and even much smaller players.

1 more reply

solverist2y ago

Why are ablations useful? Their release report seemed very informative to me without getting bogged down in jargon.

1 more reply

theLiminator2y ago· 3 in thread

What kind of use cases do these sub 10B param models serve? Are they mostly useful for code completion?

richdougherty2y ago

You can run them either for general purpose inference. You can also fine-tune them and get improved performance for specific use cases.

It's safe to assume they're worse at every task than larger models, so I wouldn't look at use cases in terms of what tasks they can do compared to larger models.

But what's good about them is they're smaller so they can run on smaller and cheaper hardware. So an example would be to fine-tune and then run on some sort of local user device rather than in the cloud. This might become more practical in the future as hardware improves.

theLiminator2y ago

Yeah, my point is moreso is are smaller models ever "smart" enough to perform useful tasks?

Perhaps for basic code completion and simple writing tasks?

gremlinsinc2y ago

say you had very vertical trained models, such that you had like 1000 separate LLMs trained on specialized data and then others LLMs trained on which LLM is most likely to have the data you need, sort of like the way Wikipedia is interlinked, or hierarchical, or essentially like a db index, over nested LLMs, performance would scale higher with many more highly focused models, at least that's my understanding of what possible here.

thewataccount2y ago· 2 in thread

Awesome! I applaud everyone training new models and attempting different techniques!

I'm concerned about the current download's availability - its two URLs to some object storage. I find that these go dark rather quickly for many different reasons (accidentally moving it, bandwidth limits, deleting it later, etc).

I'm curious if there's a reason it's not also hosted on huggingface? I'm not saying they're the best place, but redundancy is good, most models have entries there, they have a very good cdn, and isn't as likely to go dark accidentally.

selfhoster112y ago

If this model can be made to work as GGUF, TheBloke will probably have a set of quantizations in a day or two at most.

amks2y ago

We're working on it!

imjonse2y ago· 2 in thread

Congrats on the release! Two questions.

1) In the results table, Llama2 base is being compared to Persimmon base and finetuned, and only the latter performs better. Would a comparison to Llama2-chat be possible/fair?

2) The Llama-2 numbers for MMLU in that table seem different from those in the HF leaderboard and the Llama-2 webpage presentation. Is it the 1-shot variant that is different or are these measurements not 100% standard and reproducible?

ekelsen2y ago

Llama2 chat performs worse and wasn't included for that reason.

The numbers are different because the measurement is different. The blog post explains that we sample from the models and expect answers rather than relying on perplexity measurements.

YetAnotherNick2y ago

Could you share the results with standard way of benchmarking(accuracy of top selection). While the approach you guys took is reasonable, but it would be more informative to see to see how much better/worse it is with standard benchmark.

deckar012y ago· 2 in thread

The docker container fails installing flash-attn… but honestly a giant API container on top of a custom model generation framework loses all the benefits of Torch’s standard interfaces. It doesn’t really matter how optimized your model runtime is if it’s cemented into a synchronous monolith. The metric that should be optimized is time to first decoded token, because that is how speed is perceived by humans reading the output.

ekelsen2y ago

Can you share details of the build failure on the github? We'll try to help.

The inference code is shared as a proof of concept, it is not meant to be a production ready deploy. Also worth noting that not all LLMs are used to produce text which is read by humans.

deckar012y ago

https://github.com/persimmon-ai-labs/adept-inference/issues/...

It’s funny you say production, because all of the errors I ran into suggest the container is expecting your production architecture.

My advice is stream first then make synchronous convenience wrappers on top of that. Also, lean on community standards for PoC. I’m guessing your investors are interested in making this scale as cheaply as possible, but that is probably the least important feature for people evaluating your model’s quality locally.

Havoc2y ago· 2 in thread

>The model has 70k unused embeddings for multimodal extensions,

Could someone briefly explain what this means? multimodal as in picture, but if unused then presumably that part is somehow untrained...so it wouldn't know what to do with the picture?

sthatipamala2y ago

Yes, it wouldn't know what to do with the picture unless you fine-tune the model (which is why they are permissively releasing it).

The embeddings form the vocabulary of the model. The vocabulary "namespace" has 70k empty slots so you could introduce your own tokens and train on top of that, where token = some patch of multimodal data.

Havoc2y ago

Gotcha. Thanks for explaining

TrueDuality2y ago· 2 in thread

Appreciate the release! Since you're hosting the downloads directly, I'd recommend throwing an integrity hash for each of the files alongside the download links so users can verify there wasn't any corruption in transfer.

nre2y ago

Looks like someone there is reading HN as they just did it!

ekelsen2y ago

Good shout. Will be fixed soon.

AaronFriel2y ago· 1 in thread

I hope this is only a slight tangent; since the authors talk about their model serving throughput and I hope I can get a gut-check on my understanding of the state-of-the-art of model serving.

The success of ChatGPT and my current work has had me thinking a lot about the "product" applications of large language models. I work at Pulumi on www.pulumi.com/ai; it's a GPT-3.5 and GPT-4 interface using retrieval augmented generation to generate Pulumi programs, and user experience is top of mind for me.

(Fingers crossed this doesn't hug our site to death here for the reasons I'm about to explain.)

To be blunt: I have found it surprisingly difficult to find the right tools to host models without dramatically worsening the UX. In theory we should be able to fine-tune a model against our own SDKs and synthetically generated code to improve the model's output and to guard against hallucination when retrieval fails. In practice, self-hosted model serving APIs have really poor time-to-first-token or even completely lack streaming behavior. It's a non-starter to build a product on something where a user has to sit and watch a spinner for a minute or more. I've been looking at the vLLM project with great interest, but haven't found much else.

---

For folks in MLops, deploying models with streaming APIs:

1. Is it mostly accurate that none of the model serving tools created prior to ChatGPT are great for streaming, interactive use cases?

2. How are you currently serving these models as an API and what upcoming tools are you exploring?

For the authors: How does your inference optimization compare to vLLM, or other tools using techniques such as continuous batching and paged attention?

hansonw2y ago

This is the best comparison I've found that benchmarks the current OSS inference solutions: https://hamel.dev/notes/llm/inference/03_inference.html

IME the streaming API in text-generation-inference works fine in production. (Though some of the other solutions may be better). I've used it with Starcoder (15B) and the time-to-first-token / tokens per second all seem quite reasonable out of the box.

sunshadow2y ago· 1 in thread

Do you have any explanations on why this performed better than Llama 2?

andai2y ago

They did several things differently from Llama 2.

From my understanding, you'd have to repeat the experiment isolating each variable to see what difference each one actually makes, no?

elietoubi2y ago

Really cool! Honestly I wish these releases would come with a demo (like on replicate or hugging face)

rvz2y ago

Good. Keep it going. Let’s have more $0 for free AI models getting released since we all know it is the future and you can’t compete with free.

The AI race to zero must be accelerated with $0 free models and less control from gatekeepers such as ClosedAI

visarga2y ago

Since it is coming from Adept, maybe they are building 8B models for UI automation, the inputs are usually large and latency required is low. It's basically a task of information extraction and UI action generation.

j / k navigate · click thread line to collapse

56 comments

45 comments · 14 top-level

gardnr2y ago· 7 in thread

Two important takeaways on the base model:

* scored 18.9 on HumanEval (coding) where Llama2 7B scored 12.2

Can anyone share ideas on how important the 2nd one is? Do LLMs benefit from large context windows using RoPe during pretraining?

sbierwagen2y ago

phi-1 supposedly does 50.6 on HumanEval with 1.3B parameters. (Python only) https://arxiv.org/abs/2306.11644

Weights haven't been released, though.

euclaise2y ago

phi-1 is a code-specific base model, with further finetuning on top of that. This is a general language base model, not really comparable.

imjonse2y ago

no code or dataset either for phi-1.

swyx2y ago

its not so much about benefit, as it is a design goal to want large context windows.

euclaise2y ago

That tweet had it backwards, more tokens in tokenizer means that the 16k token context window typically allows for even longer passages than if LLaMA were 16k

craigacp2y ago

There's a correction to that tweet, larger vocab means fewer tokens for any given sequence (usually, assuming it's not to add other languages or character sets).

coder5432y ago

> scored 18.9 on HumanEval (coding) where Llama2 7B scored 12.2

The article claims 18.9 for the base model, but also claims 20.7 for the fine tuned model.

automatistist2y ago· 6 in thread

ironrabbit2y ago

edit: not sure why op is getting downvotes, this is a very reasonable question imo; maybe the characterization of kernel compilation as "AI" vs. just "software"?

loopist2y ago

Bnjoroge2y ago

thanks for explaining pretty concisely w/out being rude :)

automatistredo2y ago

Can someone explain the down votes? What exactly is incorrect in OPs comment?

snissn2y ago

1 more reply

TrueDuality2y ago

The use of a lot of garbage accounts in this thread and lack of model details also looks pretty shady...

1 more reply

123yawaworht4562y ago· 3 in thread

YetAnotherNick2y ago

123yawaworht4562y ago

I mean "I am become death, destroyer of worlds" bullshit about AI safety/ethics/etc that is included in every press release from Google/Meta/OpenAI and even much smaller players.

1 more reply

solverist2y ago

Why are ablations useful? Their release report seemed very informative to me without getting bogged down in jargon.

1 more reply

theLiminator2y ago· 3 in thread

What kind of use cases do these sub 10B param models serve? Are they mostly useful for code completion?

richdougherty2y ago

You can run them either for general purpose inference. You can also fine-tune them and get improved performance for specific use cases.

It's safe to assume they're worse at every task than larger models, so I wouldn't look at use cases in terms of what tasks they can do compared to larger models.

theLiminator2y ago

Yeah, my point is moreso is are smaller models ever "smart" enough to perform useful tasks?

Perhaps for basic code completion and simple writing tasks?

gremlinsinc2y ago

thewataccount2y ago· 2 in thread

Awesome! I applaud everyone training new models and attempting different techniques!

selfhoster112y ago

If this model can be made to work as GGUF, TheBloke will probably have a set of quantizations in a day or two at most.

amks2y ago

We're working on it!

imjonse2y ago· 2 in thread

Congrats on the release! Two questions.

1) In the results table, Llama2 base is being compared to Persimmon base and finetuned, and only the latter performs better. Would a comparison to Llama2-chat be possible/fair?

ekelsen2y ago

Llama2 chat performs worse and wasn't included for that reason.

The numbers are different because the measurement is different. The blog post explains that we sample from the models and expect answers rather than relying on perplexity measurements.

YetAnotherNick2y ago

deckar012y ago· 2 in thread

ekelsen2y ago

Can you share details of the build failure on the github? We'll try to help.

The inference code is shared as a proof of concept, it is not meant to be a production ready deploy. Also worth noting that not all LLMs are used to produce text which is read by humans.

deckar012y ago

https://github.com/persimmon-ai-labs/adept-inference/issues/...

It’s funny you say production, because all of the errors I ran into suggest the container is expecting your production architecture.

Havoc2y ago· 2 in thread

>The model has 70k unused embeddings for multimodal extensions,

Could someone briefly explain what this means? multimodal as in picture, but if unused then presumably that part is somehow untrained...so it wouldn't know what to do with the picture?

sthatipamala2y ago

Yes, it wouldn't know what to do with the picture unless you fine-tune the model (which is why they are permissively releasing it).

Havoc2y ago

Gotcha. Thanks for explaining

TrueDuality2y ago· 2 in thread

nre2y ago

Looks like someone there is reading HN as they just did it!

ekelsen2y ago

Good shout. Will be fixed soon.

AaronFriel2y ago· 1 in thread

I hope this is only a slight tangent; since the authors talk about their model serving throughput and I hope I can get a gut-check on my understanding of the state-of-the-art of model serving.

(Fingers crossed this doesn't hug our site to death here for the reasons I'm about to explain.)

---

For folks in MLops, deploying models with streaming APIs:

1. Is it mostly accurate that none of the model serving tools created prior to ChatGPT are great for streaming, interactive use cases?

2. How are you currently serving these models as an API and what upcoming tools are you exploring?

For the authors: How does your inference optimization compare to vLLM, or other tools using techniques such as continuous batching and paged attention?

hansonw2y ago

This is the best comparison I've found that benchmarks the current OSS inference solutions: https://hamel.dev/notes/llm/inference/03_inference.html

sunshadow2y ago· 1 in thread

Do you have any explanations on why this performed better than Llama 2?

andai2y ago

They did several things differently from Llama 2.

From my understanding, you'd have to repeat the experiment isolating each variable to see what difference each one actually makes, no?

elietoubi2y ago

Really cool! Honestly I wish these releases would come with a demo (like on replicate or hugging face)

rvz2y ago

Good. Keep it going. Let’s have more $0 for free AI models getting released since we all know it is the future and you can’t compete with free.

The AI race to zero must be accelerated with $0 free models and less control from gatekeepers such as ClosedAI

visarga2y ago

j / k navigate · click thread line to collapse