LLMLingua: Compressing Prompts for Faster Inferencing (opens in new tab)

(github.com)

149 pointsTarqDirtyToMe2y ago47 comments

47 comments

Wild. if I'm reading this correctly it's effectively a sort of "zip" algorithm for both the inputs and outputs of a prompt based model. thus, it allows a user to compress their request down to the minimal token size which retains the same semantics. In effect, this then allows a user to encode a more dense set of tokens into the original request.

Does that sound about right?

refulgentis2y ago

Yes you're correct -- it's a really interesting thing, in that it reminds me of early 2023 when people would "compress" prompts by having ChatGPT rewrite it itself into something smaller.

There's really no substantive difference between that and what they're doing here, other than they're purposefully using a crappier model than GPT 3.5/ChatGPT to increase the cost savings.

For example, the first set of graphics is demonstrating switching a long question with 5 Q/A examples ("5-shot", in the literature) into ~4 sentences that are a paraphrasing of the question and have one or two very brief examples without reasoning.

That's all well and fine if you're confident the model is so amazing that it answers as well as it does with 1-shot as it does with 5-shot, but it is very, very, very likely that is not the case. Additionally, now you're adding this odd layer between the user's input and OpenAI that will easily be "felt".

3abiton2y ago

There is a need for a comparison, otherwise I find your assessment of the performance a "bit" subjective.

refulgentis2y ago

Please, by all means! I didn't mean to imply I have data or that you need to accept my comment as a scientific data-backed conclusion. :) I just have the lived experience of ~0 ML models performing better at 0-shot than 5-shot. That would be a good sign of AGI, in fact, now that I think about it...the model being able to workaround good instructions with bad examples.

TarqDirtyToMeOP2y ago

Sounds right to me. I think it’s fun that is this may be the only compression algorithm where the output is still human understandable.

It reads like a slightly garbled version of what someone writing down bullet point notes of a lecture might write.

It’s so rare that the human optimized and machine optimized versions of an input are so similar

nextaccountic2y ago

Is there a text file with many input/output pairs? I couldn't find it in the readme

The examples folder contain jupyter notebooks, there's also some videos and papers, while I just want to see an example text compressed

TarqDirtyToMeOP2y ago

There’s some examples on the website: https://llmlingua.com/

iofu7282y ago

In fact, it can be seen as semantic communication, which is defined by Shannon.

TarqDirtyToMeOP2y ago

LLMLingua uses a well-trained small language model after alignment, such as GPT2-small or LLaMA-7B, to detect the unimportant tokens in the prompt and enable inference with the compressed prompt in black-box LLMs, achieving up to 20x compression with minimal performance loss.

cyanydeez2y ago

“Why waste time say lot word when few word do trick” -Kevin Malone

arthurcolle2y ago

Perfection. Key insight. "Few Word [is] All Need" (with a robust enough foundation model)

Linked for the culture: https://www.youtube.com/watch?v=bctjSvn-OC8&t=4s

Sleep big last night

cyanydeez2y ago

"Kevin, are you saying 'See the World' or Sea World?" -- Jim

sroussey2y ago

What would happen if instead of the long prompt, you just sent the mean of the embeddings of the prompt tokens?

behnamoh2y ago

Came here to mention this. Whenever I hear "alignment" I immediately say "No way am I going to use that shit". Seriously, there's alignment and then there's censorship—the AI creators are using the former when they actually mean the latter. This needs to stop.

TarqDirtyToMeOP2y ago

My understanding is that in an academic context you’ll hear alignment anytime a model is tuned to accomplish a certain task, not just to steer its political affiliation and idea of ethics

I don’t think this models use of alignment implies any sort of censorship, it’s just being tuned to accomplish the task of outputting only important tokens for the target llm

smeagull2y ago

In my experience it means the AI will waste tokens apologizing for it's short comings and ignoring task prompts in favour of it's alignment.

1 more reply

nathan_compton2y ago

It amazes me that this amazing new technology comes out and there is a group of people who are like "NO, NOT IF IT CAN'T TELL RACIST JOKES!"

I agree that like "tone" alignment is silly and pointless for models in the public domain, but if I were a big company who wanted to keep customers I'd align my models this way. It isn't censorship, its marketing.

avereveard2y ago

I wonder if this could also be useful in reverse, you'd have a large expensive llm producing a few tokens per sentence about the answer, then a expansion llm forming sentences out of it.

PoignardAzur2y ago

Some teams have researched ways to do this.

For instance, you can have a smaller model generate ten tokens in sequence, and then ask the larger mode "given these N tokens, what is the token N+1" ten times in parallel.

If the large and small model agree on, say, the first 7 tokens, then you keep these and throw the next 3 away and start over. So you still have to run the large model for each token, but you can at least do batch calculations (which is a lot more efficient, because loading layer weights is the bottleneck, not matrix ops).

cuuupid2y ago

The expansion llm would have to have a pretty good model of language so would likely need to be 7B realm though, but could be useful given we are almost at a time where 7b models can run ubiquitously on most consumer hardware

swyx2y ago

the text to image community has upscalers like this… i wonder if useful

wklm2y ago

I copied all the the text from this thread and compressed it, the result is:

``` {'compressed_prompt': '\t | submit\twout\nLLMLing byqtyTo\n\n. ". which\nq1 that only down human\nnextaccount many examples,pressed\nq4\n\n as semantic\n" having noreings of isating withoutre this and\n31] a\n\n the0 of, to workaroundqTo\ning after and in loss\n\n time say -. Word a\nb-leep big\n\nsr the\namshipIqToMy hear alignment to its andics this only target\n will tokensq be: The the usinging\n\nbeamIt" mying\na large expensive am\n\n generate larger"3 loading).\n\nThe expansionB run has this if\nhas] agents think it\n\npyinstall into game in " promptter\nos\n particular (. ( == transformations to given smaller) ownups\n\n this better [] thewithout\n\n. is -Error medium\n\n<\n decode\n\r\nbehnamoh 1 day ago | prev [3 more]\r\n\r\n\r\n\r\n\r\n\r\nGuidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact\r\n\r\nSearch: \r\n', 'origin_tokens': 2863, 'compressed_tokens': 217, 'ratio': '13.2x', 'saving': ', Saving $0.2 in GPT-4.'} ```

Chat GPT4 doesn't know what to do with it: https://chat.openai.com/share/73bc7b96-4453-4a6e-944d-d9d4c5...

TarqDirtyToMeOP2y ago

I would think you’d need to unescape the new lines and tabs and have a task for the model to perform with it.

Maybe try prefixing it with “summarize the following text” before compression.

Otherwise I’m not sure how it would judge what it’s important. Honestly I’m not sure what ChatGPT would do if you copied the text from this page uncompressed without asking it do something

Edit: pasting uncompressed it summarizes the discussion.

I think this solution isn’t well suited for this kind of task. It seems like you’d want to compress instructions, system prompts and memory. With a big block of text with no prior context you’re essentially relying on the smaller model to decide what’s important without enough information to judge.

Worth some more experimentation for sure

icanhasjonas2y ago

Very interesting, we've started on an approach to enable LLM agents communicate and context share in their own language, but I think calling it compression is actually more intuitive. I love this

pyinstallwoes2y ago

Intelligence is compressing information into irreducible representation.

mbb702y ago

This always seemed like the end game vs. getting a degree in prompt engineering.

If you get enough data on "initial prompt attempt" -> "final successful prompt", the whole thing can be replaced by a fine tuned model.

You would just select a "prompt rewritter llm" that optimizes for accuracy, cost, alignment etc.

pseudosavant2y ago

GPT on top of GPT. It is turtles all the way down.

passion__desire2y ago

Let's say a particular layperson wants to execute a task. He gives (INPUT <=> OUTPUT) pairs. chatgpt creates a "prompt ( == bytecode)" which captures the essence of those transformations This process is called "Program Fitting" similar to Line fitting or Curve fitting given list of data points. Then this bytecode can then be efficiently run on a smaller distilled CVM (chatgpt virtual machine) diligently chosen by ChatGPT itself since it knows which CVM to best execute the task and then run the (bytecode = prompt) on new similar data. No need to run full ChatGPT. ChatGPT creates its own MoE setups.

samus2y ago

For all we know, ChatGPT 4 might function like that

MrYellowP2y ago

I was working on the same thing months ago and it works, but it was a purely trial and error way of doing it and the compressed prompts, naturally, wouldn't necessarily work for different LLMs easily.

I am not actually convinced this is a good idea, though. This path eventually leads to a "prompt compiler" that compiles prompts into byte code for a future "more efficient" LLM to understand.

Oh and it definitely didn't require its own language model. All it required was finding how many letters one can remove from a word and which words can be completely omitted.

the_omegist2y ago

Made me think of Speedtalk by Heinlein [0].

One way to increase the context window, I thought, would be to teach the LLM a compressed language based on abbreviations etc and to have some compressing/uncompressing script do the translating with the LLM. That would allow longer prompts too.

Not as sophisticated as this LLMLingua but good enough for basic users.

[0] https://en.wikipedia.org/wiki/Speedtalk

fabmilo2y ago

This means that we need some new form of preprocessing the data before training LLMs from simple text. Probably just using this compressor and then try to decompress the full text could give some better Supervised Fine Tuned results. Wonder how to deploy this right away. Probably using its own optimized triton inference server?

dmos622y ago

Excuse the hijack: what are the most powerful language models one can run on any smartphone (without meaningful delay)?

intellectronica2y ago

Phi-2 ( https://www.microsoft.com/en-us/research/blog/phi-2-the-surp... ) may be the best right now. Gemini Nano by Google DeepMind is a close second.

baq2y ago

Redundancy is resiliency - wonder if there’s still enough error correction in the compressed language?

samus2y ago

Error connection is not really required here since there is no lossy communication medium.

Kevin092102y ago

Until your prompts looks like the following:

<Do something> with the following text:

joelthelion2y ago

I wonder if, as humans, we could benefit from this. Could we learn to read this compressed lingo?

nathan_compton2y ago

I wonder. https://en.wikipedia.org/wiki/Shorthand

samus2y ago

The model doing the compression is trained with a human language corpus. Also, this is a generic procedure to feed another model trained on a similar corpus. Therefore, I'd not expect the compressing model to do anything exotic.

Btw., humans are quite good at compressing as well. SMS used to be billed per 128 characters. Also, any slang or technical jargon are attempts at compression. These are how people push the limits of expressivness and contribute to language evolution.

phh2y ago

Are we going to do encode => LLM => decode architectures? That would be ironic

samus2y ago

It's turtles^H^H^H^H^H^H^H encoders/decoders all the way down!

j / k navigate · click thread line to collapse

47 comments

thebeardisred2y ago

Does that sound about right?

refulgentis2y ago

Yes you're correct -- it's a really interesting thing, in that it reminds me of early 2023 when people would "compress" prompts by having ChatGPT rewrite it itself into something smaller.

There's really no substantive difference between that and what they're doing here, other than they're purposefully using a crappier model than GPT 3.5/ChatGPT to increase the cost savings.

3abiton2y ago

There is a need for a comparison, otherwise I find your assessment of the performance a "bit" subjective.

refulgentis2y ago

TarqDirtyToMeOP2y ago

Sounds right to me. I think it’s fun that is this may be the only compression algorithm where the output is still human understandable.

It reads like a slightly garbled version of what someone writing down bullet point notes of a lecture might write.

It’s so rare that the human optimized and machine optimized versions of an input are so similar

nextaccountic2y ago

Is there a text file with many input/output pairs? I couldn't find it in the readme

The examples folder contain jupyter notebooks, there's also some videos and papers, while I just want to see an example text compressed

TarqDirtyToMeOP2y ago

There’s some examples on the website: https://llmlingua.com/

iofu7282y ago

In fact, it can be seen as semantic communication, which is defined by Shannon.

TarqDirtyToMeOP2y ago

cyanydeez2y ago

“Why waste time say lot word when few word do trick” -Kevin Malone

arthurcolle2y ago

Perfection. Key insight. "Few Word [is] All Need" (with a robust enough foundation model)

Linked for the culture: https://www.youtube.com/watch?v=bctjSvn-OC8&t=4s

Sleep big last night

cyanydeez2y ago

"Kevin, are you saying 'See the World' or Sea World?" -- Jim

sroussey2y ago

What would happen if instead of the long prompt, you just sent the mean of the embeddings of the prompt tokens?

behnamoh2y ago

TarqDirtyToMeOP2y ago

My understanding is that in an academic context you’ll hear alignment anytime a model is tuned to accomplish a certain task, not just to steer its political affiliation and idea of ethics

I don’t think this models use of alignment implies any sort of censorship, it’s just being tuned to accomplish the task of outputting only important tokens for the target llm

smeagull2y ago

In my experience it means the AI will waste tokens apologizing for it's short comings and ignoring task prompts in favour of it's alignment.

1 more reply

nathan_compton2y ago

It amazes me that this amazing new technology comes out and there is a group of people who are like "NO, NOT IF IT CAN'T TELL RACIST JOKES!"

avereveard2y ago

I wonder if this could also be useful in reverse, you'd have a large expensive llm producing a few tokens per sentence about the answer, then a expansion llm forming sentences out of it.

PoignardAzur2y ago

Some teams have researched ways to do this.

For instance, you can have a smaller model generate ten tokens in sequence, and then ask the larger mode "given these N tokens, what is the token N+1" ten times in parallel.

cuuupid2y ago

swyx2y ago

the text to image community has upscalers like this… i wonder if useful

wklm2y ago

I copied all the the text from this thread and compressed it, the result is:

Chat GPT4 doesn't know what to do with it: https://chat.openai.com/share/73bc7b96-4453-4a6e-944d-d9d4c5...

TarqDirtyToMeOP2y ago

I would think you’d need to unescape the new lines and tabs and have a task for the model to perform with it.

Maybe try prefixing it with “summarize the following text” before compression.

Otherwise I’m not sure how it would judge what it’s important. Honestly I’m not sure what ChatGPT would do if you copied the text from this page uncompressed without asking it do something

Edit: pasting uncompressed it summarizes the discussion.

Worth some more experimentation for sure

icanhasjonas2y ago

Very interesting, we've started on an approach to enable LLM agents communicate and context share in their own language, but I think calling it compression is actually more intuitive. I love this

pyinstallwoes2y ago

Intelligence is compressing information into irreducible representation.

mbb702y ago

This always seemed like the end game vs. getting a degree in prompt engineering.

If you get enough data on "initial prompt attempt" -> "final successful prompt", the whole thing can be replaced by a fine tuned model.

You would just select a "prompt rewritter llm" that optimizes for accuracy, cost, alignment etc.

pseudosavant2y ago

GPT on top of GPT. It is turtles all the way down.

passion__desire2y ago

samus2y ago

For all we know, ChatGPT 4 might function like that

MrYellowP2y ago

I am not actually convinced this is a good idea, though. This path eventually leads to a "prompt compiler" that compiles prompts into byte code for a future "more efficient" LLM to understand.

Oh and it definitely didn't require its own language model. All it required was finding how many letters one can remove from a word and which words can be completely omitted.

the_omegist2y ago

Made me think of Speedtalk by Heinlein [0].

Not as sophisticated as this LLMLingua but good enough for basic users.

[0] https://en.wikipedia.org/wiki/Speedtalk

fabmilo2y ago

dmos622y ago

Excuse the hijack: what are the most powerful language models one can run on any smartphone (without meaningful delay)?

intellectronica2y ago

Phi-2 ( https://www.microsoft.com/en-us/research/blog/phi-2-the-surp... ) may be the best right now. Gemini Nano by Google DeepMind is a close second.

baq2y ago

Redundancy is resiliency - wonder if there’s still enough error correction in the compressed language?

samus2y ago

Error connection is not really required here since there is no lossy communication medium.

Kevin092102y ago

Until your prompts looks like the following:

<Do something> with the following text:

joelthelion2y ago

I wonder if, as humans, we could benefit from this. Could we learn to read this compressed lingo?

nathan_compton2y ago

I wonder. https://en.wikipedia.org/wiki/Shorthand

samus2y ago

phh2y ago

Are we going to do encode => LLM => decode architectures? That would be ironic

samus2y ago

It's turtles^H^H^H^H^H^H^H encoders/decoders all the way down!

j / k navigate · click thread line to collapse