Does that sound about right?
There's really no substantive difference between that and what they're doing here, other than they're purposefully using a crappier model than GPT 3.5/ChatGPT to increase the cost savings.
For example, the first set of graphics is demonstrating switching a long question with 5 Q/A examples ("5-shot", in the literature) into ~4 sentences that are a paraphrasing of the question and have one or two very brief examples without reasoning.
That's all well and fine if you're confident the model is so amazing that it answers as well as it does with 1-shot as it does with 5-shot, but it is very, very, very likely that is not the case. Additionally, now you're adding this odd layer between the user's input and OpenAI that will easily be "felt".
It reads like a slightly garbled version of what someone writing down bullet point notes of a lecture might write.
It’s so rare that the human optimized and machine optimized versions of an input are so similar
The examples folder contain jupyter notebooks, there's also some videos and papers, while I just want to see an example text compressed
Linked for the culture: https://www.youtube.com/watch?v=bctjSvn-OC8&t=4s
Sleep big last night
I don’t think this models use of alignment implies any sort of censorship, it’s just being tuned to accomplish the task of outputting only important tokens for the target llm
I agree that like "tone" alignment is silly and pointless for models in the public domain, but if I were a big company who wanted to keep customers I'd align my models this way. It isn't censorship, its marketing.
For instance, you can have a smaller model generate ten tokens in sequence, and then ask the larger mode "given these N tokens, what is the token N+1" ten times in parallel.
If the large and small model agree on, say, the first 7 tokens, then you keep these and throw the next 3 away and start over. So you still have to run the large model for each token, but you can at least do batch calculations (which is a lot more efficient, because loading layer weights is the bottleneck, not matrix ops).
``` {'compressed_prompt': '\t | submit\twout\nLLMLing byqtyTo\n\n. ". which\nq1 that only down human\nnextaccount many examples,pressed\nq4\n\n as semantic\n" having noreings of isating withoutre this and\n31] a\n\n the0 of, to workaroundqTo\ning after and in loss\n\n time say -. Word a\nb-leep big\n\nsr the\namshipIqToMy hear alignment to its andics this only target\n will tokensq be: The the usinging\n\nbeamIt" mying\na large expensive am\n\n generate larger"3 loading).\n\nThe expansionB run has this if\nhas] agents think it\n\npyinstall into game in " promptter\nos\n particular (. ( == transformations to given smaller) ownups\n\n this better [] thewithout\n\n. is -Error medium\n\n<\n decode\n\r\nbehnamoh 1 day ago | prev [3 more]\r\n\r\n\r\n\r\n\r\n\r\nGuidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact\r\n\r\nSearch: \r\n', 'origin_tokens': 2863, 'compressed_tokens': 217, 'ratio': '13.2x', 'saving': ', Saving $0.2 in GPT-4.'} ```
Chat GPT4 doesn't know what to do with it: https://chat.openai.com/share/73bc7b96-4453-4a6e-944d-d9d4c5...
Maybe try prefixing it with “summarize the following text” before compression.
Otherwise I’m not sure how it would judge what it’s important. Honestly I’m not sure what ChatGPT would do if you copied the text from this page uncompressed without asking it do something
Edit: pasting uncompressed it summarizes the discussion.
I think this solution isn’t well suited for this kind of task. It seems like you’d want to compress instructions, system prompts and memory. With a big block of text with no prior context you’re essentially relying on the smaller model to decide what’s important without enough information to judge.
Worth some more experimentation for sure
If you get enough data on "initial prompt attempt" -> "final successful prompt", the whole thing can be replaced by a fine tuned model.
You would just select a "prompt rewritter llm" that optimizes for accuracy, cost, alignment etc.
I am not actually convinced this is a good idea, though. This path eventually leads to a "prompt compiler" that compiles prompts into byte code for a future "more efficient" LLM to understand.
Oh and it definitely didn't require its own language model. All it required was finding how many letters one can remove from a word and which words can be completely omitted.
One way to increase the context window, I thought, would be to teach the LLM a compressed language based on abbreviations etc and to have some compressing/uncompressing script do the translating with the LLM. That would allow longer prompts too.
Not as sophisticated as this LLMLingua but good enough for basic users.
<Do something> with the following text:
<some text>
Btw., humans are quite good at compressing as well. SMS used to be billed per 128 characters. Also, any slang or technical jargon are attempts at compression. These are how people push the limits of expressivness and contribute to language evolution.