undefined | Better HN

0 pointsPaulHoule3y ago0 comments

I've thought about that one for a long time. A long time ago I was reading proceedings of TREC trying to understand why Google was so much better than the search engines I knew how to build. TREC is pretty depressing because you find that 95% of the things you might think would improve search rankings do not. Particularly before BM25 was developed people tried indexing sub documents and consolidating them and consistently struck out.

Since BERT came out there is a considerable literature of people struggling mightily to combine transformer representations of document parts into a whole that convinces me that one could spend a few lifetimes pushing a bubble around underneath that rug.

I think the best argument for your case is that people seem to get along just fine with a limited short term memory. I'd temper that with the observation that a person writing a summary is actually doing a multiple stage process in which their short term memory is attending to part of what they are writing, part of what they are reading, and they are building long term memory structures at the same time. So there is a lot going on.

In the sense that abstracts work well for information retrieval and that many of them would fit in the GPT attention window or only be a little bigger you could make the case that a fixed-size structure could be highly useful for IR.

On the other hand, many documents, such as scientific papers, are considerably bigger than the current attention window and direct summarization of a single document via transformer will still need a bigger window, more like 40,000 tokens.

A lot of things in the literature are complex, muddy, contradictory or all of the above. (Try a question like "What did Freud think about narcissism?" or "What is the clinical relevance of Bleuler's concept of ambivalence?" or "Tell me about cosmic inflation" or "What is the dark matter particle?")

Hard cases really do require matching up parts of document A with parts of document B and certainly having them in the same attention window would help an LLM do that in a natural way.

It might be completely impractical, not just because of computational scalability but possibly more fundamental scalability limits. (I'm not sure a person with a 10x bigger short term memory would really be able to solve problems better than the average person... There are transformers with a 500,000 token attention window today and they suck.)

There could be some procedure where you cut documents up into pieces in various ways, extract critical context from documents A and B and other literature and also put in the parts you want to critique against each other, or even match up different parts of the same document to do the same. Maybe a small attention window could still be used to decompose documents into knowledge graphs but it is by no means trivial to reason over a KG once you have it.

What I do know today is that I have documents >4096 tokens that I want to retrieve, cluster and classify right now and transformers were not up to the task in Feb 2023, and I am hoping for some progress soon that will help.

0 comments

2 comments · 2 top-level

adultSwim3y ago

One of their other recent papers was work towards expanding this limit in language models, "Structured Prompting: Scaling In-Context Learning to 1,000 Examples"

https://arxiv.org/abs/2212.06713

p1esk3y ago

transformers were not up to the task in Feb 2023, and I am hoping for some progress soon that will help.

I think that's coming, OpenAI is talking about some new "DV" model with up to 32k context window: https://twitter.com/transitive_bs/status/1628118163874516992

Hard cases really do require matching up parts of document A with parts of document B

The hard part here is not processing documents, it's determining which documents need to be "matched". This requires having some sort of a "knowledge map", a semantic search space of "knowledge patterns", or maybe even a traditional search engine, so that given a document A a model can find relevant documents - in its long term memory, or in a dataset, or even on the internet. Once the documents are found, you don't really need to load the whole thing into the attention window. When I read a long paper, I do it section by section - I just need to maintain a high level map of the paper in my head. I process one "knowledge pattern" at a time, and every time I do a lookup or a search for relevant patterns. I shouldn't be limiting that search to only what's in my current attention window, even if the window is a million tokens. But yes, the window should be big enough to hold at least two of such patterns (which map to chunks of text, or images/audio/etc) - the one I'm currently processing, and one that is most similar to it in the knowledge space.

j / k navigate · click thread line to collapse