What I’m not sure about is if 1.5 is truncating it.
https://vladbogo.substack.com/p/gemini-15-unlocking-multimod...
https://arxiv.org/abs/2403.05530
In the needle test, we generate "needle" queries at each point in the context. We then graph the model's recall for each needle.
In this case we might
1. Generate needles: Iterate through your conversations one by one, and generate multiple choice questions for each conversation. You might need to break down conversations into chunks.
2. Test the haystack. Given the full context, run random batches of needle queries. Run multiple batches, to give good coverage of the context.
3. Visualize recall. Graph the conversation # vs the needle recall score for that conversation
Let's assume Gemini is truncating your context, but has perfect recall for no truncated context. Then your needle graph will be ~100% for all conversation in context, and then fall off like a cliff at the exact point of truncation.
My main concern with this approach is the cost, as you have to load the entire context for each batch of needles. It's likely that testing all needles at the same time would skew the results or exceed the allowed context.
I don't know how the authors deal with this issue, and I don't know if they have published code for needle testing. But if you're interested in working on this, I'd like to collaborate. We can look at the existing solutions, and if necessary we can build a needle testing fixture for working with GPT exports. I'd also be interested in supporting more broad needle testing use cases, like books, API docs, academic papers, etc.