Re: our definitions of average/long/short prompts -- we weren't really rigorous with those definitions. In general, we considered anything under 100 tokens "short", 100-300 average, and 300+ large.
Our intuition here is that the relationship between performance of the estimation and the prompt structure is less about length, and more about "ambiguity". Again, we don't really have a rigorous definition of that yet, but it's something we are working on. If you take a look at the prompts in the analysis notebook you might get a sense of what I mean: prompts 1-3 are pretty straight forward and mechanical. Prompts 4 & 5 are a bit more open to interpretation. We see performance of the estimation degrade as prompts become more and more open to interpretation.
Another example: storytelling prompts that include “I dislike open-ended conclusions and other rhetorical hooks” often results in fewer (or no) closing statements like, “as night fell, they wondered about their future.”
Edit: GPT-4 is surprisingly good at answering these things if asked to: https://chat.openai.com/share/b97ad65f-f005-49b4-a64e-eb537d...
Given that you're using cosine similarity of text embeddings to approximate the influence of individual tokens in a prompt, how does this approach fare in capturing higher-order interactions between tokens, something that Integrated Gradients (allegedly) is designed to account for? Are there specific scenarios where the cosine similarity method might fall short in capturing the nuances that Integrated Gradients can reveal?
1. The perturbation method could be improved to more directly capture long-range dependency information across tokens
2. The scoring method could _definitely_ be improved to capture more nuance across perturbations.
I think what we've found is that there does seem to be a relationship between the embedding space and attributions of LLMs, so the next step would be to figure out how to capture more nuance out of that relationship. This sort of side-steps the question you asked, because honestly we'd need to test a lot more to figure out the specific cases where an approach like this falls short.
Anecdotally - we've seen the greatest deviation between the estimation & integrated gradients as prompt "ambiguity" increases. We're thinking about ways to quantify & measure that ambiguity but that's its own can of worms.