We've now partially replicated Reflection Llama 3.1 70B's eval claims (opens in new tab)

(twitter.com)

4 points_micah_h1y ago1 comments

1 comments

1 comments · 1 top-level

And the twit is gone after public outroar.

Now there claim that 70B saw worse performance than Llama 3.1 70B (and obviously worse than closed source alternatives)[1].

Outstanding questions:

- What exactly did they "partially replicate"

- Why Redditors were able to identify all the details (wrapped Claude, wrapped GPT4o, initial prompt, details of finetuned Lllama 3.0, not 3.1) and ArtificialAnlys was not?

- Why after revealing the truth they still write "We are not clear", "We are not clear"?

[1] https://x.com/ArtificialAnlys/status/1832965630472995220

j / k navigate · click thread line to collapse