Wait, people still use this benchmark? I hear there's a huge flaw on it.
For examples, fine-tuning the model on 4chan make it scores better on TruthfulQA. It becomes very offensive afterwards though, for obvious reasons. See GPT-4chan [1]
Could it be that people who can talk anonymously with no reputation to gain or lose and no repercussions to fear actually score high on truthfulness? Could it be that truthfulness is actually completely unrelated to the offensiveness of the language used to signal in-group status?
Not sure I understand your example? It's not an offensiveness benchmark, in fact I can imagine a model trained to be inoffensive would do worse on a truth benchmark. I wouldn't go so far as to say truthfulQA is actually testing how truthful a model is or its reasoning. But it's one of the least correlated with other benchmarks which makes it one of the most interesting. Much more so than running most other tests that are highly correlated with MMLU performance. https://twitter.com/gblazex/status/1746295870792847562