I'm guessing it's just Mechanical Turk content which wasn't even spellchecked.
I'm thinking here about "People is around the field watching the game", and other input errors, not necessarily output errors, but maybe if I thought about it a little more I'd be able to make similar arguments for accepting weirder outputs? Not as confident about that. For inputs, the hopeful effect of training/validating against such examples would be to make the model somewhat able to deal with imperfect inputs when the overall meaning is clear.
The scary thing about this hype cycle is that AI and ML are both being deployed in life-and-death scenarios like automated driving and health-care settings. This isn’t the normal web hype of “Uber for X” that we are used to.
The HellaSwag benchmark is an example of a large language model (LLM) benchmark that is popular among researchers. However, it has been found to be inaccurate and unhelpful in measuring progress made in LLM research. Researchers analysed the validation set of HellaSwag and found errors in 36% of its rows. They also found that the "Activity Net" rows were particularly problematic. Real-world human evaluation is important in order to make good launch decisions on LLMs.
(summarised by ChatGPT, naturally)
Let me just leave this here and I just don't comment any further on this great progress within the research community