Many students clear try to answer exams by pattern matching, and I've seen a lot of exams of students "matching" on a pattern based on one word on a question and doing something totally wrong.
For example, customer service reps tend to often vaguely match your request with a possibly or only vaguely applicable templated response.
Technically savvy customers who tend to try explain problems in detail are probably more likely to get an actually non-applicable canned response as the CS rep gets frustrated with the amount of information and will latch onto the first phrase which relates to a templated response without really considering context.
My reply’s getting a little tangential now, but I feel this is good life advice, I’ve found I’m more likely to get decent customer service if I keep my requests as short as possible.
The first sentence needs to essentially state the issue I need help with. In some cases a bulleted list of things I’ve tried helps and then I’m sure to include essential info like an account number, e.g.
I’m getting error 13508 when I try log into my account. I’ve already tried the following solutions with no success:
- Clearing my browser cache and cookies.
- Restarting my computer.
- Running all software updates.
My account number: xxx
What is the next step here?
The next step will be to walk you through clearing your browser cache and cookies.
Because the CS rep has no idea who you are, and your protestations of competency fall on deaf ears because they've dealt with 23325424 people in the last year that claimed to know what they're doing but actually didn't at all.
Their goal is to get through the script, because getting through the script is the only way to be sure that it's all been done the way it needs to be done. And if they don't run through the script, and refer you to the next level of support, and it turns out that you hadn't actually cleared your browser cache and cookies, then that's their fault and they get dinged for it.
I always approach these situations with this understanding; that the quickest way to get my problem solved is to help them work through their script. And every now and then, just occasionally, working through the script has shown up something simple and obvious that I'd totally missed despite my decades of experience.
Obviously I don't do business with that company anymore.
However, I still think any irrelevant facts would upset a number of exam takers, and claiming it "clearly" wouldn't is far too strong a claim to make without evidence.
It reminds me of Kahneman's "system 1" (fast) and "system 2" (slow) thinking. LLMs are system 1 - fast, intuitive, instinctual. Humans often think that way. But we can also break out system 2 when we choose to, and apply logic, reason, etc.
But in general I do not think these models are claiming at being good at replicating the performance of a distracted or otherwise low performing pupil. I think they should be evaluated against humans who are capable of completing word problems containing context that is not inherently necessary to the math question. The reason those tests I mentioned use these word problems is that it's a way to evaluate someone's ability to think in abstract mathematical terms about everyday situations, which obviously involve lots of unimportant information the person must choose to consider or not.
tl;dr: I think a reasonably competent high school student could answer the apple and cat question, which is absolutely a reasonable bar for an LLM to clear. If university students are failing these questions, then they have not been taught test taking skills, which should be considered a mathematical failure just as unacceptable as that of the LLM, not a mitigating similarity for the latter.