For MMLU, it highlights the CoT @ 32 result, where Ultra beats GPT4, but it loses to GPT4 with 5-shot, for example.
For GSM8K it uses Maj1@32 for Ultra and 5-shot CoT for GPT4, etc.
Then also, for some reason, it uses different metrics for Ultra and Pro, making them hard to compare.
What a mess of a "paper".