'Reasoning steps' here is just arbitrary and meaningless. Not only is there no utility to it unlike the above 2 but it's just incredibly silly to me to think we should be directly comparing something like that with entities operating in wildly different substrates.
If I can't look at the score and immediately get a good idea of where things stand, then throw it way. 5% here could mean anything from 'solving only a tiny fraction of problems' to "solving everything correctly but with more 'reasoning steps' than the best human scores." Literally wildly different implications. What use is a score like that ?