> not elo rating because that depends on a human reference
1) It's an imprecise target: believers can always hype and skeptics can always downplay improvements. Humans can do lots of different things somewhat well at the same time, so a machine beating human-level performance in one field (like identifying digits) says little about other fields (like identifying code vulnerabilities).
2) ELO ratings, or similar metrics are measurements of skill, and can be brute-forced to some extent, equivalent to grinding up levels in a video game. Brute-forcing a solution is "bad", but how do we know a new method is "better/more elegant/more efficient"? For algorithms we have Big-O notation, so we know (brute force < bubble sort < quick sort), perhaps there is an analogue for machine learning.
I would like performance comparisons that focus on quantities unique to machines. I don't compare the addition of computer processors with reference to human addition, so why not treat machine intelligence similarly?
There are many interesting quantities with which we can compare ML models. Energy usage is a popular metric, but we can also compare the structure of the network, the code used, the hardware, the amount of training data, the amount of training time, and the similarity between training and test data. I think a combination of these would be useful to look at every time a new model arrives.