This stuff is really, really hard. Social science is very difficult as there's a lot of variance in human ability/responses. Added to that is the variance surrounding setup and tool usage (claude code vs aider vs gemini vs codex etc).
Like, there's a good reason why social scientists try to use larger samples from a population, and get very nerdy with stratification et al. This stuff is difficult otherwise.
The gold standard (rather like the METR study) is multiple people with random assignment to tasks with a large enough sample of people/tasks that lots of the random variance gets averaged out.
On a 1 person sample level, it's almost impossible to get results as good as this. You can eliminate the person level variance (because it's just one person), but I think you'd need maybe 100 trials/tasks to get a good estimate.
Personally, that sounds really implausible, and even if you did accomplish this, I'd be sceptical of the results as one would expect a learning effect (getting better at both using LLM tools and side projects in general).
The simple answer here (to your original question) is no, you probably can't measure this yourself as you won't have enough data or enough controls around the collection of this data to make accurate estimates.
To get anywhere near a good estimate you'd need multiple developers and multiple tasks (and a set of people to rate the tasks such that the average difficulty remains constant.
Actually, I take that back. If you work somewhere with lots and lots of non-leetcode interview questions (take homes etc) you could probably do the study I suggested internally. If you were really interested in how this works for professional development, then you could randomise at the level of interviewee and track those that made it through and compare to output/reviews approx 1 year later.
But no, there's no quick and easy way to do this because the variance is way too high.
> Do you think trashb who made the initial question above would take the results of such evaluation and say "Yeah, that's good enough and answers my question"?
I actually think trashb would have been OK with my original study, but obviously that's just my opinion.