undefined | Better HN

0 pointsdisgruntledphd22mo ago0 comments

> I guess I was looking for something bit more concrete, that one could apply themselves, which would answer the "if they have measured their results? [...] Can you provide data that objects this view" part of parents comment.

This stuff is really, really hard. Social science is very difficult as there's a lot of variance in human ability/responses. Added to that is the variance surrounding setup and tool usage (claude code vs aider vs gemini vs codex etc).

Like, there's a good reason why social scientists try to use larger samples from a population, and get very nerdy with stratification et al. This stuff is difficult otherwise.

The gold standard (rather like the METR study) is multiple people with random assignment to tasks with a large enough sample of people/tasks that lots of the random variance gets averaged out.

On a 1 person sample level, it's almost impossible to get results as good as this. You can eliminate the person level variance (because it's just one person), but I think you'd need maybe 100 trials/tasks to get a good estimate.

Personally, that sounds really implausible, and even if you did accomplish this, I'd be sceptical of the results as one would expect a learning effect (getting better at both using LLM tools and side projects in general).

The simple answer here (to your original question) is no, you probably can't measure this yourself as you won't have enough data or enough controls around the collection of this data to make accurate estimates.

To get anywhere near a good estimate you'd need multiple developers and multiple tasks (and a set of people to rate the tasks such that the average difficulty remains constant.

Actually, I take that back. If you work somewhere with lots and lots of non-leetcode interview questions (take homes etc) you could probably do the study I suggested internally. If you were really interested in how this works for professional development, then you could randomise at the level of interviewee and track those that made it through and compare to output/reviews approx 1 year later.

But no, there's no quick and easy way to do this because the variance is way too high.

> Do you think trashb who made the initial question above would take the results of such evaluation and say "Yeah, that's good enough and answers my question"?

I actually think trashb would have been OK with my original study, but obviously that's just my opinion.

0 comments

trashb2mo ago

To wrap this up, what I was trying to say is that the feeling of being faster may not align with the reality. Even for people that have a good understanding of the matter it may be difficult to estimate. So I would say be skeptical of claims like this and try to somehow quantize it in a way that matters for the tasks you do. This is something managers of software projects have been trying to tackling for a while now.

There is no exact measurement in this case but you could get an idea by testing certain types of implementations. For example if you are finishing similar tasks on average 25% faster during a longer testing period with and without AI. Just the act of timing yourself doing tasks with or without AI may already give a crude indication of the difference.

You could also run a trail implementing coding tasks like leet code however you will introduce some kind of bias due to having done it previously. And additionally the tasks may not align with your daily activities.

A trail with multiple developers working on the same task pool with or without AI could lead to more substantial results but you won't be able to do that by yourself.

embedding-shape2mo ago

So there seems to be an shared underestanding how difficult "measure your results" would be in this case, so could we also agree that asking someone:

> I wonder if they have measured their results? [...] Can you provide data that objects this view, based on these (celebrity) developers or otherwise?

isn't really fair? Because not even you or I really know how to do so in a fair and reasonable manner, unless we start to involve trials with multiple developers and so on.

trashb1mo ago

> isn't this fair?

We are talking about hear say anecdotal evidence from some influential people in the industry. The people mentioned in the comment I responded to have influence to organize certain research. Some measurements (even if not ideal) can point to 20x vs 0.1x speedup differences at least.

I indicated that there is at least some research pointing that developers (experienced or not) often overestimate the gains of using AI. There are a lot of other things that may prompt people to say things regarding emergent industries, for example investments into the AI industry.

I am interested if the claims are real or perhaps overstated. Therefore I asked what kind of information this is based on. This is how science works compared to marketing claims. Hypothesis lead to experiments that result in measurements that lead to a conclusion.

But as of now I still didn't even get a link to the statements supposedly made by these influential developers, this is the rhetoric with a lot of claims around AI especially. And therefore I am still skeptical about such claims until I see some concrete evidence.

So I would say yes it is fair to ask if they measured their results to back up their claims, especially if they are influential developers.

disgruntledphd2OP2mo ago

> isn't really fair? Because not even you or I really know how to do so in a fair and reasonable manner, unless we start to involve trials with multiple developers and so on.

I think in a small conversation like this, it's probably not entirely fair.

However, we're hearing similar things from much larger organisations who definitely have the resources to do studies like this, and yet there's very little decent work available.

In fact, lots of the time they are deliberately misleading people (25% of our code generated by AI being copilot/other autocomplete). Like, that 25% stat was probably true historically with JetBrains products and using any form of code generations (for protobufs et al) so it's wildly deceptive et al.

j / k navigate · click thread line to collapse