In our small and humble internal evals it regularly beats any other frontier models on some tasks. The shape of capability is really not intuitive/1 dimensional
Using a specific Monad-transformer regularly? It'll use that pattern, and often very well, handling all the wrapping and unwrapping needed to move data types about (at least well enough that the odd case it misses some wrapping/unwrapping is easy to spot and manage).
A custom GPT or GEM with the same source files, and those models regularly fail to maintain style and context, often suggesting solutions that might be fine in isolation but make little sense in the context of a larger codebase. It's almost like they never reliably refer to the code included in the project/GPT/GEM.
Claude on the other hand is so consistent about referring to existing artifacts that, as you approach the limit of project size (which is admittedly small) you can use up your entire 5-hour block of credits with just a few back-and-forths.
After an hour and a half assed working result, i put everything into claude and it made it significant better on the first try and i had not a subscription active with claude.
I find it concerning there is no real accurate benchmarks for this stuff that we can all agree on.