As a freelancer I do a bit of everything, and I’ve seen places where LLM breezes through and gets me what I want quickly, and times where using an LLM was a complete waste of time.
Building a simple marketing website? Probably don’t waste your time - an LLM will probably be faster.
Designing a new SLAM algorithm? Probably LLMs will spin around in circles helplessly. That being said, that was my experience several years ago… maybe state of the art has changed in the computer vision space.
I've been impressed by how this isn't quite true. A lot of my coding life is spent in the popular languages, which the LLMs obviously excel at.
But a random dates-to-the-80s robotics language (Karel)? I unfortunately have to use it sometimes, and Claude ingested a 100s of pages long PDF manual for the language and now it's better at it than I am. It doesn't even have a compiler to test against, and still it rarely makes mistakes.
I think the trick with a lot of these LLMs is just figuring out the best techniques for using them. Fortunately a lot of people are working all the time to figure this out.
Even if your architectural idea is completely unique... a never before seen magnum opus, the building blocks are still legos.
I was looking at trying to remember/figure out some obscure hardware communication protocol to figure out enumeration of a hardware bus on some servers. Feeding codex a few RFC URLs and other such information, plus telling it to search the internet resulted in extremely rapid progress vs. having to wade through 500 pages of technical jargon and specification documents.
I'm sure if I was extending the spec to a 3.0 version in hardware or something it would not be useful, but for someone who just needs to understand the basics to get some quick tooling stood up it was close to magic.
The question relevant for LLMs would be "how many high quality results would I get if I googled something related to this", and for DICOM the answer is "many". As long the that is the case LLMs will not have trouble answering questions about it either.
A very simple kind of query that in my experiences causes problems to many current LLMs is:
"Write {something obscure} in the Wolfram programming language."
This is actually where I would be most reluctant to use an LLM. Your website represents your product, and you probably don’t want to give it the scent of homogenized AI slop. People can tell.
If you decide on your own brand colors and wording, there’s very little left about the code that can’t be done instantly by an LLM (at least on a marketing website).
Without playing around with it, you wouldn't know when to use an LLM and when not.
Why would I do that? Well, I wanted to understand more deeply how differences in my prompting might impact the outcomes of the model. I also wanted to get generally better at writing prompts. And of course, improving at controlling context and seeing how models can go off the rails. Just by being better at understanding these patterns, I feel more confident in general at when and how to use LLMs in my daily work.
I think, in general, understanding not only that earlier models are weaker, but also _how_ they are weaker, is useful in its own right. It gives you an extra tool to use.
I will say, the biggest findings for "weaknesses" I've found are in training data. If you're keeping your libraries up-to-date, and you're using newer methods or functionality from those libraries, AI will constantly fail to identify with those new things. For example, Zod v4 came out recently and the older models absolutely fail to understand that it uses some different syntax and methods under the hood. Jest now supports `using` syntax for its spyOn method, and models just can't figure it out. Even with system prompts and telling them directly, the existing training data is just too overpowering.
For example: gemini became a lot better in a lot more tasks. How do I know? because i also have very basic benchmarks or lets say "things which haven't worked" are my benchmark.
This is an industry that requires continuous learning.