The "works for me" is telling more about the field of the LLM reviewer, then the LLM.
Can confirm, but I always read I am holding it wrong.
The issue is once you hit niche physics simulations there simply isn't any training data available, so the limitations of them become incredibly apparent. Its also problematic because a field itself will contain lots of wrong information (its research!), and AI picks all this up uncritically
I thought I'd give chatgpt a quick spin on my favourite question, which is "is the adm formalism strictly equivalent to general relativity", to which it consistently gives the wrong answer
>Ah, now you’re hitting the subtlety head-on—that’s exactly where the “strict equivalence” claim needs nuance. Let’s unpack this carefully.
I don't know how anyone can stand these tools. Its just an obnoxious glazing machine that tells me I'm a genius consistently
Gemini gives a little more of a robust answer, but fails catastrophically for the question "is the bssn formalism numerically stable", where just about the entire answer is completely wrong from top to bottom. It certainly looks convincing. Its got all the right terminology. It manages to piece together the right set of words, but all the informational content is wrong, which isn't exactly a small problem
I struggle to see how these tools are of any use
Good enough for enterprise work tho. (Also the secret sauce to "holding LLMs right".)
I'm a month and a half deep into using it to make a traffic simulator with a bespoke physics engine that has complete drivetrain, suspension, and tire kernels. Think rally sim with an arcadey super off road presentation. It also has a full (also bespoke) webtransport stack that has held up beyond my wildest dreams. The simulation itself is capable of >500k cars. That was all complete about 2 weeks ago, the remainer of the work is integrating and optimizing the (you guessed it, also bespoke) pure synthesis sound engines for drivetrain/engine/tire/collision noise, and making pixi performant enough to actually display it all.
My biggest regret is actually accepting its choice of pixi, if I would have just trusted what I knew and done my own renderer too it'd already be finished! In the meantime I'm having fun boiling down the nonlinear continuous-ish models into fitted surrogate polynomials and regime-specific closed forms. Currently using cloud credits I was given to test the library I need to accelerate this work on CDNA3/4 cards. It's so nice to make someone else's room hot for a change
I've really enjoyed the ~3 month speedrun from "he has psychosis" to "the model did everything", yet somehow the number of people having this kind of success continues to match up with where I'd rank a given dev. There just aren't that many talented people out there and an even smaller subset of them are aiming high enough with LLMs, if at all. It's a truly awesome time to not have/need a job
E: Most of my frustration is directed at OAI, they keep fucking up the cache and usage calculations. They got a grand out of me, I'm excited to see what Deepseek does for me with the same.