I just finished a small project where I used o3-mini and o3-mini-high to generate most of the code. I averaged around 200 lines of code an hour, including the business logic and unit tests. Total was around 2200 lines. So, not a big project, but not a throw away script. The code was perfectly fine for what we needed. This is the third time I've done this, and each time I get faster and better at it.
1. I find a "pair programming" mentality is key. I focus on the high-level code, and let the model focus on the lower level code. I code review all the code, and provide feedback. Blindly accepting the code is a terrible approach.
2. Generating unit tests is critical. After I like the gist of some code, I ask for some smoke tests. Again, peer review the code and adjust as needed.
3. Be liberal with starting a new chat: the models can get easily confused with longer context windows. If you start to see things go sideways, start over.
4. Give it code examples. Don't prompt with English only.
FWIW, o3-mini was the best model I've seen so far; Sonnet 3.5 New is a close second.
The minute you have to discuss those things with someone else, your bandwidth decreases by orders of magnitude and now you have to put words to these things and describe them, and physically type them in or vocalize them. Then your counterpart has to input them through his eyes and ears, process that, and re-output his thoughts to you. Slow, slow, slow, and prone to error and specificity problems as you translate technical concepts to English and back.
Chat as a UX interface is similarly slow and poorly specific. It has all the shortcomings of discussing your idea with a human and really no upside besides the dictionary-like recall.
> I focus on the high-level code, and let the model focus on the lower level code.
Tbh the reason I don't use LLM assistants is because they suck at the "low level". They are okay at mid level and better at high level. I find it's actual coding very mediocre and fraught with errors.I've yet to see any model understand nuance or detail.
This is especially apparent in image models. Sure, it can do hands but they still don't get 3D space nor temporal movements. It's great for scrolling through Twitter but the longer you look the more surreal they get. This even includes the new ByteDance model also on the front page. But with coding models they ignore context of the codebase and the results feel more like patchwork. They feel like what you'd be annoyed at with a junior dev for writing because not only do you have to go through 10 PRs to make it pass the test cases but the lack of context just builds a lot of tech debt. How they'll build unit tests that technically work but don't capture the actual issues and usually can be highly condensed while having greater coverage. It feels very gluey, like copy pasting from stack overflow when hyper focused on the immediate outcome instead of understanding the goal. It is too "solution" oriented, not understanding the underlying heuristics and is more frustrating than dealing with the human equivalent who says something "works" as evidenced by the output. This is like trying to say a math proof is correct by looking at just the last line.
Ironically, I think in part this is why chat interface sucks too. A lot of our job is to do a lot of inference in figuring out what our managers are even asking us to make. And you can't even know the answer until you're part way in.
1. I need a smart autocomplete that can work backwards and mimic my coding patterns
2. I need a pair programming buddy (of sorts, this metaphor doesn't completely work, but I don't have a better one)
Pair development, even a butchered version of the so called "strong style" (give the driver the highest level of abstraction they can use/understand) works quite well for me. But, the main reason this works is that it forces me to structure my thinking a little bit, allows me to iterate on the definition of the problem. Toss away the sketch with bigger parts of the problem, start again.
It also helps me to avoid yak shaving, getting lost in the detail or distracted because the feedback loop between me seeing something working on the screen vs. the idea is so short (even if the code is crap).
I'd also add 5.: use prompts to generate (boring) prompts. For instance, I needed a simple #tag formatter for one of my markdown sites. I am aware that there's a not-so-small list of edge cases I'd need to cover. In this case I'd write a prompt with a list of basic requirements and ask the LLM to: a) extend it with good practice, common edge cases b) format it as a spec with concrete input / output examples. This works a bit similar to the point you made about generating unit tests (I do that too, in tandem with this approach).
In a sense 1) is autocomplete 2) is a scaffolding tool.
Yesterday, I asked o3-mini to "optimize" a block of code. It produced very clean, functional TypeScript. However, because the code is reducing stock option chains, I then asked o3-mini to "optimize for speed." In the JavaScript world, this is usually done with for loops, and it even considered aspects like array memory allocation.
This shows that using the right qualifiers is important for getting the results you want. Today, I use both "optimize for developer experience" and "optimize for speed" when they are appropriate.
Although declarative code is just an abstraction, moving from imperative jQuery to declarative React was a major change in my coding experience. My work went from telling the system how to do something to simply telling it what to do. Of course, in React—especially at first—I had to explain how to do things, but only once to create a component. After that, I could just tell the system what to do. Now, I can simply declare the desired outcome, the what. It helps to understand how things work, but that level of detail is becoming less necessary.
Chat sucks for pulling in context, and the only worse thing I've tried is the IDE integrations that supposedly pull the relevant context for you (and I've tried quite a few recently).
I don't know if naive fine-tuning with codebase would work, I suspect there are going to be tools that let you train the AI on code in the sense that it can have some references in model, and it knows how you want your project code/structure to look like (which is often quite different from what it looks in most areas)
Even then though, I asked o1-cursor to start a react app. It failed, mostly because it's out of date. It's instructions were for react 2 versions ago.
This seems like an issue. If the statistically most likley answer is old, that's not helpful.
I think chat is a nice intermediary evolution between the CLI (that we use every day) and whatever comes next.
I work at Augment (https://augmentcode.com), which, surprise surprise, is an AI coding assistant. We think about the new modality required to interact with code and AI on a daily basis.
Beside increase productivity (and happiness, as you don't have to do mundane tasks like tests, documentations, etc), I personally believe that what AI can open up is actually more of a way for non-coders (think PMs) to interact with a codebase. AI is really good at converting specs, user stories, and so on into tasks—which today still need to be implemented by software engineers (with the help of AI for the more tedious work). Think of what Figma did between designers and developers, but applied to coding.
What’s the actual "new UI/UX paradigm"? I don’t know yet. But like with Figma, I believe there’s a happy ending waiting for everyone.
I'm now bracing for the "oh sht, we're all out of a job next year" narrative.
It is helpful to frame this in the historical arc described by Yuval Harari in his recent book "Nexus" on the evolution of information systems. We're at the dawn of history for how to work with AI, and actively visualizing the future has an immediate ROI.
"Chat" is cave man oral tradition. It is like attempting a complex Ruby project through the periscope of an `irb` session. One needs to use an IDE to manage a complex code base. We all know this, but we haven't connected the dots that we need to approach prompt management the same way.
Flip ahead in Harari's book, and he describes rabbis writing texts on how to interpret [texts on how to interpret]* holy scriptures. Like Christopher Nolan's movie "Inception" (his second most relevant work after "Memento"), I've found myself several dreams deep collaborating with AI to develop prompts for [collaborating with AI to develop prompts for]* writing code together. Test the whole setup on multiple fresh AI sessions, as if one is running a business school laboratory on managerial genius, till AI can write correct code in one shot.
Duh? Good managers already understand this, working with teams of people. Technical climbers work cliffs this way. And AI was a blithering idiot until we understood how to simulate recursion in multilayer neural nets.
AI is a Rorschach inkblot test. Talk to it like a kindergartner, and you see the intelligence of a kindergartner. Use your most talented programmer to collaborate with you in preparing precise and complete specifications for your team, and you see a talented team of mature professionals.
We all experience degradation of long AI sessions. This is not inevitable; "life extension" needs to be tackled as a research problem. Just as old people get senile, AI fumbles its own context management over time. Civilization has advanced by developing technologies for passing knowledge forward. We need to engineer similar technologies for providing persistent memory to make each successive AI session smarter than the last. Authoring this knowledge helps each session to survive longer. If we fail to see this, we're condemning ourselves to stay cave men.
Compare the history of computing. There was a lot of philosophy and abstract mathematics about the potential for mechanical computation, but our worldview exploded when we could actually plug the machines in. We're at the same inflection point for theories of mind, semantic compression, structured memory. Indeed, philosophy was an untestable intellectual exercise before; now we can plug it in.
How do I know this? I'm just an old mathematician, in my first month trying to learn AI for one final burst of productivity before my father's dementia arrives. I don't have time to wait for anyone's version of these visions, so I computed them.
In mathematics, the line in the sand between theory and computation keeps moving. Indeed, I helped move it by computerizing my field when I was young. Mathematicians still contribute theory, and the computations help.
A similar line in the sand is moving, between visionary creativity and computation. LLMs are association engines of staggering scope, and what some call "hallucinations" can be harnessed to generalize from all human endeavors to project future best practices. Like how to best work with AI.
I've tested everything I say here, and it works.
Perhaps there's gonna be post-AI programming movement where people actually stare at the same monitor and discuss while one of them is coding.
As a sidenote - we've done experiments with FOBsters, and when paired this way, the multiply their output. There's something about psychology of groups and how one can only provide maximum output when teaming.
Even for solo activities, and non-IT activities, such as skiing/snowboard, it is better to have a partner to ride with you and discuss the terrain.
Works amazingly well for a lot of what I've been working on the past month or two.
I had crucial area with threads. Code generated by chat seemed to be ok, but had one flaw. My initial code written manually was bug free. chat-generated output was not. It was difficult to catch it via inspection.
My experiments have been nowhere near that successful.
I would love, love, love to see a transcript of how that process worked over an hour, if that was something you were willing to share.
I do all this + rubber ducky the hell out of it.
Sometimes I just discuss concepts of the project with the thing and it helps me think.
I dont think chat is going to be right for everyone but it absolutely works for me.
Came to vote good too. I mean, why do we all love a nice REPL? That's chat right? Chat with an interpreter.
GitHub Copilot is...not. It doesn't seem to understand how to help me as well as ChatGPT does.
This is what I've found to be key. If I start a new feature, I will work with the LLM to do the following:
- Create problem and solution statement
- Create requirements and user stories
- Create architecture
- Create skeleton code. This is critical since it lets me understand what it wants to do.
- Generate a summary of the skeleton code
Once I have done the above, I will have the LLM generate a reusable prompt that I can use to start LLM conversations with. Below is an example of how I turn everything into a reusable prompt.
https://beta.gitsense.com/?chat=b96ce9e0-da19-45e8-bfec-a3ec...
As I make changes like add new files, I will need to generate a new prompt but it is worth the effort. And you can see it in action here.
https://beta.gitsense.com/?chat=b8c4b221-55e5-4ed6-860e-12f0...
The first message is the reusable prompt message. With the first message in place, I can describe the problem or requirements and ask the LLM what files it will need to better understand how to implement things.
What I am currently doing highlights how I think LLM is a game changer. VCs are going for moonshots instead of home runs. The ability to gather requirements and talk through a solution before even coding is how I think LLMs will revolutionize things. It is great that it can produce usable code, but what I've found it to be invaluable is it helps you organize your thoughts.
In the last link, I am having a conversation with both DeepSeek v3 and Sonnet 3.5 and the LLMs legitimately saved me hours in work, without even writing a single line of code. In the past, I would have just implemented the feature and been done with it, and then I would have to fix something if I didn't think of an edge case. With LLMs, it literally takes minutes to develop a plan that is extremely well documented that can be shared with others.
This ability to generate design documents is how I think LLMs will ultimately be used. The bonus is producing code, but the reality is that documentation (which can be tedious and frustrating) is a requirement for software development. In my opinion, this is where LLMs will forever change things.
why the top comments on HN are always people who have not read the article
In large, I assert this is because the best way to do something is to do that thing. There can be correspondence around the thing, but the artifacts that you are building are separate things.
You could probably take this further and say that narrative is a terrible way to build things. It can be a great way to communicate them, but being a separate entity, it is not necessarily good at making any artifacts.
Chat is a great UI pattern for ephemeral conversation. It's why we get on the phone or on DM to talk with people while collaborating on documents, and don't just sit there making isolated edits to some Google Doc.
It's great because it can go all over the place and the humans get to decide which part of that conversation is meaningful and which isn't, and then put that in the document.
It's also obviously not enough: you still need documents!
But this isn't an "either-or" case. It's a "both" case.
I really don't like the idea of chatting with an AI though. There are better ways to interface with AIs and the focus on chat is making people forget that.
The last counter argument I read got buried on Discord or Slack somewhere.
I wish I could block them within all these chat apps.
"Sorry, you can't bother to send voice messages to this person."
I think this applies to any “fuzzy generation” scenario. It certainly shouldn’t be the only tool, and (at least as it stands today) isn’t good enough to finalize and fine tune the final result, but a series of “a foo with a bar” “slightly less orange” “make the bar a bit more like a fizzbuzz” interactions with a good chat UI can really get a good 80% solution.
But like all new toys, AI and AI chat will be hammered into a few thousand places where it makes no sense until the hype dies down and we come up with rules and guidelines for where it does and doesn’t work
- "App builders" that use some combination of drag&drop UI builders, and design docs for architecture, workflows,... and let the UI guess what needs to be built "under the hood" (a little bit in the spirit of where UML class diagrams were meant to take us). This would still require actual programming knowledge to evaluate and fix what the bot has built
- Formal requirement specification that is sufficiently rigorous to be tested against automatically. This might go some way towards removing the requirement to know how to code, but the technical challenge would simply shift to knowing the specification language
I created the tetr app[1] which is basically “chat UI for everything”. I did that because I used to message myself notes and wanted to expand it to many more things. There’s not much back and forth, usually 1 input and instant output (no AI), still acting like a chat.
I think there’s a lot of intuitiveness with chat UI and it can be a flexible medium for sharing different information in a similar format, minimizing context switching. That’s my philosophy with tetr anyhow.
It's usually not. Narrative is a famously flawed way to communicate or record the real world.
It's great for generating engagement, though.
Incidentally I think that's also a good model for how much to trust the output - you might have a colleague who knows enough about X to think they can answer your question, but they're not necessarily right, you don't blindly trust it. You take it as a pointer, or try the suggestion (but not surprised if it turns out it doesn't work), etc.
I think it is brilliant. On another hand I caught myself many times writing prompts to colleagues. Although it made requirements of what I need so much clearer for them.
I’ve been saying this since 2018
Agreed that copy pasting context in and out of ChatGPT isn't the fastest workflow. But Cursor has been a major speed up in the way I write code. And it's primarily through a chat interface, but with a few QOL hacks that make it way faster:
1. Output gets applied to your file in a git-diff style. So you can approve/deny changes.
2. It (kinda) has context of your codebase so you don't have to specify as much. Though it works best when you explicitly tag files ("Use the utils from @src/utils/currency.ts")
3. Directly inserting terminal logs or type errors into the chat interface is incredibly convenient. Just hover over the error and click the "add to chat"
I’ve only been slowed down with AI tools. I tried for a few months to really use them and they made the easy tasks hard and the hard tasks opaque.
But obviously some people find them helpful.
Makes me wonder if programming approaches differ wildly from developer to developer.
For me, if I have an automated tool writing code, it’s bc I don’t want to think about that code at all.
But since LLMs don’t really act deterministically, I feel the need to double check their output.
That’s very painful for me. At that point I’d rather just write the code once, correctly.
The chat interface is... fine. Certainly better integrated into the editor than GitHub Copilot's, but I've never really seen the need to use it as chat—I ask for a change and then it makes the change. Then I fixed what it did wrong and ask for another change. The chat history aspect is meaningless and usually counterproductive, because it's faster for me to fix its mistakes than keep everything in the chat window while prodding it the last 20% of the way.
Zed makes it trivial to attach documentation and terminal output as context. To reduce risk of hallucination, I now prefer working in static, strongly-typed languages and use libraries with detailed documentation, so that I can send documentation of the library alongside the codebase and prompt. This sounds like a lot of work, but all I do is type "/f" or "/t" in Zed. When I know a task only modifies a single file, then I use the "inline assist" feature and review the diffs generated by the LLM.
Additionally, I have found it extremely useful to actually comment a codebase. LLMs are good at unstructured human language, it's what they were originally designed for. You can use them to maintain comments across a codebase, which in turn helps LLMs since they get to see code and design together.
Last weekend, I was able to re-build a mobile app I made a year ago from scratch with a cleaner code base, better UI, and implement new features on top (making the rewrite worth my time). The app in question took me about a week to write by hand last year; the rewrite took exactly 2 days.
---
As a side note: a huge advantage of Zed with locally-hosted models is that one can correct the code emitted by the model and force the model to re-generate its prior response with those corrections. This is probably the "killer feature" of models like qwen2.5-coder:32b. Rather than sending extra prompts and bloating the context, one can just delete all output from where the first mistake was made, correct the mistake, then resume generation.
Imperative: - write a HTTP server that serves jokes - add a healthcheck endpoint - add TLS and change the serving port to 443
Declarative: - a HTTP server that serves jokes - contains a healthcheck endpoint - supports TLS on port 443
The differences here seem minimal because you can see all of it at once, but in the current chat paradigm you'd have to search through everything you've said to the bot to get the full context, including the side roads that never materialized.
In the document approach you're constantly refining the document. It's better than reviewing the code because (in theory) you're looking at "support TLS on port 443" instead of a lot of code, which means it can be used by a wider audience. And ideally I can give the same high level spec to multiple LLMs and see which makes the best application.
e.g. executives treat the org as a blackbox LLM and chat w it to get real results
I found interacting with it via chat to be super-useful and a great way to get stuff done. Yeah, sometimes you just have to drop into the code, and tag a particular line and say "this isn't going to work, rewrite it to do x" (or rewrite it yourself), but the ability to do that doesn't vitiate the value of the chat.
So you either need lots of extra text to remove the ambiguity of natural language if you use AI or you need a special precise subset to communicate with AI and that’s just programming with extra steps.
Real projects don't require an infinitely detailed specification either, you usually stop where it no longer meaningfully moves you towards the goal.
The whole premise of AI developer automation, IMO, is that if a human can develop a thing, then AI should be able too, given the same input.
Joking aside, this is likely where we will end up, just with a slightly higher programming interface, making developers more productive.
All the same buzzwords, including "AI"! In 1981!
Having a feedback loop is the only way viable for this. Sure, the client could give you a book on what they want, but often people do not know their edge cases, what issues may arise/etc.
If you know how to program, then I agree and part of why I don't see the point. If you don't know how to program, than the prompt isn't much different than providing the specs/requirements to a programmer.
haha, I just imagined sending TypeScript to ChatGPT and having it spit my TypeScript back to me. "See guys, if you just use Turing-complete logically unambiguous input, you get perfect output!"
The struggle is to provide a context that disambiguates the way you want it to.
LLMs solve this problem by avoiding it entirely: they stay ambiguous, and just give you the most familiar context, letting you change direction with more prompts. It's a cool approach, but it's often not worth the extra steps, and sometimes your context window can't fit enough steps anyway.
My big idea (the Story Empathizer) is to restructure this interaction such that the only work left to the user is to decide which context suits their purpose best. Given enough context instances (I call them backstories), this approach to natural language processing could recursively eliminate much of its own ambiguity, leaving very little work for us to do in the end.
Right now my biggest struggle is figuring out what the foundational backstories will be, and writing them.
I think about this like SQL in the late 80s. At the time, SQL was the “next big thing” that was going to mean we didn’t need programmers, and that management could “write code”. It didn’t quite work out that way, of course, as we all know.
I see chat-based interfaces to LLMs going exactly the same way. The LLM will move down the stack (rather than up) and much more appropriate task-based UX/UI will be put on top of the LLM, coordinated thru a UX/UI layer that is much sympathetic to the way users actually want to interact with a machine.
In the same way that no end-users ever touch SQL these days (mostly), we won’t expose the chat-based UX of an LLM to users either.
There will be a place for an ad-hoc natural language interface to a machine, but I suspect it’ll be the exception rather than the rule.
I really don’t think there are too many end users who want to be forced to seduce a mercurial LLM using natural language to do their day-to-day tech tasks.
Only when someone discovers another paradigm that matches or exceeds the effectiveness of LLMs without being a language model.
If you asked me two or three years ago I would have strongly agreed with this theory. I used to point out that every line of code was a decision made by a programmer and that programming languages were just better ways to convey all those decisions than human language because they eliminated ambiguity and were much terser.
I changed my mind when I saw how LLMs work. They tend to fill in the ambiguity with good defaults that are somewhere between "how everybody does it" and "how a reasonably bright junior programmer would do it".
So you say "give me a log on screen" and you get something pretty normal with Username and Password and a decent UI and some decent color choices and it works fine.
If you wanted to provide more details, you could tell it to use the background color #f9f9f9, but a part of what surprised my and caused me to change my mind on this matter was that you could also leave that out and you wouldn't get an error; you wouldn't get white text on white background; you would get a decent color that might be #f9f9f9 or might be #a1a1a1 but you saved a lot of time by not thinking about that level of detail and you got a good result.
Right now we have a ton of AI/ML/LLM folks working on this first clear challenge: better models that generate better defaults, which is great—but also will never solve the problem 100%, which is the second, less-clear challenge: there will always be times you don't want the defaults, especially as your requests become more and more high-level. It's the MS Word challenge reconstituted in the age of LLMs: everyone wants 20% of what's in Word, but it's not the same 20%. The good defaults are good except for that 20% you want to be non-default.
So there need to be ways to say "I want <this non-default thing>". Sometimes chat is enough for that, like when you can ask for a different background color. But sometimes it's really not! This is especially true when the things you want are not always obvious from limited observations of the program's behavior—where even just finding out that the "good default" isn't what you want can be hard.
Too few people are working on this latter challenge, IMO. (Full disclosure: I am one of them.)
In your example, the issue is not with writing the logon screen (You can find several example on github and a lot of css frameworks have form snippets). The issue is making sure that it works and integrate well with the rest of the project, as well as being easy to maintain.
Something like tldraw's "make real" [1] is a much better bet, imo (not that it's mutually exclusive). Draw a rough mockup of what you want, let AI fill in the details, then draw and write on it to communicate your changes.
We think multi-modally; why should we limit the creative process to just text?
[1] https://tldraw.substack.com/p/make-real-the-story-so-far
Here is an example of our approach:
https://blog.codesolvent.com/2024/11/building-youtube-video-...
We are also using the requirements to build a checklist, the AI generates the checklist from the requirements document, which then serves as context that can be used for further instructions.
Here's a demo:
The mode that I've found most fruitful when using Cursor is treating it almost exactly as I would a pair programming partner. When I start on a new piece of functionality I describe the problem and give it what my thoughts are on a potential solution and invite feedback. Sometimes my solution is the best. Sometimes the LLM had a better idea and frequently we take a modified version of what one of us suggested. Just as you would with a human partner. The result of the discussion is better than what either of us would have done on their own.
I also will do classical ping-pong style tdd with it one we agreed on an approach. I'll write a test; llm makes it pass and write the next test which I'll make pass and so on.
As with a real pair, it's important to notice when they are struggling and help them or take over. You can only do this if you stay fully engaged and understand every line. Just like when pairing. I've found llms get frequently in a loop where something doesn't work and they keep applying the same changes they've tried before and it never works. Understand what they are trying to do and help them out. Don't be a shitty pair for your llm!
It gets even funner when you try to get other models to fix whatever is broken and they too get caught in the same loop. I’ll be like “nope! Your buddy ChatGPT said the same thing and got stuck in such and such loop. Clearly whatever you are trying isn’t working so step back and focus on the bigger picture. Are we even doing this the right way in the first place?”
And of course it still walks down the loop. So yeah, better be ready to fix that problem yourself cause if they all do the same thing you are either way off course or they are missing something!
It will slowly grow in complexity, strictness, and features, until it becomes a brand-new programming language, just with a language model and a SaaS sitting in the middle of it.
A startup will come and disrupt the whole thing by simply writing code in a regular programming language.
> Looking for a low level engineer, who works close to the metal, will work on our prompts
https://aider.chat/docs/usage/watch.html
How jarring it is & how much it takes you out of your own flow state is very much dependent on the model output quality and latency still, but at times it works rather nicely.
I think it's more ideal to have the LLM map text to some declarative pseudocode that's easy to read which is then translated to code.
The example given by Daniel might map to something like this:
define sign-in-screen:
panel background "#f9f9f9":
input email required: true, validate-on-blur: true
input password required: true
button "Sign in" gradient: ("#EEE" "#DDD")
connect-to-database
Then you'd use chat to make updates. For example, "make the gradient red" or "add a name field." Come to think of it, I don't see why chat is a bad interface at all with this set up.It decided to output something JSON and maybe YAML once.
The example shows "Sign-in screen" with 4 (possibly more) instructions. This could equivalently have been entered one at a time into 'chat'. If the response for each was graphic and instantaneous, chat would be no worse than non-chat.
What makes non-chat better is that the user puts more thought into what they write. I do agree for producing code Claude with up-front instructions beats ChatGPT handily.
If OTOH AI's actually got as good or better than humans, chat would be fine. It would be like a discussion in Slack or PR review comments.
1) The first thing to improve chats as a genre of interface, is that they should all always be a tree/hierarchy (just like Hacker News is), so that you can go back to ANY precise prior point during a discussion/chat and branch off in a different direction, and the only context the AI sees during the conversation is the "Current Node" (your last post), and all "Parent Nodes" going back to the beginning. So that at any time, it's not even aware of all the prior "bad branches" you decided to abandon.
2) My second tip for designs of Coding Agents is do what mine does. I invented a 'block_begin/block_end' syntax which looks like this, and can be in any source file:
// block_begin MyAddNumbers
var = add(a, b)
return a + b
// block_end
With this syntax you can use English language to explain and reason about extremely specific parts of your code with out expecting the LLM to "just understand". You can also direct the LLM to only edit/update specific "Named Blocks", as I call them.
So a trivial example of a prompt expression related to the above might be "Always put number adding stuff in the MyAddNumbers Block".
To explain entire architectural aspects to the LLM, these code block names are extremely useful.
Proper context is absolutely everything when it comes to LLM use
I've tested a few integrated AI dev tools and it works like a charm. I don't type all my instructions at once. I do it the same way as I do it with code. Iteratively:
1) Create a layout
2) Fill left side
3) Fill right side
4) Connect components
5) Populate with dummy data
> The first company to get this will own the next phase of AI development tools.
There's more than 25 working on this problem and they are already in production and some are really good.
Everything else, is just putting layers, that are not nearly as capable at an LLM, between me and the raw power of the LLM.
The core realization I made to truly unlock LLM code assistance as a 10x + productivity gain, is that I am not writing code anymore, I am writing requirements. It means being less an engineer, and more a manager, or perhaps an architect. It's not your job to write tax code anymore, it's your job to describe what the tax code needs to accomplish and how it's success can be defined and validated.
Also, it's never even close to true that nobody uses LLMs for production software, here's a write-up by Google talking about using LLMs to drastically accelerate the migration of complex enterprise production systems: https://arxiv.org/pdf/2501.06972
Last night I wrote an implementation of an AI paper and it was so much easier to just discard the automatic chat formatting and do it "by hand": https://github.com/Xe/structured-reasoning/blob/main/index.j...
I wonder if foundation models are an untapped goldmine in terms of the things they can do, but we can't surface them to developers because everyone's stuck in the chat pattern.
Would you be so kind as to ELI5 what you did in that index.js?
I've used ollama to run models locally, but I'm still stuck in chat-land.
Of course, if a blog post is in the works, I'll just wait for that :)
The company I work for integrated AI into some of our native content authoring front-end components and people loved it. Our system took a lot of annotating to be able to accurately translate the natural language to the patterns of our system but users so far have found it WAYYY more useful than chat bc it's deeply integrated into the tasks they do anyway.
Figma had a similar success at last year's CONFIG when they revealed AI was renaming default layers names (Layer 1, 2, etc)... something they didn't want to do anyway. I dare say nobody gave a flying f about their "template" AI generation whereas layer renaming got audible cheers. Workflow integration is how you show people AI isn't just replacing their job like some bad sci-fi script.
Workflow integration is going to be big. I think chat will have its place tho; just kind of as an aside in many cases.
Then having ai generate code for my project didn't feel good either, I didn't really understand what it was doing so I would have to read it to understand, then what is the purpose, I might as well write it.
I then started playing, and out came a new type of programming language called plang (as in pseudo language). It allows you to write the details without all the boiler code.
I'm think I've stumbled on to something, and just starting to get noticed :) https://www.infoworld.com/article/3635189/11-cutting-edge-pr...
In a real-world scenario, we begin with detailed specifications and requirements, develop a product, and then iterate on it. Chat-based interactions might be better suited to this iterative phase. Although I'm not particularly fond of the approach, it does resemble receiving a coworker's feedback, making a small, targeted change, and then getting feedback again.
Even if the system were designed to focus solely on the differences in the requirements—thus making the build process more iterative—we still encounter an issue: it tends to devolve into a chat format. You might have a set of well-crafted requirements, only for the final instruction to be, "The header should be 2px smaller."
Nonetheless, using AI in an iterative process (focusing on requirement diffs, for example) is an intriguing concept that I believe warrants further exploration.
That’s the thing about language, you CAN’T program in human language for this exact reason, whereas programming languages are mechanical but precise, human languages flow better but they leave wiggle room. Computers can’t do jack shit with wiggle room, they’re not humans. That’ll always remain, until there’s an AI people like enough to have it’s own flair on things.
So far as this article is concerned (not the many commenters who are talking past it), "chat" is like interacting with a shell or a REPL. How different is the discussion that Winograd has with SHRDLU
https://en.wikipedia.org/wiki/SHRDLU
with the conversation that you have with a database with the SQL monitor really?
There's a lot to say for trying to turn that kind of conversation into a more durable artifact. I'd argue that writing unit tests in Java I'm doing exploratory work like I'd do in a Python REPL except my results aren't scrolling away but are built into something I can check into version control.
On the other hand, workspace-oriented programming environments are notorious for turning into a sloppy mess, for instance people really can't make up their mind if they want to store the results of their computations (God help you if you have more than one person working on it, never mind if you want to use version control -- yet, isn't that a nice way to publish a data analysis?) or if they want to be a program that multiple people can work, can produce reproducible results, etc.
See also the struggles of "Literate Programming"
Not to say there isn't an answer to all this but boy is it a fraught area.
English behaviour descriptions -> generated tests
Use both behaviour descriptions and feedback from test results to iterate on app development
Absolutely insane that all the doors unlocked by being able to interact with a computer graphically, and yet these people have visions of the future stuck in the 60s.
Chat is an awesome powerup for any serious tool you already have, so long as the entity on the other side of the chat has the agency to actually manipulate the tool alongside you as well.
I think this post shows there could be a couple levels of indirection, some kind of combination of the "overarching design doc" that is injected into every prompt, and a more tactical level syntax/code/process that we have with something like a chat window that is code aware. I've definitely done some crazy stuff by just asking something really stupid like "Is there any way to speed this up?" and Claude giving me some esoteric pandas optimization that gave me a 100x speedup.
I think overall the tools have crazy variance in quality of output, but I think with some "multifacet prompting", ie, code styling, design doc, architect docs, constraints, etc you might end up with something that is much more useful.
So I completely agree with this. Chat is not a good UI
Example of a Structured Pseudo-Code Prompt:
Let’s say you want to generate code for a function that handles object detection:
'''Function: object_detection Input: image Output: list of detected objects
Steps: 1. Initialize model (load pretrained object detection model)
2. Preprocess the image (resize, normalize, etc.)
3. Run the image through the model
4. Extract bounding boxes and confidence scores from the model's output
5. Return objects with confidence greater than 0.5 as a list of tuples (object_name, bounding_box)
Language: Python'''
Been experimenting with the same approach but for "paged shells" (sorry for the term override) and this seems to be a best of both worlds kinda thing for shells. https://xenodium.com/an-experimental-e-shell-pager That is, the shell is editable when you need it to be (during submission), and automatically read-only after submission. This has the benefit of providing single-character shortcuts to navigate content. n/p (next/previous) or tab/backtab.
The navigation is particularly handy in LLM chats, so you can quickly jump to code snippets and either copy or direct output elsewhere.
Chat is also iterative. You can go back there and fix things that were misinterpreted. If the misinterpretation happens often, you can add on another instruction on top of that. I strongly disagree that they'd be fixed documents. Documents are a way to talk to yourself and get your rules right before you commit to them. But it costs almost nothing to do this with AI vs setting up brainstorming sessions with another human.
However, the rational models (o1, r1 and such) are good at iterating with themselves, and work better when you give them documents and have them figure out the best way to implement something.
It has features to add context from your current project pretty easily, but personally I prefer to constantly edit the chat buffer to put in just the relevant stuff. If I add too much, Claude seems to get confused and chases down irrelevant stuff.
Fully controlling the context like that seems pretty powerful compared to other approaches I've tried. I also fully control what goes into the project - for the most part I don't copy paste anything, but rather type a version of the suggestion out quickly.
If you're fast at typing and use an editor with powerful text wrangling capabilities, this is feasible. And to me, it seems relatively optimal.
Many developers don't realize this but as you go back and forth with models, you are actively polluting their context with junk and irrelevant old data that distracts and confuses it from what you're actually trying to do right now. When using sleeker products like Cursor, it's easy to forget just how much junk context the model is constantly getting fed (from implicit RAG/context gathering and hidden intermediate steps). In my experience LLM performance falls off a cliff somewhere around 4 decent-sized messages, even without including superfluous context.
We're further separating the concept of "workflow" from "conversation" and prompts, basically actively and aggressively pruning context and conversation history as our agents do their thing (and only including context that is defined explicitly and transparently), and it's allowing us to tackle much more complex tasks than most other AI developer tools. And we are a lot happier working with models - when things don't work we're not forced to grovel for a followup fix, we simply launch a new action to make the targeted change we want with a couple clicks.
It is in a weird way kind of degrading to have to politely ask a model to change a color after it messed up, and it's also just not an efficient way to work with LLMs - people just default to that style because it's how you'd interact with a human you are delegating tasks to. Developers still need to truly internalize the facts that LLMs are purely completion machines, that your conversation history lives entirely client side outside of active inference, and that you can literally set your conversation input to be whatever you want (even if the model never said that) - after that realizing you're on the path towards using LLMs like "what words do I need to put it in to get it to do what I want" rather than working "with" them.
However, even with a "docs as spec" pattern, how can you control the actual quality of the code written? How maintainable will it be? If the spec changes (read: it _will_ change constantly), is it easy enough to refactor? What about tests? I also shrink in fear at the complexity of docs that could be _exactly_ captured as code... "well we almost always do it this way, but this one time we do it this way..."
- intellisense in the inputbox based on words in this or all previous chats and a user customizable word list
- user customizable buttons and keyboard shortcuts for common quick replies, like "explain more".
- when claude replies with a numbered list of alternatives let me ctrl+click a number to fork the chat with continued focus on that alternative in a new tab.
- a custom right click menu with action for selection (or if no selection claude can guess the context e.g. the clicked paragraph) such as "new chat with selection", "explain" and some user customizable quick replies
- make the default download filenames follow a predicable pattern, claude currently varies it too much e.g. "cloud-script.py" jumps to "cloud-script-errorcheck.py". I've tried prompting a format but claude seems to forget that.
- the stop button should always instantly stop claude in its tracks. Currently it sometimes takes time to get claude to stop thinking.
- when a claude reply first generates code in the right sidebar followed by detailed explanation text in the chat, let some keyboard shortcut instantly stop the explanation in its tracks. Let the same shortcut preempt that explanation while the sidebar code is still generating.
- chat history search is very basic. Add andvanced search features, like filter by date first/last message and OR search operator
- batch jobs and tagging for chat history. E.g. batch apply a prompt to generate a summary in each selected chat and then add the tag "summary" to them. Let us then browse by tag(s).
- tools to delete parts of a chat history thread, that in hindsight were detours
- more generally, maybe a "chat history chat" to have Claude apply changes to the chat histories
1. Ask AI to generate a spec of what we're planning to do. 2. Refine it until it's kind of resembling what I want to do 3. Ask AI to implement some aspects from the spec
I used this single line to generate a 5 line Java unit test a while back.
test: grip o -> assert state.grip o
LLMs have wide "understanding" of various syntaxes and associated semantics. Most LLMs have instruct tuning that helps. Simplifications that are close to code work.
Re precision, yes, we need precision but if you work in small steps, the precision comes in the review.
Make your own private pidgin language in conversation.
Emails are so similar to Chat, except we're used to writing in long-form, and we're not expecting sub-minute replies.
Maybe emails are going to be the new chat?
I've been experimenting with "email-like" interfaces (that encourage you to write more / specify more), take longer to get back to you, and go out to LLMs. I think this works well for tools like Deep Research where you expect them to take minutes to hours.
Chat is single threaded and ephemeral. Documents are versioned, multi-threaded, and a source of truth. Although chat is not appropriate as the source of truth, it's very effective for single-threaded discussions about documents. This is how people use requirements documents today. Each comment on a doc is a localized chat. It's an excellent interface when targeted.
Like with any coworker - when ideas get real, get out of chat and start using our tools and process to get stuff done.
I would like as well to add to it a peer-programming feature, with it making some comments on top of the shoulder when coding, a kind of smarter linter that will not lint one line, but that will have the entire project context.
Perhaps I should comment all todos and then write "finish todos" as the always-same text prompt.
And that's not even to say that I don't write code comments. When working on large legacy codebases, where you often need to do 'weird' things in service of business goals and timelines, a comment that explains WHY something was done the way it was is valuable. And I leave those comments all the time. But they're still a code smell.
Comments are part of your code. So they need to be maintained with the rest of your code. Yet they are also "psychologically invisible" most of the time to most programmers. Our IDEs even tend to grey them out by default for us, so that they get out of the way so we can focus on the actual implementation code.
This means that comments are a maintenance obligation that often get ignored and so they get out of sync with the actual code really fast.
They also clutter the code unnecessarily. Code, at its best, should be self-explanatory with extremely little effort needed to understand the intent of the code. So even a comment that explains why the code is weird is doing little more than shining a flashlight on smelly code without actually cleaning it up.
And don't get me started on "todo" comments. Most professional shops use some kind of project management tool for organizing and prioritizing future work. Don't make your personal project management the problem of other people that share and contribute to your codebase. There is zero rationale for turning shared code into your personal todo list. (and it should be obvious that I'm talking about checked in code .. if it's your working branch then you do you :) )
So if programming using LLMs is similar to writing comments (an interesting analogy I hadn't considered before), then maybe this is part of the reason I haven't found a problem that LLMs solve for me yet (when programming specifically). I just don't think like that when I'm writing code.
For writing, the canvas interface is much more effective because you rely less on copy and paste. For code, even with the ctrl+i method, it works but it's a pain to have to load all other files as reference every single time.
It's not really a conscious choice, but rather a side effect. And we already see the trend is away from that, with tools like chatGPT Canvas, editors like Windsurf, etc.
Once the models become fast enough to feel instantaneous, we'll probably begin to see more seamless interfaces. Who wants a pair programmer who goes "umm... ahh..." every time you type something? A proper coding assistant should integrate with your muscle memory just like autocomplete. Tab, tab, tab and it's done.
AI in many levels is more capable than human programmer, in some it is not. It is not supersmart. It can not hold entire program in its head, you have to feed it small relevant section of program.
》 That’s why we use documents—they let us organize complexity, reference specific points, and track changes systematically.
Extra steps. Something like waterfall...
di(
Yet, millions of programmers use their mouse to SELECT first something visually and THEN delete whatever was selected. Shrug.I won't be surprised if chat-based programming will be the next way of doing stuff.
- Speed up literature recherche
- replace reading library documentation
- generate copy pasta code that has been written often before
The problem with this is that you need a gazillion of menus, dialogs and options to find that modal which does the thing _exactly_ what you want. Menus and likes are a means to an end, we don't really want them, but up until recently we couldn't live without them. With instruct based computing this is all changing.
But that is true? Devs spend more time in meetings than writing code. Having conversations about the code they are going to write.
When we're trying to wrangle a piece of code to do something we want but aren't quite sure of how to interact with the api, it's a different matter.
What i found is that by the time copilot/gpt/deepseek has enough knowledge about the problem and my codebase, I've run out of tokens. Because my head can contain a much larger problem area than these models allow me to feed them in a budget friendly manner.
Back to... programming languages? :)
Theoretically maybe, but chat windows are getting the job done right now.
It could be quite fun !
chat is so drastically far away from my workflow that it doesn't feel like my workflow is wrong.
So, something like Gherkin?
For higher-level AI assist, I do agree chat is not what makes sense. What I think would be cool is to work in markdown files, refining in precise plain english each feature. The AI then generates code from the .md files plus existing context. Then you have well-written documentation and consistent code. You can do this to a degree today by referencing a md file in chat, or by using some of the newer tools, but I haven't seen exactly what I want yet. (I guess I should build it?)
Its a problem of programming languages and definitions.
I suspect there's an 100 year old book describing what I'm saying but much more eloquently.
The level of precision required for highly complex tasks was never necessary before. My four year old has a pretty solid understanding of how the different AI tools she has access to will behave differently based on how she phrases what she says, and I've noticed she is also increasingly precise when making requests of other people.
The challenge is that I haven't seen anything better really.
Lately the innovation comes mainly from deeper integration with tools. Standalone AI editors are mainly popular with people who use relatively simple editors (like VS Code). VS Code has a few party tricks but for me swapping out Intellij for something else on a typical Kotlin project is a complete non starter. Not going to happen. I'd gain AI, but I'd loose everything else that I use all the time. That would be a real productivity killer. I want to keep all the smart tooling I already have and have used for years.
There are a few extensions for intellij but they are pretty much all variations of a sidebar with a chat and autocomplete. Autocomplete competes with normal autocomplete, which I use all the time. And the clippy style "it looks like you are writing a letter" style completions just aren't that useful too me at all. They are just noise and break my flow. And they drown out the completions I use and need all the time. And sidebars just take up space and copying code from there back to your editor is a bit awkward as UX
Lately I've been using chat gpt. It started out pretty dumb but these days I can option+shift+1 in a chat and have it look over my shoulder at my current editor. "how do I do that?" translates into a full context with my current editing window, cursor & selected text, etc. all in the context. Before I was copy pasting everything and the kitchen sync to chat gpt, now it just tells me what I need to do. The next step up from this is that it starts driving the tools itself. They already have a beta for this. This deeper integration is what is needed.
A big challenge is that most of these tools are driven to minimize cost and context size. Tokens cost money. So chat GPT only looks at my active editor and not at the 15 other files I have open. It could. But it doesn't. It's also unaware of my project structure, or the fact that most of my projects are kotlin multiplatform and can't use JVM dependencies. So, in that sense, every chat still is a bit ground hog day. It's promise to "remember" stuff when you ask it too is super flaky. It forgets most things it's supposed to remember pretty quickly.
These are solvable problems of course. But it's useful to me for debugging, analyzing, completing functions, etc.
We tried a pair-programming exercise, and he got visibly angry, flustered and frustrated when he tried to verbalize what he was doing.
One of the reasons Business Analysts and the like exist is that not everyone can bridge the gap between the messy, verbal real world, and the precision demanded by programming languages.
In its current form LLMs are pretty much at their limit, barring optimization and chaining them together for more productivity once we have better hardware. Still, it will just be useful for repetitive low level tasks and mediocre art. We need more breakthroughs beyond transformers to approach something that creates like humans instead of using statistical inference.
How do you know that?
First of all, most people can't write extremely complex applications, period. Most programmers included. If your baseline for real programming is something of equivalent complexity as the U.S. tax code, you're clearly such a great programmer that you're an outlier, and should recognize that.
Second of all, I think it's a straw man argument to say that you can either write prototype-level code with a chat UI, or complex code with documents. You can use both. I think the proposition being put forward is that more people can write complex code by supplementing their document-based thinking with chat-based thinking. Or, that people can write slightly better-than-prototype level code with the help of a chat assistant. In other words, that it's better to have access to AI to help you code small sections of a larger application that you are still responsible for.
I'd be more interested in reading a good argument against the value of using chat-based AI as another tool in your belt, rather than a straight-up replacement for traditional coding. If you could make that argument, then you could say chat is a bad UI pattern for dev tools.
or the other way around,give AI a design doc and generate what you want,this is still chatting, just more official and lengthy
I don’t buy that a document could capture what is needed here. Imagine describing navigating through multiple levels of menus in document form. That sounds straight up painful even for trivial apps. And for a full blown app…nope
There is a whole new paradigm missing there imo
Vague and prone to endless argument?
Eh, that's just copium because we all have a vested monetary interest in them not being useful for "anything real", whatever that means. If it turns out that there useful for "real things", then then entire industry would get turned on its head. (hint: they're useful for "real" things), though putting the entire codebase into the context window doesn't currently work. Aider works past this by passing the directory tree and filenames as context, so the LLM guess that /cloud/scope/cluster.go is where the cluster scope code lives and ask for that specific file to get added to the context and you can ask it to add, say, logging code to that file.
We play around with LLMs to build a chat experience. My first attempt made Claude spew out five questions at a time, which didn't solve the "guiding" problem. So I started asking it to limit the number of unanswered questions. It worked, but felt really clunky and "cheap."
I drew two conclusions: We need UI builders for this to feel nice, and professionals will want to use forms.
First, LLMs would be great at driving step-by-step guides, but it must be given building blocks to generate a UI. When asking about location, show a map. When deciding to ask about TIN or roof size, if the user is technically inclined, perhaps start with asking about the roof. When asking about the roof size, let the user draw the shape and assign lengths. Or display aerial photos. The result on screen shouldn't be a log of me-you text messages, but a live-updated summary of where we are, and what's remaining.
Second, professionals have incentive to build mental model for navigating complex data structures. People who have no reason to invest time into the data model (e.g. a consumer buying a single solar panel installation in ther lifetime,) will benefit from rich LLM-driven UIs. Chat UIs might create room for a new type of computer user who doesn't use visual clues to build this mental model, but everyone else will want to stay on graphics. If you're an executive wondering how many sick days there were last month, that's a situation where a BI LLM RAG would be great. But if you're not sure what your question is, because you're hired to make up your own questions, then pointing, clicking and massaging might make more sense.
doc=programming in a DSL? / (what was that one language which was functional & represented in circles in a canvas?)
Writing a crud web API? Great! Writing business logic for a niche edge case in a highly specialized domain? Good luck.
What's a good interface?
There are a few things we try to balance to make a good UI/UX:
- Latency: How long it takes to do a single task
- Decision-tree pathing: How many tasks to meet a goal
- Flexibility/Configurability: How much of a task can be encapsulated by the user's predefined knowledge of the system
- Discoverability: What tasks are available, and where
The perfect NLP chat could accomplish some of these:
- Flexibility/Configurability: Define/infer words and phrases that the user can use as shortcuts
- Decision-tree pathing: Define concepts that shortcut an otherwise verbose interaction
- Latency: Context-aware text-completions so the user doesn't need to type as much
- Discoverability: Well-formed introductions and clarifying questions to introduce useful interaction
This can only get us so far. What better latency can be accomplished than a button or a keyboard shortcut? What better discoverability than a menu?
The most exciting prospect left is flexibility. Traditional software is inflexible. It can only perform the interaction it was already designed with. Every design decision becomes a wall of assumption. These walls are the fundamental architecture of software. Without them, we would have nothing. With them, we have a structure that guides us along whatever assumptions were already made.
If we want to change something about our software's UI, then we must change the software itself, and that means writing. If NLP was a truly solved problem, then software compatibility and flexibility would be trivialized. We could redesign the entire UI by simply describing the changes we want.
LLMs are not even close. Sure, you can get one to generate some code, but only if the code you want generated is close enough to the text it was already trained on. LLMs construct continuations of tokens: no more, no less. There is no logic. There is no consideration about what is right or wrong: only what is likely to come next.
Like you said,
> You can’t build real software without being precise about what you want.
This is the ultimate limitation of UI. If only we could be ambiguous instead! LLMs let us do that, but they keep that ambiguity permanent. There is no real way to tie an LLM back down to reality. No logic. No axioms. No rules. So we must either be precise or ambiguous. The latter option is an exciting development, and certainly offers its own unique advantages, but it isn't a complete solution.
---
I've been thinking through another approach to the ambiguity problem that I think could really give us the expressive power of natural language, while preserving the logical structure we use to write software (and more). It wouldn't solve the problem entirely, but it could potentially move it out of the way.