The difficulty in fixing a bug is in figuring out what’s causing the bug. If you know it’s caused by an incorrect operation, and we know that LLMs can fix simple defects like this, what does this prove?
Has anyone dug through the paper yet to see what the rest of the issues look like? And what the diffs look like? I suppose I’ll try when I have a sec.
Since this fixes 12% of the bugs, the authors of the paper probably agree with you that 100-12= 88%, and hence "most bugs" don't have nicely written bug reports.
A cool thing about LLM is they have infinite patience. They can go back and forth with the user until they either sort out how to make a useable bug report, or give up.
I'm going to give it a try on my side project and see if it can at least provide a hint or some guidance on the development of small new features in an existing well structured project.
To a non-programmer, putting in tests for myfunc(x) {return x + 2;} sounds useful but in reality computers do not tend to have any issues performing basic algebra.
So you didn’t save 12% of your effort, you wasted probably more than double your effort checking the work of a tool that is wrong eight out of nine times.
I've experimented in this direction previously, but found agentic behavior is often chaotic and leads to long expensive sessions that go down a wrong rabbit hole and ultimately fail.
It's great that you succeed on 12% of swe-bench, but what happens the other 88% of the time? Is it useless wasted work and token costs? Or does it make useful progress that can be salvaged?
Also, I think swe-bench is from your group, right? Have you made any attempt to characterize a "skilled human upper bound" score?
I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task.
Unless your job is primarily to clean up somebody else's mess, your debugging time is a key part of a career-long feedback loop that improves your craft. Be careful not to shrug it off as something less. Many many people are spending a lot of money to let you forget it, and once you do, you'll be right there in the ranks of the cheaply replaceble.
(And on the odd chance that cleaning up other people's mess is your job, you should probably be the one doing it; and for largely the same reasons)
https://github.com/princeton-nlp/SWE-agent/blob/main/config/...
Once I’m back on desktop I want to look at the gut history of this file.
Think of it as a meta-prompt optimizer, it uses a LLM to optimize your prompts, to optimize your LLM.
I'm working on a more general SWE task eval framework in JS for arbitrary language and framework now (for starter JS/TS, SQL and Python), for my own prompt engineering product.
Hit me up if you are interested.
(not because bugs will be gone - because the cost of reviewing the PR vs the benefit gained to the project will be a substantial net loss)
At least from a maintainability perspective.
I would like to see if this implementation is less destructive or at least more suitable for a red-green-refactor workflow.
Hello world is 10GB, but even grandma can make hello worlds now.
While the overall goal is to build arbitrarily large, complex features and projects that are too much for ChatGPT or IDE-based tools, another aspect that I've put a lot of focus on is how to handle mistakes and corrections when the model starts going off the rails. Changes are accumulated in a protected sandbox separate from your project files, a diff review TUI is included that allows for bad changes to be rejected, all actions are version-controlled so you can easily go backwards and try a different approach, and branches are also included for trying out multiple approaches.
I think nailing this developer-AI feedback loop is the key to getting authentic productivity gains. We shouldn't just ask how well a coding tool can pass benchmarks, but what the failure case looks like when things go wrong.
How is your market testing going?
Do you have contracts with clients amenable to let you write case studies? Do you need help selling, designing, or fulfilling these kinds of pilot contacts?
What are your plans for docs a PR?
As a researcher, it's currently hard to situate plandex against existing research, or anticipate where a technical contribution is needed.
As a business owner, it's currently hard to visualize plandex's impact on a business workflow.
Are you open to producing a technical report? Detail plandex methodology, benchmark efficiency, ablation tests for key contributions, customer case studies, relevant research papers, and next steps/help needed.
What do you think?
If plandex is interested in being a fully open org, then I'd be interested in seeing it find its market footing and grow its technical capabilities. We need open source orgs like this!
The general workflow is to load some relevant context (could be a few files, an entire directory, a glob pattern, a URL, or piped in data), then send a prompt. Quick example:
plandex new
plandex load components/some-component.ts lib/api.ts package.json https://react.dev/reference/react/hooks
plan tell "Update the component in components/some-
components.ts to load data from the 'fetchFooBars'
function in 'lib/api.ts' and then display it in a
datagrid. Use a suitable datagrid library."
From there the plan will start streaming. Existing files will be updated and new files created as needed.One thing I like about it for large codebases compared to IDE-based tools I've tried is that it gives me precise control over context. A lot of tools try to index the whole codebase and it's pretty opaque--you never really know what the model is working with.
For instance, I recently learned about how to replace setup.py with pyproject.toml for a large number of projects. I also learned how to publish packages to pypi. These changes significantly improve project ease and accessibility, and are very easy to do.
The main thing that holds people back is that python packaging documentation is notoriously cryptic - well I've already paid that cost, and now it's easy!
So I'm thinking of finding projects that are healthy, but haven't focused on modernizing their packaging or distributing their project through pypi.
I'd build human + agent based tooling to help me find candidates, propose the improvement to existing maintainers, then implement and deliver.
I could maybe upgrade 100 projects, then write up the adventure.
Anyone have inspiration/similar ideas, and wanna brainstorm?
I'm kind of sad that future generations will not have that experience...
What's funny is that people on here & tech people in general seem to be the most averse to improving equity between all humans/stopping the obscenely rich from abusing and twisting the system. Do many HN peeps believe they're all somehow gonna become billionaires one day?
Unclogging sewers on the other hand…
This and Devin generate garbage code that will make any codebase worse.
It's a joke that 12.5% is even associated with the word "success".
Copilot, so far, is only good for predicting the next bit of similar patterns of code
0% -> 12% improvement is not bad for two years either (I'm somewhat arbitrary picking the release date of ChatGPT). If this can be kept up for a few years we will have some extremely useful tooling. The cost can be relatively high as well, since engineering time is currently orders of magnitude more expensive than these tools.
Interesting scalability questions will arise wrt to security when scaling the already unmanagably large code bases by another magnitude (or two), though.
Edit: Yes, it does write some code that is "close" enough, but in some cases it is wrong, in others it doesn't not do exactly what asked. I.e. needs supervision from someone who understands both the requirements, the code and the problems that may arise from the naive line that the LLM is taking. Mind you, the most popular the issue, the better the line LLM is taking. So in other words, IMHO is a glorified Stack Overflow. Just as there are engineers that copy-paste from SO without having any idea what the code does, there will be engineers that will just copy paste from LLM. Their work will be much better than if they used SO, but I think it's still nowhere to the mark of a Senior SWE and above.
use std::fs::File;
use std::io::{self, BufReader, Read};
fn read_file_character_by_character(path: &str) -> io::Result<()> {
// Open the file in read-only mode.
let file = File::open(path)?;
// Create a buffered reader to read the file more efficiently.
let reader = BufReader::new(file);
// `chars` method returns an iterator over the characters of the input.
// Note that it returns a Result<(char, usize), io::Error>, where usize is the byte length of the char.
for char_result in reader.chars() {
match char_result {
Ok(c) => print!("{}", c),
Err(e) => return Err(e),
}
}
Ok(())
}
fn main() {
let path = "path/to/your/file.txt";
if let Err(e) = read_file_character_by_character(path) {
eprintln!("Error reading file: {}", e);
}
}(Also, the comment about the iterator element type is inconsistent with the code following it. Based on the comment, `c` would be of type `(char, usize)`, but then trying to print it with {} would fail because tuples don't implement Display.)
the typing of code is the easy part even though it's a part a lot of folks are somewhat addicted to.
the things which have far more value are applying value judgements to requirements, correlating and incorporating sparse and inaccurate diagnostic information into a coherent debugging strategy, and so on. there will come a time when it can assist with these too, probably first on requirements distillation, but for more complex debugging tasks that's a novel problem solving area that we've yet to see substantial movement on.
so if you want to stave off the robots coming for you, get good at debugging hard problems, and learn to make really great use of tools that accelerate the typing out of solutions to baseline product requirements.
Also I just asked Claude and Gemini and they both provided an implementation that matches the “bytes to UTF-8” rust docs. Assuming those are right,LLMs can do this (but I haven’t tested the code).
But, for most of the debates I've seen, I don't think it the answer matters all too much.
Once we have models that can act as full senior SWEs.. the models can engineer the models. And then we've hit the recursive case.
Once models can engineer models better and faster than humans, all bets are off. Its the foggy future. Its the singularity.
This is such an extremely bullish case, I'm not sure why you'd think this is even remotely possible. A Google search is usually more valuable than ChatGPT. For example, the rust utf-8 example is already verbatim solved on reddit: https://www.reddit.com/r/rust/comments/l5m1rw/how_can_i_effi...
I'm not saying that the whole "robots building better robots" thing is a pipedream, but given where things are today, this is not something that's going to happen soon.