Princeton group open sources "SWE-agent", with 12% fix rate for GitHub issues (opens in new tab)

(github.com)

307 pointsasteroidz2y ago143 comments

143 comments

95 comments · 22 top-level

dimal2y ago· 16 in thread

The demo shows a very clearly written bug report about a matrix operation that’s producing an unexpected output. Umm… no. Most bug reports you get in the wild are more along the lines of “I clicked on on X and Y happened” then if you’re lucky they’ll say “and I expected Z”. Usually the Z expectation is left for the reader to fill in because as human users we understand the expectations.

The difficulty in fixing a bug is in figuring out what’s causing the bug. If you know it’s caused by an incorrect operation, and we know that LLMs can fix simple defects like this, what does this prove?

Has anyone dug through the paper yet to see what the rest of the issues look like? And what the diffs look like? I suppose I’ll try when I have a sec.

drcode2y ago

> Most bug reports you get in the wild are more along the lines of

Since this fixes 12% of the bugs, the authors of the paper probably agree with you that 100-12= 88%, and hence "most bugs" don't have nicely written bug reports.

medellin2y ago

In my 15 years i would say less than 1% of bug reports are like this. If you know the bug to this level most people just would fix it themselves

aiauthoritydev22y ago

12% is a very very large number for that kind of problem. I doubt even 0.1% of bug reports in the wild are that well written.

2 more replies

skywhopper2y ago

It fixes 12% of their benchmark suite, not 12% of bug reports.

dimal2y ago

I suppose I should nail down my point. No one would ever write a big report like this. A bug generally has an unknown cause. Once you found the cause of the bug, you’d fix it. Nowadays, you could just cut and paste the problem into ChatGPT and get the answer right then. So why would anyone ever log this bug? All this demo proves that they automated a process that didn’t need automation.

1 more reply

stingraycharles2y ago

It appears that they’re using the PRs from the top5000 most popular PyPi packages for their bench: https://github.com/princeton-nlp/SWE-bench/tree/main/swebenc...

jcarrano2y ago

Maybe it would be better if the agent would help people submit better reports instead of trying to fix it. E.g. it could ask them to add missing information, test different combinations of inputs, etc. I could also learn which maintainer to ping according to the type of issue.

bee_rider2y ago

Maybe it just needs another, independent tool. One that detects poorly written bug reports and rejects them.

A cool thing about LLM is they have infinite patience. They can go back and forth with the user until they either sort out how to make a useable bug report, or give up.

bfdm2y ago

While it might tickle metrics the right way, frustrating a user into giving up because your bot was not satisfied is not solving their problem.

3 more replies

codeonline2y ago

I agree that bugs aren't as well specified as the example. But a specification for a new feature certainly can be.

I'm going to give it a try on my side project and see if it can at least provide a hint or some guidance on the development of small new features in an existing well structured project.

chinchilla20202y ago

Agreed. I have never encountered a simple math bug in the wild.

To a non-programmer, putting in tests for myfunc(x) {return x + 2;} sounds useful but in reality computers do not tend to have any issues performing basic algebra.

megablast2y ago

Exactly. This is not perfect and doesn't fix every report so it is useless.

skywhopper2y ago

On the contrary, it’s worse than useless. If it could fix 12% of bugs (it can’t — it only fixes 12% of their benchmark suite), you’d still have to figure out which 12% of the responses it gave were good. So, 88% of the time you’d have wasted time confirming a “fix” that doesn’t work. But it’s worse than that. Because even on the fixes it got right, you’d still have to fully vet it, because this tool doesn’t know when it can’t solve something, or ask for clarification. It just gives a wrong answer.

So you didn’t save 12% of your effort, you wasted probably more than double your effort checking the work of a tool that is wrong eight out of nine times.

dimal2y ago

That’s not what I said and you know it. I’m not saying LLMs are useless. I’m not even saying this tool is useless. I’m saying I’m not impressed with this tool, at least as represented in the demo.

gorjusborg2y ago

If the bug report needs to be of a certain quality to work, they've just invented issue-oriented programming.

forty2y ago

The trick is that people would use LLM to write very long and detailed bug reports :p

noncoml2y ago· 14 in thread

If you are afraid that LLMs will replace you at your job, ask an LLM to write Rust code for reading a utf8 file character by character

Edit: Yes, it does write some code that is "close" enough, but in some cases it is wrong, in others it doesn't not do exactly what asked. I.e. needs supervision from someone who understands both the requirements, the code and the problems that may arise from the naive line that the LLM is taking. Mind you, the most popular the issue, the better the line LLM is taking. So in other words, IMHO is a glorified Stack Overflow. Just as there are engineers that copy-paste from SO without having any idea what the code does, there will be engineers that will just copy paste from LLM. Their work will be much better than if they used SO, but I think it's still nowhere to the mark of a Senior SWE and above.

raggi2y ago

it does an ok job with this task:

    use std::fs::File;
    use std::io::{self, BufReader, Read};

    fn read_file_character_by_character(path: &str) -> io::Result<()> {
        // Open the file in read-only mode.
        let file = File::open(path)?;

        // Create a buffered reader to read the file more efficiently.
        let reader = BufReader::new(file);

        // `chars` method returns an iterator over the characters of the input.
        // Note that it returns a Result<(char, usize), io::Error>, where usize is the byte length of the char.
        for char_result in reader.chars() {
            match char_result {
                Ok(c) => print!("{}", c),
                Err(e) => return Err(e),
            }
        }

        Ok(())
    }

    fn main() {
        let path = "path/to/your/file.txt";

        if let Err(e) = read_file_character_by_character(path) {
            eprintln!("Error reading file: {}", e);
        }
    }

comex2y ago

Only problem is that the critical `chars` method doesn't actually exist. Rust's standard library has a `chars` method for strings, but not for Readers.

(Also, the comment about the iterator element type is inconsistent with the code following it. Based on the comment, `c` would be of type `(char, usize)`, but then trying to print it with {} would fail because tuples don't implement Display.)

1 more reply

raggi2y ago

fwiw, the benchmark that matters really has nothing to do with authoring code.

the typing of code is the easy part even though it's a part a lot of folks are somewhat addicted to.

the things which have far more value are applying value judgements to requirements, correlating and incorporating sparse and inaccurate diagnostic information into a coherent debugging strategy, and so on. there will come a time when it can assist with these too, probably first on requirements distillation, but for more complex debugging tasks that's a novel problem solving area that we've yet to see substantial movement on.

so if you want to stave off the robots coming for you, get good at debugging hard problems, and learn to make really great use of tools that accelerate the typing out of solutions to baseline product requirements.

woodruffw2y ago

If we're being sticklers, this isn't reading character-by-character: it's performing a buffered read, which then gets iterated over.

3 more replies

vineyardmike2y ago

Yea the problem with that is the control group - grab any SWE and ask them the same thing. I don’t think most would pass. Unless you want to give an SWE time to learn… then it’s hardly fair. And I vaguely trust the LLM to be able to learn it too.

Also I just asked Claude and Gemini and they both provided an implementation that matches the “bytes to UTF-8” rust docs. Assuming those are right,LLMs can do this (but I haven’t tested the code).

https://doc.rust-lang.org/std/string/struct.String.html

userbinator2y ago

I'm not afraid of LLMs replacing me because of their output quality. The problem is the proliferation of quantity-over-quality "churn out barely-working crap as fast as possible" culture that gives LLMs the advantage over real humans.

int_19h2y ago

I'm kinda hoping that LLMs will get pushed into production use writing code before they have acceptable quality (because greed), and the result will be lots of crap that's so badly broken most of the time that there will be a massive pushback against said culture from the users. Maybe from the governments as well, after a few well-publicized infrastructure failures.

1 more reply

iwontberude2y ago

Hypothetically, which ticker symbols would you buy put contracts on, at what strike prices, and at what expiration dates? As far as I can tell, a lot of people are betting a lot of money that you are wrong, but actually I think you are right.

jeremyjh2y ago

The most relevant companies focused on this aren't publicly traded. The ones that are publicly traded like MSFT have way too many other factors affecting their value - not to mention the fact that they'll make money on generative AI that has nothing to do with coding regardless of if an SWE-agent ever works.

1 more reply

noncoml2y ago

Ugh, I am not claiming that LLMs are not great innovation. Just that they are not going to replace SWE jobs in our(maybe my) lifetime.

1 more reply

DabbyDabberson2y ago

The way I see it, its undetermined if Generative AI will be able to fully do a SWE job.

But, for most of the debates I've seen, I don't think it the answer matters all too much.

Once we have models that can act as full senior SWEs.. the models can engineer the models. And then we've hit the recursive case.

Once models can engineer models better and faster than humans, all bets are off. Its the foggy future. Its the singularity.

dvt2y ago

> Once we have models that can act as full senior SWEs.. the models can engineer the models.

This is such an extremely bullish case, I'm not sure why you'd think this is even remotely possible. A Google search is usually more valuable than ChatGPT. For example, the rust utf-8 example is already verbatim solved on reddit: https://www.reddit.com/r/rust/comments/l5m1rw/how_can_i_effi...

int_19h2y ago

The implicit assumption here is that a human "senior SWE" can engineer a model of the same quality that is capable of simulating him. Which is definitely not true with the best models that we have today - and they certainly can't simulate a senior SWE, so the actual bar is higher.

I'm not saying that the whole "robots building better robots" thing is a pipedream, but given where things are today, this is not something that's going to happen soon.

vertis2y ago

People (SWEs) don't want to hear this. I think it's an inevitability that something of this nature will happen.

anotherpaulg2y ago· 6 in thread

Very cool project!

I've experimented in this direction previously, but found agentic behavior is often chaotic and leads to long expensive sessions that go down a wrong rabbit hole and ultimately fail.

It's great that you succeed on 12% of swe-bench, but what happens the other 88% of the time? Is it useless wasted work and token costs? Or does it make useful progress that can be salvaged?

Also, I think swe-bench is from your group, right? Have you made any attempt to characterize a "skilled human upper bound" score?

I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task.

a_wild_dandan2y ago

Personally, I'd just use one of my local MacBook models (e.g. Mixtral 8x7b) and forget about any wasted branches & cents. My debugging time costs many orders of magnitude more than SWE-agent, so even a 5% backlog savings would be spectacular!

swatcoder2y ago

> My debugging time costs many orders of magnitude more than SWE-agent

Unless your job is primarily to clean up somebody else's mess, your debugging time is a key part of a career-long feedback loop that improves your craft. Be careful not to shrug it off as something less. Many many people are spending a lot of money to let you forget it, and once you do, you'll be right there in the ranks of the cheaply replaceble.

(And on the odd chance that cleaning up other people's mess is your job, you should probably be the one doing it; and for largely the same reasons)

1 more reply

ein0p2y ago

I’ve tried this with another similar system. FOSS LLMs including Mixtral are currently too weak to handle something like this. For me they run out of steam after only a few turns and start going in circles unproductively

Aperocky2y ago

That's assuming that the other 95% stays the same with this new agent (vs creating more work for you to now also have to parse what the model is saying).

int_19h2y ago

Given that they got 12% with GPT-4, which is vastly better than any open model, I doubt this would be particularly productive. And powering compute at full load is going to add up.

senko2y ago

If you don't mind me asking, which agentic tools/frameworks have you tried for code fixing/generation, with which LLMs?

danenania2y ago· 6 in thread

I'm working on a somewhat similar project: https://github.com/plandex-ai/plandex

While the overall goal is to build arbitrarily large, complex features and projects that are too much for ChatGPT or IDE-based tools, another aspect that I've put a lot of focus on is how to handle mistakes and corrections when the model starts going off the rails. Changes are accumulated in a protected sandbox separate from your project files, a diff review TUI is included that allows for bad changes to be rejected, all actions are version-controlled so you can easily go backwards and try a different approach, and branches are also included for trying out multiple approaches.

I think nailing this developer-AI feedback loop is the key to getting authentic productivity gains. We shouldn't just ask how well a coding tool can pass benchmarks, but what the failure case looks like when things go wrong.

barfbagginus2y ago

How open are you to moving plandex cloud over to AGPL? I know, tough ask right out the gate! Think about that one for a bit.

How is your market testing going?

Do you have contracts with clients amenable to let you write case studies? Do you need help selling, designing, or fulfilling these kinds of pilot contacts?

What are your plans for docs a PR?

As a researcher, it's currently hard to situate plandex against existing research, or anticipate where a technical contribution is needed.

As a business owner, it's currently hard to visualize plandex's impact on a business workflow.

Are you open to producing a technical report? Detail plandex methodology, benchmark efficiency, ablation tests for key contributions, customer case studies, relevant research papers, and next steps/help needed.

What do you think?

If plandex is interested in being a fully open org, then I'd be interested in seeing it find its market footing and grow its technical capabilities. We need open source orgs like this!

danenania2y ago

It’s AGPL licensed already :)

1 more reply

etheridev2y ago

You need to make yourself a business analyst agent to provide the feedback! To make it real, perhaps a team of them with conflicting personalities.

danenania2y ago

I think we'll get there at some point, but one thing I've learned from this project is how difficult it is to stack AI interactions. Each little bit of AI-based logic that gets added tends to fail terribly at first. Only after a long period of intense testing and iteration does it become remotely usable. The more you are combining different kinds of tasks, the more difficult it gets.

panqueca2y ago

Does it work with a large existing codebase?

danenania2y ago

Yes, at least up to the point of the context limit of the underlying model. If you needed to go beyond that, you would break the work up into separate "plans" (a plan is a set of tasks with an attached context and conversation).

The general workflow is to load some relevant context (could be a few files, an entire directory, a glob pattern, a URL, or piped in data), then send a prompt. Quick example:

  plandex new
  plandex load components/some-component.ts lib/api.ts package.json https://react.dev/reference/react/hooks
  plan tell "Update the component in components/some- 
  components.ts to load data from the 'fetchFooBars' 
  function in 'lib/api.ts' and then display it in a 
  datagrid. Use a suitable datagrid library."

From there the plan will start streaming. Existing files will be updated and new files created as needed.

One thing I like about it for large codebases compared to IDE-based tools I've tried is that it gives me precise control over context. A lot of tools try to index the whole codebase and it's pretty opaque--you never really know what the model is working with.

unit_circle2y ago· 5 in thread

A 1/8 chance of fixing a bug at the cost of a careful review and some corrections is not bad.

0% -> 12% improvement is not bad for two years either (I'm somewhat arbitrary picking the release date of ChatGPT). If this can be kept up for a few years we will have some extremely useful tooling. The cost can be relatively high as well, since engineering time is currently orders of magnitude more expensive than these tools.

golergka2y ago

It's still abysmal from POV of actually using it in production, but it's a very impressive rate of improvement. Given what happened with LLMs and image generation in the last few years, we can probably assume that these systems will be able to fix most trivial bugs pretty soon.

blharr2y ago

I still don't know. I feel like there are many ways where GPT will write some code or fix a bug in a way that makes it significantly harder to debug. Even for relatively simple tasks, it's kind of like machine-generated code that I would not want to touch.

WanderPanda2y ago

It is a bit worrisome but we manage to deal with subpar human code as well. Often the boilerplate generated by ChatGPT is already better than what an unexperienced coder would string together. I‘m sure it will not be a free lunch but the the benefits will probably outweigh the downsides.

Interesting scalability questions will arise wrt to security when scaling the already unmanagably large code bases by another magnitude (or two), though.

stefan_2y ago

These „benchmark“ are tuned around reporting some exciting result, once you look inside, all the „fixes“ are trash.

SrslyJosh2y ago

If someone submitted 8 PRs and 7 of them were bullshit, I would close anything else they submitted in the future without even bothering to review.

pjmlp2y ago· 4 in thread

Eventually it will be 90% fix rate and everyone cheering for the 12% will be flipping burgers instead.

iLoveOncall2y ago

Flipping burgers will be automated long before AI fixes any relevant number of bug reports.

pjmlp2y ago

Might be, still the point I was trying to make remains.

fennecfoxy2y ago

I still think this is a long way off, but it definitely ties into UBI etc and improvement of the general human condition, taxing the rich, restricting investment on protected things like housing and public industries like water, electricity, healthcare and internet.

What's funny is that people on here & tech people in general seem to be the most averse to improving equity between all humans/stopping the obscenely rich from abusing and twisting the system. Do many HN peeps believe they're all somehow gonna become billionaires one day?

littlestymaar2y ago

Why would a human ever flip a burger at that point? It's not a particularly difficult task for a robot.

Unclogging sewers on the other hand…

matthewaveryusa2y ago· 3 in thread

Very neat. Uses the langchain method, here are some of the prompts:

https://github.com/princeton-nlp/SWE-agent/blob/main/config/...

toddmorey2y ago

I’m always fascinated to read the system prompts & I always wonder what sort of gains can be made optimizing them further.

Once I’m back on desktop I want to look at the gut history of this file.

clement_b2y ago

I have a git feeling this comment was written on mobile.

hazn2y ago

DSPy is the best tool for optimizing prompts [0]: https://github.com/stanfordnlp/dspy

Think of it as a meta-prompt optimizer, it uses a LLM to optimize your prompts, to optimize your LLM.

1 more reply

JonChesterfield2y ago· 3 in thread

If AI generated pull requests become a popular thing we'll see the end of public bug trackers.

(not because bugs will be gone - because the cost of reviewing the PR vs the benefit gained to the project will be a substantial net loss)

CGamesPlay2y ago

Not a chance. If AI-generated pull requests become popular, GitHub will automatically offer them in response to opened issues. Case in point: they already are popular for dependency upgrades.

JonChesterfield2y ago

And thus issues will no longer be opened

itsgrimetime2y ago

It’ll likely keep getting better, if it gets to 30-40% I’d say that’s a decent trade off. Also could you boost your chances by having the AI do a 2nd pass and double check the work? I’d be curious what the success rate of an LLM “determining whether a bug fix is valid” is

aussieguy12342y ago· 3 in thread

12% fix rate = 88% bug rate

mlcrypto2y ago

Yep. After xz we don't need a bot mindlessly fixing all suggestions from malicious actors

Dylan168072y ago

I don't think xz makes a difference here. The perceived likelihood of problems, malicious or not, is pretty much the same. As far as this discussion goes, it's just another example in the pile of examples, not an event with meaningful before and after epochs.

aussieguy12342y ago

Fix one bug, introduce 5 more

bwestergard2y ago· 2 in thread

Friendly suggestion to the authors: success rates aren't meaningful to all but a handful of researchers. They should add a few examples of tests SWE-agent passed and did not pass to the README.

nyrikki2y ago

Yes please, the code quality on Devin was incredibly poor in all examples I traced down.

At least from a maintainability perspective.

I would like to see if this implementation is less destructive or at least more suitable for a red-green-refactor workflow.

NegativeLatency2y ago

Unless you weren't actually that successful but need to publish a "successful" result

rwmj2y ago· 2 in thread

Do we know how much extra work it created for the real people who had to review the proposed fixes?

r0ze-at-hn2y ago

Ah, well let me tell you about my pull request reviewer LLM project.

ActionHank2y ago

Jokes on you, let me tell you about my prompt to binary LLM project.

Hello world is 10GB, but even grandma can make hello worlds now.

2 more replies

sumeruchat2y ago· 2 in thread

Once we have this fully automated, any good developer could have a team of 100 robo SWEs and ship like crazy. The real competition is with those devs not with the bots.

recursive2y ago

Shipping like crazy isn't useful by itself. Shipping non-garbage and being able to maintain it still has some value.

sumeruchat2y ago

Would you say cloning a complex saas startup in a week with payments integrated after letting AI just scrape them (or uploading screenshots of their app) is creating value?

3 more replies

paradite2y ago· 1 in thread

For anyone who didn't bother looking deeper, the SWEbench benchmark contains only Python code projects, so it is not representative of all the programing languages and frameworks.

I'm working on a more general SWE task eval framework in JS for arbitrary language and framework now (for starter JS/TS, SQL and Python), for my own prompt engineering product.

Hit me up if you are interested.

barfbagginus2y ago

Assuming the data set is proprietary, else please share the repo

lispisok2y ago· 1 in thread

Their demo is so similar to the Devin one I had to go look up the Devin one to check I wasnt watching the same demo. I feel like there might be a reason they both picked Sympy. Also I rarely put weight into demos. They are usually cherry-picked at best and outright fabricated at worst. I want to hear what 3rd parties have to say after trying these things.

lewhoo2y ago

Maybe that's the point of this research. Hey look, we reproduced the way to game the stats a bit. I really can't tell anymore.

tibbetts2y ago· 1 in thread

But can their AI quietly introduce a security exploit into a GitHub project?

worthless-trash2y ago

Copilot already does this.

barfbagginus2y ago· 1 in thread

I would like something like this that helps me, as a green developer, find open source projects to contribute to.

For instance, I recently learned about how to replace setup.py with pyproject.toml for a large number of projects. I also learned how to publish packages to pypi. These changes significantly improve project ease and accessibility, and are very easy to do.

The main thing that holds people back is that python packaging documentation is notoriously cryptic - well I've already paid that cost, and now it's easy!

So I'm thinking of finding projects that are healthy, but haven't focused on modernizing their packaging or distributing their project through pypi.

I'd build human + agent based tooling to help me find candidates, propose the improvement to existing maintainers, then implement and deliver.

I could maybe upgrade 100 projects, then write up the adventure.

Anyone have inspiration/similar ideas, and wanna brainstorm?

SrslyJosh2y ago

...or you could just use the GitHub API to find projects that match certain criteria (e.g., no pyproject.toml). I'm not sure what the stochastic parrot adds here, besides making noob mistakes that you'll have to find and fix before you can submit PRs. You'd learn a lot more by trying to actually automate the process yourself.

iLoveOncall2y ago· 1 in thread

And creates how many new ones?

This and Devin generate garbage code that will make any codebase worse.

It's a joke that 12.5% is even associated with the word "success".

1letterunixname2y ago

Do spaces and spelling fixes count?

Copilot, so far, is only good for predicting the next bit of similar patterns of code

Frummy2y ago· 1 in thread

Interesting idea to provide the Agent-Computer Interface for it to scroll and such, interact easier from its perspective

aussieguy12342y ago

Similar to how early computers didn't have enough ram to display the whole text file, so old programmers had to work with parts of the file at a time. It's not a bad way to get around the context window problem, which is kind of similar.

mdaniel2y ago· 1 in thread

I think that "Demo" link is just an extremely annoying version of an HTML presentation, so they could save me a shitload of clicking if they just dumped their presentation out to a PDF or whatever so I could read faster than watching it type out text as if it was live. It also whines a lot on the console about its inability to connect to a websocket server on 3000 but I don't know what it would do with a websocket connection if had it

SrslyJosh2y ago

Probably created with an LLM.

Madmallard2y ago

What veterans in the field know that AI hasn’t tackled is that the majority of difficulty in development is dealing with complexity and ambiguity and a lot of it has to do with communication between people in natural language as well as reasoning in natural language about your system. These things are not solved by AI as it is now. If you can fully specify what you want with all of the detail and corner cases and situation handling then at some point AI might be able to make all of that for you. Great! Unfortunately, that’s the actual hard part! Not the implementation generally.

readthenotes12y ago

I made a lot of money as I was paid hourly while working with a cadre of people I called "the defect generators".

I'm kind of sad that future generations will not have that experience...

trebligdivad2y ago

So this issues arbitrary shell commands based on trying to understand the untrusted bug text ? Should be fun waiting until someone finds an escape.

j / k navigate · click thread line to collapse

143 comments

95 comments · 22 top-level

dimal2y ago· 16 in thread

Has anyone dug through the paper yet to see what the rest of the issues look like? And what the diffs look like? I suppose I’ll try when I have a sec.

drcode2y ago

> Most bug reports you get in the wild are more along the lines of

Since this fixes 12% of the bugs, the authors of the paper probably agree with you that 100-12= 88%, and hence "most bugs" don't have nicely written bug reports.

medellin2y ago

In my 15 years i would say less than 1% of bug reports are like this. If you know the bug to this level most people just would fix it themselves

aiauthoritydev22y ago

12% is a very very large number for that kind of problem. I doubt even 0.1% of bug reports in the wild are that well written.

2 more replies

skywhopper2y ago

It fixes 12% of their benchmark suite, not 12% of bug reports.

dimal2y ago

1 more reply

stingraycharles2y ago

It appears that they’re using the PRs from the top5000 most popular PyPi packages for their bench: https://github.com/princeton-nlp/SWE-bench/tree/main/swebenc...

jcarrano2y ago

bee_rider2y ago

Maybe it just needs another, independent tool. One that detects poorly written bug reports and rejects them.

A cool thing about LLM is they have infinite patience. They can go back and forth with the user until they either sort out how to make a useable bug report, or give up.

bfdm2y ago

While it might tickle metrics the right way, frustrating a user into giving up because your bot was not satisfied is not solving their problem.

3 more replies

codeonline2y ago

I agree that bugs aren't as well specified as the example. But a specification for a new feature certainly can be.

I'm going to give it a try on my side project and see if it can at least provide a hint or some guidance on the development of small new features in an existing well structured project.

chinchilla20202y ago

Agreed. I have never encountered a simple math bug in the wild.

To a non-programmer, putting in tests for myfunc(x) {return x + 2;} sounds useful but in reality computers do not tend to have any issues performing basic algebra.

megablast2y ago

Exactly. This is not perfect and doesn't fix every report so it is useless.

skywhopper2y ago

So you didn’t save 12% of your effort, you wasted probably more than double your effort checking the work of a tool that is wrong eight out of nine times.

dimal2y ago

gorjusborg2y ago

If the bug report needs to be of a certain quality to work, they've just invented issue-oriented programming.

forty2y ago

The trick is that people would use LLM to write very long and detailed bug reports :p

noncoml2y ago· 14 in thread

If you are afraid that LLMs will replace you at your job, ask an LLM to write Rust code for reading a utf8 file character by character

raggi2y ago

it does an ok job with this task:

    use std::fs::File;
    use std::io::{self, BufReader, Read};

    fn read_file_character_by_character(path: &str) -> io::Result<()> {
        // Open the file in read-only mode.
        let file = File::open(path)?;

        // Create a buffered reader to read the file more efficiently.
        let reader = BufReader::new(file);

        // `chars` method returns an iterator over the characters of the input.
        // Note that it returns a Result<(char, usize), io::Error>, where usize is the byte length of the char.
        for char_result in reader.chars() {
            match char_result {
                Ok(c) => print!("{}", c),
                Err(e) => return Err(e),
            }
        }

        Ok(())
    }

    fn main() {
        let path = "path/to/your/file.txt";

        if let Err(e) = read_file_character_by_character(path) {
            eprintln!("Error reading file: {}", e);
        }
    }

comex2y ago

Only problem is that the critical `chars` method doesn't actually exist. Rust's standard library has a `chars` method for strings, but not for Readers.

1 more reply

raggi2y ago

fwiw, the benchmark that matters really has nothing to do with authoring code.

the typing of code is the easy part even though it's a part a lot of folks are somewhat addicted to.

woodruffw2y ago

If we're being sticklers, this isn't reading character-by-character: it's performing a buffered read, which then gets iterated over.

3 more replies

vineyardmike2y ago

https://doc.rust-lang.org/std/string/struct.String.html

userbinator2y ago

int_19h2y ago

1 more reply

iwontberude2y ago

jeremyjh2y ago

1 more reply

noncoml2y ago

Ugh, I am not claiming that LLMs are not great innovation. Just that they are not going to replace SWE jobs in our(maybe my) lifetime.

1 more reply

DabbyDabberson2y ago

The way I see it, its undetermined if Generative AI will be able to fully do a SWE job.

But, for most of the debates I've seen, I don't think it the answer matters all too much.

Once we have models that can act as full senior SWEs.. the models can engineer the models. And then we've hit the recursive case.

Once models can engineer models better and faster than humans, all bets are off. Its the foggy future. Its the singularity.

dvt2y ago

> Once we have models that can act as full senior SWEs.. the models can engineer the models.

int_19h2y ago

I'm not saying that the whole "robots building better robots" thing is a pipedream, but given where things are today, this is not something that's going to happen soon.

vertis2y ago

People (SWEs) don't want to hear this. I think it's an inevitability that something of this nature will happen.

anotherpaulg2y ago· 6 in thread

Very cool project!

I've experimented in this direction previously, but found agentic behavior is often chaotic and leads to long expensive sessions that go down a wrong rabbit hole and ultimately fail.

It's great that you succeed on 12% of swe-bench, but what happens the other 88% of the time? Is it useless wasted work and token costs? Or does it make useful progress that can be salvaged?

Also, I think swe-bench is from your group, right? Have you made any attempt to characterize a "skilled human upper bound" score?

a_wild_dandan2y ago

swatcoder2y ago

> My debugging time costs many orders of magnitude more than SWE-agent

(And on the odd chance that cleaning up other people's mess is your job, you should probably be the one doing it; and for largely the same reasons)

1 more reply

ein0p2y ago

Aperocky2y ago

That's assuming that the other 95% stays the same with this new agent (vs creating more work for you to now also have to parse what the model is saying).

int_19h2y ago

Given that they got 12% with GPT-4, which is vastly better than any open model, I doubt this would be particularly productive. And powering compute at full load is going to add up.

senko2y ago

If you don't mind me asking, which agentic tools/frameworks have you tried for code fixing/generation, with which LLMs?

danenania2y ago· 6 in thread

I'm working on a somewhat similar project: https://github.com/plandex-ai/plandex

barfbagginus2y ago

How open are you to moving plandex cloud over to AGPL? I know, tough ask right out the gate! Think about that one for a bit.

How is your market testing going?

Do you have contracts with clients amenable to let you write case studies? Do you need help selling, designing, or fulfilling these kinds of pilot contacts?

What are your plans for docs a PR?

As a researcher, it's currently hard to situate plandex against existing research, or anticipate where a technical contribution is needed.

As a business owner, it's currently hard to visualize plandex's impact on a business workflow.

What do you think?

If plandex is interested in being a fully open org, then I'd be interested in seeing it find its market footing and grow its technical capabilities. We need open source orgs like this!

danenania2y ago

It’s AGPL licensed already :)

1 more reply

etheridev2y ago

You need to make yourself a business analyst agent to provide the feedback! To make it real, perhaps a team of them with conflicting personalities.

danenania2y ago

panqueca2y ago

Does it work with a large existing codebase?

danenania2y ago

The general workflow is to load some relevant context (could be a few files, an entire directory, a glob pattern, a URL, or piped in data), then send a prompt. Quick example:

  plandex new
  plandex load components/some-component.ts lib/api.ts package.json https://react.dev/reference/react/hooks
  plan tell "Update the component in components/some- 
  components.ts to load data from the 'fetchFooBars' 
  function in 'lib/api.ts' and then display it in a 
  datagrid. Use a suitable datagrid library."

From there the plan will start streaming. Existing files will be updated and new files created as needed.

unit_circle2y ago· 5 in thread

A 1/8 chance of fixing a bug at the cost of a careful review and some corrections is not bad.

golergka2y ago

blharr2y ago

WanderPanda2y ago

Interesting scalability questions will arise wrt to security when scaling the already unmanagably large code bases by another magnitude (or two), though.

stefan_2y ago

These „benchmark“ are tuned around reporting some exciting result, once you look inside, all the „fixes“ are trash.

SrslyJosh2y ago

If someone submitted 8 PRs and 7 of them were bullshit, I would close anything else they submitted in the future without even bothering to review.

pjmlp2y ago· 4 in thread

Eventually it will be 90% fix rate and everyone cheering for the 12% will be flipping burgers instead.

iLoveOncall2y ago

Flipping burgers will be automated long before AI fixes any relevant number of bug reports.

pjmlp2y ago

Might be, still the point I was trying to make remains.

fennecfoxy2y ago

littlestymaar2y ago

Why would a human ever flip a burger at that point? It's not a particularly difficult task for a robot.

Unclogging sewers on the other hand…

matthewaveryusa2y ago· 3 in thread

Very neat. Uses the langchain method, here are some of the prompts:

https://github.com/princeton-nlp/SWE-agent/blob/main/config/...

toddmorey2y ago

I’m always fascinated to read the system prompts & I always wonder what sort of gains can be made optimizing them further.

Once I’m back on desktop I want to look at the gut history of this file.

clement_b2y ago

I have a git feeling this comment was written on mobile.

hazn2y ago

DSPy is the best tool for optimizing prompts [0]: https://github.com/stanfordnlp/dspy

Think of it as a meta-prompt optimizer, it uses a LLM to optimize your prompts, to optimize your LLM.

1 more reply

JonChesterfield2y ago· 3 in thread

If AI generated pull requests become a popular thing we'll see the end of public bug trackers.

(not because bugs will be gone - because the cost of reviewing the PR vs the benefit gained to the project will be a substantial net loss)

CGamesPlay2y ago

Not a chance. If AI-generated pull requests become popular, GitHub will automatically offer them in response to opened issues. Case in point: they already are popular for dependency upgrades.

JonChesterfield2y ago

And thus issues will no longer be opened

itsgrimetime2y ago

aussieguy12342y ago· 3 in thread

12% fix rate = 88% bug rate

mlcrypto2y ago

Yep. After xz we don't need a bot mindlessly fixing all suggestions from malicious actors

Dylan168072y ago

aussieguy12342y ago

Fix one bug, introduce 5 more

bwestergard2y ago· 2 in thread

Friendly suggestion to the authors: success rates aren't meaningful to all but a handful of researchers. They should add a few examples of tests SWE-agent passed and did not pass to the README.

nyrikki2y ago

Yes please, the code quality on Devin was incredibly poor in all examples I traced down.

At least from a maintainability perspective.

I would like to see if this implementation is less destructive or at least more suitable for a red-green-refactor workflow.

NegativeLatency2y ago

Unless you weren't actually that successful but need to publish a "successful" result

rwmj2y ago· 2 in thread

Do we know how much extra work it created for the real people who had to review the proposed fixes?

r0ze-at-hn2y ago

Ah, well let me tell you about my pull request reviewer LLM project.

ActionHank2y ago

Jokes on you, let me tell you about my prompt to binary LLM project.

Hello world is 10GB, but even grandma can make hello worlds now.

2 more replies

sumeruchat2y ago· 2 in thread

Once we have this fully automated, any good developer could have a team of 100 robo SWEs and ship like crazy. The real competition is with those devs not with the bots.

recursive2y ago

Shipping like crazy isn't useful by itself. Shipping non-garbage and being able to maintain it still has some value.

sumeruchat2y ago

Would you say cloning a complex saas startup in a week with payments integrated after letting AI just scrape them (or uploading screenshots of their app) is creating value?

3 more replies

paradite2y ago· 1 in thread

For anyone who didn't bother looking deeper, the SWEbench benchmark contains only Python code projects, so it is not representative of all the programing languages and frameworks.

I'm working on a more general SWE task eval framework in JS for arbitrary language and framework now (for starter JS/TS, SQL and Python), for my own prompt engineering product.

Hit me up if you are interested.

barfbagginus2y ago

Assuming the data set is proprietary, else please share the repo

lispisok2y ago· 1 in thread

lewhoo2y ago

Maybe that's the point of this research. Hey look, we reproduced the way to game the stats a bit. I really can't tell anymore.

tibbetts2y ago· 1 in thread

But can their AI quietly introduce a security exploit into a GitHub project?

worthless-trash2y ago

Copilot already does this.

barfbagginus2y ago· 1 in thread

I would like something like this that helps me, as a green developer, find open source projects to contribute to.

The main thing that holds people back is that python packaging documentation is notoriously cryptic - well I've already paid that cost, and now it's easy!

So I'm thinking of finding projects that are healthy, but haven't focused on modernizing their packaging or distributing their project through pypi.

I'd build human + agent based tooling to help me find candidates, propose the improvement to existing maintainers, then implement and deliver.

I could maybe upgrade 100 projects, then write up the adventure.

Anyone have inspiration/similar ideas, and wanna brainstorm?

SrslyJosh2y ago

iLoveOncall2y ago· 1 in thread

And creates how many new ones?

This and Devin generate garbage code that will make any codebase worse.

It's a joke that 12.5% is even associated with the word "success".

1letterunixname2y ago

Do spaces and spelling fixes count?

Copilot, so far, is only good for predicting the next bit of similar patterns of code

Frummy2y ago· 1 in thread

Interesting idea to provide the Agent-Computer Interface for it to scroll and such, interact easier from its perspective

aussieguy12342y ago

mdaniel2y ago· 1 in thread

SrslyJosh2y ago

Probably created with an LLM.

Madmallard2y ago

readthenotes12y ago

I made a lot of money as I was paid hourly while working with a cadre of people I called "the defect generators".

I'm kind of sad that future generations will not have that experience...

trebligdivad2y ago

So this issues arbitrary shell commands based on trying to understand the untrusted bug text ? Should be fun waiting until someone finds an escape.

j / k navigate · click thread line to collapse