Demo video: https://www.youtube.com/watch?v=X61NGZpaNQA
So far, we have dozens of open source projects and companies using Ellipsis. We seem to have landed in a kind of sweet spot where there’s a good match between the current capabilities of AI tools and the actual needs of software engineers - this doesn’t replace human review, but it saves you time by catching/fixing lots of small silly stuff.
Here’s an example in the wild: https://github.com/relari-ai/continuous-eval/pull/38. Ellipsis (1) adds a PR summary; (2) finds a bug and adds a review comment; (3) after a (human) user comments, generates a side PR with the fix; and (4) after a (human) user merges the side PR and adds another commit, re-reviews the PR and approves it
Here’s another example: https://github.com/SciPhi-AI/R2R/pull/350#pullrequestreview-..., where Ellipsis adds several comments with inline suggestions that were directly merged by the developer.
You can configure Ellipsis in natural language to enforce custom rules, style guides, or conventions. For example, here’s how the `jxnl/instructor` repo uses natural language rules to make sure that docs are kept in sync: https://github.com/jxnl/instructor/blob/main/ellipsis.yaml#L..., and here’s an example PR that Ellipsis came up with based on those rules: https://github.com/jxnl/instructor/pull/346.
Installing into your repo takes 2 clicks at https://www.ellipsis.dev. You do have to sign up to try it out because we need you to authorize our GitHub app to read your code. Don’t worry, your code is never stored or used to train models (https://docs.ellipsis.dev/security).
We’d really appreciate your feedback, thoughts, and ideas!
Take a look at this PR for example: https://github.com/julep-ai/julep/pull/311
Ellipsis caught a bunch of things that would have come up only in code review later. It also got a few things wrong but they are easy to ignore. I like it overall, helpful once you get the hang of it although far from a “junior dev”.
I am still confused if vector size should be 1024 or 728 lol.
Indeed. Amazon originally advertised CodeGuru as being "like having a distinguished engineer on call, 24x7".[^1] That became a punchline at work for a good while.
I can definitely see the value of a tool that helps identify issues and suggest fixes for stuff beyond your typical linter, though. In theory, getting that stuff out of the way could make for more meaningful human reviews. (Just don't overpromise what it can reasonably do.)
[^1]: https://web.archive.org/web/20191203185853/https://aws.amazo...
One of our largest learnings is that state of the art LLM's aren't good enough to write code autonomously, but they are good enough to be helpful during code review.
Would love to get your thoughts on sweep. Does it meet your expectations? If not, where does it fall short?
It’s not perfect even after that but gives me a good starting point and often just needs a minor change.
The other end of the spectrum is linting and tests, which catch errors before review.
Does Ellipsis have a role between these two? If so, what is the role?
Ellipsis will use your existing lint/test commands when making code changes. For example, you can leave a comment on a PR that says "@ellipsis-dev fix the assertion in the failing unit test" and Ellipsis will run the tests, observe the failure, make the code change, confirm the tests pass, lint-fix the code, and push a commit or new PR
Why run it as part of a PR then? I'd prefer to run a tool like this before a PR is even open, and ideally on my local machine.
Only if they're already below average.
Interesting. Got any numbers how it affects team velocity?
Sure, Copilot speeds up human dev productivity, but our take is that humans should only be spending their time on the highest value code changes and use products like Ellipsis to handle the rest.
The downside of async code gen is that Ellipsis workflows take a few minutes to run because Ellipsis is actually building the project, running the tests, fixing it's mistakes, etc. The upside is that a developer can have multiple workflows running at once, and each workflow delivers higher quality code because it's guaranteed to be working + tested.
I'm super bullish on async code gen. I think there's a whole category of tedious development tasks with unambiguous solutions that can be automated to the point where a human just needs to give it a LGTM.
As someone building (and frequently using) an AI coding tool that is very much focused on development time[1], I still also use GH Copilot and ChatGPT plus heavily as well. In a team setting, I could definitely see using my tool in conjunction with Ellipsis too. A feature built partly (or entirely) with AI still needs PR review. And an agent focused specifically on that job is likely going to be better at it than an agent designed for building new features from scratch.
One analogy I see today is the typical testing pipeline. Unit tests get run locally, then maybe a CI job runs unit tests + integration tests, then maybe there's a deployment job which does a blue/green release. At every stage the system is being "tested", but because the tests validate different capabilities, it's like concentric circles that grow confidence for the change.
A software dev lifecycle that uses AI dev tools is similar. Agents will review/contribute at the various stages of the SDLC, sometimes with overlap, but mostly additive and building on the output of one another.
Summary: The point of summary is to tell "why" the change was made and highlight unusual/non-trivial parts.. and examples absolutely fails there. To look at first one:
- Why was "generate" result type updated? Was it customer request or general cleanup or prep for some ongoing work?
- The other 3 points - are they logic fallout of output type update, or are those separate changes? For latter case, you really want to list the changes ("Updated examples to use more recent gpt-4 model" for example)
- What's the point of just saying "updating X in Y" if you don't say how? this is just visual noise duplicating "file changes" tab in the PR.
Suggested changes: those are even worse - like https://github.com/relari-ai/continuous-eval/pull/38#discuss...
- This is an example file and you know where the "dataset" comes from. Why would you have non-serializeable records to begin with?
- This changes semantics from "let programmer know in case of error" to "produce corrupted/truncated data file in case of error" - which generally makes debugging harder and gives people nasty surprises when their file is somehow missing records. Sure, sometimes this is needed, but in that particular file it's all downsides. This should not have been proposed at all.
- Even if you do want the check somehow, it's pretty terrible as written - the message does not include original error text nor bad object nor even line number. What is one supposed to do if they see it?
----
And I know some people say "you are supposed to review AI-generated content before submitted", but I am also sure many new users will think the kind of crap advice that AI generates is OK.
Ellipsis authors: please stop making open source worse. Buggy patches are worse than no patches, and 1 line hand-written summary is better then AI-generated useless one.
Maintainers: don't install this in your repo if you don't want crap PRs.
And I know some people say "you are supposed to review AI-generated content before submitted", but I am also sure many new users will think the kind of crap advice that AI generates is OK.
Those who don't know better, think AI is awesome and will solve all their problems. Those who do know enough to see the flaws don't need AI either.
Ellipsis authors: please stop making open source worse. Buggy patches are worse than no patches, and 1 line hand-written summary is better then AI-generated useless one.
I'll extend that to say "please stop making software worse." We were already drowning in mediocrity before AI accelerated it.
We're watching development of idiocracy in real time. It'll start with "yeah, we're reviewing every content it generates" until new generation will stop thinking for itself completely and PRs will switch from machine-sceptic to "who are you to argue against computer?" (and not because machine became so good it's better than human).
First, I'm pretty unimpressed with the PR descriptions. I'd be frustrated if my company adopted this and I started seeing PRs like the one from continuous-eval that you linked to first. It's classic LLM output: lots of words, not much actual substance. "This PR updates simple.py", "updates certain values". It's the kind of information that can be gleaned in the first 5 seconds of glancing through a PR, and if that creates the illusion that no more description is needed then we'll have lost something.
Second, in the same PR: when writing a collection to a jsonl file, I would expect an empty collection to give me an empty file, not no file. Further, I haven't looked at the rest of the context, but it seems extremely unlikely to me that dataset_generator.generate would somehow produce non-serializable objects, and a human would easily see that. These two suggestions feel at best like a waste of time and at worst wrong, and it's concerning to me that the habits this tool encourages led to the suggestions being uncritically adopted and incorporated and that this seemed to you to be a good example of the tool in use.
The second PR you linked to is, I think, a better example, but I'm still not sold on it. Similar to the PR descriptions, I'm concerned that a tool like this would create an illusion of security against the "simple" problems (leaving reviewers to focus on the high level), whereas I'd hope that the human reviewer would still read every line carefully. And if they're reading every line carefully, then have we really saved them that much time by paying for an LLM reviewer to look over it first? Maybe the time it takes to type out a note about a misnamed environment variable.
PR descriptions should be about intention, not what changed (what changed is in the PR itself), so I agree there.
With the other stuff though, if you treat it as a linter, that might be fine? The problem is when you justify getting it as it'll save you engineering time, but it won't actually do that significantly unless you have a lot of people using it.
We handle this, the result is that if you set up a Dockerfile, we promise to return working, tested code: https://docs.ellipsis.dev/code#from-a-pr
JS/TS, Python, Java, C++, Ruby are highly supported.
[Edit] You can see a bunch of them here: https://github.com/search?q=%22ellipsis.dev%22&type=issues Nothing breathtaking unfortunately.
Example of PR review: https://github.com/getzep/zep-js/pull/67#discussion_r1594781...
Example of issue-to-PR: https://github.com/getzep/zep/issues/316
Example of bug fix on a PR: https://github.com/jxnl/instructor/pull/546#discussion_r1544...
If humans participation is reduced they could approve PR's without proper reviewing.
Eventually this stochastic parrot could throw an ``rm -rv ${TEMP}/`` in there and you are roasted.
Ellipsis can't commit code without your permission and approval, so this particular parrot can't feather your filesystem.
For a solo engineer like me who's working in multiple codebases across multiple languages, it's excellent as another set of eyes to catch big and small things in a pull request workflow I'm used to (and it has caught more than a few). I'd argue even as a backstop for catching edge cases/screwups that may lead to wasting my time that it's already more than paid for itself.
What do you get? A bot that makes commits. I don’t want to keep those commits. I would rather just squash (rebase) those typo fixes into my original commit. Just noise.
What else do you get? It seems that you get to pretend that you’re in a Google Doc For Code.[1] If people want that then they should work towards real-time online code editors.
Take a step back. The point of code review is knowledge transfer, mentorship and guaranteeing code quality. You can certainly fulfill the last point with a program. And this is AI so of course you can in principle solve the two first as well. But if the underlying point is to transfer the knowledge from the senior humans (groan) to the junior humans, then how does your AI replace that? Because the point isn’t to transfer some general-purpose (AI) knowledge but the specific knowledge of those more experienced developers.
[1] Why do people want to turn regular programs (including AI) into some high-tech dance in the cloud where you tag and ping bots on GitHub? I’m at a loss.
First two code reviews show no feedback.