> Their claim that it is a "complete rewrite" is irrelevant, since they had ample exposure to the originally licensed code
This is simply not true. The reason why the "clean room" concept exists is precisely since actually the law recognizes that independent implementations ARE possibile. The "clean room" thing is a trick to make the litigation simpler, it is NOT required that you are not exposed to the original code. For instance, Linux was implemented even if Linus and other devs where well aware of Unix internals. The law really mandates this: does the new code copy something that was in the original one? The clean room trick makes it simpler to say, it is not possible, if there are similar things it is just by accident. But it is NOT a requirement.
1. File by file rewrite by AI (“change functions and vars a bit”)
2. One LLM writes a diff language (or pseudo code) version of each function that a diff LLM translates back into code and tests for input/output parity
The real danger is that this becomes increasingly undetectable in closed source code and can continue to sync with progress in the GPLed repo.
I don’t think any current license has a plausible defense against this sort of attack.
If it can't and it costs a bunch of money to clean it up then same as always.
OTOH if what is actually happening is just that it is rewording the existing code so it looks different then it is still going to run afoul of copyright. You can't just rewrite harry potter with different words.
Note that even with Google vs oracle it was important they didn't need the actual code just the headers to get the function calls were enough. Yes it's true that the clean room isn't required but when you have an AI and you can show that it can't do it a second time without looking at the source (not just function declarations) that's pretty strong evidence.
Finally, how exactly do people think corporations rewrite portions of code that were contributed before re-licensing under a private license? It is ABSOLUTELY possible to rewrite code and relicense it.
Edit: Further, so these people think you contribute to a project, that project is beholden to your contribution permanently and it can never be excised? That seems like it would blatantly violate their original persons rights to exercise their own control of the code without those contributions, which is exactly the purpose of a rewrite.
(I have my doubts the rewrite is a reasonably defect free replacement)
I'll trade that stick for what GenAI can do for me, in a heartbeat.
The question, of course, is how this attitude -- even if perfectly rational at the moment -- will scale into the future. My guess is that pretty much all the original code that will ever need to be written has already been written, and will just need to be refactored, reshaped, and repurposed going forward. A robot's job, in other words. But that could turn out to be a mistaken guess.
And then Pilgrim is again wrong by saying that the use of Claude definitively makes it a derivative work because of the inability to prove it the work in question did not influence the neurons involved.
It is all dueling lay misreadings of copyright law, but it is also an area where the actual specific applicable law, on any level specific enough to cleanly apply, isn’t all that clear.
When there is similar code, the only defense possible to prove that you have not copied the original is to show that your process is a clean room re-implementation.
If the code is completely different, then clean room or not is indeed irrelevant. The only way the author can claim that you violated their copyright despite no apparent similarity is for them to have proof you followed some kind of mechanical process for generating the new code based on the old one, such as using an LLM with the old code as input prompt (TBD, completely unsettled: what if the old code is part of the training set, but was not part of the input?) - the burden of proof is on them to show that the dissimilarity is only apparent.
In realistic cases, you will have a mix of similar and dissimilar portions, and portions where the similarity is questionable. Each of these will need to be analyzed separately - and it's very likely that all the similar portions will need to be re-written again if you can't prove that they were not copied directly or from memory from the original, even if they represent a very small part of the work overall. Even if you wrote a 10k page book, if you copied one whole page verbatim from another book, you will be liable for that page, and the author may force you to take it out.
Yes, but you do not have to prove that you haven’t copied the original; you have to prove you didn’t infringe copyright. For that there are other possible defenses, for example:
- fair use
- claiming the copied part doesn’t require creativity
- arguing that the copied code was written by AI (there’s jurisdiction that says AI-generated art can’t be copyrighted (https://www.theverge.com/2023/8/19/23838458/ai-generated-art...). It’s not impossible judges will make similar judgments for AI-generated programs)
The expected functionality of chardet (detect the unicode encoding) is kind of fixed - apart from edge cases and new additions to unicode, you'd expect the original and new implementations to largely pass the same tests, and have a lot of similar code such as for "does this start with a BOM".
The fact that the JPlag shows such a low %overlap for an implementation of "the same interface" is convincing evidence for me that it's not just plagiarised.
This is legitimately a very weird case and I have no idea how a court would decide it.
The AI was trained with the code, so the complete rewrite is tainted and not a clean room. I can't believe this would need spelling out.
So convincing evidence, by historical standards, that ChatGPT, Gemini, Copilot AND Claude are all derivative works of the GPL linux kernel can be gotten simply by asking "give me struct sk_buff", then keep asking until you're out of the headers (say, ask how a network driver uses it).
That means if courts are honest (and they never are when it comes to GPL) OpenAI, Google and Anthropic would be forced to release ALL materials needed to duplicate their models "at cost". Given how LLMs work that would include all models, code, AND training data. After all, that is the contract these companies entered into when using the GPL licensed linux kernel.
But of course, to courts copyright applies to you when Microsoft demands it ($30000 per violation PLUS stopping the use of the offending file/torrent/software/... because such measures are apparently justified for downloading a $50 piece of software), it does not apply to big companies when the rules would destroy them.
The last time this was talked about someone pointed out that Microsoft "stole", as they call it, the software to do product keys. They were convicted for doing that, and the judge even increased damages because of Microsoft's behavior in the case.
But there is no way in hell you'll ever get justice from the courts in this. In fact courts have already decided that AI training is fair use on 2 conditions:
1) that the companies acquired the material itself without violating copyright. Of course it has already been proven that this is not the case for any of them (they scraped it without permission, which has been declared illegal again and again in the file sharing trials)
2) that the models refuse to reproduce copyrighted works. Now go to your favorite model and ask "Give me some code written by Linus Torvalds": not a peep about copyright violation.
... but it does not matter, and it won't matter. Courts are making excuses to allow LLM models to violate any copyright, the excuse does not work, does not convince rational people, but it just doesn't matter.
But of course, if you thought that just because they cheat against the law to make what they're already doing legal, they'll do the same for you, help you violate copyright, right? After all, that's how they work! Ok now go and ask:
"Make me an image of Mickey Mouse peeling a cheese banana under an angry moon"
And you'll get a reply "YOU EVIL COPYRIGHT VILLAIN". Despite, of course, Mickey Mouse no longer being covered under copyright!
And to really get angry, find your favorite indie artist, and ask to make something based on their work. Even "Make an MC Escher style painting of Sonic the Hedgehog" ... even that doesn't count as copyright violation, only the truly gigantic companies deserve copyright protection.
That’s not how “derivative works”, well, work.
First of all, a thing can only be a derivative work if it is itself an original work of authorship.
Otherwise, it might be (or contain) a complete copy or a partial copy of one or more source works (which, if it doesn't fall into a copyright exception, would still be a at least a potential violation), but its not a derivative work.
"Insider Knowledge" is not relevant for copyright law. That is more in the space of patent law then copyright law.
Or else a artist having seen a picture of a sunset over an empty ocean wouldn't be allowed to pain another sunset over an empty ocean as people could claim copyright violation.
Through what is a violation is, if you place the code side by side and try to circumvent copyright law by just rephrasing the exact same code.
This also means that if you give an AI access to a code base and tell it to produce a new code base doing the same (or similar) it will most likely be ruled as copyright violation as it's pretty much a side by side rewriting.
But you very much can rewrite a project under new license even if you have in depth knowledge. IFF you don't have the old project open/look at it while doing so. Rewrite it from scratch. And don't just rewrite the same code from memory, but instead write fully new code producing the same/similar outputs.
Through while doing so is not per-se illegal, it is legally very attackable. As you will have a hard time defending such a rewrite from copyright claims (except if it's internally so completely different that it stops any claims of "being a copy", e.g. you use complete different algorithms, architecture, etc. to produce the same results in a different way).
In the end while technically "legally hard to defend" != "illegal", for companies it's most times best to treat it the same.
On the contrary. Except for discussions about punitive damages and so on, insider knowledge or lack thereof is completely irrelevant to patent law. If company A has a patent on something, they can assert said patent against company B regardless of whether any person in company B had ever seen or heard of company A and their patent. Company B could have a legal trail proving they invented their product that matches the patent from scratch with no outside knowledge, and that they had been doing this before company A had even filed their patent, and it wouldn't matter at all - company A, by virtue of filing and being granted a patent, has a legal monopoly on that invention.
In contrast, for copyright the right is intrinsically tied to the origin of a work. If you create a digital image that is entirely identical at the pixel level with a copyrighted work, and you can prove that you had never seen that original copyrighted work and you created your image completely independently, then you have not broken anyone's copyright and are free to sell copies of your own work. Even more, you have your own copyright over your own work, and can assert it over anyone that tries to copy your work without permission, despite an identical work existing and being owned by someone else.
Now, purely in principle this would remain true even if you had seen the other work. But in reality, it's impossible to convince any jury that you happened to produce, entirely out of your own creativity, an original work that is identical to a work you had seen before.
> But you very much can rewrite a project under new license even if you have in depth knowledge. IFF you don't have the old project open/look at it while doing so.
No, this is very much false. You will never be able to win a court case on this, as any significant similarity between your work and the original will be considered a copyright violation, per the preponderance of the evidence.
This is not true. I will just give the example of the nighttime illumination of the Eiffel Tower:
> https://www.travelandleisure.com/photography/illegal-to-take...
On the other hand, if I can prove to the jury’s satisfaction that I’ve never been exposed to Puzo’s work in any form, it’s independent creation.
For a rather entertaining example (though raunchy, for a heads up): https://www.youtube.com/watch?v=zhWWcWtAUoY&themeRefresh=1
How different does the new code have to be from the old code and how is that measured?
Think of a rewrite (by a human or an LLM) as a translation. If you wrote a book in English and somebody translated it into Spanish, it'd still be a copyright issue. Same thing with translations.
That's very different to taking the idea of a body of work. So you can't copyright the idea of a pirate taking a princess hostage and a hero rescuing her. That's too generic. But even here there are limits. There have been lawsuits over artistic works being too similar.
Back to software, you can't copyright the idea of photo-editing software but you can copyright the source code that produces that software. If you can somehow prompt an LLM to produce photo editing software or if a person writes it themselves then you have what's generally referred to as a "cleanroom" implmentation and that's copyright-free (although you may have patent issues, which is a whole separate issue).
But even if you prompted an LLM that way, how did the LLM learn what it needed? Was the source code of another project an input in its training? This is a legal grey area, currently. But I suspect it's going to be a problem.
When does generative AI qualify for fair use?
https://suchir.net/fair_use.html
Balaji's argument is very strong and I feel we will see it tested in court as soon as LLM license-washing starts getting more popular.
Then use another LLM to produce code from that spec.
This would be similar to the cleanroom technique.
Original works can only be produced by a human being, by definition in copyright law. Any artifact produced by an animal, a mechanical process, a machine, a natural phenomenon etc is either a derived work if it started from an original copyrighted work, or a public domain artifact not covered by copyright law if it didn't.
For example, an image created on a rock struck by lightning is not a copyright covered work. Similarly, an image generated by an diffusion model from a randomly generated sentence is not a copyrightable work. However, if you feed a novel as a prompt to an LLM and ask for a summary, the resulting summary is a derived work of said novel, and it falls under the copyright of the novel's owner - you are not allowed to distribute copies of the summary the LLM generated for you.
Whether the output of an LLM, or the LLM weights themselves, might be considered derived works of the training set of that LLM is a completely different discussion, and one that has not yet been settled in court.
But either way, deleting the original version from the repo and replacing it with the new version - as opposed to, say, archiving the old version and starting a new repo with the new version - would still be a dick move.
One of their engineers was able to recreate their platform by letting Claude Code reverse engineer their Apps and the Web-Frontend, creating an API-compatible backend that is functionally identical.
Took him a week after work. It's not as stable, the unit-tests need more work, the code has some unnecessary duplication, hosting isn't fully figured out, but the end-to-end test-harness is even more stable than their own.
"How do we protect ourselves against a competitor doing this?"
Noodling on this at the moment.
As engineers, we often think only about code, but code has never been what makes a business succeed. If your client thinks that their businesses primary value is in the mobile app code they wrote, 1) why is it even open source? 2) the business is doomed.
Realistically, though, this is inconsequential, and any time spent worrying about this is wasted time. You don't protect yourself from your competitor by worrying about them copying your mobile app.
They did not copy the mobile app. They copied the service.
They do something very similar for some of their work. It’s hard to use external services so they replicate them and the cost of doing so has come down from “don’t be daft, we can’t reimplement slack and google drive this sprint just to make testing faster” to realistic. They run the sdks against the live services and their own implementations until they don’t see behaviour differences. Now they have a fast slack and drive and more (that do everything they need for their testing) accelerating other work. I’m dramatically shifting my concept of what’s expensive and not for development. What you’re describing could have been done by someone before, but the difficulty of building that backend has dropped enormously. Even if the application was closed you could probably either now or soon start to do the same thing starting with building back to core user stories and building the app as well.
You can view some of this as having things like the application as a very precise specification.
Really fascinating moment of change.
I think it's interesting to add what they use it for and why its hard.
What they use it for:
- It's about automated testing against third party services.
- It's not about replicating the product for end users
Why using external services is hard/problematic
- Performance: They want to have super fast feedback cycles in the agentic loop: In-Memory tests. So they let the AI write full in-memory simulations of (for example) the slack api that are behaviorally equivalent for their use cases.
- Feasiblity: The sandboxes offered by these services usually have performance limits (= number of requests per month, etc) that would easily be exhausted if attached to a test harness that runs every other minute in an automated BDD loop.
If the platform is so trivial that it can be reverse engineered by an AI agent from a dumb frontend, what's there to protect against? One has to assume that their moat is not that part of the backend but something else entirely about how the service is being provided.
You can try patenting; but not after the fact. Copyright won't help you here. You can't copyright an algorithm or idea, just a specific form or implementation of it. And there is a lot of legal history about what is and isn't a derivative work here. Some companies try to forbid reverse engineering in their licensing. But of course that might be a bit hard to enforce, or prove. And it doesn't work for OSS stuff in any case.
Stuff like this has been common practice in the industry for decades. Most good software ideas get picked apart, copied and re-implemented. IBM's bios for the first PC quickly got reverse engineered and then other companies started making IBM compatible PCs. IBM never open sourced their bios and they probably did not intend for that to happen. But that didn't matter. Likewise there were several PC compatible DOS variants that each could (mostly) run the same applications. MS never open sourced DOS either. There are countless examples of people figuring out how stuff works and then creating independent implementations. All that is perfectly legal.
https://bitsavers.org/pdf/ibm/pc/pc/6025008_PC_Technical_Ref...
https://bitsavers.org/pdf/ibm/pc/xt/1502237_PC_XT_Technical_...
https://bitsavers.org/pdf/ibm/pc/at/1502494_PC_AT_Technical_...
Between this and the fact that their PC-DOS (née MS-DOS) license was nonexclusive, I'm honestly not sure what they expected to happen.
The nature of early IBM PC advertising suggests to me that they expected the IBM name and established business relationships to carry as much weight as the specifications itself, and that "IBM PC compatible" systems would be no more attractive than existing personal computers running similar if not identical third-party software (PC-DOS wasn't the only example of IBM reselling third-party software under nonexclusive license), and would perhaps even lead to increased sales of first-party IBM PCs.
Which, in fact, they did, leading me to believe the actual result may have been not too far from their original intent, only with IBM capturing and holding a larger share of the pie.
I know it's a provoking question but that answers why a competitor is not a competitor.
I have been thinking about this a lot lately, as someone launching a niche b2b SaaS. The unfortunate conclusion that I have come to is: have more capital than anyone for distribution.
Is there any other answer to this? I hope so, as we are not in the well-capitalized category, but we have friendly user traction. I think the only possible way to succeed is to quietly secure some big contracts.
I had been hoping to bootstrap, but how can we in this new "code is cheap" world? I know it's always been like this, but it is even worse now, isn't it?
How do our competitors protect themselves against us doing this?
There is a certain amount of brand loyalty and platform inertia that will keep people. Also, as you point out, just having the source code isn't enough. Running a platform is more than that. But that gap will narrow with time.
The broader issue here is that there are people in tech who don't realize that AI is coming for their jobs (and companies) too. I hope people in this position can maybe understand the overall societal issues for other people seeing their industries "disrupted" (ie destroyed) by AI.
https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....
That's the neat thing: you don't!
DMCA. The EULA likely prohibits reverse engineering. If a competitor does that, hit'em with lawyers.
Or, if you want to be able to sleep at night, recognize this as an opportunity instead of a threat.
People do cleanroom implementations as a precaution against a lawsuit, but it's not a necessary element.
In fact, even if some parts are similar, it's still not a clear-cut case - the defendant can very well argue that the usage was 1. transformative 2. insubstantial to the entirety of work.
"The complaint is the maintainers violated the terms of LGPL, that they must prove no derivation from the original code to legally claim this is a legal new version without the LGPL license."
The burden of proof is on the accuser.
"I am genuinely asking (I’m not a license expert) if a valid clean room rewrite is possible, because at a minimum you would need a spec describing all behavior, which ses to require ample exposure to the original to be sufficiently precise."
Linux would be illegal if so (they had knowledge of Unix before), and many GNU tools are libre API-compatible reimplementations of previous Unix utilities :)
Now, whether chardet 7.0.0 is a derivative of chardet or not is a matter of copyright law that the LGPL has no say on, and a rather murky ground with not that much case law to rely on behind it. If it's not, the new author is free to distribute chardet 7.0.0 under any license they want, since it is a new work under his copyright.
The original code is part of claude's training material. With that intepretation of the LGPL AI is incapable of writing non LGPL derivatives. I like that interpretation.
It should be perfectly ok (by maintainer or anyone for that mater) to be inspired from a community project and build something from scratch hand-crafting/ AI sloping, as long as the imitation is given a new name/ identity.
What rubbed me off personally was maintainer saying "pin your dependncies to version 6.0.0 or 5.x.x", as if maintainer owns the project. maintainer role is more akin to serve the community, not rule.
If it is completely new, why not start a new project with new name? No one will object. And of course leave the old project behind to whoever is willing to maintain it. And if the new name project is better, people will follow.
my understanding of the situation is:
Is the caretaker paying from his own pocket to maintain the hall? no
Is the caretaker paying from his own pocket for community usage of the hall? no
Is the caretaker spending time to maintain the community hall? yes
Is caretaker obliged to spend time on community hall? no
Is caretaker free to stop spending time on community hall? yes.
Is caretaker free to raze current hall, build new hall on same ground for new purposes WITH community agreement? YES
Is caretaker free to raze current hall, build new hall on same ground for new purposes WITHOUT community agreement (even if paying all the bill)? NO
Is caretaker free to build another similar hall someplace else? YES
Reasoning of your comment is of someone who is hell bent on staking claims on community resources (like big companies) without having slightest concern of the wishes or well-being of the community. Not sure of the commenter's motive either, given the new account with just two comments, supporting such blatant disregard of basic human decency.
Question: if they had built one using AI teams in both “rooms”, one writing a spec the other implementing, would that be fine? You’d need to verify spec doesn’t include source code, but that’s easy enough.
It seems to mostly follow the IBM-era precedent. However, since the model probably had the original code in its training data, maybe not? Maybe valid for closed source project but not open-source? Interesting question.
It doesn't matter how they structure the agents. Since chardet is in the LLM training set, you can't claim any AI implementation thereof is clean room.
Might still be valid for closed source projects (probably is).
I think courts would need to weigh in on the open source side. There’s legal precedent is that you can use a derived work to generate a new unique work (the spec derived for the copyrighted code is very much a derived work). There are rulings that LLMs are transformative works, not just copies of training data.
LLMs can’t reproduce their entire training set. But this thinking is also ripe for misuse. I could always train or fine-tune a model on the original work so that it can reproduce the original. We quickly get into statistical arguments here.
It’s a really interesting question.
While it feels unlikely that a simple "write this spec from this code" + "write this code from this spec" loop would actually trigger this kind of hiding behaviour, an LLM trained to accurately reproduce code from such a loop definitely would be capable of hiding code details within the spec - and you can't reasonably prove that the frontier LLMs have not been trained to do so.
Also, it's weird that it's okay apparently to use pirated materials to teach an LLM, but maybe not to disseminate what the LLM then tells you.
Edit: this is wrong
> Unfortunately, because the code that chardet was originally based on was LGPL, we don't really have a way to relicense it. Believe me, if we could, I would. There was talk of chardet being added to the standard library, and that was deemed impossible because of being unable to change the license.
So the person that did the rewrite knew this was a dive into dangerous water. That's so disrespectful.
I don't think that the second sentence is a valid claim per se, it depends on what this "rewritten code" actually looks like (IANAL).
Edit: my understanding of "clean room implementation" is that it is a good defence to a copyright infrigement claim because there cannot be infringement if you don't know the original work. However it does not mean that NOT "clean room implementation" implies infrigement, it's just that it is potentially harder to defend against a claim if the original work was known.
As the LGPL says:
> A "work based on the Library" means either the Library or any derivative work under copyright law: that is to say, a work containing the Library or a portion of it, either verbatim or with modifications and/or translated straightforwardly into another language. (Hereinafter, translation is included without limitation in the term "modification".)
Is v7.0.0 a [derivative work](https://en.wikipedia.org/wiki/Derivative_work)? It seems to depend on the details of the source code (implementing the same API is not copyright infringement).
Especially now that ai can do this for any kind of intellectual property, like images, books or sourcecode. If judges would allow an ai rewrite to count as an original creation, copyright as we know it completely ends world wide.
Instead whats more likely is that no one is gonna buy that shit
The change log says the implementation is completely different, not a copy paste. Is that wrong?
>Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved.
I’m not sure that “a total rewrite” wouldn’t, in fact, pass muster - depending on how much of a rewrite it was of course. The ‘clean room’ approach was just invented as a plausible-sounding story to head off gratuitous lawsuits. This doesn’t look as defensible against the threat of a lawsuit, but it doesn’t mean it wouldn’t win that lawsuit (I’m not saying it would, I haven’t read or compared the code vs its original). Google copied the entire API of the Java language, and got away with it when Oracle sued. Things in a courtroom can often go in surprising ways…
[edit: negative votes, huh, that’s a first for a while… looks like Reddit/Slashdot-style “downvote if you don’t like what is being said” is alive and well on HN]
[1]https://github.com/chardet/chardet/compare/6.0.0.post1...7.0...
This is not a good analogy.
A "rewrite" in context here is not a reproduction of the original work but a different work that is functionally equivalent, or at least that is the claim.
Pin your dependency versions people! With hashes at this point, cant trust anybody out here.
Tech people, particularly engineers, tend to make a fundamental error when dealing with the law that almost always causes them to make wrong conclusions. And that error is that they look for technical compliance when so much of the law is subjective and holistic.
An example I like to use is people who do something illegal on the Internet and then use the argument "you can't prove I did it (with absolute certainty)". It could've been someone who hacked your Wifi. You don't know who on the Wifi did it, etc. But the law will look at the totality of the evidence. Did the activity occur when you were at home and stop when you weren't? How likely are alternative explanations? Etc.
All of that will be considered based on some legal standard depending on the venue. In civil court that tends to be "the preponderance of the evidence" (meaning more likely than not) while in criminal court it's "beyond a reasonable doubt" (which is a much higher standard).
So, using your example, an engineer will often fall into a trap of thinking they can substitute enough words to have a new original work, Ship of Theseus-like. And the law simply doesn't work that way.
So, when this gets to a court (which it will, it's not a question of "if"), the court will consider how necessary the source work was to what you did. If you used it for a direct translation (eg from C++ to Go) then you're going to lose. My prediction is that even using it in training data will be cause for a copyright claim.
If you use Moby Dick in your training data and ask an LLM to write a book like Moby Dick (either explicitly or implicitly) then you're going to have an issue. Even if you split responsibilities so one LLM (training on Moby Dick) comes up with a structure/prompt and another LLM (not trained on Moby Dick) writes it, I don't think that'll really help you avoid the issue.
This has a lot of similarity to when colorization of film started popping up. Did colorizing black and white movies suddenly change the copyright of the film? At this point is seems mostly the courts say no. But you may find sometimes people rule the other way and say yes. But it takes time and a lot of effort to get what in general people want.
But basically if you start with a 'spec' then make something you probably can get a wholly owned new thing. But if you start with the old thing and just transform it in some way. You can do that. But the original copyright holders still have rights too to the thing you mangled too.
If I remember right they called it 'color of copyright' or something like that.
The LLM bits you are probably right. But that has not been worked out by the law or the courts yet. So the courts may make up new case law around it. Or the lawmakers might get ahead of it and say something (unlikely).
I know it sounds like an oversimplification, but "got off on a technicality" is a common thing among the well-connected and well-heeled. Sure, us nerds probably focus too much on the "technicality" part, since we are by definition technical, but the rest is wishy-washy, unfair BS as far as many of our brains work much of the time.
Otherwise all this rewrite accomplishes is a 2.3% accuracy improvement and some performance gains that might not be relevant in production, in exchange for a broken test suite, breaking changes, and unnecessary legal and ethical risks pushed out as an update to what was already a stable project.
If it's truly a sufficiently separate project that it can be relicensed from LGPL, then it could've just been _a fully separate project with a new identity_, and the license change would've been at least harder to challenge. Instead, we're here.
All AI generated code is tainted with GPL/LGPL because the LLMs might have been taught with it
That is however stricter than what's actually legally necessary. It's just that the actual legal standard would require a court ruling to determine if you passed it, and everyone wants to avoid that. As a consequence there also aren't a lot of court cases to draw similarities to
I've heard this called in some circles "The curse of knowledge." The same thing applies to emulator developers, especially N64 developers (and now Nintendo emulator developers in general) after the Oman Archive and later Gigaleaks. There's an informal "If you read this, you can NEVER directly contribute to the development of that emulator, ever."
This comes to a head when a relatively unknown developer starts contributing oddly specific patches to an emulator.
This is actually harder standard than some people think.
The absolute clean room approaches in USA are there because they help short circuit a long lawsuit where a bigger corp can drag forever until you're broken.
However, the copyright system has always be a sham to protect US capital interests. So I would be very surprised if this is actually ruled/enforced. And in any case american legislators can just change the law.
His Python books, although a bit dated, are something I still recommend to new Python programmers.
(I can hear a "challenge accepted" from some random HNer already)
If the code is different but API compatible, Google Java vs Oracle Java case shows that if the implementation is different enough, it can be considered a new implementation. Clean room or not.
I don't think this is a precedent either, plenty of projects changed licenses lol.
I keep kind mixing them up but the GPL licenses keep popping up as occasionally horror stories. Maybe the license is just poorly written for today's standards?
They usually did that with approval from existing license holders (except when they didn't, those were the bad cases for sure).
[1] https://github.com/chardet/chardet/issues/327#issuecomment-4...
Seems like there is no real point, just vibes.
A rewrite based on functional equivalency is not infringing on the copyright as long as no creative expression was copied. That was the heart of the Google case, whether the API itself was creative expression or functionality.
There are many aspects to what can be considered creative expression, including names, organization, non-functional aspects. An algorithm would not be protected expression. If an AI can write it without reference to the original source code, using only documented behavior, then it would not be infringing (proving that it didn't copy anything from training data might be tough though). It also would not itself be copyrightable, except for elements that could be traced back as "authorship" to the humans who worked with the AI.
If LLMs can create GOOD software based only on functionality, not by copying expression, then they could reproduce every piece of GPL software and release it as Public Domain (which it would have to be if no human has any authorship in it). By the same principle that the GPL software wasn't infringing on the programs they copied functionality from, neither would the AI software. That's a big IF at this point, though, the part about producing GOOD software without copying.
Someone should not be able to write a semi-common core utility, provide it as a public good, abandon it for over a decade, and yet continue to hold the rest of the world hostage just because of provenance. That’s a trap and it’s not in any public interest.
The true value of these things only comes from use. The extreme positions for ideals might be nice at times, but for example we still don’t have public access to printer firmware. Most of this ideology has failed in key originating goals and continues to cause headaches.
If we’re going to share, share. If you don’t want to share, don’t. But let’s not setup terminal traps, no one benefits from that.
If we flip this back around though, shouldn’t this all be MPL and Netscape communications? (Edit: turns out they had an argument about that in the past on their own issue tracker: https://github.com/chardet/chardet/issues/36)
People not being okay with having to share their improvements not being able to use the software is by design.
I don't get how you get from there to some sinister hostage taking situation.
Also everyone that contributes to the previous LGPL verison probably contributed under LGPL only, so it is now just one guy...
The claim being made is that because some prior implementation was licensed one way, all other implementations must also be licensed as such.
AIUI the code has provenance in Netscape, prior to the chardet library, and the Netscape code has provenance in academic literature.
Now the question of what constitutes a rewrite is complex, and maybe somewhat more complex with the AI involvement, but if we take the current maintainers story as honest they almost certainly passed the bar of independence for the code.
Licensing aside, morally you don't rewrite someone else's project with the same package name.
Why does this new project here needed to replace the original like that in this dishonourable way? The proper way would have been to create a proper new project.
Note: even Python's own pip drags this in as dependency it seems (hopefully they'll stick to a proper version)
Half a million lines of code have been deleted and replaced over the course of four days, directly to the main branch with no opportunity for community review and testing. (I've no idea whether depending projects use main or the stable branch, but stable is nearly 4 years old at this point, so while I hope it's the version depending projects use, I wouldn't put money on it.)
The whole thing smells a lot like a supply chain attack - and even if it's in good faith, that's one hell of a lot of code to be reviewed in order to make sure.
- the outputs, even if correctly deduced, are often incompatible: "utf-16be" turns into "utf-16-be", "UTF-16" turns into "utf-16-le" etc. FWIW, the old version appears to have been a bit of a mess (having had "UTF-16", "utf-16be" and "utf-16le" among its outputs) but I still wouldn't call the new version _compatible_,
- similarly, all `ascii` turn into `Windows-1252`
- sometimes it really does appear more accurate,
- but sometimes it appears to flip between wider families of closely related encodings, like one SHIFT_JIS test (confidence 0.99) turns into cp932 (confidence 0.34), or the whole family of tests that were determined as gb18030 (chinese) are now sometimes determined as gb2312 (the older subset of gb18030), and one even as cp1006, which AFAIK is just wrong.
As for performance claims, they appear not entirely false - analyzing all files took 20s, versus 150s with v6.0. However, looks like the library sometimes takes 2s to lazy initialize something, which means that if one uses `chardetect` CLI instead of Python API, you'll pay this cost each time and get several times slower instead.
Oh, and this "Negligible import memory (96 B)" is just silly and obviously wrong.
2,305 files changed
+0 -546871 lines changed
https://github.com/chardet/chardet/commit/7e25bf40bb4ae68848...AFAIK this was not a clean room reimplementation. But since it was rewritten by hand, into a different language, with not just a different internal design but a different API, I could easily buy that chardetng doesn't infringe while Python chardet 7 does.
Legal: How much are you willing to spend on litigation? The only real "protection" by copyright is in court.
Other questions that haven't really been explored before also are maintained: the original author hasn't been involved in some time, technically the copyright of all code since still belongs to those authors who might be bound by LGPL but are also the only ones with the right to enforce it and could simply choose not to. What then?
Either one would still have to meet the requirements like being sufficiently non-obvious. The first steam engine was patented, even though you couldn't patent one any more.
Releasing a core library like this under a genuinely free licence (MIT) is a service to anyone working in the ecosystem.
See what FFmpeg writes on this topic: https://ffmpeg.org/legal.html
Because I don’t think so
* LLMs make it trivial to recreate almost any software using its test suite (maybe not a derivative work)
* LLM generated code has no copyright (according to current court interpretations)
Soon we will be able to make an unlicensed copy of anything if we have its test suite and a little money for tokens.
They ported the LGPL version. There's no obligation to port any other, unless "MPL 1.1 or LGPL" is itself some kind of singular licence.
I am sure I am missing something ... what is it?
So to settle this, someone needs to violate this license and get sued. Or maybe proactively sue?
Which is going cause a collision between the "not copyrightable" and "derived from copyrighted work" angles.
Unless the human is so far removed from the output. (And how far is far enough is probably very much depends on the circumstances and unless/until case law or Congress gives us some unifying criteria, it's going to be up to how the judge and the jury feels.)
..
For example someone set up a system where their dog ends up promoting some AI to make video games. This might be the closest to the case of that photo.
Though there the court ruled only that PETA (as friend of the monkey) cannot sue the photographer, because the monkey cannot be a copyright holder, but very importantly it didn't rule on the authorship of the photographer. (And thus the Wikimedia metadata which states that the image is in thr public domain, is simply their opinion.)
Be really careful who you give your projects keys to, folks!
What's worse - disassembler+AI is good enough to "translate" the binary into working source code, probably in a different programming language than then original.
“chardet 7.0 is a ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x”
Do people not write anymore?
As Freud famously said, sometimes an em dash is just an em dash.
> dan-blanchard and claude committed 4 days ago
The em dash is just a bonus, the grammatical structure is the giveaway. I'd invite Blanchard to argue that it wasn't LLM-generated.
I use AI tooling all day every day and can easily pick out when something was written by most popular modern models. I welcome an agentic web; it's the inevitable future. But not like this. I want things to get better, not worse.
https://repo.or.cz/tinycc.git/blob/3d963aebcd533da278f086a3e...
The interesting part is that the original author is against it but some people claims it could be a rewrite and not a derivative work.
I don't know the legal basis of everything but it's definitly not morally correct toward the original author.
Can coding agents relicense open source through a “clean room” implementation of code?
https://simonwillison.net/2026/Mar/5/chardet/
Discussion: https://news.ycombinator.com/item?id=47264043
That is just the easiest way to disambiguate the legal situation (i.e. the most reliable approach to prevent it from being considered a derivative work by a court).
I'm curious how this is gonna go.
If the new code was generated entirely by an LLM, can it be licensed at all? Or is it automatically in the public domain?
There's absolutely nothing stopping you granting a license to public domain work... granting a license is just waiving rights that the author might have to sue for copyright infringement under certain circumstances...
Personally I'd be unwilling to use this work without the license, because I would not be confident that it was public domain.
The big question is whether or not is it a derivative work of an LGPL project. If it is, then it's just an outright copyright violation.
While I am obviously Team GPL and not team "I 'rewrote' this with AI so now it's mine", I'm team anti-fork, and definitely not team 'Chardet'.
Forking should be a last resort, one better option is to yeet the thing entirely.
And chardet lends itself perfectly for this, using chardet is a sign of an issue and low craftmanship, either by the developer using chardet, or the developer that failed to signal the encoding of their text. (See Joel Spolsky's "The absolute minimum every developer should know about character encoding") (And let's be honest, it's probably the developers problem, not everything is someone else's fault.)
Just uninstall this thing where you can, and avoid installing it always, because you always can.
You know I'm right. I will not be replying to copium
> I put this together using Claude Code with Opus 4.6 with the amazing https://github.com/obra/superpowers plugin in less than a week. It took a fair amount of iteration to get it dialed in quite like I wanted, but it took a project I had been putting off for many years and made it take ~4 days.
Given the amount of changes I seriously doubt that this re-implementation has been reviewed properly and I wonder how this is going to be maintainable going forward.
If I release blub 1.0.0 under GPL, you cannot fork it and add features and release that closed-source, but I can certainly do that as I have ownership. I can't stop others continuing to use 1.0.0 and develop it further under the GPL, but what happens to my own 1.1.0 onwards is up to me. I can even sell the rights to use it closed-source.
What is this recent (clanker-fueled?) obsession to give everything fancy computer-y names with high numbers?
It's not a '12 stage pipeline', it's just an algorithm.
Do you know this kind of area and are commenting on the code?