1. You can't just rely on documentation ("we never said we would guarantee this or that") to push back on your users' claims that you introduced a breaking change. If you care more about your documentation than your users, they will turn their back on you.
2. However if you start guaranteeing too much stability, innovation and change will become too costly or even impossible. In this instance, if the git team has to guarantee their hashes (which seems impossible anyway because it depends on the external gzip program) then they can never improve on compression.
Tough situation to be in.
I can only imagine someone going to great lengths to avoid such "a stable order of operations was never guaranteed" discussion by just randomizing the order of execution or something similar (I bet someone will then use that as a seed for prng).
edit: skipping the first paragraph lead to repeating hyrums law.
It was an internal user interface, intended for employees of our conpany. Once upon a time, we had a process for adding a new record where it had to be added manually to multiple internal systems. So the internal UI had its own copy of the data. But then we built a single source of truth for this data source, that single source of truth had an API which our application would query and so the database table updates were abandoned as they were for the database's internal use only, however nobody ever bothered to remove the old table with a few hundred rows of stale data.
Two years later, we got the bug report then. The users' manager was complaining that the dataset was incomplete, that it was impeding his work, and that it needed to be fixed asap.
It turned out at some stage he had requested and was granted read only access to that DB, and had been querying the records of user actions in that DB to track the volume and quality of work his subordinates did. And then at some other point he realised that he could join against this table to get readable labels rather than opaque identifiers for the types of data said reports were working on. Except of course, the data was two years stale so he was noticing an increasing amount of "missing" labels in his report.
Said user escalated all the way to a VP of engineering before accepting that no, a private database is not a supported interface of our product.
......... Hyrum’s law?
Bingo. https://news.ycombinator.com/item?id=34631275#34636529
Why github doesn't cache archives instead of regenerating them on the fly is unclear, and maybe something the developers should address. Or maybe there was a cache and it got blown away by the change that caused the archive checksums to change.
Github could just generate the tarball once and store it in the same way as other release artifacts. But for some reason they chose not to.
In my experience this sort of simplistic proposal by an outsider is almost always born of ignorance about the complexities of the actual system.
A story about a Golang program that had assumed map iteration was uniformly random but it’s not, which caused a load balancer to assign work unevenly:
https://dev.to/wallyqs/gos-map-iteration-order-is-not-that-r...
A graph of the map iteration order’s distribution showing that it’s not uniformly random:
You pull that and get a tarball that is presented to the world as an "official release". Looks like a file. Acts like a file. It's a file.
So now your package manager or reproducible build engine or whatever needs a reference to the "official source code release", and what do you point it to? That file, obviously. It's right there on the "release" page for the download. And of course you checksum it for security, because duh.
Then last week all of a sudden that file changed! Sure, it has the same contents. But the checksum that you computed in good faith based on the official release tarball doesn't match!
If there's a misunderstanding here, it's on github and not the users. They can't be providing official release tarballs if they won't guarantee consistency. "As documented", this feature was a huge footgun[1]. That's bad.
[1] Actually it's worse: as documented it's basically useless. If you can't externally validate the results of that archive file, then the only way to use it is to tell your users that they have to trust Microsoft not to do anything bad, because you can't make any promise about the file that they can verify!
The fact it looked like an immutable file is much more relevant though.
Pigs would fly first, but it's possible!
Uh huh.
> After a fruitful exchange with GitHub support staff, I was able to confirm the following (quoting with their permission):
>> I checked with our team and they confirmed that we can expect the checksums for repository release archives, found at /archive/refs/tags/$tag, to be stable going forward. That cannot be said, however, for repository code download archives found at archive/v6.0.4.
>> It's totally understandable that users have come to expect a stable and consistent checksum value for these archives, which would be the case most of the time. However, it is not meant to be reliable or a way to distribute software releases and nothing in the software stack is made to try to produce consistent archives. This is no different from creating a tarball locally and trying verify it with the hash of the tarball someone created on their own machine.
>> If you had only a tag with no associated release, you should still expect to have a consistent checksum for the archives at /archive/refs/tags/$tag.
> In summary: It is safe to reference archives of any kind via the /refs/tags endpoint, everything else enjoys no guarantees.
(posted 4 Feb 2022)
https://github.com/bazel-contrib/SIG-rules-authors/issues/11...
There's even a million linked PRs and issues where people went around and specifically updated their code to point to the URLs that were, nominally, stable.
I suspect that the GH employee who made these comments just misunderstood how these archives were being generated, or the behavior was depending on some internal implementation detail that got wiped away at some point. But if an employee at a big-ass company publicly says "yeah that's supported" to employees at another big-ass company, people are gonna take it as somewhat official.
That provides multiple advantages. Unlike GitHub’s unreliable automatically generated files, a fixed file can be hashed or cryptographically signed by the project (with SSH signatures, Signify, PGP, etc.), and later verified without having to extract the files first or check out the underlying repo.
Another thing many projects aren’t aware of: if your project uses Git submodules, anyone using GitHub’s autogenerated tarballs will be unable to build your software, because those don’t contain submodules.
Or how about this: Microsoft could provide that as a feature in their "official release" page for projects, which is exactly what we all thought that page was for in the first place.
Seriously: if archive links are unreliable they're basically useless anyway. Who wants tarballs in the modern world except for package management or build automation?
And this is easy enough to do automatically with GitHub actions, I have a workflow [1] which runs on each release to create a stable archive of the source and attaches it to the release.
[1] https://github.com/elesiuta/picosnitch/blob/master/.github/w...
Why not?
I cannot help but wonder if this change was forced upon github by Microsoft because gzip is GPL 3, maybe this other version is a clean room clone. We all know corporations hate GPLv3, including the large corporation I work for.
Furthermore, gzip isn't even necessarily the best tool to produce gzip data. If you want multi-core parallelism there's pigz, and if you're willing to trade higher CPU usage to get a better compression ratio you can use zopfli. I don't know the details of the implementation in git and whether it tries to leverage multi-threading or zopfli-like techniques, but the point stands that gzip isn't the final word on producing gzip data.
As much as I distrust Microsoft, I don't think there were any ulterior motives here.
Didn't Google beat Hyrum's law by using their weight to force middleboxes to accept some variation in some datum of an http header or something?
Edit: hint: something about rotating a value for some number of decades. Either forcing the hand of middleboxes or CAs, I can't remember. In either case, it seemed like a real pain in the ass to keep the API observability concrete from hardening. :)
The other example of evading Hyrum's Law that comes to mind was when early Javascript users of JSON observed how they could intersperse comments.
Crocker said he noticed people were using the comments to stuff preprocessing directives into JSON.
He then devised the most ingenious hack: He told people they weren't allowed to put comments in JSON. Then people stopped putting comments in JSON.
I'm starting to wonder whether Hyrum's Law is really more of a suggestion. :)
I'm certain there's some exploit waiting to subvert the decompress algorithm and substitute malicious content in place of the actual archive files.
- your HTTPS stack - gzip encoded HTTP - your sha256 program
Checksums still work and protect against malicious tarballs which are generally riskier to unpack than plain steam compression / decompression. The server and client gets the smaller file transfers and compression improvements can evolve transparently by negotiating the transfer encoding. The server can still cache the encoded form to avoid needing to compress the same file repeatedly.
Seems like a win win solution without requesting a drastic redesign of package managers everywhere and everyone walks away having won the properties of the system they value.
Would probably want to store expected file size together with checksum to avoid the "compressed stream of endless zeroes" attack vector
The main simplification is that there’s less work on the current side at scale - the file you download is the file you checksum. That’s different if package maintainers have to do it manually for each package.
Can you say more about the endless zeroes attack? Are you thinking about finding a sha256 collision? You have to keep computing the sha256 for every new byte which is expensive. And if that ever becomes practical, the ecosystem will switch to 384, 512 or 512/256. But sure, storing file size + hash is generally a good idea to make it that much harder (in practice no one bothers and this advice would apply regardless of compression or not because the expensive bit is the digest computation to find a collision)
I think this just made me realise an issue I was having with Swift Package Manager a few months back. We have a bunch of ObjC frameworks in our app that we don't want people to update anymore so we can rewrite them, and we just threw them all into a big umbrella project, but for some reason we couldn't get the binary target URL from Github Enterprise to work on our self hosted Enterprise instance because the checksum would be different every time, but it worked perfectly for Github Cloud.
Is there anyone from Github here - Can you confirm that is the cause of issue for GH Enterprise?
But this is an example of a much weaker proposition: if you don't document your contract, then people will guess what the contract is and some of them will guess wrong.
(In fact in this case it seems it's more like "if you don't document your contract and your support staff sometimes say the behaviour is A, people will rely on the behaviour being A".)