Anyway, so... the xz project has been compromised for a long time, at least since 5.4.5. I see that this JiaT75 guy has been the primary guy in charge of at least the GitHub releases for years. Should we view all releases after he got involved as probably compromised?
My TLDR is that I would regard all commits by JiaT75 as potentially compromised.
Given the ability to manipulate gitnhistory I am not sure if a simple time based revert is enough.
It would be great to compare old copies of the repo with the current state. There is no guarantee that the history wasn't tampered with.
Overall the only safe action would IMHO to establish a new upstream from an assumed good state, then fully audit it. At that point we should probably just abandon it and use zstd instead.
Xz is an implant of 7zip's LZMA(2) compression into traditional Unix archiver skeleton. It trades long compression times and giant dictionaries (that need lots of memory) for better (“much-better-than-deflate”) compression ratios. Therefore, zstd, no matter how fashionable that name might be in some circles, is not a replacement for xz.
It should also be noted that those LZMA-based archive formats might not be considered state-of-the-art today. If you worry about data density, there are options for both faster compression at the same size, and better compression in the same amount of time (provided that data is generally compressible). 7zip and xz are widespread and well tested, though, and allow decompression to be fast, which might be important in some cases. Alternatives often decompress much slowly. This is also a trade-off between total time spent on X nodes compressing data, and Y nodes decompressing data. When X is 1, and Y is in the millions (say, software distribution), you can spend A LOT of time compressing even for relatively minuscule gains without affecting the scales.
It should also be noted that many (or most) decoders of top compressing archivers are implemented as virtual machines executing chains of transform and unpack operations defined in archive file over pieces of data also saved there. Or, looking from a different angle, complex state machines initializing their state using complex data in the archive. Compressor tries to find most suitable combination of basic steps based on input data, and stores the result in the archive. (This is logically completed in neural network compression tools which learn what to do with data from data itself.) As some people may know, implementing all that byte juggling safely and effectively is a herculean task, and compression tools had exploits in the past because of that. Switching to a better solution might introduce a lot more potentially exploited bugs.
You should use ultra settings and >=19 as the compression level. E.g. arch used 20 and higher compression levels do exist, but they were already at a <1% increase.
It does beat xz for these tasks. It's just not the default settings as those are indeed optimized for the lzo to gzip/bzip2 range.
found it mentioned in https://github.com/facebook/proxygen/blob/main/build/fbcode_..., looks like it's going to be cousin of zstd, but maybe for the stronger compression use cases
Rewritten history is not a real concern because it would have been immediately noticed by anyone updating an existing checkout.
> Overall the only safe action would IMHO to establish a new upstream from an assumed good state, then fully audit it. At that point we should probably just abandon it and use zstd instead.
This is absurd and also impossible without breaking backwards compatibility all over the place.