From the email...
"I haven't seen the attack yet, but git doesn't actually just hash the data, it does prepend a type/length field to it. That usually tends to make collision attacks much harder, because you either have to make the resulting size the same too, or you have to be able to also edit the size field in the header."
[...]
"I haven't seen the attack details, but I bet
(a) the fact that we have a separate size encoding makes it much harder to do on git objects in the first place
(b) we can probably easily add some extra sanity checks to the opaque data we do have, to make it much harder to do the hiding of random data that these attacks pretty much always depend on."
$ curl https://shattered.io/static/shattered-1.pdf | wc -c
422435
$ curl -s https://shattered.io/static/shattered-2.pdf | wc -c
422435
Second, the length is already being hashed into the content during computation of a SHA-1 hash. Look up Merkle-Damgard construction: https://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_co...
There is benefit in storing the length at the prefix as well, as you can avoid length extension attacks, but that's not making attacks "much harder".
The more restrictive the serialization format of the hashed data, the harder it is to find a collision that’s valid in the given application context.
$ curl -s https://shattered.io/static/shattered-1.pdf | hexdump -n 512 -C
00000000 25 50 44 46 2d 31 2e 33 0a 25 e2 e3 cf d3 0a 0a |%PDF-1.3.%......|
00000010 0a 31 20 30 20 6f 62 6a 0a 3c 3c 2f 57 69 64 74 |.1 0 obj.<</Widt|
00000020 68 20 32 20 30 20 52 2f 48 65 69 67 68 74 20 33 |h 2 0 R/Height 3|
00000030 20 30 20 52 2f 54 79 70 65 20 34 20 30 20 52 2f | 0 R/Type 4 0 R/|
00000040 53 75 62 74 79 70 65 20 35 20 30 20 52 2f 46 69 |Subtype 5 0 R/Fi|
00000050 6c 74 65 72 20 36 20 30 20 52 2f 43 6f 6c 6f 72 |lter 6 0 R/Color|
00000060 53 70 61 63 65 20 37 20 30 20 52 2f 4c 65 6e 67 |Space 7 0 R/Leng|
00000070 74 68 20 38 20 30 20 52 2f 42 69 74 73 50 65 72 |th 8 0 R/BitsPer|
00000080 43 6f 6d 70 6f 6e 65 6e 74 20 38 3e 3e 0a 73 74 |Component 8>>.st|
00000090 72 65 61 6d 0a ff d8 ff fe 00 24 53 48 41 2d 31 |ream......$SHA-1|
000000a0 20 69 73 20 64 65 61 64 21 21 21 21 21 85 2f ec | is dead!!!!!./.|
The shattered attack was about a so-called "identical prefix" collision, while the shambles paper's collision was a "chosen prefix" one. You can choose it in both cases, but in the "chosen prefix" one both colliding prefixes can be entirely different (and can be as long as you want btw, the attack doesn't cost more if the prefix is 4 KB vs 4 GB), while in the "identical prefix" case it has to be identical.In the double-digit thousands of dollars, an attack that gets 10x or 100x harder is still cheap for state actors.
Assuming the NSA is at least a year or two ahead of the field, git should now accelerate its migration process.
The only thing that prefixing the length makes difficult is using the same prefix multiple times: you basically have to make up your mind about the type and length before mounting the shattered attack. Also, the prefix means you have to do your own shattered attack and can't use the PDFs that google provided as proof of their project's success. Price tag for that seems to be 11k.
[1]: https://github.com/cr-marcstevens/sha1collisiondetection
But it sounds as if the cost of changing the hash algorithm is high. What are the impacts of this change? How many things would break if git just changed the algorithm with each new release? Does git assume that the hash algorithm is statically given to be SHA-1 or are there qualifiers on which algorithm is enabled/permitted/configured?
Git is moving to a flexible hash though. [1]
[1] https://stackoverflow.com/questions/28159071/why-doesnt-git-...
The Python community would freak out, lol.
Unless Linus really believes that git will be fine using SHA-1 for decades to come I don't think it's very responsible to keep kicking the ball down the road waiting for the inevitable day when a viable proof of concept attack on git will be published and people will have to emergency-patch everything.
As I read the OP [1] a chosen-prefix collision attack such as this allows you to “edit the size field in the header”. Or am I missing something?
1. “A chosen-prefix collision is a more constrained (and much more difficult to obtain) type of collision, where two message prefixes P and P’ are first given as challenge to the adversary, and his goal is then to compute two messages M and M’ such that H(P || M) = H(P’ || M’), where || denotes concatenation.”
EDIT: On second thought I was missing something: the adversary is further constrained in the git case because it must find M and M’ of correct length (specified in P and P’). Linus is right (as usual), this probably makes it much harder.
This argument sounds sound to me.
People store things in git that aren't text. Therefore it's not safe.
I still feel like they really should've taken this problem more seriously and earlier. The more we wait the more painful the migration will be when the day comes to move to a different hash function, because everybody knows that'll happen sooner or later. Two years ago we had a collision, now we have chosen prefix, how much longer until somebody actually manages to make a git object collision?
And keep in mind that public research is probably several years behind top secret state agency capabilities. Let's stop looking for excuses every time SHA-1 takes a hit and rip the bandaid already. It's going to be messy and painful but it has to be done.
With this chosen-prefix attack, they chose two prefixes and generated collisions by appending some data. So your two prefixes just need to be "tree {GOOD,BAD}\nauthor foo\n\nmerge me\0"
The only thing preventing injecting a backdoor into a pull request now seems to be git's use of hardened sha1.