None of this is to say that the legal profession shouldn't move to SHA2; it should.
Firstly ... the NIST recommendation TFA links doesn't just recommend SHA-3, it actually says "Federal agencies should use SHA-2 or SHA-3 as an alternative to SHA-1." SHA-2 and SHA-3 are both valid and recommended hash functions by NIST. And while 3 is higher than 2, and SHA-3 is newer, in this case it doesn't mean "better". Being based on Keccak and a sponge construction, SHA-3 provides "diversity" more than "improvement".
Secondly ... SHA2 is widely implemented in existing hardware, and it's just currently more efficient (and likely to remain so). So why waste power, especially on something you'll be doing in bulk.
O.k., so that's why SHA-2 and not SHA-3. But Blake is worth avoiding IMO ... because the FISMA says that for Federal work, you have to use one of what NIST recommends in FIPS. Obviously the legal profession needs to be able to practice in Federal courts (Article III and administrative) ... so if you're going to pick a new standard, pick one of those (but not SHA-3!).
Lastly, and this is really an aside ... it's not uncommon for some folks to think SHA3 and SHA384 are the same thing, but they are not. SHA384 is just a variant of SHA-2 with a 384-bit digest length and correspondingly improved security margin. Other Federal standards, like CNSA, separately recommend SHA384 as a good minimum ... so it can be confusing and I think it's understandable why some people think this is what SHA3 is just short for.
The security (i.e. the difficulty in finding collisions) decreases with the length of the document for SHA-2 and it stays constant for SHA-3.
Nevertheless, it is unlikely that typical legal documents are big enough for this to matter, except when the hashes would be e.g. for entire seized HDDs or SSDs, so SHA-2 is an acceptable choice for replacing MD5.
The only reason why SHA-3 has not become widespread is that it, like also AES, requires hardware support for good performance, but for some reason Intel has not added an SHA-3 instruction to the x86 ISA. Arrow Lake S, expected to be launched at the end of this year, will add support for SHA-512 and for the Chinese standard hashes, but there is still no intention to add SHA-3 (like Arm already did).
Both AES and SHA-3 are not recommendable on CPUs without dedicated instructions, due to low performance, but with hardware support they become faster than any alternatives. The difference between them is that now only the cheaper microcontrollers lack AES support, while SHA-3 support is still seldom encountered.
I'd only let NIST and FIPS drive my crypto choices if I was being actively forced to do so by a government use case, and even so I'd be praying for the day that FIPS joins us in the modern era and stops tethering us to acronyms they could enumerate a decade ago.
The records being produced almost always exist and the hash is being used to differentiate which is which and whether you got are correct copies.
If evil litigant produces fake documents instead of real ones, and you can't get access to the system they came from, no hash will save you - they can produce any hash they want and you can't verify against an original.
If you can verify against the original, forensics isn't going to look at a hash and say "hash matches my job is done".
Most of the time what matters is whether the evidence exists or not.
The tobacco companies lied and claimed they never had any evidence in the first place.
assume instead they wanted to falsify the data to say that as far as they could tell smoking did not cause cancer.
this is not a hashing problem. This is a problem of making up fake studies that look real.
Essentially, the gameboy expected a bitmap of the Nintendo logo to be present on the cartridge rom, and was shown on screen at boot. It had to match a version stored on the gameboy itself or else the game wouldn’t start.
The thinking (that I’m not sure was ever tested) was that someone producing a game that tried to trick consumers into thinking it was an official Nintendo product, would be liable for damages in a trademark lawsuit. Since the game would never start without an official Nintendo logo, the hope was to make the legal system enforce Nintendo’s licensing scheme.
If you do have access to the original system, md5 is usually not the thing in the way of falsifying evidence.
most of the time they will claim, the evidence does not exist at all, rather than try to falsify it.
falsification requires making lots of things that makes sense historically, and humans to swear to them.
you also often have to falsify more than one system in a consistent way.
you have to do all of these things in a way that the forensic specialist is not going to think that everything looks really weird
The examples given in the article rely on an attacker manipulating an image prior to it being hashed, or compromising their opponent's database. If you can (and are willing to) do these things, no hashing algorithm will help. They also assume that the hash is literally the only thing anyone will rely on in identifying a document, which is not how it will work in practice. (No witness has ever said "yes there was a letter, I don't remember what it looked like or what it said but I remember its MD5 hash was [...]".)
I don't think that hash colliding artifacts would necessarily be obvious. They could be in part of a file that ends up being ignored by parsers of the file format. Or it could be some low level noise in pixel values in scanned documents.
And do you think we trust md5, or do you think we have another tool that can compare documents? We do have such tools.
What prevents chicanery in discovery and disclosure is how annoying it was to go to law school and pass the bar compared to how fun it would be to work at Wendy’s for the rest of our careers.
Should we use a better hash function: probably. Is this a problem of not enough technology literacy in the profession? Well, not really; there are plenty of tech literate lawyers (hi!) we just have questionable tools just like you do in IT and we make do.
Same reason why blockchain is (almost) never the answer: that's not how human systems work.
Trust is centralized. Always had been.
The reason why people like us keep changing everything for security is specifically because we have no access to justice. Computer crimes are international and difficult to prosecute, so you might as well drop an algorithm like a hot potato if anyone - even just nation state actors - could break it. We build our rules out of code because we do not have access to the material they make laws out of.
That being said, continuing to use MD5 is utterly inexcusable.
Let’s ignore that no second preimage attack is currently known for MD5. The software the author links to has a FAQ that links to a paper that lays out the second preimage complexity for MD4:
https://who.paris.inria.fr/Gaetan.Leurent/files/MD4_FSE08.pd...
It takes 2^102 hashes to brute force this for MD4, which is weaker than MD5. A bitcoin Antminer K7 will set you back $2,000, and it gets 58 TH/s for sha256, which is slower than MD5 or MD4. Let’s ignore that MD5 is more complex than MD4, and let’s say conservatively that similar hardware might be twice as fast for MD5 (SHA256 is really only 20-30% slower on a cpu). It’ll take 2^102/58e12/2/60/60/24/365, or about 1.4 billion years to do a second preimage attack with current hardware. So you could do that 3 times before the sun dies.
If you want to reduce that to 1.4 years, you could maybe buy a billion K7’s for $2 trillion. And each requires 2.8kW so you’ll need to find 2.8 terawatts somewhere. That’s 34 trillion kWh for 1.4 years. US yearly energy consumption is 4 trillion kWh.
It will be a while, probably decades or more, before there’s a tractable second preimage attack here.
Yes, there are stronger hashes out there than MD5, but for file verification (which is what it’s being used for) it’s fine. Safe, even. The legal folks should probably switch someday, and it’ll probably be convenient to do so since many crypto libraries won’t even let you use MD5 unless you pass a “not for security” argument.
But there’s no crisis. They can take their time.
The problem with this argument is that people often don't properly understanding the security requirements of systems. I can't count the number of times I've seen people say "md5 is fine for use case xyz" where in some counterintuitive way it wasn't fine.
And tbh, I don't understand the urge of people to defend broken hash functions. Just use a safe one, even if you think you "don't need it". It doesn't have any downsides to choose a secure hash function, and it's far easier to do that than to actually show that you "don't need it" (instead of just having a feeling you don't need it).
For the unlikely event that you think that the performance matters (which is unlikely, as cryptographic hash functions are so fast that it's really hard to build anything where the diff. between md5 and sha256 matters), even that's covered: blake3 is faster than md5.
I can count many more times that people told me that md5 was "broken" for file verification when, in fact, it never has been.
My main gripe with the article is that it portrays the entire legal profession as "backwards" and "deeply negligent" when they're not actually doing anything unsafe -- or even likely to be unsafe. And "tech" apparently knows better. Much of tech, it would seem, has no idea about the use cases and why one might be safe or not. They just know something's "broken" -- so, clearly, we should update immediately or risk... something.
> Just use a safe one, even if you think you "don't need it".
Here's me switching 5,700 or so hashes from md5 to sha256 in 2019: https://github.com/spack/spack/pull/13185
Did I need it? No. Am I "compliant"? Yes.
Really, though, the main tangible benefit was that it saved me having to respond to questions and uninformed criticism from people unnecessarily worried about md5 checksums.
Help us out by describing a time when this happened. MD5's weaknesses are easily described, and importantly, it is still (second) preimage resistant.
I agree that upgrade is likely your best bet. But I've found the other direction of bad reasoning is a more pernicious trap to fall into. "My system uses bcrypt somewhere so therefore it is secure" and the like is often used as a full substitute for thinking about the entirety of the system.
The ideal discourse would not imply a binary sense of "safety" at all, much less for a function evaluated outside the context and needs of its usage....
Also, hand-wavy extrapolations from Bitcoin miners aren't a reliable estimate of how fast & energy-efficient dedicated MD5 hardware could become.
Pirate libraries are particularly important to preserve our cultural heritage in a transparent and trustworthy way. A role that traditional libraries sadly cannot fulfill due to draconian copyright laws, especially around digital books. With archive.org as notable exception.
Still, they should switch. Sha1 is not good either.
It’s frankly broken that evidence-handling doesn’t have to follow the government's advice about hash function selection!
I informed her that since FIPS 140-2 is about physical properties of key creation and management, all the relevant layers in a cloud-only solution are simply in the wrong scope. And I added that I am allergic to the string "FIPS" in general. Even having it present in official contract language makes people leap into weird assumptions about supported and allowed algorithms.
Her response? "Oh, that makes sense."
Different checksum algorithms can provide better error detection for specific channel error models (potentially even with fewer bits). Non-cryptographic checksums are typically designed for various failure models like a burst of corrupt bits, trading off what they do/don't detect to better match detection of corruption in the data they will protect.
For example, if you know that there will be at most one bit flip in your message, a single bit checksum (parity check) is sufficient to identify that an error occurred, regardless of your message size. (Note that this is an illustrative example only, since, typically, messages have a certain number of errors for a certain number of message bits -- the expected number of errors depends on the size of the message.)
Important real-life-facts.
There was no "give-me-an-appropriate-hash" function.
There was:
md5sum yourfile.txt
Nobody wants to think about "channel's bit error distribution" in a non-security critical context. In fact, its irrelevant, and possibly a usability issue.
SHA256 is also near universally supported and doesn’t have this drawback. The only cases where MD5 would be available and SHA256 wouldn’t, is systems that are out of security support anyway, where there are bigger problems to contend with.
I'd suggest using Adler (what zlib does) for a simple and fast checksum. Then that should, one hopes, be painfully obvious to be a bad fit for anything security related.
Luckily, my personal rule is to default to a cryptographic hash unless I can convince myself that cryptographic robustness will never matter and performance definitely will matter, rather than the other way around.
In this specific case all users were internal to the company, so it wouldn't have really mattered if it was vulnerable. But it could just have easily been an external user-facing thing.
They are not useful as file identifiers. I have found multiple CRC32 collisions even in a single directory (a big one, with around ten thousand files).
For error detection in a big file, you need at least some 64-bit CRC, though SipHash is likely to be a better choice than a CRC.
For identifying uniquely a file in a multi-TB file system, which may have many millions of files, even a 64-bit hash is not good enough, a hash of 128 bits or more is needed to make negligible the probability of collisions. I have verified this experimentally, finding several 64-bit file hash collisions in my file systems.
An "arbitrary collision" here means you can find two inputs (pre-images) which hash to the same thing. Like you ran some code and discovered that "SDFKLHKLJxchjasdfgklhjaskdhjlf9" hashed to the same thing as "klhkasdfhjkl899078790". Finding a second pre-image means you start with one message, like "ALL QUIET. REMAIN CALM." and figured out that "ATTACK AT DAWN 051928" hashes to the same MIC.
I can't believe I'm defending using MD5. But... finding second pre-images is still hard. Sasaki & Aoki say it's got a complexity of around 2^116.9 and requires 11 * 2^45 words of memory (thought 1400Tb isn't THAT outlandish these days.)
Still... statements like "finding a second pre-image is hard" don't age well and will guarantee a tractable second pre-image attack will be published tomorrow.
But... if you have a bunch of docs and you're not signing them or asking people to trust the hash of each doc, you can (reasonably) quickly de-dup by sorting by MD5 hash and then looking for dups. Which is how many people use MD5. And they continue using MD5 because multiple organizations have similar lists and if you wanted to change it, you would need to get everyone to move to a different algorithm.
But yeah... at this point we should assume someone will publish a tractable second pre-image attack "any day now" and get to work migrating from MD5 to MD5 : Next Generation. But good luck getting more than 2 people to agree to what the next preferred hash algorithm should be.
I'm personally of the opinion that it doesn't matter. MD5 is fine for genomics. The chances of valid genome files colliding is still extremely low, and there's not really any relevant attack space. Replacing one assembly file with another will just break someone's analysis pipeline, and most likely in a very clear obvious way.
Then why use a cryptographic hash at all? much better hashes out there that only strive for distribution/avalanche.
https://en.wikipedia.org/wiki/Non-cryptographic_hash_functio...
Sure, there are better non-cryptographic hashes, but, again the concern of lawyers and genomics folk is neither security nor efficiency - simplicity and "works most of the time" are the two metrics at stake.
If either laywers or genomics folks cared about document forgery of this nature (spoiler, they don't), they would move to something like SHA3. If they had a need for high-scalability hash algorithms (spoiler, they don't), they would switch to another faster algorithm.
This is a concept I understand security folks struggle to understand - sometimes we _just don't care_. And we never should.
Maybe, something a struggling security enthusiast could understand - a video game.
If you implement e.g. a caesar cipher, you can have fun, accessible puzzle. Implementing AES in your game as a puzzle, while much harder, fails desperately at the "accessibility" metric. In your single player game, if you want to see some "identifying hash", if you see an md5 one, that's enough. No, you should not worry about people forging documents for your ad-hoc identification system, if you don't have people attempting to forge in-game items. Maybe its even a feature that you need to forge such a hash, as a way to solve a puzzle.
And given the history of cryptographic hashes, i'm even more convinced that anyone depending on sha3/whatever being better than md5/etc over the next 10-20 years is fooling themselves.
Now would I use it in a secure boot chain/etc as a stamp of uniqueness? Probably not.
MD5 brings the feature that you'll forever be explaining why you chose a function that had already been broken for 30 years when other options were readily available.
I have build data warehouses with md5 as the hashing algorithm to generate keys from natural keys. Did some back of the envelope calculations back then and found that the chance of a hash collision was minute. Don't remember the exact numbers, but somewhere in the 100s of years if I was generating keys every second.
This could btw very well be a thing with large volumes of data, but in many systems this absolutely not a worry.
Ya know... It's 2024 and Azure's blob storage ONLY supports MD5 for integrity checks when writing blobs. There are no other hash functions supported there. The default cloud storage solution implemented by one of the largest cloud providers out there ONLY uses MD5.
I really want to use something else, but whenever I have to interact with them I must fall back to MD5. It's not up to me as a dev to use something better if I need to interact with Azure. Yes, I can use other hashes alongside MD5, but if I want integrity checks with the storage provider I can't completely abandon MD5.
That said, they apparently use eight passes of MD5 hashing along with salting, which they claim is a sufficiently secure combo.
WordPress's core and default themes are known to be fairly secure, so I'd like to believe they know what they're talking about, but if nothing else it feels icky.
There ends up being a usability issue here. An MD5 hash is only 128 bits long. So 32 hex digits. A SHA-2 hash is going to be 256 bits. Or 64 hex digits. Manually comparing 64 hex digits is in practice much harder than twice as hard as comparing 32 hex digits. People get lost in the middle. If you chop down your 256 bit hash to 128 bits then due to birthday collisions you can probably brute force a collision anyway (you end up only having to do something like 2^64 operations). So there ends up being a usability argument for specifying that your system has to be able to be secure in the face of collisions. At that point you could then further argue that you will just stick with MD5.
If a visual comparison is believed necessary, it should better be made easier, e.g. by overwriting the two hash values, using text of different colors.
Otherwise, even a bash script, or even just one bash command line can easily compare the output of two sha256sum executions and print an appropriate message.
Comparing MD5 and SHA-2 for visual human diffing is like comparing a stick of dynamite to a landmine when trying to pop a pimple; any potential safety differences are trivial once you start using something in a fundamentally unsafe way.
https://crypto.stackexchange.com/questions/84520/how-long-wo...
"We have successfully run the computation during two months last summer, using 900 GPUs (Nvidia GTX 1060)."
Such resources can be easily rented from a cloud.
So for anyone willing to spend up to 100k USD, it is trivial to find SHA-1 collisions.
SHA2-256 is "only" 21975.5 MH/s so you'd have to double the number of GPUs or amount of time.
[1] https://gist.github.com/Chick3nman/32e662a5bb63bc4f51b847bb4...
For instance: law school training to strengthen ethics, well known punishments to act as a deterrent, hiring processes to filter out unscrupulous actors, and whistleblower protections to encourage and reward vigilance.
It’s not an opinion that adheres to the cynical zeitgeist, but in my experience most members of this profession are extremely trustworthy. I’m sure they dislike the stereotype of lawyer=rotter just as much as hackers are tired of being typecast as Newman… sorry, Dennis from Jurassic Park!
This is not true. It’s not possible to guarantee this. One can be certain that two files DON’T match if they have different hashes, but one cannot be certain that two files DO match based ONLY on the fact that they have the same hash.
The fact that any document can contain its own MD5 hash embedded in there should be hugely concerning enough.
The hash also happens to start with 5EAF00D.
And a PNG version too: https://news.ycombinator.com/item?id=32956964
But no one has made an exclusively plaintext (ASCII) MD5-quine yet, and I suspect doing so may be impossible given the characteristics of collision blocks.
1. a document containing "1", whose hash begins with "1"
2. a document containing "12", whose hash begins with "12"
3. a document containing "123", whose hash begins with "123"
#1 is certain to exist. #2 exists, but would take 16x as long to brute force. #3 would take 16x longer again. If this pattern doesn't continue until 2^128, where would it stop, and why?
All hashes can be brute forced this way, even secure ones SHA-2. Its security relies on the fact that the earth doesn't contain enough computing power to execute a brute force attack within the universe's lifetime.
By that logic, SHA 256 is also broken:
$ cat >sha256.py
from hashlib import sha256
s = 'from hashlib import sha256\ns = %r\nprint sha256(s%%s).hexdigest()\n'
print sha256(s%s).hexdigest()
$ sha256sum sha256.py
14cc85c420ced317fdb73e9403ac3f6e1d96d19c70ae0dce8da9b8d96fa0b4d3 sha256.py
$ python sha256.py
14cc85c420ced317fdb73e9403ac3f6e1d96d19c70ae0dce8da9b8d96fa0b4d3
(Yes, PDF is turing complete. Yes, that's terrible. No, it doesn't have anything to do with hash function deficiencies; it's turing complete on (malicious) purpose, just like webpages with javascript.)(I've never tried to built one of these, so I could be just totally wrong here).
I think that's pretty amazing, to be honest.
Unless I missed it, this article seems to not refute the most fundamental point: MD5 was never broken for encryption. Hashing is not encryption.
Moreover, MD5, SHA-1 and SHA-2 contain a block cipher function used in the Davies-Meyer mode of operation.
The internal block cipher function can be extracted and used in any other mode of operation possible for block cipher functions.
Because of these possibilities, many older laws that have existed in various places, prohibiting the inclusion of encryption in software products, but allowing secure hashing functions, have been completely misguided.
The point is that MD5 is no good if there's any way an adversary might want to subvert it. It's fine if you just want to use it for hashing your own documents, but as soon as there's an incentive for someone to substitute one document for another, MD5 is problematic.
That's certainly the case for encryption, but it's also the case for these legal document records.
It's broken in an adversarial situation: given the hash of evidence-file A, it's possible to construct a file B that gives the same hash.
But it would be a different matter entirely to construct a file B that actually looked like a file of evidence relevant to the case. I don't know how lawyers use these hashes, but unless they're being used to detect malicious tampering, I don't see what's wrong with MD5. And since the files to be hashed are evidence, they're in the custody of a court; things have got quite bad if court officials might be tampering with evidence.
No, that's a second preimage attack. MD5 is safe against preimage & second preimage attacks.
What MD5 is not safe against, is a collision attack: you can create two messages/files with different content, that end up having the same hash.
So to exploit the vulnerability, you have to be able to manipulate file A, the original piece of evidence, to construct a file B that has a matching hash. I still fail to see how this impacts files submitted to a court in evidence.
Given a message M, length function L(), and MD5 hash function H(); is there an attack which can generate message M', such that H(M)==H(M') _and_ L(M)==L(M')?
In other words: Two different messages, both of the same length, with the same hash?
It's almost like a chosen prefix collision attack, but with no prefix (so P is empty) and a given message (M is known, M' is up to the attacker).
I ask because I frequently use GridFTP for data transfer, and it uses both the file length and the MD5 has to verify that files were transferred correctly.
MD5 is fine for the first task, and totally unacceptable for the second.
And it is unclear if that is in any way unusual.
Yes, I must say I smiled when I saw the author's assertion that moving the entire legal industry to a new hashing algorithm is "trivial".
Like i get that md5 is essentially a unique identifier and not meant to protect against malicious interference but if all the exhibits had the same identifier surely that would confuse people.
The point is that evidence is an agreement between the two sides in a case, and it's not an absolute thing.
If you have the original document that was signed by both parties, great. If you have a scan (using a lossless compression format) of the document and proof that the original was destroyed, great. If you have a scan but no proof of destruction, still great. If you have a photograph of the document and no proof, still great. If you have a vague recollection of what was in the document, still great. All of these are "great" if the other side accepts that they are accurate depictions of the original. If they don't accept that, then there's an argument about what the original document contained and the provenance of the evidence, and only then does the actual quality matter. Original document with wet signature is hard to argue with (but not impossible - wet signatures can be forged). The further away from that, the easier it is to argue that the document presented is not accurate and should not be accepted as evidence.
Knowing that it's possible to use collisions to create false evidence doesn't matter if no-one contests that the evidence is false. It only becomes significant if one side says that the document has been tampered with, and that's not that common. The side claiming it was tampered with would have to present their version of the document, and their version of events that allowed the document to be tampered with, and so on. The judge would make a ruling about which version of the document was considered the "real" one and the case would continue. Obviously there are edge cases where the whole trial verdict hinges on which version of the document is the correct one, but they're edge cases. And in those cases you could-re-hash the documents involved and double-check with one was right, etc.
In the OP's example, where a letter of recommendation has the same hash as a authorisation letter, this is only going to matter if one side says the accused was authorised and the other says they weren't. The authorisation letter will be produced by one side, and the recommendation letter produced by the other, and there'll be an argument about which was the original document. The fact that they have the same hash isn't really relevant. It's a minor point of interest given that these are two clearly different documents saying different things.
In the specific cases for the SFO that I worked on, the SFO descended on the accused's offices like locusts, sweeping every single document into carefully numbered bags. We scanned the documents in secure facilities, stored the originals in secure facilities, stored the resulting images in secure storage, and deleted any cache or copies. My professional opinion is that it would be impossible for anyone to create two documents prior to the SFO's investigation that would create an intentional MD5 collision in the evidence used in court. And, even if they somehow did, it wouldn't matter because both documents would be in evidence bags in storage and could be recovered to be examined by the court.
Obviously, from a black/white technical point of view, using a better hash algorithm would be better. But I can see why the legal profession is reluctant to adopt the new thing; it's a hassle and it will only affect a tiny amount of cases, if any.
To the article, what tptacek said.
It will make a speed up, but its not like shor's algorithm - you need a really powerful quantum computer before md5 comes under threat.
But to be clear. Md5 is broken do not use.
It is much more computationally feasible to create two inputs from scratch that hash to the same value than to forge an existing documents hash (the threat model I’m assuming they’re discussing in relation to the law).
As far as I know I am not aware of a demonstrated second preimage attack on md5. Not saying to keep using it, just trying to not spread fud.
Edit: I do see second preimage is mentioned about 3/4 of the way through the article. I confess that I did stop reading and started skimming before then.
A successful second preimage attack is needed if you want to make a second variant with the same hash like an already existing legal document.
However, when the original document is not yet in the possession of others (or there might be a way to destroy or replace their older copies), you can make more or less invisible modifications to it, so that a second different document will have the same hash with it. Then the altered original document can be handed to other parties, who will not notice changes from whatever had been agreed, while keeping an alternative document that can be shown later as having the same hash.
While opportunities for such a forgery should happen less often, it is much better to use a collision-resistant hash to completely remove this possibility.
After the first high-profile case where authenticity of evidence gets called into question because a seized electronic document was deliberately doctored to allow for a hash collision (if that ever happens), there will be a will to change to something new.
[1] unless I'm missing something, this boils down to: "given f(x: string) => y, how can I minimize the odds that you can generate an X for a desired Y"