The Curious Case of MD5 (opens in new tab)

[0] https://catskull.net/gameboy-boot-screen-logo.html

akerl_2y ago

It feels like you're just hunting for complexity a bit here. Any of SHA2, SHA3, or Blake are viable options.

I'd only let NIST and FIPS drive my crypto choices if I was being actively forced to do so by a government use case, and even so I'd be praying for the day that FIPS joins us in the modern era and stops tethering us to acronyms they could enumerate a decade ago.

tptacek2y ago

He's hunting for complexity because it's interesting to talk about!

DannyBee2y ago

Lawyer here. This is right. Unlike some of the typical software use cases, md5 is usually not the only thing stopping you from falsifying evidence.

The records being produced almost always exist and the hash is being used to differentiate which is which and whether you got are correct copies.

If evil litigant produces fake documents instead of real ones, and you can't get access to the system they came from, no hash will save you - they can produce any hash they want and you can't verify against an original.

If you can verify against the original, forensics isn't going to look at a hash and say "hash matches my job is done".

Most of the time what matters is whether the evidence exists or not.

The tobacco companies lied and claimed they never had any evidence in the first place.

assume instead they wanted to falsify the data to say that as far as they could tell smoking did not cause cancer.

this is not a hashing problem. This is a problem of making up fake studies that look real.

woodruffw2y ago

This is more or less what I learned when I worked on forensics software, the kind that was supposed to maintain this kind of chain of custody/integrity. Like most things that touch the legal system, the presumption is that dishonesty or unsoundness in the chain of custody is fundamentally a legal problem with legal recourses, not something that can be solved with math.

graypegg2y ago

I quite like the licensing trick Nintendo used on the gameboy as an example of this. [0]

Essentially, the gameboy expected a bitmap of the Nintendo logo to be present on the cartridge rom, and was shown on screen at boot. It had to match a version stored on the gameboy itself or else the game wouldn’t start.

The thinking (that I’m not sure was ever tested) was that someone producing a game that tried to trick consumers into thinking it was an official Nintendo product, would be liable for damages in a trademark lawsuit. Since the game would never start without an official Nintendo logo, the hope was to make the legal system enforce Nintendo’s licensing scheme.

bentley2y ago

The thinking was tested (in U.S. jurisdiction) in Sega v. Accolade. https://en.wikipedia.org/wiki/Sega_v._Accolade

The court sensibly ruled that using technical means to force competitors to display your trademark against their will doesn’t mean you can then claim they’re infringing that trademark.

pcwalton2y ago

Apple tried this too with Dont Steal Mac OS X.kext, which uses a haiku with a copyright message as the key to decrypt certain executables like the Finder. I don't think it had any real-world impact.

https://who.paris.inria.fr/Gaetan.Leurent/files/MD4_FSE08.pd...

DannyBee2y ago

Right. If you don't have access to the original system (whether it's the thing that stored the emails or a computer with evidence on it), just an image, the hash doesn't matter anyway. They can just put fake data in the evidence vault system to begin with.

If you do have access to the original system, md5 is usually not the thing in the way of falsifying evidence.

most of the time they will claim, the evidence does not exist at all, rather than try to falsify it.

falsification requires making lots of things that makes sense historically, and humans to swear to them.

you also often have to falsify more than one system in a consistent way.

you have to do all of these things in a way that the forensic specialist is not going to think that everything looks really weird

NoboruWataya2y ago

Exactly. Both sides will have copies of all pertinent evidence, and if their copies conflict in any material way, it will be noticed and investigated and the party that manipulated their copy is going to have a very bad time.

The examples given in the article rely on an attacker manipulating an image prior to it being hashed, or compromising their opponent's database. If you can (and are willing to) do these things, no hashing algorithm will help. They also assume that the hash is literally the only thing anyone will rely on in identifying a document, which is not how it will work in practice. (No witness has ever said "yes there was a letter, I don't remember what it looked like or what it said but I remember its MD5 hash was [...]".)

planede2y ago

> the real document won't have hash colliding artifacts in it

I don't think that hash colliding artifacts would necessarily be obvious. They could be in part of a file that ends up being ignored by parsers of the file format. Or it could be some low level noise in pixel values in scanned documents.

franga20002y ago

Both of those are things computer forensics experts are used to looking for and I'm certain even a beginner could find them.

planede2y ago

Are court documents routinely analyzed by computer forensic experts, or only when there is suspicion of tampering?

singleshot_2y ago

It certainly does not matter. If you wanted to present a fake document as evidence, would sha-whatever help? Nope. If you wanted to withhold key evidence and pretend it didn’t exist, would it help? Nope.

And do you think we trust md5, or do you think we have another tool that can compare documents? We do have such tools.

What prevents chicanery in discovery and disclosure is how annoying it was to go to law school and pass the bar compared to how fun it would be to work at Wendy’s for the rest of our careers.

Should we use a better hash function: probably. Is this a problem of not enough technology literacy in the profession? Well, not really; there are plenty of tech literate lawyers (hi!) we just have questionable tools just like you do in IT and we make do.

brohee2y ago

What if both documents have collision artifacts? That's the most realistic attack in e.g. a contract dispute.

paulddraper2y ago

That's right.

Same reason why blockchain is (almost) never the answer: that's not how human systems work.

Trust is centralized. Always had been.

lnxg33k12y ago

I would say the system understands MD5 being broken, but more of usability issue the system is leaving a backdoor to keep itself safe, comes to mind how the system got threatened about Epstein and went over and under in order to keep him free, society should be worried about legal system having backdoors

tgamblin2y ago· 11 in thread

The article mentions the key detail: MD5 is broken for cryptography (collisions) but not for second preimage attacks. I was hoping there would be some discussion of just how much more difficult the latter is. It is extremely difficult.

Let’s ignore that no second preimage attack is currently known for MD5. The software the author links to has a FAQ that links to a paper that lays out the second preimage complexity for MD4:

It takes 2^102 hashes to brute force this for MD4, which is weaker than MD5. A bitcoin Antminer K7 will set you back $2,000, and it gets 58 TH/s for sha256, which is slower than MD5 or MD4. Let’s ignore that MD5 is more complex than MD4, and let’s say conservatively that similar hardware might be twice as fast for MD5 (SHA256 is really only 20-30% slower on a cpu). It’ll take 2^102/58e12/2/60/60/24/365, or about 1.4 billion years to do a second preimage attack with current hardware. So you could do that 3 times before the sun dies.

If you want to reduce that to 1.4 years, you could maybe buy a billion K7’s for $2 trillion. And each requires 2.8kW so you’ll need to find 2.8 terawatts somewhere. That’s 34 trillion kWh for 1.4 years. US yearly energy consumption is 4 trillion kWh.

It will be a while, probably decades or more, before there’s a tractable second preimage attack here.

Yes, there are stronger hashes out there than MD5, but for file verification (which is what it’s being used for) it’s fine. Safe, even. The legal folks should probably switch someday, and it’ll probably be convenient to do so since many crypto libraries won’t even let you use MD5 unless you pass a “not for security” argument.

But there’s no crisis. They can take their time.

hannob2y ago

> The article mentions the key detail: MD5 is broken for cryptography (collisions) but not for second preimage attacks.

The problem with this argument is that people often don't properly understanding the security requirements of systems. I can't count the number of times I've seen people say "md5 is fine for use case xyz" where in some counterintuitive way it wasn't fine.

And tbh, I don't understand the urge of people to defend broken hash functions. Just use a safe one, even if you think you "don't need it". It doesn't have any downsides to choose a secure hash function, and it's far easier to do that than to actually show that you "don't need it" (instead of just having a feeling you don't need it).

For the unlikely event that you think that the performance matters (which is unlikely, as cryptographic hash functions are so fast that it's really hard to build anything where the diff. between md5 and sha256 matters), even that's covered: blake3 is faster than md5.

tgamblin2y ago

> I can't count the number of times I've seen people say "md5 is fine for use case xyz" where in some counterintuitive way it wasn't fine.

I can count many more times that people told me that md5 was "broken" for file verification when, in fact, it never has been.

My main gripe with the article is that it portrays the entire legal profession as "backwards" and "deeply negligent" when they're not actually doing anything unsafe -- or even likely to be unsafe. And "tech" apparently knows better. Much of tech, it would seem, has no idea about the use cases and why one might be safe or not. They just know something's "broken" -- so, clearly, we should update immediately or risk... something.

> Just use a safe one, even if you think you "don't need it".

Here's me switching 5,700 or so hashes from md5 to sha256 in 2019: https://github.com/spack/spack/pull/13185

Did I need it? No. Am I "compliant"? Yes.

Really, though, the main tangible benefit was that it saved me having to respond to questions and uninformed criticism from people unnecessarily worried about md5 checksums.

RuggedPineapple2y ago

>And "tech" apparently knows better.

The tech community has a massive problem with Dunning-Kruger, and has for basically ever. Hell two decades ago when I was a young guy working in the field so did I.

I'm not sure if its because the field is basically a young man's game and that's inherent with relative youth, or if there's something deeper going on, but its hard to ignore once you notice it.

That said, the idea that you have a better handle of what's going on in the legal system and the needs/uses legal professionals have then actual people in the legal profession and academics in the legal field is a pretty big leap even with those priors.

pxx2y ago

> I can't count the number of times I've seen people say "md5 is fine for use case xyz" where in some counterintuitive way it wasn't fine.

Help us out by describing a time when this happened. MD5's weaknesses are easily described, and importantly, it is still (second) preimage resistant.

I agree that upgrade is likely your best bet. But I've found the other direction of bad reasoning is a more pernicious trap to fall into. "My system uses bcrypt somewhere so therefore it is secure" and the like is often used as a full substitute for thinking about the entirety of the system.

NavinF2y ago

> MD5's weaknesses are easily described, and importantly, it is still (second) preimage resistant

Most devs have no idea what that means, but most devs still need to use hash functions. They need to use primitives that match their mental model of a hash function. Said model is https://en.m.wikipedia.org/wiki/Random_oracle

The usual answer here is "don't roll your own crypto", but in practice abstinence-only cryptography education doesn't work.

hannob2y ago

> Help us out by describing a time when this happened.

Linus Torvalds saying that SHA-1 is okay for git, while it is used for Git signatures as well. Signatures are a classic "you need collission resistance to have safe signatures, but people are often confused about it" case.

JeffSnazz2y ago

> And tbh, I don't understand the urge of people to defend broken hash functions. Just use a safe one, even if you think you "don't need it".

The ideal discourse would not imply a binary sense of "safety" at all, much less for a function evaluated outside the context and needs of its usage....

gojomo2y ago

Second preimage attacks aren't the only threat in a forensics environment.

Also, hand-wavy extrapolations from Bitcoin miners aren't a reliable estimate of how fast & energy-efficient dedicated MD5 hardware could become.

tgamblin2y ago

Which part was hand-wavy/unreasonable? Do you think that dedicated MD5 hardware could become billions or even millions of times more efficient within a decade? If so, why?

gojomo2y ago

MD5 is already not "fine" or "safe, even" against malicious actors who might pre-prepare collisions, or pre-seed their documents with the special constructs that make MD5 manipulable to collision-attacks.

Even if your extrapolative method was sound, you've already got several factors wrong. The best SHA256 Bitcoin miners are today more than twice your estimate in hashrate, and on plain CPUs SHA256 is more like 4x slower than MD5. (Your smaller estimate of MD5's speed advantage is likely derived from benchmarks where there's special hardware support for SHA256, but not MD5, as common in modern processors.)

But it's also categorically wrong to think the CPU ratio is a good guide to how hardware optimizations would fare for MD5. The leading Bitcoin miners already use a (patented!) extra 'ASICBoost' optimization to eke out extra parallelized SHA256 tests, for that use-case, based on the internals of the algorithm. As a smaller, simpler algorithm – also with various extra weaknesses! – there's no telling how many times faster dedicated MD5 hardware, either for generically calculating hashes or with special adaptations for collision-search – might run, with similar at-the-gates, on-the-die cleverness.

Further, attacks only get better & theory breakthroughs continue. Since MD5 is already discredited amongst academics & serious-value-at-risk applications – and has been since 1994, when expert cryptographers began recommending against its use in new work – there's not much above-ground scholarly/commercial activity refining attacks further. The glory & gold has mostly moved elsewhere.

But taking solace in the illusory lack-of-attacks from that situation is foolhardy, as is pronouncing, without reasoning, that it's "probably decades or more" before second-preimage attacks are practical. Many thought that with regard to collision attacks versus SHA1 – but then the 1st collision arrived in 2017 & now they're cheap.

You can't linear-extrapolate the lifetime of a failed, long-disrecommended cryptographic hash that's already failed in numerous of its original design goals. Like a bridge built with faulty math or tainted steel, it might collapse tomorrow, or 20 years from now. Groups in secret may already have practical attacks – this sort of info has large private value! – waiting for the right time to exploit, or currently only exploiting in ways that don't reveal their private capability.

You are right that there's no present 'crisis'. But it could arrive tomorrow, causing a chaotic mad-dash to fix, putting all sorts of legal cases/convictions/judgements in doubt. Evidentiary systems should be providing robust authentication/provenance continuity across decades, as that's how long cases continue, or even centuries, for related historical/policy/law issues to play out.

Good engineers won't wait for a crisis to fix a longstanding fragility in socially-important systems, or deploy motivated-reasoning wishful-thinking napkin-estimates to rationalize indefinite inaction.

Jenda_2y ago

If I understand this correctly, the paper only shows a particular attack of complexity 2^102. Someone may find a different attack with much lower complexity. That's the usual way how cryptography gets broken -- people find better and better attacks, and suddenly the latest attack has low enough complexity to be practical.

kozak2y ago· 10 in thread

I still use MD5 as a 128-bit checksum algorithm that is fast and universally supported and compatible everywhere. In this role it's still useful, just don't expect it to be a cryptographic hash anymore.

ms5122y ago

When you don't have a need for a cryptographic digest, it's important to think of the channel's bit error distribution in selecting a checksum algorithm.

Different checksum algorithms can provide better error detection for specific channel error models (potentially even with fewer bits). Non-cryptographic checksums are typically designed for various failure models like a burst of corrupt bits, trading off what they do/don't detect to better match detection of corruption in the data they will protect.

For example, if you know that there will be at most one bit flip in your message, a single bit checksum (parity check) is sufficient to identify that an error occurred, regardless of your message size. (Note that this is an illustrative example only, since, typically, messages have a certain number of errors for a certain number of message bits -- the expected number of errors depends on the size of the message.)

drisden842y ago

"When you don't have a need for a cryptographic digest, it's important to think of the channel's bit error distribution in selecting a checksum algorithm."

Important real-life-facts.

There was no "give-me-an-appropriate-hash" function.

There was:

md5sum yourfile.txt

Nobody wants to think about "channel's bit error distribution" in a non-security critical context. In fact, its irrelevant, and possibly a usability issue.

danpalmer2y ago

While I see the point, what starts as a checksum can easily become relied upon for security over time, after all, checking whether bits have been modified accidentally on purpose, is a subtle distinction in many systems.

SHA256 is also near universally supported and doesn’t have this drawback. The only cases where MD5 would be available and SHA256 wouldn’t, is systems that are out of security support anyway, where there are bigger problems to contend with.

hoten2y ago

SHA256 is something like 30 percent slower than MD5.

I'd suggest using Adler (what zlib does) for a simple and fast checksum. Then that should, one hopes, be painfully obvious to be a bad fit for anything security related.

adrian_b2y ago

On a modern CPU (i.e. 64-bit Arm since 2012, Intel Atom since 2016, AMD Zen since 2017, Intel Core since 2019) SHA-256 is twice faster than MD5.

The difference in speed between the hashing speeds is actually greater, a double speed for a long file is what you get when the execution time includes parts that are identical for the two hashes, i.e. launching md5sum/sha256sum and reading the file.

Older versions of the binary coreutils package may mask this speed difference by having executables compiled only for very old CPUs.

Recent coreutils versions normally use for hashing the OpenSSL library, if found, and OpenSSL uses the hardware instructions where available.

Where a bad sha256sum is installed, "openssl dgst -r -sha256" should work instead.

AlienRobot2y ago

In Python the Adler library returns a 32 bit checksum. It works pretty well when you're comparing one file to another file. It doesn't work pretty well if you want to, for example, create a quick fingerprint that (tries to) uniquely identify tens of thousands of files.

On StackOverflow I saw someone say that they got hash collisions in MD5 (128 bit) after hashing around 20k files.

When I tried making something similar I figured if I added the size of the file in bytes to the hash that would decrease the number of hash collisions since you would need a permutation of bytes in a set of bytes of same size to generate the same MD5 hash to get a collision. Still feels random and unavoidable in the greater scheme of things, though.

jeffparsons2y ago

This happened to me. Users initially couldn't directly control the content being hashed, because it contained a random element (via UUID). Later, the API surface expanded.

Luckily, my personal rule is to default to a cryptographic hash unless I can convince myself that cryptographic robustness will never matter and performance definitely will matter, rather than the other way around.

In this specific case all users were internal to the company, so it wouldn't have really mattered if it was vulnerable. But it could just have easily been an external user-facing thing.

akvadrako2y ago

Except MD5 is slower than SHA2 on modern PCs / servers.

charlieyu12y ago

You may as well use CRC32 or Alder32.

adrian_b2y ago

Those are suitable for error detection for relatively short files, in the kilobyte range. They can be used for error detection in big files only if you compute one per page, e.g. one for each 4kB page.

They are not useful as file identifiers. I have found multiple CRC32 collisions even in a single directory (a big one, with around ten thousand files).

For error detection in a big file, you need at least some 64-bit CRC, though SipHash is likely to be a better choice than a CRC.

For identifying uniquely a file in a multi-TB file system, which may have many millions of files, even a 64-bit hash is not good enough, a hash of 128 bits or more is needed to make negligible the probability of collisions. I have verified this experimentally, finding several 64-bit file hash collisions in my file systems.

Dwedit2y ago· 10 in thread

MD5 is incredibly broken. The PDF file PoC||GTFO 0x14 (https://dl.packetstormsecurity.net/mag/pocgtfo/pocorgtfo14.p..., 42MB large) is a PDF file that can be also run in a NES emulator, and will display its own MD5 hash. The MD5 hash is also shown in the pdf document itself. (Don't download it from archive.org, their copy is altered)

The fact that any document can contain its own MD5 hash embedded in there should be hugely concerning enough.

The hash also happens to start with 5EAF00D.

userbinator2y ago

There's a GIF MD5-quine here: https://news.ycombinator.com/item?id=13823704

And a PNG version too: https://news.ycombinator.com/item?id=32956964

But no one has made an exclusively plaintext (ASCII) MD5-quine yet, and I suspect doing so may be impossible given the characteristics of collision blocks.

ForkMeOnTinder2y ago

How is it impossible? I would think an MD5 quine exists with probability approaching 1 as the size of the document grows to infinity. Think about the reduced problem:

1. a document containing "1", whose hash begins with "1"

2. a document containing "12", whose hash begins with "12"

3. a document containing "123", whose hash begins with "123"

#1 is certain to exist. #2 exists, but would take 16x as long to brute force. #3 would take 16x longer again. If this pattern doesn't continue until 2^128, where would it stop, and why?

All hashes can be brute forced this way, even secure ones SHA-2. Its security relies on the fact that the earth doesn't contain enough computing power to execute a brute force attack within the universe's lifetime.

userbinator2y ago

as the size of the document grows to infinity.

Therein lies the problem.

Also the fact that it would need to be constrained to 7-bit ASCII only, and on top of that be "valid" in its natural language. It's a neat trick to make two documents look completely different with the same hash, but looking at the techniques which are required, they all rely on a binary file format and copious amounts of data which are effectively "hidden" --- all of which do not apply to a text file.

Jenda_2y ago

Maybe try to have a look at it as permutations: the mapping "hex of the hash" → "its actual hash" is a (presumably random) permutation. And it's quite probable that such permutation has a fixed point: http://laurentmazare.github.io/2014/09/27/fixed-points-of-ra...

The problem is that we currently don't know how find it more efficiently than with exhaustive search, AFAIK.

Edit: previously on HN: https://news.ycombinator.com/item?id=614079

a13692099932y ago

> a PDF file that can be also run in a NES emulator, and will display its own MD5 hash. The MD5 hash is also shown in the pdf document itself.

By that logic, SHA 256 is also broken:

  $ cat >sha256.py
  from hashlib import sha256
  s = 'from hashlib import sha256\ns = %r\nprint sha256(s%%s).hexdigest()\n'
  print sha256(s%s).hexdigest()
  $ sha256sum sha256.py
  14cc85c420ced317fdb73e9403ac3f6e1d96d19c70ae0dce8da9b8d96fa0b4d3  sha256.py
  $ python sha256.py
  14cc85c420ced317fdb73e9403ac3f6e1d96d19c70ae0dce8da9b8d96fa0b4d3

(Yes, PDF is turing complete. Yes, that's terrible. No, it doesn't have anything to do with hash function deficiencies; it's turing complete on (malicious) purpose, just like webpages with javascript.)

Dwedit2y ago

Don't be silly here, the MD5 is clearly in the plaintext here, and the NES ROM is only the first 40k of the file. It is not able to scan itself and print out a hash that way.

a13692099932y ago

> the MD5 is clearly in the plaintext here

Alright, I'll bite: at what byte offset in the binary file contents does a trivial encoding[0] of the MD5 hash occur?

> the NES ROM is only the first 40k of the file. It is not able to scan itself and print out a hash that way.

It is possible to encode the effects of multiple blocks of arbitrary[1] data on a hash function internal state (independently of what state you start in) in much less space than that data actually takes up, although I'll grant that actually doing so is somewhat impressive in it's own right, so I don't have a trivial translation to SHA 256 immediately ready to post.

Edit: tracked down my saved version:

  $ md5sum pocorgtfo14.pdf
  5eaf00d25c14232555a51a50b126746c  pocorgtfo14.pdf
  $ grep -aoi 5eaf00d pocorgtfo14.pdf || echo not found
  not found
  $ # using ...b126746c because 5eaf00... has a nul
  $ grep -aoF $(printf '\xb1\x26\x74\x6c') pocorgtfo14.pdf || echo not found
  not found

The MD5 is definitely not clearly in the plaintext here, though it could be only mildly unclear.

0: Eg, I'd accept 31 34 63 63 38 35 63 34 ... as an encoding of 14cc85c4... from above.

1: Including random/incompressible data.

planede2y ago

PDF being Turing complete doesn't mean that the program embedded within can access the bytes of the document.

tptacek2y ago

It might be the case that the more complicated the artifact you're trying to forge is, the easier the MD5 forgery gets; the challenge with doing hash forgeries is that you lose control of some of the values and placement of bytes, which is more noticeable/disruptive in a short document than in a file that is a bunch of (seekable, error-recovering) formats at once.

(I've never tried to built one of these, so I could be just totally wrong here).

AlienRobot2y ago

>The fact that any document can contain its own MD5 hash embedded in there should be hugely concerning enough.

I think that's pretty amazing, to be honest.

StillBored2y ago· 8 in thread

Sigh, been having this conversation in a related codebase. Md5 is just as fine as any other generic hash function if its being used as a non-unique key, which for many cases replacing it with one of the more "secure" alternatives does nothing except for the fact that the resulting hashes are frequently longer, thereby further reducing the statistical chance of an accidental collision. For something like a document store, duplication system, etc, simply taking the extra step of doing a binary comparison against the text associated with the hash assures that accidental (or intentional) collisions are handled. With the bonus that you probably get to either publish a paper or detect someone trying to attack the system should the text comparison fail.

And given the history of cryptographic hashes, i'm even more convinced that anyone depending on sha3/whatever being better than md5/etc over the next 10-20 years is fooling themselves.

Now would I use it in a secure boot chain/etc as a stamp of uniqueness? Probably not.

marshray2y ago

> Md5 is just as fine as any other generic hash function if its being used as a non-unique key

MD5 brings the feature that you'll forever be explaining why you chose a function that had already been broken for 30 years when other options were readily available.

StillBored2y ago

Often, its less about picking it for a new project vs having the discussion about how to update an existing one, often with data at rest that needs converting. In the latter case often the hash needs to fit into the existing 128b field size, so one is throwing a good number of the SHA bits away anyway.

marshray2y ago

This is not a justification for using a broken function.

Truncating a fixed number of bits does not make a good secure function less secure, other than the implications that the shorter length has on brute-force and collision strength.

In some cases it can even make it even more secure, e.g. SHA-2-512/256.

jochem92y ago

The chances of having an accidental hash collision are really small.

I have build data warehouses with md5 as the hashing algorithm to generate keys from natural keys. Did some back of the envelope calculations back then and found that the chance of a hash collision was minute. Don't remember the exact numbers, but somewhere in the 100s of years if I was generating keys every second.

This could btw very well be a thing with large volumes of data, but in many systems this absolutely not a worry.

brohee2y ago

The chance of a random collision is minute but if someone is actually building collisions the system is broken. DVC uses MD5 of file for reduplication for example and when you purposely inject files withe the same MD5 (which take seconds to build) the result is data loss.

Moru2y ago

md5 is faster due to being older and made for older hardware so I guess that is why it's in use for things like that. All deduplicating tools I have used first check for file length before it even tries to do a checksum so I guess that would take care of some problems. It's harder to find a collision if you have to keep the filesize the same.

8n4vidtmkvmk2y ago

Supposing I did want to use the file hash as a unique key and I really don't want to do a byte for byte comparison... And I care about speed but not so much about bad actors, what should I use?

ylk2y ago

Didn’t think about it much, but file size should be a good indicator if the hash isn’t horrible. md5 + file size comparison could work for your use-case.

https://crypto.stackexchange.com/questions/84520/how-long-wo...

upofadown2y ago· 7 in thread

The history of this makes it hard to convince people to supersede hashes based on the fact that they can be collided. If the legal community had switched to SHA-1 at the point that MD5 was found to be weak for collisions they would have had to consider switching over to SHA-2 10 years later. From their perspective they dodged a bullet.

There ends up being a usability issue here. An MD5 hash is only 128 bits long. So 32 hex digits. A SHA-2 hash is going to be 256 bits. Or 64 hex digits. Manually comparing 64 hex digits is in practice much harder than twice as hard as comparing 32 hex digits. People get lost in the middle. If you chop down your 256 bit hash to 128 bits then due to birthday collisions you can probably brute force a collision anyway (you end up only having to do something like 2^64 operations). So there ends up being a usability argument for specifying that your system has to be able to be secure in the face of collisions. At that point you could then further argue that you will just stick with MD5.

adrian_b2y ago

A truncated SHA-256 is both more secure and also faster to compute on any modern CPU than MD5, and a visual comparison would work identically.

If a visual comparison is believed necessary, it should better be made easier, e.g. by overwriting the two hash values, using text of different colors.

Otherwise, even a bash script, or even just one bash command line can easily compare the output of two sha256sum executions and print an appropriate message.

upofadown2y ago

If you have some sort of information processing system available to compare hashes, then you would be better off comparing the data directly. The hashes when used in this context are primarily to make things like manual comparisons possible. Usability is the point.

saghm2y ago

> There ends up being a usability issue here. An MD5 hash is only 128 bits long. So 32 hex digits. A SHA-2 hash is going to be 256 bits. Or 64 hex digits. Manually comparing 64 hex digits is in practice much harder than twice as hard as comparing 32 hex digits. People get lost in the middle.

Comparing MD5 and SHA-2 for visual human diffing is like comparing a stick of dynamite to a landmine when trying to pop a pimple; any potential safety differences are trivial once you start using something in a fundamentally unsafe way.

wuiheerfoj2y ago

Running 2^64 SHA1 ops on a GPU takes 15 years, so I think finding a reasonable collision for that half using SHA2/3 is not as trivial as you suggest:

adrian_b2y ago

A chosen-prefix (i.e. a demonstration how to modify a given legal document to obtain two different documents that have the same hash) SHA-1 collision has already been computed:

"We have successfully run the computation during two months last summer, using 900 GPUs (Nvidia GTX 1060)."

Such resources can be easily rented from a cloud.

So for anyone willing to spend up to 100k USD, it is trivial to find SHA-1 collisions.

https://sha-mbles.github.io/

moyix2y ago

Since that post was published, the 4090 came out, which can (according to this hashcat benchmark [1]) do 50,638.7 million SHA1 hashes per second, so now it would only take a single 4090 GPU 11.55 years. Or you could buy 12 of them and do it in a year, etc. So it's definitely not cheap but 15 years is definitely an overestimate (and presumably GPUs will keep getting faster...).

SHA2-256 is "only" 21975.5 MH/s so you'd have to double the number of GPUs or amount of time.

[1] https://gist.github.com/Chick3nman/32e662a5bb63bc4f51b847bb4...

gruez2y ago

Generating 2^64 hashes isn't guaranteed to produce a collision, and even if a collision did exist in that set, you're not going to find it by getting a bunch of GPUs to compute 2^64 hashes. There's a huge difference between a haystack that maybe contains a needle, and a needle that's been pulled from the haystack and presented to you. To actually find and identify the collisions you'll need to hook those GPUs up to some sort of storage/retrieval system. Just to store 2^64 128-bit hashes would take 295.1 exabytes. That's an order of magnitude more storage than NSA's utah datacenter[1].

[1] https://en.wikipedia.org/wiki/Utah_Data_Center

https://en.wikipedia.org/wiki/Non-cryptographic_hash_functio...

tneely2y ago· 5 in thread

We still see heavy use of MD5 in genomics as well. It's effectively used to generate a single identifier that can be used to reference a specific genome assembly. There have been discussions and attempts to move to other, more secure algorithms, but the community and its tooling is too deeply entrenched in using the MD5 for the reference that it would take a herculean effort to change.

I'm personally of the opinion that it doesn't matter. MD5 is fine for genomics. The chances of valid genome files colliding is still extremely low, and there's not really any relevant attack space. Replacing one assembly file with another will just break someone's analysis pipeline, and most likely in a very clear obvious way.

wyldfire2y ago

> there's not really any relevant attack space.

Then why use a cryptographic hash at all? much better hashes out there that only strive for distribution/avalanche.

drisden842y ago

MD5 has/had a well-known "media" surface - lawyers/genomics folks had heard of it. Libraries had it as an accessible function (command line utilities, even).

Sure, there are better non-cryptographic hashes, but, again the concern of lawyers and genomics folk is neither security nor efficiency - simplicity and "works most of the time" are the two metrics at stake.

If either laywers or genomics folks cared about document forgery of this nature (spoiler, they don't), they would move to something like SHA3. If they had a need for high-scalability hash algorithms (spoiler, they don't), they would switch to another faster algorithm.

This is a concept I understand security folks struggle to understand - sometimes we _just don't care_. And we never should.

Maybe, something a struggling security enthusiast could understand - a video game.

If you implement e.g. a caesar cipher, you can have fun, accessible puzzle. Implementing AES in your game as a puzzle, while much harder, fails desperately at the "accessibility" metric. In your single player game, if you want to see some "identifying hash", if you see an md5 one, that's enough. No, you should not worry about people forging documents for your ad-hoc identification system, if you don't have people attempting to forge in-game items. Maybe its even a feature that you need to forge such a hash, as a way to solve a puzzle.

masklinn2y ago

Because they’re known to be collision resistant (it’s a primary requirement), whereas non-cryptographic hashes are not, so now you need to evaluate each function individually for this property which is a hassle. And an unnecessary one, I doubt the computation of the hash is what genomics are bound on.

ivanbakel2y ago

But what's the relevance of collision resistance, without a meaningful attack surface?

tneely2y ago

Yeah that's fair - it doesn't need to be cryptographic. But someone back in the day decided MD5 was what they wanted and it stuck. It always raises alarms with pen testers and security scans at work, and each time we have to explain that the cryptographic security is irrelevant; it's just some unfortunate genomics standard we need to support.

kmeisthax2y ago· 4 in thread

If you could just throw anyone who forged a digital signature in prison, you'd keep using MD5, too.

The reason why people like us keep changing everything for security is specifically because we have no access to justice. Computer crimes are international and difficult to prosecute, so you might as well drop an algorithm like a hot potato if anyone - even just nation state actors - could break it. We build our rules out of code because we do not have access to the material they make laws out of.

That being said, continuing to use MD5 is utterly inexcusable.

Moru2y ago

Yes, this is a thing. My arguments have bene shot down with a handwaving several times. "But that would be a crime so then we call our lawyers". Feels like it would be cheaper to just use something secure than to pay a lawyer :-)

NoboruWataya2y ago

If someone commits a crime, it's not the victim of the crime that has to pay for the lawyer to prosecute them.

tesdinger2y ago

Kim Yong Un is not going to make a bank transfer out of North Korea to pay for your Lawyers after they hacked you.

m30472y ago

Would have upvoted except for that last sentence. There is no such thing as a perfectly good airplane, at least as long as "perfect" means flawless rather than good enough. Granted, there is a whole dance for accepting the risk from known defects (that's the whole point).

CaliforniaKarl2y ago· 4 in thread

I've been wondering, is there a term for a type of attack like this:

Given a message M, length function L(), and MD5 hash function H(); is there an attack which can generate message M', such that H(M)==H(M') _and_ L(M)==L(M')?

In other words: Two different messages, both of the same length, with the same hash?

It's almost like a chosen prefix collision attack, but with no prefix (so P is empty) and a given message (M is known, M' is up to the attacker).

I ask because I frequently use GridFTP for data transfer, and it uses both the file length and the MD5 has to verify that files were transferred correctly.

abhibeckert2y ago

I don't know anything about GridFTP - but there's a huge difference between verifying if files were "transferred correctly" and verifying that files were transferred without being tampered with by a malicious party.

MD5 is fine for the first task, and totally unacceptable for the second.

[1] https://github.com/corkami/collisions

Indeed, which is why I didn’t mention third-party tampering. For that, the transfer can be sent inside of a TLS-enabled connection.

supriyo-biswas2y ago

That is still an attack on the second preimage or a collision resistance properties of the hash function. Most collisions do work this way, for example see [1].

That makes sense, but is there a specific name for this type of collision?

ipython2y ago· 4 in thread

I read through this hoping to have a reasonable discussion of the difference between preimage attacks (see https://en.m.wikipedia.org/wiki/Preimage_attack) and was disappointed when I did not see the topic mentioned once. :(

It is much more computationally feasible to create two inputs from scratch that hash to the same value than to forge an existing documents hash (the threat model I’m assuming they’re discussing in relation to the law).

As far as I know I am not aware of a demonstrated second preimage attack on md5. Not saying to keep using it, just trying to not spread fud.

Edit: I do see second preimage is mentioned about 3/4 of the way through the article. I confess that I did stop reading and started skimming before then.

adrian_b2y ago

The article mentions the fact that a successful second preimage attack is not always necessary.

A successful second preimage attack is needed if you want to make a second variant with the same hash like an already existing legal document.

However, when the original document is not yet in the possession of others (or there might be a way to destroy or replace their older copies), you can make more or less invisible modifications to it, so that a second different document will have the same hash with it. Then the altered original document can be handed to other parties, who will not notice changes from whatever had been agreed, while keeping an alternative document that can be shown later as having the same hash.

While opportunities for such a forgery should happen less often, it is much better to use a collision-resistant hash to completely remove this possibility.

userbinator2y ago

Yes, if you haven't already noticed, crypto is just a religion at this point, propagated by the "experts" who don't actually think and actively silence dissent.

evouga2y ago

I think it's simply that the blog author and commentators have an unrealistic threat model when it comes to how the legal profession uses MD5s.

After the first high-profile case where authenticity of evidence gets called into question because a seized electronic document was deliberately doctored to allow for a hash collision (if that ever happens), there will be a will to change to something new.

refulgentis2y ago

I doubt it, legal types won't see this as a math problem[1], but a legal problem (forging documents)

[1] unless I'm missing something, this boils down to: "given f(x: string) => y, how can I minimize the odds that you can generate an X for a desired Y"

OhMeadhbh2y ago· 3 in thread

There's a difference between finding a collision and finding a second pre-image. While I agree you shouldn't use MD5, and absolutely don't use a signature algorithm which uses it, finding a second pre-image is harder than finding an arbitrary collision with MD5.

An "arbitrary collision" here means you can find two inputs (pre-images) which hash to the same thing. Like you ran some code and discovered that "SDFKLHKLJxchjasdfgklhjaskdhjlf9" hashed to the same thing as "klhkasdfhjkl899078790". Finding a second pre-image means you start with one message, like "ALL QUIET. REMAIN CALM." and figured out that "ATTACK AT DAWN 051928" hashes to the same MIC.

I can't believe I'm defending using MD5. But... finding second pre-images is still hard. Sasaki & Aoki say it's got a complexity of around 2^116.9 and requires 11 * 2^45 words of memory (thought 1400Tb isn't THAT outlandish these days.)

Still... statements like "finding a second pre-image is hard" don't age well and will guarantee a tractable second pre-image attack will be published tomorrow.

But... if you have a bunch of docs and you're not signing them or asking people to trust the hash of each doc, you can (reasonably) quickly de-dup by sorting by MD5 hash and then looking for dups. Which is how many people use MD5. And they continue using MD5 because multiple organizations have similar lists and if you wanted to change it, you would need to get everyone to move to a different algorithm.

But yeah... at this point we should assume someone will publish a tractable second pre-image attack "any day now" and get to work migrating from MD5 to MD5 : Next Generation. But good luck getting more than 2 people to agree to what the next preferred hash algorithm should be.

bawolff2y ago

So if the document is evidence , then its probably created by the attacker. This seems like a setup where collision is more relavent than 2nd preimage.

OhMeadhbh2y ago

How so?

adrian_b2y ago

Second preimage attacks are relevant for the documents that you create and give to others.

Keeping a hash of the document ensures that you can prove that any altered document shown by someone else is not the original.

Collision attacks are relevant for the documents created by others, which you receive.

If you have a hash of the document that is collision-resistant, you can trust that the creator does not have other variants of the document with the same hash.

If the hash is not collision resistant, i.e. it is MD5 or SHA-1, you cannot know if the creator of the document has not also created another variant of the document than the one handed to you, which has the same hash.

That is why a digital signature on a document received from others is meaningful only if it is based on a collision-resistant hash.

If you sign and verify your own documents, for detecting modifications, a second preimage attack resistant hash would be enough.

https://github.com/minio/sha256-simd

callalex2y ago· 3 in thread

This is the same legal system that still uses polygraphs as “lie defectors” and known-junk DNA matching tests as fact, so this isn’t exactly shocking.

hollerith2y ago

What court has admitted polygraph test results into evidence? Surely, none in the US.

callalex2y ago

I suspect we have different definitions of “legal system” in mind. You are correct that such things cannot be admitted as evidence into a court case, but law enforcement agencies still use the machines and do their best to lie to unknowing victims that it will be admitted to court.

randombits02y ago

They are often used in pre-employment screening for sensitive government jobs. Yes, it would be illegal for a private employer to do this.

robertlagrant2y ago· 3 in thread

> Yes, they say, MD5 is broken for encryption, but since they’re not doing encryption, it’s fine for them to use it.

Unless I missed it, this article seems to not refute the most fundamental point: MD5 was never broken for encryption. Hashing is not encryption.

adrian_b2y ago

While hashing is not encryption, any secure hashing function can be used for encryption, even when used as a black box, (by making an unpredictable PRNG with it).

Moreover, MD5, SHA-1 and SHA-2 contain a block cipher function used in the Davies-Meyer mode of operation.

The internal block cipher function can be extracted and used in any other mode of operation possible for block cipher functions.

Because of these possibilities, many older laws that have existed in various places, prohibiting the inclusion of encryption in software products, but allowing secure hashing functions, have been completely misguided.

alextingle2y ago

It's broken for hashing too.

The point is that MD5 is no good if there's any way an adversary might want to subvert it. It's fine if you just want to use it for hashing your own documents, but as soon as there's an incentive for someone to substitute one document for another, MD5 is problematic.

That's certainly the case for encryption, but it's also the case for these legal document records.

robertlagrant2y ago

I don't disagree, but I think my point still stands.

throwaway892012y ago· 2 in thread

Another unfortunately place where MD5 is widely used: pirate libraries such as Library Genesis and Anna's Archive. While content is distributed at large in torrents with SHA1-summed shards, and Anna's Archive at least offers some structured metadata which would allow to slowly migrate away from MD5, files are still indexed using MD5 as primary key, and any other kind of file hash is nowhere to be found.

Pirate libraries are particularly important to preserve our cultural heritage in a transparent and trustworthy way. A role that traditional libraries sadly cannot fulfill due to draconian copyright laws, especially around digital books. With archive.org as notable exception.

tesdinger2y ago

I do not need to do a hash collision to upload malware to Library Genesis. I could just upload malware with a slightly different name than a popular book and claim it is a different release, like book_high_quality. To securely view content downloaded from such sites, update your software and sandbox the application.

bawolff2y ago

It should be noted that md5 is probably still secure for this usecase (maybe you could do a bait and switch with a specificly prepared file, but you can't force a collision with a non-evil file)

Still, they should switch. Sha1 is not good either.

krallja2y ago· 2 in thread

I actually picked SHA256 for the path-prefix feature in https://jacob.jkrall.net/benfords-law for “NIST compliance.” That is, I didn’t ever want to answer “yes” to a potential customer’s CISO security surveys question like “does your application use any non-NIST-approved hashing functions?”

It’s frankly broken that evidence-handling doesn’t have to follow the government's advice about hash function selection!

bostik2y ago

Funnily enough, I had an interesting discussion with a client's lawyer (who, to their credit, is reasonably tech-savvy) before the holidays. I had redlined "FIPS 140-2" from their contract language. I'll omit the context, because it's too nuanced to be discussed here, but the long and short of it was that she wanted to know why I did that.

I informed her that since FIPS 140-2 is about physical properties of key creation and management, all the relevant layers in a cloud-only solution are simply in the wrong scope. And I added that I am allergic to the string "FIPS" in general. Even having it present in official contract language makes people leap into weird assumptions about supported and allowed algorithms.

Her response? "Oh, that makes sense."

krallja2y ago

May we all be so lucky to have such an enlightened client(‘ s lawyer)!

laserbeam2y ago· 2 in thread

> MD5 should be considered broken and unsuitable for further use.

Ya know... It's 2024 and Azure's blob storage ONLY supports MD5 for integrity checks when writing blobs. There are no other hash functions supported there. The default cloud storage solution implemented by one of the largest cloud providers out there ONLY uses MD5.

I really want to use something else, but whenever I have to interact with them I must fall back to MD5. It's not up to me as a dev to use something better if I need to interact with Azure. Yes, I can use other hashes alongside MD5, but if I want integrity checks with the storage provider I can't completely abandon MD5.

Pikamander22y ago

WordPress still uses MD5 for database passwords to this very day with no immediate plans to change it.

That said, they apparently use eight passes of MD5 hashing along with salting, which they claim is a sufficiently secure combo.

WordPress's core and default themes are known to be fairly secure, so I'd like to believe they know what they're talking about, but if nothing else it feels icky.

Snow_Falls2y ago

I'm confused, if they're going through the effort to make something known bad (MD5 secure, then why not just use something secure in the first place (e.g. SHA3)?

denton-scratch2y ago· 2 in thread

> only broken for encryption

It's broken in an adversarial situation: given the hash of evidence-file A, it's possible to construct a file B that gives the same hash.

But it would be a different matter entirely to construct a file B that actually looked like a file of evidence relevant to the case. I don't know how lawyers use these hashes, but unless they're being used to detect malicious tampering, I don't see what's wrong with MD5. And since the files to be hashed are evidence, they're in the custody of a court; things have got quite bad if court officials might be tampering with evidence.

aljarry2y ago

> It's broken in an adversarial situation: given the hash of evidence-file A, it's possible to construct a file B that gives the same hash.

No, that's a second preimage attack. MD5 is safe against preimage & second preimage attacks.

What MD5 is not safe against, is a collision attack: you can create two messages/files with different content, that end up having the same hash.

denton-scratch2y ago

Yeah, sorry. TFA made that clear.

So to exploit the vulnerability, you have to be able to manipulate file A, the original piece of evidence, to construct a file B that has a matching hash. I still fail to see how this impacts files submitted to a court in evidence.

chias2y ago· 2 in thread

It's worth noting that there are no known attacks against MD5 HMACs, which look identical to MD5 hashes.

notahomosapien2y ago

Quantum computers will severely break MD5 and SHA-1, so they'd be broken even if they are used with HMAC. Use SHA2-256 unless you need quantum-resistant collision resistance, in which case you should use SHA2-384. Use HMAC-SHA2-* with an 256-bit key if you want to prevent length extension attacks.

bawolff2y ago

Severely break is a bit of a overstatement.

It will make a speed up, but its not like shor's algorithm - you need a really powerful quantum computer before md5 comes under threat.

But to be clear. Md5 is broken do not use.

shpx2y ago· 1 in thread

MD5 hashes are half the length of the recommended hashing algorithms. This convenience (and the switching cost) is worth more than the theoretical security considerations.

NoboruWataya2y ago

> (and the switching cost)

Yes, I must say I smiled when I saw the author's assertion that moving the entire legal industry to a new hashing algorithm is "trivial".

heads2y ago

I can see the lawyers’ point. Any old checksum is good enough for spotting random data corruption. If you do happen to have a crooked lawyer who is submitting tampered evidence then you would hope that there are better systems in place to weed out this ethical corruption.

For instance: law school training to strengthen ethics, well known punishments to act as a deterrent, hiring processes to filter out unscrupulous actors, and whistleblower protections to encourage and reward vigilance.

It’s not an opinion that adheres to the cynical zeitgeist, but in my experience most members of this profession are extremely trustworthy. I’m sure they dislike the stereotype of lawyer=rotter just as much as hackers are tired of being typecast as Newman… sorry, Dennis from Jurassic Park!

LgLasagnaModel2y ago

“A cryptographic hash should uniquely identify a file.”

This is not true. It’s not possible to guarantee this. One can be certain that two files DON’T match if they have different hashes, but one cannot be certain that two files DO match based ONLY on the fact that they have the same hash.

wglb2y ago

Despite official statements about MD5, the leading e-discovery software provider uses sha256: https://help.relativity.com/9.0/Content/Relativity/Processin....

And it is unclear if that is in any way unusual.

tcper2y ago

In other industries, MD5 collision is the most minor problem, it just works, and God know when/how they encounter collision, so just use it.

bawolff2y ago

I've always wondered what would happen if some black hat made all their payloads have the same hash in the hopes that if it ever went to court the confusion would hinder the proceedings.

Like i get that md5 is essentially a unique identifier and not meant to protect against malicious interference but if all the exhibits had the same identifier surely that would confuse people.

marcus_holmes2y ago

I worked for years as the tech guy for a document imaging company. We worked a few gigs for the Serious Fraud Office in the UK. So, while I'm not a lawyer, I bumped into this stuff a fair bit.

The point is that evidence is an agreement between the two sides in a case, and it's not an absolute thing.

If you have the original document that was signed by both parties, great. If you have a scan (using a lossless compression format) of the document and proof that the original was destroyed, great. If you have a scan but no proof of destruction, still great. If you have a photograph of the document and no proof, still great. If you have a vague recollection of what was in the document, still great. All of these are "great" if the other side accepts that they are accurate depictions of the original. If they don't accept that, then there's an argument about what the original document contained and the provenance of the evidence, and only then does the actual quality matter. Original document with wet signature is hard to argue with (but not impossible - wet signatures can be forged). The further away from that, the easier it is to argue that the document presented is not accurate and should not be accepted as evidence.

Knowing that it's possible to use collisions to create false evidence doesn't matter if no-one contests that the evidence is false. It only becomes significant if one side says that the document has been tampered with, and that's not that common. The side claiming it was tampered with would have to present their version of the document, and their version of events that allowed the document to be tampered with, and so on. The judge would make a ruling about which version of the document was considered the "real" one and the case would continue. Obviously there are edge cases where the whole trial verdict hinges on which version of the document is the correct one, but they're edge cases. And in those cases you could-re-hash the documents involved and double-check with one was right, etc.

In the OP's example, where a letter of recommendation has the same hash as a authorisation letter, this is only going to matter if one side says the accused was authorised and the other says they weren't. The authorisation letter will be produced by one side, and the recommendation letter produced by the other, and there'll be an argument about which was the original document. The fact that they have the same hash isn't really relevant. It's a minor point of interest given that these are two clearly different documents saying different things.

In the specific cases for the SFO that I worked on, the SFO descended on the accused's offices like locusts, sweeping every single document into carefully numbered bags. We scanned the documents in secure facilities, stored the originals in secure facilities, stored the resulting images in secure storage, and deleted any cache or copies. My professional opinion is that it would be impossible for anyone to create two documents prior to the SFO's investigation that would create an intentional MD5 collision in the evidence used in court. And, even if they somehow did, it wouldn't matter because both documents would be in evidence bags in storage and could be recovered to be examined by the court.

Obviously, from a black/white technical point of view, using a better hash algorithm would be better. But I can see why the legal profession is reluctant to adopt the new thing; it's a hassle and it will only affect a tiny amount of cases, if any.

emmelaich2y ago

Interestingly, use of MD5 has been been found in court to make evidence less convincing. https://www.schneier.com/blog/archives/2005/08/the_md5_defen...

To the article, what tptacek said.

ddtaylor2y ago

This is actually something I discussed with my legal team and was prepared to bring up at trial with our own forensic expert. My team was (correctly, IMO) not expecting much to happen because the amount of precedent that exists with MD5 being used as a way to say these documents haven't been swapped or tampered.

xpil2y ago

Interestingly, DBT is still using MD5 as the default algorithm for generating row identifiers.

j / k navigate · click thread line to collapse

164 comments

131 comments · 28 top-level

tptacek2y ago· 20 in thread

None of this is to say that the legal profession shouldn't move to SHA2; it should.

colmmacc2y ago

You say to use SHA2, but TFA says to use SHA3 or Blake. I think your recommendation is the better one, but I feel like teasing out why because it's interesting.

Secondly ... SHA2 is widely implemented in existing hardware, and it's just currently more efficient (and likely to remain so). So why waste power, especially on something you'll be doing in bulk.

wolf550e2y ago

https://github.com/BLAKE3-team/BLAKE3

adrian_b2y ago

SHA-3 (also BLAKE2 or BLAKE3) is definitely more secure for very large documents than SHA-2.

The security (i.e. the difficulty in finding collisions) decreases with the length of the document for SHA-2 and it stays constant for SHA-3.

oconnor6632y ago

> The security (i.e. the difficulty in finding collisions) decreases with the length of the document for SHA-2

Could you spell this part out for me?

[0] https://catskull.net/gameboy-boot-screen-logo.html

akerl_2y ago

It feels like you're just hunting for complexity a bit here. Any of SHA2, SHA3, or Blake are viable options.

tptacek2y ago

He's hunting for complexity because it's interesting to talk about!

DannyBee2y ago

Lawyer here. This is right. Unlike some of the typical software use cases, md5 is usually not the only thing stopping you from falsifying evidence.

The records being produced almost always exist and the hash is being used to differentiate which is which and whether you got are correct copies.

If you can verify against the original, forensics isn't going to look at a hash and say "hash matches my job is done".

Most of the time what matters is whether the evidence exists or not.

The tobacco companies lied and claimed they never had any evidence in the first place.

assume instead they wanted to falsify the data to say that as far as they could tell smoking did not cause cancer.

this is not a hashing problem. This is a problem of making up fake studies that look real.

woodruffw2y ago

graypegg2y ago

I quite like the licensing trick Nintendo used on the gameboy as an example of this. [0]

bentley2y ago

The thinking was tested (in U.S. jurisdiction) in Sega v. Accolade. https://en.wikipedia.org/wiki/Sega_v._Accolade

The court sensibly ruled that using technical means to force competitors to display your trademark against their will doesn’t mean you can then claim they’re infringing that trademark.

pcwalton2y ago

Apple tried this too with Dont Steal Mac OS X.kext, which uses a haiku with a copyright message as the key to decrypt certain executables like the Finder. I don't think it had any real-world impact.

https://who.paris.inria.fr/Gaetan.Leurent/files/MD4_FSE08.pd...

DannyBee2y ago

If you do have access to the original system, md5 is usually not the thing in the way of falsifying evidence.

most of the time they will claim, the evidence does not exist at all, rather than try to falsify it.

falsification requires making lots of things that makes sense historically, and humans to swear to them.

you also often have to falsify more than one system in a consistent way.

you have to do all of these things in a way that the forensic specialist is not going to think that everything looks really weird

NoboruWataya2y ago

planede2y ago

> the real document won't have hash colliding artifacts in it

franga20002y ago

Both of those are things computer forensics experts are used to looking for and I'm certain even a beginner could find them.

planede2y ago

Are court documents routinely analyzed by computer forensic experts, or only when there is suspicion of tampering?

singleshot_2y ago

And do you think we trust md5, or do you think we have another tool that can compare documents? We do have such tools.

What prevents chicanery in discovery and disclosure is how annoying it was to go to law school and pass the bar compared to how fun it would be to work at Wendy’s for the rest of our careers.

brohee2y ago

What if both documents have collision artifacts? That's the most realistic attack in e.g. a contract dispute.

paulddraper2y ago

That's right.

Same reason why blockchain is (almost) never the answer: that's not how human systems work.

Trust is centralized. Always had been.

lnxg33k12y ago

tgamblin2y ago· 11 in thread

Let’s ignore that no second preimage attack is currently known for MD5. The software the author links to has a FAQ that links to a paper that lays out the second preimage complexity for MD4:

It will be a while, probably decades or more, before there’s a tractable second preimage attack here.

But there’s no crisis. They can take their time.

hannob2y ago

> The article mentions the key detail: MD5 is broken for cryptography (collisions) but not for second preimage attacks.

tgamblin2y ago

> I can't count the number of times I've seen people say "md5 is fine for use case xyz" where in some counterintuitive way it wasn't fine.

I can count many more times that people told me that md5 was "broken" for file verification when, in fact, it never has been.

> Just use a safe one, even if you think you "don't need it".

Here's me switching 5,700 or so hashes from md5 to sha256 in 2019: https://github.com/spack/spack/pull/13185

Did I need it? No. Am I "compliant"? Yes.

Really, though, the main tangible benefit was that it saved me having to respond to questions and uninformed criticism from people unnecessarily worried about md5 checksums.

RuggedPineapple2y ago

>And "tech" apparently knows better.

The tech community has a massive problem with Dunning-Kruger, and has for basically ever. Hell two decades ago when I was a young guy working in the field so did I.

I'm not sure if its because the field is basically a young man's game and that's inherent with relative youth, or if there's something deeper going on, but its hard to ignore once you notice it.

pxx2y ago

> I can't count the number of times I've seen people say "md5 is fine for use case xyz" where in some counterintuitive way it wasn't fine.

Help us out by describing a time when this happened. MD5's weaknesses are easily described, and importantly, it is still (second) preimage resistant.

NavinF2y ago

> MD5's weaknesses are easily described, and importantly, it is still (second) preimage resistant

The usual answer here is "don't roll your own crypto", but in practice abstinence-only cryptography education doesn't work.

hannob2y ago

> Help us out by describing a time when this happened.

JeffSnazz2y ago

> And tbh, I don't understand the urge of people to defend broken hash functions. Just use a safe one, even if you think you "don't need it".

The ideal discourse would not imply a binary sense of "safety" at all, much less for a function evaluated outside the context and needs of its usage....

gojomo2y ago

Second preimage attacks aren't the only threat in a forensics environment.

Also, hand-wavy extrapolations from Bitcoin miners aren't a reliable estimate of how fast & energy-efficient dedicated MD5 hardware could become.

tgamblin2y ago

Which part was hand-wavy/unreasonable? Do you think that dedicated MD5 hardware could become billions or even millions of times more efficient within a decade? If so, why?

gojomo2y ago

Jenda_2y ago

kozak2y ago· 10 in thread

ms5122y ago

When you don't have a need for a cryptographic digest, it's important to think of the channel's bit error distribution in selecting a checksum algorithm.

drisden842y ago

"When you don't have a need for a cryptographic digest, it's important to think of the channel's bit error distribution in selecting a checksum algorithm."

Important real-life-facts.

There was no "give-me-an-appropriate-hash" function.

There was:

md5sum yourfile.txt

Nobody wants to think about "channel's bit error distribution" in a non-security critical context. In fact, its irrelevant, and possibly a usability issue.

danpalmer2y ago

hoten2y ago

SHA256 is something like 30 percent slower than MD5.

I'd suggest using Adler (what zlib does) for a simple and fast checksum. Then that should, one hopes, be painfully obvious to be a bad fit for anything security related.

adrian_b2y ago

On a modern CPU (i.e. 64-bit Arm since 2012, Intel Atom since 2016, AMD Zen since 2017, Intel Core since 2019) SHA-256 is twice faster than MD5.

Older versions of the binary coreutils package may mask this speed difference by having executables compiled only for very old CPUs.

Recent coreutils versions normally use for hashing the OpenSSL library, if found, and OpenSSL uses the hardware instructions where available.

Where a bad sha256sum is installed, "openssl dgst -r -sha256" should work instead.

AlienRobot2y ago

On StackOverflow I saw someone say that they got hash collisions in MD5 (128 bit) after hashing around 20k files.

jeffparsons2y ago

This happened to me. Users initially couldn't directly control the content being hashed, because it contained a random element (via UUID). Later, the API surface expanded.

In this specific case all users were internal to the company, so it wouldn't have really mattered if it was vulnerable. But it could just have easily been an external user-facing thing.

akvadrako2y ago

Except MD5 is slower than SHA2 on modern PCs / servers.

charlieyu12y ago

You may as well use CRC32 or Alder32.

adrian_b2y ago

They are not useful as file identifiers. I have found multiple CRC32 collisions even in a single directory (a big one, with around ten thousand files).

For error detection in a big file, you need at least some 64-bit CRC, though SipHash is likely to be a better choice than a CRC.

Dwedit2y ago· 10 in thread

The fact that any document can contain its own MD5 hash embedded in there should be hugely concerning enough.

The hash also happens to start with 5EAF00D.

userbinator2y ago

There's a GIF MD5-quine here: https://news.ycombinator.com/item?id=13823704

And a PNG version too: https://news.ycombinator.com/item?id=32956964

But no one has made an exclusively plaintext (ASCII) MD5-quine yet, and I suspect doing so may be impossible given the characteristics of collision blocks.

ForkMeOnTinder2y ago

How is it impossible? I would think an MD5 quine exists with probability approaching 1 as the size of the document grows to infinity. Think about the reduced problem:

1. a document containing "1", whose hash begins with "1"

2. a document containing "12", whose hash begins with "12"

3. a document containing "123", whose hash begins with "123"

#1 is certain to exist. #2 exists, but would take 16x as long to brute force. #3 would take 16x longer again. If this pattern doesn't continue until 2^128, where would it stop, and why?

userbinator2y ago

as the size of the document grows to infinity.

Therein lies the problem.

Jenda_2y ago

The problem is that we currently don't know how find it more efficiently than with exhaustive search, AFAIK.

Edit: previously on HN: https://news.ycombinator.com/item?id=614079

a13692099932y ago

> a PDF file that can be also run in a NES emulator, and will display its own MD5 hash. The MD5 hash is also shown in the pdf document itself.

By that logic, SHA 256 is also broken:

  $ cat >sha256.py
  from hashlib import sha256
  s = 'from hashlib import sha256\ns = %r\nprint sha256(s%%s).hexdigest()\n'
  print sha256(s%s).hexdigest()
  $ sha256sum sha256.py
  14cc85c420ced317fdb73e9403ac3f6e1d96d19c70ae0dce8da9b8d96fa0b4d3  sha256.py
  $ python sha256.py
  14cc85c420ced317fdb73e9403ac3f6e1d96d19c70ae0dce8da9b8d96fa0b4d3

Dwedit2y ago

Don't be silly here, the MD5 is clearly in the plaintext here, and the NES ROM is only the first 40k of the file. It is not able to scan itself and print out a hash that way.

a13692099932y ago

> the MD5 is clearly in the plaintext here

Alright, I'll bite: at what byte offset in the binary file contents does a trivial encoding[0] of the MD5 hash occur?

> the NES ROM is only the first 40k of the file. It is not able to scan itself and print out a hash that way.

Edit: tracked down my saved version:

  $ md5sum pocorgtfo14.pdf
  5eaf00d25c14232555a51a50b126746c  pocorgtfo14.pdf
  $ grep -aoi 5eaf00d pocorgtfo14.pdf || echo not found
  not found
  $ # using ...b126746c because 5eaf00... has a nul
  $ grep -aoF $(printf '\xb1\x26\x74\x6c') pocorgtfo14.pdf || echo not found
  not found

The MD5 is definitely not clearly in the plaintext here, though it could be only mildly unclear.

0: Eg, I'd accept 31 34 63 63 38 35 63 34 ... as an encoding of 14cc85c4... from above.

1: Including random/incompressible data.

planede2y ago

PDF being Turing complete doesn't mean that the program embedded within can access the bytes of the document.

tptacek2y ago

(I've never tried to built one of these, so I could be just totally wrong here).

AlienRobot2y ago

>The fact that any document can contain its own MD5 hash embedded in there should be hugely concerning enough.

I think that's pretty amazing, to be honest.

StillBored2y ago· 8 in thread

And given the history of cryptographic hashes, i'm even more convinced that anyone depending on sha3/whatever being better than md5/etc over the next 10-20 years is fooling themselves.

Now would I use it in a secure boot chain/etc as a stamp of uniqueness? Probably not.

marshray2y ago

> Md5 is just as fine as any other generic hash function if its being used as a non-unique key

MD5 brings the feature that you'll forever be explaining why you chose a function that had already been broken for 30 years when other options were readily available.

StillBored2y ago

marshray2y ago

This is not a justification for using a broken function.

Truncating a fixed number of bits does not make a good secure function less secure, other than the implications that the shorter length has on brute-force and collision strength.

In some cases it can even make it even more secure, e.g. SHA-2-512/256.

jochem92y ago

The chances of having an accidental hash collision are really small.

This could btw very well be a thing with large volumes of data, but in many systems this absolutely not a worry.

brohee2y ago

Moru2y ago

8n4vidtmkvmk2y ago

Supposing I did want to use the file hash as a unique key and I really don't want to do a byte for byte comparison... And I care about speed but not so much about bad actors, what should I use?

ylk2y ago

Didn’t think about it much, but file size should be a good indicator if the hash isn’t horrible. md5 + file size comparison could work for your use-case.

https://crypto.stackexchange.com/questions/84520/how-long-wo...

upofadown2y ago· 7 in thread

adrian_b2y ago

A truncated SHA-256 is both more secure and also faster to compute on any modern CPU than MD5, and a visual comparison would work identically.

If a visual comparison is believed necessary, it should better be made easier, e.g. by overwriting the two hash values, using text of different colors.

Otherwise, even a bash script, or even just one bash command line can easily compare the output of two sha256sum executions and print an appropriate message.

upofadown2y ago

saghm2y ago

wuiheerfoj2y ago

Running 2^64 SHA1 ops on a GPU takes 15 years, so I think finding a reasonable collision for that half using SHA2/3 is not as trivial as you suggest:

adrian_b2y ago

A chosen-prefix (i.e. a demonstration how to modify a given legal document to obtain two different documents that have the same hash) SHA-1 collision has already been computed:

"We have successfully run the computation during two months last summer, using 900 GPUs (Nvidia GTX 1060)."

Such resources can be easily rented from a cloud.

So for anyone willing to spend up to 100k USD, it is trivial to find SHA-1 collisions.

https://sha-mbles.github.io/

moyix2y ago

SHA2-256 is "only" 21975.5 MH/s so you'd have to double the number of GPUs or amount of time.

[1] https://gist.github.com/Chick3nman/32e662a5bb63bc4f51b847bb4...

gruez2y ago

[1] https://en.wikipedia.org/wiki/Utah_Data_Center

https://en.wikipedia.org/wiki/Non-cryptographic_hash_functio...

tneely2y ago· 5 in thread

wyldfire2y ago

> there's not really any relevant attack space.

Then why use a cryptographic hash at all? much better hashes out there that only strive for distribution/avalanche.

drisden842y ago

MD5 has/had a well-known "media" surface - lawyers/genomics folks had heard of it. Libraries had it as an accessible function (command line utilities, even).

This is a concept I understand security folks struggle to understand - sometimes we _just don't care_. And we never should.

Maybe, something a struggling security enthusiast could understand - a video game.

masklinn2y ago

ivanbakel2y ago

But what's the relevance of collision resistance, without a meaningful attack surface?

tneely2y ago

kmeisthax2y ago· 4 in thread

If you could just throw anyone who forged a digital signature in prison, you'd keep using MD5, too.

That being said, continuing to use MD5 is utterly inexcusable.

Moru2y ago

NoboruWataya2y ago

If someone commits a crime, it's not the victim of the crime that has to pay for the lawyer to prosecute them.

tesdinger2y ago

Kim Yong Un is not going to make a bank transfer out of North Korea to pay for your Lawyers after they hacked you.

m30472y ago

CaliforniaKarl2y ago· 4 in thread

I've been wondering, is there a term for a type of attack like this:

Given a message M, length function L(), and MD5 hash function H(); is there an attack which can generate message M', such that H(M)==H(M') _and_ L(M)==L(M')?

In other words: Two different messages, both of the same length, with the same hash?

It's almost like a chosen prefix collision attack, but with no prefix (so P is empty) and a given message (M is known, M' is up to the attacker).

I ask because I frequently use GridFTP for data transfer, and it uses both the file length and the MD5 has to verify that files were transferred correctly.

abhibeckert2y ago

MD5 is fine for the first task, and totally unacceptable for the second.

[1] https://github.com/corkami/collisions

Indeed, which is why I didn’t mention third-party tampering. For that, the transfer can be sent inside of a TLS-enabled connection.

supriyo-biswas2y ago

That is still an attack on the second preimage or a collision resistance properties of the hash function. Most collisions do work this way, for example see [1].

That makes sense, but is there a specific name for this type of collision?

ipython2y ago· 4 in thread

As far as I know I am not aware of a demonstrated second preimage attack on md5. Not saying to keep using it, just trying to not spread fud.

Edit: I do see second preimage is mentioned about 3/4 of the way through the article. I confess that I did stop reading and started skimming before then.

adrian_b2y ago

The article mentions the fact that a successful second preimage attack is not always necessary.

A successful second preimage attack is needed if you want to make a second variant with the same hash like an already existing legal document.

While opportunities for such a forgery should happen less often, it is much better to use a collision-resistant hash to completely remove this possibility.

userbinator2y ago

Yes, if you haven't already noticed, crypto is just a religion at this point, propagated by the "experts" who don't actually think and actively silence dissent.

evouga2y ago

I think it's simply that the blog author and commentators have an unrealistic threat model when it comes to how the legal profession uses MD5s.

refulgentis2y ago

I doubt it, legal types won't see this as a math problem[1], but a legal problem (forging documents)

[1] unless I'm missing something, this boils down to: "given f(x: string) => y, how can I minimize the odds that you can generate an X for a desired Y"

OhMeadhbh2y ago· 3 in thread

Still... statements like "finding a second pre-image is hard" don't age well and will guarantee a tractable second pre-image attack will be published tomorrow.

bawolff2y ago

So if the document is evidence , then its probably created by the attacker. This seems like a setup where collision is more relavent than 2nd preimage.

OhMeadhbh2y ago

How so?

adrian_b2y ago

Second preimage attacks are relevant for the documents that you create and give to others.

Keeping a hash of the document ensures that you can prove that any altered document shown by someone else is not the original.

Collision attacks are relevant for the documents created by others, which you receive.

If you have a hash of the document that is collision-resistant, you can trust that the creator does not have other variants of the document with the same hash.

That is why a digital signature on a document received from others is meaningful only if it is based on a collision-resistant hash.

If you sign and verify your own documents, for detecting modifications, a second preimage attack resistant hash would be enough.