Stackoverflow used those MD5 hashes to display gravatars. Downloading old stackexchange data dumps is simply an easy way of collecting stackoverflow's gravatar hashes.
> Some time in 2013 the email hash issue was brought to light and Stack Exchange promptly removed the email hashes from their dumps, but the damage was already done.
The issue has been known since at least 2009: http://www.developer.it/post/gravatars-why-publishing-your-e...
And I warned of this exact issue on meta.stackexchange in 2011: https://meta.stackexchange.com/a/84734/152255
In 2011 the dump contained 95k unique email hashes plus 10k IPv4 addresses, while the dump mentioned in the article had grown to 1.8M.
Stack exchange ignored the issue for several years, instead of fixing it promptly. This delay increased the number of affected email addresses increased 20x.
You can of course hash public email lists but that won't work for unique emails.
The second to last paragraph in the article says: “With my Nvidia RTX 3080 it only took 3 minutes and 17 seconds to process. Hashcat was able to recover 51.81% (972,933) of the hashes.”
Unlike hashing algorithms designed to be hard to reverse, MD5 has a very fast algorithm, so rainbow tables are fast and easy to create.
A public list of emails only speeds up the process and is not required as you could just build a completely random rainbow table. Prioritizing common email patterns in the table also is just another speed boost.
So yes, literally every security expert out there knows md5 hashes are weak, hence the advent of salted hashes, and nowadays sha and variants, as well as well-known hash types for passwords like bcrypt/argon2.
https://meta.stackexchange.com/a/84734/152255
I think a bit later somebody else did a similar attack on a different dataset, and recovered about 50% using a GPU based hasher (a GPU can burn through billions of MD5 hashes per second).
https://arstechnica.com/information-technology/2013/12/crypt...
ETA: the author isn't claiming to have found a preimage attack, but brute force is a legitimate tactic when the search space is constrained, and I'd argue it counts for claiming that those particular hashes were reversed.
The search space is mostly lowercase and maybe . or +
So it literally isn’t a reverse lookup but it’s still pretty broken.