undefined | Better HN

0 pointsjbeda10y ago0 comments

Using a hash or CRC here is totally necessary. Often times CRCs in TCP fail due to corruption outside the network stack. Having an end to end check will catch, say, memory bit flips and such after data comes off the wire.

But there is no call for a cryptographic hash here. This isn't being used as any sort of ID or to verify integrity outside of corruption.

0 comments

9 comments · 3 top-level

devit10y ago· 4 in thread

No, it's pretty much totally unnecessary.

The API works on top of TLS, which already includes cryptographic authentication of all data (usually via SHA-1/2 HMAC or AES-GCM).

The hash would be computed at the client right after reading from disk and right before TLS enryption, and since they seem to terminate TLS at the storage server it would be computed right after TLS decryption and right before storage, so it doesn't seem to provide any gain.

I think they should just remove it, or at least make it optional.

jbedaOP10y ago

When operating at scale, you will, once in a while, have corruption. Even if you use ECC RAM, once in a while you'll have a double bit flip. And it doesn't look like Backblaze uses ECC (https://www.backblaze.com/blog/storage-pod-4-5-tweaking-a-pr...) despite good evidence that ECC is necessary (PDF: http://static.googleusercontent.com/media/research.google.co...). Even if you do have ECC, you'll once in a while have a bad NIC that with HW offload that will corrupt the TCP stream silently.

This is all rare, but it does happen. This is why the GCS team wants to know if you are seeing corruption on file upload as it might be some bad hardware failing in a non-obvious way.

hga10y ago

I just spent 10 or so minutes and it looks like they do use ECC, and per https://news.ycombinator.com/item?id=2786695 see ECC corrections reported in their log files.

gabbo10y ago

As jbeda mentions, hardware errors are one big reason: with the scale S3/Azure/GCS/Backblaze operate at it's a matter of when and not if you're going to run into problems. Also: TLS may guarantee the bits your client sends are the ones their server receives, but that's just one cause of errors.

There's the write path from B2 receives your bits to when they're stored on disk, for one. You could have unforeseen bugs in the code sitting on the other end of their upload URL (it's probably not all theirs, and even if it was it was written by human developers).

Or B2's internal network path (if they have any) between that and the disk. Ideally that would provide integrity too, but maybe not. They offer a low price point and call out other compromises they make to achieve it (e.g. limited load balancing) - so while I really doubt it, it's remotely plausible they deem the internal overhead of SSL too high.

But then there's the potential for mismatch between "what the customer thinks they uploaded" and "what the customer actually uploaded" too! Less of an issue for now because their API only appears to support uploading files all at once, but eventually I'm sure they'll support a multipart upload scheme like the other platforms do. At which point uploads become more complicated since clients need to retain state and potentially resume. What if a client screws it up and there's some off-by-one error (or whatever)? If you can provide instant feedback, at upload time, that your clients provided bogus data, that's a good thing.

You can argue it's a painful requirement to force on users since it means they have to track/compute it themselves (might be nontrivial for streaming applications), which is fair. But there are enough points of failure, and the numbers so large, that errors happening is a fact and you really need to insure against it. Especially here, your entire reason for existing is to reliably store bits so it's kinda important to get it provably right.

It seems completely sensible to err on the side of caution, especially as a new and relatively unproven platform (as an object storage platform provider I mean, obviously they have tons of experience storing things).

ak21710y ago

There are many places in your stack where data corruption can and will occur. You are correct that TLS provides payload integrity on a per-packet basis - but it doesn't protect you against silent truncation (to fight this, always declare and check content-length, or use chunked encoding). I have seen corruption occur in NIC buffers, ECC'd main memory, Xen MMU'd memory pages (yes, Xen was responsible), and multiple places in HTTP server and client stacks. None of those failures manifested until hundreds of terabytes of data had successfully gone through the system.

If you're handling data on behalf of others, it's paramount that you checksum data end-to-end. Amazon S3 allows you to do this by sending the MD5 or SHA along with the data. Google GCE allows you to do this with CRCs (which, despite what others in this thread say, are more appropriate for the task than crypto hashes, as long as you use enough bits).

Dylan1680710y ago· 2 in thread

A cryptographic hash is pretty much as fast as anything else, and lets you be certain. There's no good reason to use anything else.

ghshephard10y ago

Apparently, SHA-1 is pretty slow compared to others, about 20x slower than the fastest hash algorithms out there.

https://github.com/Cyan4973/xxHash

You would think, that if it's just being used as a checksum, anything that passes https://code.google.com/p/smhasher/wiki/SMHasher with high marks would be sufficient.

Dylan1680710y ago

Why would you want 'just' a checksum? I want something I can rely on. If I have to dedicate half a core per gbps of internet-crossing upload, that's not a big deal.

1 more reply

ak21710y ago

The CRC in TCP is not powerful enough, but CRCs can be adjusted to be arbitrarily powerful. The main advantage of CRCs is that they can be independently computed for multiple parts and combined when concatenating the parts.

j / k navigate · click thread line to collapse

0 comments

9 comments · 3 top-level

devit10y ago· 4 in thread

No, it's pretty much totally unnecessary.

The API works on top of TLS, which already includes cryptographic authentication of all data (usually via SHA-1/2 HMAC or AES-GCM).

I think they should just remove it, or at least make it optional.

jbedaOP10y ago

This is all rare, but it does happen. This is why the GCS team wants to know if you are seeing corruption on file upload as it might be some bad hardware failing in a non-obvious way.

hga10y ago

I just spent 10 or so minutes and it looks like they do use ECC, and per https://news.ycombinator.com/item?id=2786695 see ECC corrections reported in their log files.

gabbo10y ago

ak21710y ago

Dylan1680710y ago· 2 in thread

A cryptographic hash is pretty much as fast as anything else, and lets you be certain. There's no good reason to use anything else.

ghshephard10y ago

Apparently, SHA-1 is pretty slow compared to others, about 20x slower than the fastest hash algorithms out there.

https://github.com/Cyan4973/xxHash

You would think, that if it's just being used as a checksum, anything that passes https://code.google.com/p/smhasher/wiki/SMHasher with high marks would be sufficient.

Dylan1680710y ago

Why would you want 'just' a checksum? I want something I can rely on. If I have to dedicate half a core per gbps of internet-crossing upload, that's not a big deal.

1 more reply

ak21710y ago

j / k navigate · click thread line to collapse