S3 Strong Consistency (opens in new tab)

(aws.amazon.com)

699 pointsjuliansimioni5y ago227 comments

227 comments

156 comments · 48 top-level

WrtCdEvrydy5y ago· 18 in thread

Has anyone ever seen S3 behave eventually consistent? I have not seen a lot of eventual consistency in the real world but I wonder if I'm just working on the wrong problems?

macksd5y ago

Yes. In fact I spent months working on a system to workaround all the problems it caused for Hadoop: https://hadoop.apache.org/docs/current/hadoop-aws/tools/hado.... We had a lot of automated tests that would fail because of missing (or more recently, out of date) files a significant portion (maybe 20-30%) of the time. We believed that S3 inconsistency was to blame but I always felt like that was the typical "blame flaky infra" cop-out. As soon as S3Guard was in place all those tests were constantly green. It really was S3 inconsistency, routinely.

To be fair to S3, Hadoop pretending S3 is a hierarchical filesystem is a bit of a hack. But I had cases where new objects wouldn't be listed for hours. There's only so much you can do to design an application around that, especially when Azure and Google Storage don't have that problem.

banana_giraffe5y ago

Yes, a ton. The big issue for me was Does key exist? -> No -> Upload object with key. Followed by Get key -> Not found. Less rarely, I'd see Upload object followed by an overwrite object, followed by a download download some combination of the two if the download was segmented (like AWS's cli will do).

The first one happened often enough to cause problems, the second one was a fairly rare event, but still had to be handled.

It might have been more of an issue with "large" buckets. The bucket in particular where I had to dance around the missing object issue had ~100 million fairly small objects. I ended up having to create a database to track if objects exist so no consumer would ever try to see if a non-existent object existed.

Time to revisit all that mess, I suppose.

allengeorge5y ago

Yes, we have. Specifically:

1. LIST 2. PUT 3. LIST

would trigger situations where (3) wouldn't include the object inserted in (2). This is well-known however.

juancampa5y ago

Also:

  1. HEAD key -> 404
  2. PUT key  -> 200
  3. GET key  -> 404 (what? But I just put it!)

This is commonly used for "upload file if it doesn't exist"

1 more reply

roseway45y ago

Yes. Writing objects from EMR Spark to S3 buckets in a different account. Immediately setting ACLs on the new objects fail as the objects are not yet “there”. Using EMRFS Consistent View was previously the obvious solution here, but it adds significant complexity.

enw5y ago

EMRFS consistent view is a mess. Changes to S3 files are likely to break EMRFS, the metadata needs to be kept in sync (otherwise the table will keep increasing forever) with tools that are a pain to install, and when I used it to run Hive queries from a script, it didn't work.

1 more reply

brian_herman5y ago

Yeah, I never had that problem either. But I wonder how they are addressing the https://en.wikipedia.org/wiki/CAP_theorem

content_sesh5y ago

Sacrificing availability I reckon.

1 more reply

SpicyLemonZest5y ago

I've never seen the minutes long delays that you were theoretically supposed to accommodate, but second-scale delays were pretty common in my team's experience. (We're writing a lot of files though, probably hundreds or thousands per second aggregated across all our customers.)

WrtCdEvrydy5y ago

Yeah, I got seconds but people were saying there were minutes of data... max I ever saw was 3 seconds on a 2gb file...

1 more reply

enw5y ago

All the time when using EMR with S3. Really frequent, and really annoying to mitigate.

rantwasp5y ago

yes. i believe newer region had read after write in a while but you needed to use the right endpoint

i remember a time when if you were using the us-standard “region” and were unlucky it could take 12 hours for your objects to become visible

TheDong5y ago

Yes.

I've observed eventual consistency with S3 prior to this change where it took on the order of several hundred milliseconds to observe an update of an existing key.

I've observed this as recently as 2 years ago.

johncolanduoni5y ago

There used to be a behavior with 404 caching that was pretty easy to trigger: if you race requesting an object and creating it, subsequent reads could continue to 404.

OJFord5y ago

Yes, it caused us some pain with multipart uploads that we sort of proxied, them being multipart from the client to us as well.

richwater5y ago

Yes. When you're writing terabytes of data files it was very obvious.

That's why manifest files became so popular.

dilatedmind5y ago

I think it was already strongly consistent within regions?

like if you had tried to read a non-existent key, then wrote to it, it might continue to appear to not exist for a minute?

TheDong5y ago

Their previous consistency was RAW, but not RAU (for most regions. I think us-east-1 was missing RAW consistency for quite a while).

After a write, you would _always_ be able to read the key you just wrote.

After an update, you could get a stale copy of the key if your subsequent read hit a different server.

1 more reply

k__5y ago· 15 in thread

It's interesting to read all these comments here that talk about the eventual consistency like it was some kind of bug.

deathanatos5y ago

The previous behavior (new keys weren't strongly consistent) were weird, and made a lot of patterns that AWS itself sort of guides you towards, broken.

For example, you can set up an SQS queue to be informed of writes into a bucket. The idea being that some component writes into the bucket, and a consumer then fetches the data & processes it. Except — gotcha! — there wasn't a guarantee of consistency. It would usually work, though, and so you'd get these weird failures that you couldn't explain unless you already knew the answer, really.

Same thing with how S3 object names sort of feel like file names, and that you might use them for something useful, and you design around that, and then you hit the section of the docs that says "no, you shouldn't, they should be uniformly balanced". (Though honestly, I've never felt any negative side-effects from ignoring this, unlike the consistency guarantees.)

jsjohnst5y ago

> then you hit the section of the docs that says "no, you shouldn't, they should be uniformly balanced". (Though honestly, I've never felt any negative side-effects from ignoring this

You likely won’t, unless you hit a bucket size that has at least a billion small files inside. If you get into the trillion and above range and are heavily unbalanced due to a recent deploy of yours, ooops, there goes S3. Or it did in 2013 anyway.

0xbadcafebee5y ago

So the problem was that users don't know how S3 works?

howlgarnish5y ago

Are there any scenarios where you would want eventual consistency over strong consistency? (Assuming pricing, performance, replication etc are the same.)

acdha5y ago

Your last sentence is the answer I’ve always heard given: this in the context of a performance trade off — do you wait for every node in the system to acknowledge an update and purge caches, or return as soon as a sufficient durability threshold has been hit?

nhumrich5y ago

At the benefit of added availability? absolutely. This is the entire premise of mongodb.

2 more replies

robotresearcher5y ago

No. A magical do-everything datastore would have strong consistency. Eventual consistency is something you might settle for given the performance cost trade offs.

To see this, consider that strong consistency has all the guarantees of eventual consistency, plus some more. And the additional guarantees might make your application much easier to write.

TheCoelacanth5y ago

Job security?

dboreham5y ago

Train tunnels

crazyjncsu5y ago

No.

CobrastanJorji5y ago

If someone uses your API wrong, that's their bug. If a huge number of people use your API wrong, that's your bug.

zwily5y ago

It wasn’t a bug, it was well-documented behavior. However, it was the root of many bugs by people building on top of S3 that did not take the eventual consistency into account.

I’ll bet this release is going to fix a lot of weird bugs in systems out there using S3.

k__5y ago

Guess the name simple storage service gave many people the idea that you could use it without a basic idea of distributed systems.

SilasX5y ago

I'm pulling my hair out because people are using eventual consistency to mean "eventual consistency with a noticeable lag before the consistency is achieved", which bothers the pedant in me.

I mean, technically, strong consistency implies eventual consistency (with lag = 0). But everyone's equating eventual consistency with the noticeable lag itself, implying that EC per se is a bad thing.

For an analogy, it would be like if people were talking about how rectangular cakes suck, because <problems of unequal width vs length>, and thus they use square cakes, but ... square cakes are rectangular too.

(What's going on is that people use <larger set> to mean <larger set, excluding members that qualify for smaller set>. <larger set> = eventual consistency, rectangle; <smaller set> = strong consistency, square.)

Obviously, it's not a problem for communication because everyone implicitly understands what's going on and uses the words the same way, but I wish people spoke in terms of "has anyone actually seen a consistency lag?" rather than "has anyone actually seen eventual consistency?", since the latter is not the right way to frame it, and is actually a good thing to have, which happened both before and after this development.

benlivengood5y ago

Eventual consistency with zero lag is not strong consistency. Two simultaneous conflicting writes with eventual consistency will appear successful. Strong consistency guarantees that one of the writes will fail.

EDIT: I think the reason people care about the lag is specifically for this case; if you want to know if your eventually consistent write actually succeeded then you need to know how long to wait to check.

1 more reply

damon_c5y ago· 10 in thread

The more I learn about S3 the more I feel the same awe I feel when looking at the Pyramids or Ankor Wat.

> You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in an S3 bucket.

dheera5y ago

Side question -- why is s3fs so slow? I get only about 1 put per 10 seconds with s3fs. One would hope that you could use something like s3fs as a makeshift S3 API, but its performance is really horrible.

rovr1385y ago

I can't answer this directly, but I thought I'd chime in with a possible alternative, or at least something else to test.

I'm using rclone with S3[1] (and others). There are commands[2] where you can use to just sync/copy/ files but it also has a mount option[3]. It also has caching[4] and other things that might help.

[1] https://rclone.org/s3/

[2] https://rclone.org/commands/

[3] https://rclone.org/commands/rclone_mount/

[4] https://rclone.org/cache/

gaul5y ago

This is not expected behavior. An older version had pessimized locking which preventing concurrently adding a file to a directory and listing the directory simultaneously which is the most similar symptom. I recommend upgrading to the latest version, running with debug flags `-f -d -o curldbg`, and observing what's happening. Please open an issue at https://github.com/s3fs-fuse/s3fs-fuse/issues if your symptoms persist.

banana_giraffe5y ago

One thing I noticed when I went digging through the s3fs source trying to answer that myself is that it's uploader is single threaded. Part of what makes AWS's cli reasonably fast is that it uses several threads to make sure to use as much bandwidth as possible. If I limit AWS's tool to one thread on a box with plenty of bandwidth, it turns my upload down from around 135 MiB/s to around 20 MiB/s

Not sure if that's all of it, or even the majority of the slowness.

1 more reply

web0075y ago

One fun fact that I learned recently: a prefix is not strictly path-delimited. I would think of /foo/bar and /foo/baz and /bar/baz as having two prefixes, but it could be anywhere from one to three, depending on how S3 has partitioned your data.

rossmohax5y ago

I asked around and it seems, that `prefix` here is exact value for `prefix` or `key` query in the API. So you can make 5500 ListObject rps for prefix=/a/b and simultaneously make 5500 GetObject rps for key=/a/b/file1 and not be ratelimited.

In other word ratelimiting key is `${HTTP_METHOD}(${QURERY_PARAM_KEY}|${QUERY_PARAM_PREFIX})`

randallsquared5y ago

It shouldn't be one if you're sending 3500+ requests a second. It may take some time for partitioning to happen, though.

1 more reply

swyx5y ago

11 nines of reliability. I don't know anything else promising that (unclear if they actually reach it)

ignoramous5y ago

Route53 promises a 100% availability.

https://www.slideshare.net/AmazonWebServices/under-the-hood-...

https://aws.amazon.com/route53/sla/

darkr5y ago

11 nines of _durability_. Uptime SLA for s3 is 99.9% AFAICR

2 more replies

yodsanklai5y ago· 9 in thread

Could someone describe a few real-life scenarios where this is useful and noticeable?

Macha5y ago

You have a processing job which dumps a bunch of output files in a directory. A downstream job uses these files as input, sees a new directory, and pulls all the files in the new directory.

Because s3 was not strongly consistent, you would have the downstream job see a arbitrary subset of the files for a short while after creation, and not just the oldest files. This could cause your job to skip processing input files unless you provided some sort of manifest of all the files it would expect in that batch. So then you'd have to load the manifest, then keep retrying until all the input files showed up in whatever s3 node you were hitting.

mullen5y ago

If this is an issue, you wait X amount of seconds after the file has been created and then start processing it. This would allow the file to be consistent before processing it.

A better idea would be triggering Lambda jobs which either directly processes the files as they are added to S3 or trigger Lambda jobs which add the files to SQS and each job in the SQS is processed by another Lambda job.

1 more reply

PetahNZ5y ago

Our image processing worker queues/servers write an S3 object and dispatch a follow up job. Currently we have to delay the next job (we use 5 seconds) otherwise then next job may start processing before the S3 object is available (it 404's if the next job is run straight away).

random56345y ago

Same issues updating a file - you’d update on s3, then trigger processing and even though it was 50ms later - you’d get OLD version. I didn’t know gcp didn’t have this issue - would have been a big selling point

ramraj075y ago

That sounds _very_ wrong, what type of throughput are we talking about here?

2 more replies

mcqueenjordan5y ago

Out of curiosity, did you have retries on the GETs?

1 more reply

HiJon895y ago

We used to use S3 for Maven artifact storage. This is mostly an append-only workload, however Maven updates maven-metadata.xml files in place. These files contain info about what versions exist, when they were updated, what the latest snapshot version is, etc. We would see issues where a Maven build publishes to S3, and then a downstream build would read an out-of-date maven-metadata.xml and blow up. Or worse, it could silently use an older build and cause a nasty surprise when you deploy. It only happened a small percentage of the time, but when you’re doing tens of thousands of builds per day it ends up happening every day.

We switched to GCS for our Maven artifacts and the problem went away.

riking5y ago

To be clear: the "blowing up" would occur when a client observed the new maven-metadata.xml file, but old ("does-not-exist") records for the newly uploaded artifact, correct?

With this update, ordering the metadata update after the artifact upload means this failure is now impossible.

1 more reply

juliansimioniOP5y ago

One very common one is for situations where you might have a multi-step pipeline to process data

- step 1 generates/processes data, stores it in S3, overwriting the previous copy. triggers step 2 to run

- step 2 runs, fetches the data from s3 for its own processing. However, because only a few seconds have elapsed, step 2 fetches the old version of data from the S3 bucket

You can work around this by, for example, always using unique S3 object keys, but then you have to coordinate across the data processing steps, and it becomes harder to manage things like storing only the 10 latest versions.

The Argo workflow tool (https://argoproj.github.io/) is one example of a tool that can suffer from this problem.

boulos5y ago· 6 in thread

Disclosure: I work on Google Cloud.

This is super awesome for customers. I am also beyond excited for all the open-source connectors to finally be simplified so that they don't have to deal with the "oh right, gotta be careful because of list consistency". It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.

I'm now much more inclined to make my GCS FUSE rewrite in Rust also support S3 directly :).

gopalv5y ago

> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop

It's been a few years, but I lost customer data on gs:// due to listing consistency - create a bunch of files, list dir, rename them one by one to commit the dir into Hive - listing missed a file & the rename skipped the missing one.

Put a breakpoint and the bug disappears, hit the rename and listing from the same JVM, the bug disappears.

Consistency issues are hell.

> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.

Yes!

S3guard was a complete pain to secure, since it was built on top of dynamodb and since any client could be a writer for some file (any file), you had to give all clients write access to the entire dynamodb table, which was quite a security headache.

boulos5y ago

If it's been a few years, it was likely before the migration to Spanner:

https://cloud.google.com/blog/products/gcp/how-google-cloud-...

We posted that blog in early 2018, but stated:

> Last year we migrated all of Cloud Storage metadata to Spanner, Google’s globally distributed and strongly consistent relational database.

So maybe pre-2017?

1 more reply

basicneo5y ago

> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop

Is this public?

ceocoder5y ago

Yes, https://github.com/GoogleCloudDataproc/hadoop-connectors

disclosure: I work at Google as well.

2 more replies

ofek5y ago

Is your rewrite available?

boulos5y ago

Not yet, it’s too crappy :). I only get to it every once in a while and it shows.

As an example, until last weekend, I would just hardcode an access token rather than doing an oauth dance. Luckily, tame-oauth [1] from the folks at Embark was reasonably easy to integrate with.

I also got a little depressed that zargony/rust-fuse was stuck on an ancient version until I learned that Chris Berner from OpenAI had forked it earlier this year to modernize / update it [2].

[1] https://github.com/EmbarkStudios/tame-oauth

[2] https://github.com/cberner/fuser

1 more reply

mholt5y ago· 5 in thread

Would this enable atomic operations with S3 (e.g. "put object only if it does not exist")?

leakybucket5y ago

The copy object operation has options to control if the copy occurs based on examining source metadata, see "x-amz-copy-source-if-match" at https://docs.aws.amazon.com/AmazonS3/latest/API/API_CopyObje... .

With the consistency change, those might be useful as the basis for atomic operations.

OJFord5y ago

Could you not already do that by specifying ETag?

rcaught5y ago

No, because the object may not be there (but already uploaded) with eventual consistency.

1 more reply

scarface745y ago

The Etag is not a reliable hash of the file contents. It will be different if the file was uploaded in multiple parts like using the CLI than if it was moved as one operation like copying from one bucket to another. You can tell the difference because the Etag will have a “-“ in it if it was uploaded as a multipart upload.

ryanworl5y ago

I also want the answer to this, but to the best of my knowledge it does not.

cash225y ago· 5 in thread

How do they accomplish strong consistency without any hit to availability?

YZF5y ago

While I don't know the specifics I can say that availability is down to engineering practices.

Let's say that their consistency model is achieved via quorum, that is writes write to a quorum of nodes while reads read from a quorum of nodes (of their metadata database) then this guarantees read after write consistency.

The availability aspect of this is just engineering, making sure you're never down to less than a quorum of nodes. E.g. by having redundant power supplies, generators, networks, or whatever engineering it takes to reduce the probability of failure.

There's other aspects, e.g. latency, that may suffer, but again this is solved via engineering. Just throw more hardware at it to bring the latency down. The only time where you absolutely can't solve it is if you provide strong consistency across geographical regions that are far apart, there's no way then not to pay that latency.

This is just another example of why the CAP theorem isn't really as useful to determining limitations of practical systems as it may seem at first site.

beoberha5y ago

I hope they write a paper or at least do a talk about how they went about it. Usually these things are more involved than an outsider would think. Agreed that you can beat most of the CAP tradeoffs with significant engineering effort, but it’s usually pretty interesting how they accomplish it. Especially at S3’s scale where the COGs impact of “throw more hardware at it” can cost on the order of tens or hundreds of millions.

throw935y ago

>> Let's say that their consistency model is achieved via quorum, that is writes write to a quorum of nodes while reads read from a quorum of nodes (of their metadata database) then this guarantees read after write consistency.

If reads are reading only from a quorum of nodes how do you guarantee they have latest data? In theory, while a node is servicing a read request wouldn’t you need to query “all” other nodes to verify that the data in current node is latest? What if the quorum of nodes don’t have the latest data yet?

1 more reply

valenterry5y ago

> (of their metadata database)

With that you have just moved the question to: how do they ensure that the metadata database is available _and_ strongly consistent at the same time for all the requests?

Because the metadata database is certainly also a distributed one, hence you need to query _all_ the nodes or can end up in a split-brain situation and lose consistency (or availability if you choose to down the system or a part of the system).

3 more replies

valenterry5y ago

Good question. There will be some drawback from having strong consistency indeed.

tutfbhuf5y ago· 4 in thread

If someone is wondering why this is huge. It's because of the problems you face, when you don't have strong consistency. See this picture to get an understanding of what kind of problems you might run into with eventual consistency:

https://i.ibb.co/DtxrRH3/eventual-consistency.png

pachico5y ago

You just made my day :)

rafaelturk5y ago

I shall print this..

freemint5y ago

Please support the original artist and make sure you have a license to print it before you do.I

4 more replies

pantulis5y ago

Unbelievably good!

bob10295y ago· 4 in thread

This is really cool. We actually got burned on weak S3 consistency a few weeks ago when generating public download links for a customer. Took us a few hours of troubleshooting to realize they had downloaded a cached/older version of the software we had uploaded to the same URL just a few minutes prior. Resolution was to use unique paths per version to guarantee we were talking to the right files each time.

One potentially related item I was thinking about - How does HN feel about the idea of a system that has eventual durability guarantees which can be inspected by the caller? I.e. Call CloudService.WriteObject(). It writes the object to the local datacenter and asynchronously sends it to the remote(s). It returns some transaction id. You can then pass this id to CloudService.TransactionStatus() to determine if a Durable flag is set true. Or, have a convenience method like CloudService.AwaitDurabilityAsync(txnid). In the most disastrous of circumstances (asteroid hits datacenter 1ms after write returns to caller), you would get an exception on this await call after a locally-configured timeout in which case you can assume you should probably retry the write operation.

I was thinking this might be a way to give the application a way to decide how to deal with the concept of latency WRT cross-site replication. You could not await the durability for 4 nines or wait the additional 0-150 ms to see more nines. I wonder how this risk profile sits with people on the business side. I feel like having a window of reduced durability per object that is only ~150ms wide and up-front can be safely ignored. Especially considering the hypothetical ways in which you could almost instantaneously interrupt the subsequent activities/processing with the feedback mechanism proposed above.

0xbadcafebee5y ago

> they had downloaded a cached/older version of the software we had uploaded to the same URL

This is a great example (to me) of how weak consistency is useful. It exposed a problem you otherwise wouldn't notice.

You were deploying offline artifacts to customers without giving the customer the version of the new artifact. Regardless of backend consistency, the customer will not be able to tell (from the URL, anyway) what version they are downloading. Nor will they be able to, say, download the old version if the new version introduces bugs. By changing your deployment method to use unique URLs per version, your customer gains useful functionality and you avoid having to depend on strong consistency (which actually removes expensive requirements).

If you design for weak consistency, your system ends up more resilient to multiple problems.

benlivengood5y ago

> I was thinking this might be a way to give the application a way to decide how to deal with the concept of latency WRT cross-site replication.

What if your application crashes in the middle of waiting?

Ultimately if you want full consistency you need it everywhere. Your workflow writing to consistent storage needs to be an atomic output of its own inputs with a scheduler that will re-run the processing if it hasn't committed its output durably.

bob10295y ago

The application crashing while awaiting durability is equivalent to not awaiting the assurance mechanism in the first place.

Ways to recover from this could include simply scanning through entities modified previously to determine durability status. Other application nodes could participate in this effort if the original is not recoverable. An explicit transaction should be used in cases where partial completion is not permissible, as with any complex database operation.

This is a tricky way to think about the problem because you are still getting transactional semantics regardless. The only thing that would be in question is if your data can survive an asteroid impact. With this model, even with an application crashing, there is still an exceedingly high probability that any given object will survive into additional asynchronous destinations (assuming a transaction scope is implicit or externally completed).

I think the biggest problem in selling this concept is with the fatalists in the developer/business community who are unwilling to make compromises in their application domain logic in order to conform to our physical reality. From a risk modeling perspective, I feel like we have way more dangerous things to worry about on a daily basis than an RPO in which the latest ~150ms of business data at one site might be lost in an extremely catastrophic scenario. Trying to synchronously replicate data to 2+ sites for literally everything a business does is probably going to cause more harm than good in most shops. If you operate nuclear reactors or manage civilian airspace, then perhaps you need something that can bridge that gap. But, most aren't even remotely close to this level of required assurance.

scrollaway5y ago

It would be a cool interface, I would love to see that.

jchw5y ago· 3 in thread

I believe this makes Amazon S3 behave more similar to Azure blob storage[1] and Google Cloud Storage[2], which is pretty convenient for folks who are abstracting blob stores across different offerings.

For what it’s worth, consistency in S3 was usually pretty good anyways, but I ran into issues where it could vary a bit in the past. If you designed your application with this in mind, of course, it shouldn’t be an issue. In my case I believe we just had to add some retry logic. (And of course, that is no longer necessary.)

[1]: https://docs.microsoft.com/en-us/azure/storage/common/storag...

[2]: https://cloud.google.com/storage/docs/consistency

joana0355y ago

And Minio too: https://docs.min.io/docs/distributed-minio-quickstart-guide....

eloff5y ago

And Wasabi as well: https://wasabi-support.zendesk.com/hc/en-us/articles/1150016...

y4m4b45y ago

Wasabi is eventually consistent for overwrites ``eventual consistency for overwrite HTTP PUTs and HTTP DELETEs``

This is not the same.

1 more reply

tutfbhuf5y ago· 3 in thread

Ask HN: I have a question regarding strong consistency. The question is related to all geo distributed strong consistent storages, but let me pick Google Spanner as example.

Spanner has a maximum upper bound of 7 milliseconds clock offset between nodes, thx to TrueTime. To achieve external consistency, Spanner simply waits 7 ms before committing a transaction. After that amount of time the window of uncertainty is over and we can be sure, that every transaction happens to be in the right order. So far so good, but how does Spanner deal with cross data center network latency? Let's say I commit a transaction to Spanner in Canada and after 7 ms, I get my confirmation, but now in Australia someone also does a conflicting transaction and also get his confirmation after 7 ms. Spanner however, bound to the laws of physics, can only see this conflict after 100+ ms delay, due to network latency between the datacenters. How is that problem solved?

foota5y ago

This documentation talks about it in brief: https://cloud.google.com/spanner/docs/whitepapers/life-of-re.... You can read the spanner paper for more detail: https://static.googleusercontent.com/media/research.google.c...

The simple answer is that there are round trips required between datacenters when you have a quorum that spans data centers. Additionally, one replica of a split is the leader, and the leader holds write locks. So already you have to talk to the leader replica, even if it's out of the DC you're in. Getting the write lock overlaps with the transaction commit though. So for your example if we say the leader is in Canada and the client is in Australia, and we're doing a write to row 'Foo' without first reading (a so called blind write):

Client (Australia) -> Leader (Canada): Take a write write lock on 'Foo' and try to commit transaction

Leader -> other replicas: Start commit, timestamp 123

Other replicas -> Leader: Ready for commit

Leader waits for majority ready for commit and for timestamp 123 to be before the current truetime interval

Leader -> Other replicas: Commit transaction, and in parallel replies to client.

Of course there are things you can do to mitigate this depending on your needs, but there's no free lunch. If you have a client in Australia and a client in Canada writing to the same data you're going to pay for round trips for transactions.

rockinghigh5y ago

The Spanner paper goes through this: http://static.googleusercontent.com/media/research.google.co...

Writes are implemented with pessimistic locking. A write in a Canada datacenter may have to acquire a lock in Australia as well. See page 12 of the paper for the F1 numbers (mean of 103.0ms with standard deviation of 50ms for multi-site commits).

throwaway_dcnt5y ago

Partitioning.

jz-amz5y ago· 3 in thread

Some related reading -

* Jeff Barr's blog post on this topic, has an interesting case study: https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...

* Collaboration with Hadoop/S3A maintainers leading to this release: https://aws.amazon.com/blogs/opensource/community-collaborat...

yftsui5y ago

I am curious about how much Dropbox pays for data ingress/egress, they migrated storage to their own on premise data center, then now moving data back to S3 for the data lake.

Scaevolus5y ago

That's analytics data, not file storage. If they have 34PB of analytics, they certainly have far more than that of blob storage.

3 more replies

bboreham5y ago

Amazon has hard-disk products for large transfers. Snowcone, Snowball and Snowmobile.

mikaeluman5y ago· 3 in thread

I thought we had worked out in 2010-2012 that the CAP theorem, consistency, availability and partition tolerance (https://en.wikipedia.org/wiki/CAP_theorem#History) made this impossible?

Where is the mistake.

johncolanduoni5y ago

S3 is almost certainly not fully partition tolerant at the node level and requires some sort of quorum. Other “magical” data stores like Spanner also retain this limitation, they just have very reliable replication strategies.

pricechild5y ago

"almost certainly not fully partition tolerant at the node level"

Whoa hang on! You can't just say you're not tolerant of partitions...

I'm struggling to find the best post on aphyr.com about this but https://aphyr.com/posts/325-comments-on-you-do-it-to is a good one, specifically the line:

"CP and AP are upper bounds: systems can provide C or A during a partition, but might provide neither."

In short, Partitions happen, no matter what. No really, when you have multiple nodes they happen, you must tolerate them. You can't have CA.

1 more reply

chillydawg5y ago

I would guess write times are slightly up to cover the cost of the new replication completion guarantees.

QuinnyPig5y ago· 3 in thread

This was a problem with data lakes / analytics. Small inconsistencies would trash runs; that problem now goes away and removes a lot of janky half-fixes to work around the issue.

For most common use cases it's not really an issue.

FridgeSeal5y ago

Of course the real issue there is those tool treating S3 as something it's not (a filesystem) and building things on it with the expectation of certain guarantees that it explicitly didn't provide before now.

I dig the feature (strong consistency) but in some ways it just enables tools that were abusing it to just do so more easily.

Decker875y ago

Isn't that the point? To provide some useful functionality for others to build on (and pay for)?

staticassertion5y ago

Like RedShift? AWS uses S3 as a database, seems reasonable that their customers would.

1 more reply

yellow_lead5y ago· 3 in thread

What's up with all the new AWS features in the past few days? Coincidence or not?

kommissar5y ago

No coincidence. It’s reinvent season. https://reinvent.awsevents.com/

yellow_lead5y ago

Thanks, I was totally out of the loop there.

1 more reply

banana_giraffe5y ago

AWS Re:Invent is going on. It's virtual this year, but they're still using it to unveil a bunch of new stuff.

stubish5y ago· 2 in thread

Once libraries and tools start relying on this, it is going to make life interesting to the S3-compatible players. The API remains the same, but the behavior is quite different.

sleepydog5y ago

Google Cloud storage has behaved this way for years. According to other comments, so does Azure.

stubish5y ago

It is the smaller players I'm thinking of, which needed S3 compatibility for uptake. Openstack Swift, Backblaze, Ceph... I am not sure if any of these suddenly stopped being S3 compatible due to this announcement.

1 more reply

kleebeesh5y ago· 2 in thread

Fantastic. Any write-ups or descriptions for how they made it happen?

hoytschermerhrn5y ago

Seconded, I'd love to read a whitepaper if anyone is able to provide one. I'm assuming this operates on some variant of Google's TrueTime[0].

[0] https://cloud.google.com/spanner/docs/true-time-external-con...

foota5y ago

Not necessarily, I think Spanner's magic is most useful for bounded staleness reads and scaling reads through non-leader replicas. Otherwise I think you're looking at normal "NewSQL" guarantees.

hendry5y ago· 2 in thread

I wonder if this would make S3 a host for sqlite!

foxhop5y ago

Yup, I think it would.

sudeepj5y ago

I have started working on this actually (the vfs being developed in Rust). The design was complex but now this simplifies a lot.

The vfs will also have snapshotting capabilities i.e. you can have multiple versions of sqlite db.

1 more reply

marwan-nwh5y ago· 2 in thread

What about CAP theorem?

danenania5y ago

The CAP theorem doesn't preclude strong consistency. It just means that at a certain scale, you'll pay for it with extra latency.

marwan-nwh5y ago

And that latency means that some reads won't see the latest write, right? Do I understand this correctly?

1 more reply

scrollaway5y ago· 1 in thread

That's pretty incredible. With no price increase either. Using s3 as a makeshift database is possible in a lot more scenarios now.

I especially like using s3 as a db when dealing with ETLs, where the state of some data is stored in its key prefix. This means that the etl can stop / resume at any point with a single source of truth as the database.

The potential drawback is cost of course; moving (renaming) is free but copying is not. S3's biggest price pain is always its PUTs. In many ETLs this is usually a non-issue because you have to do those PUTs regardless, as you probably want the data to be saved at the various stages for future debugging and recovery.

foxhop5y ago

Yup me too! Checkout this state_doc Interface I wrote in Python for building a state machine where the state is stored in S3. I never bothered to solve for the edge case of inconsistent reads because my little state machine only ever measured in hours when it slept waiting for work and then minutes in-between waking up and doing work to verify restores.

With the news of S3's strong consistency my program is immediately safer to use for some set of defects, since the underlying datastore properties, S3 has made my codes naive/invalid assumptions, true.

https://github.com/remind101/dbsnap/blob/master/dbsnap_verif...

I should likely pull out the StateDoc class into it's own module.

kartayyar5y ago· 1 in thread

This is one reason I have been a big fan of Google Cloud Storage over AWS S3: at a past company AWS consistency was a huge pain, and GCS has had this for years.

https://cloud.google.com/blog/products/gcp/how-google-cloud-...

aparsons5y ago

Azure as well

ultimoo5y ago· 1 in thread

> S3 consistency is available at no additional cost and removes the need for additional third-party, services, and complex architecture.

I wonder whether any companies/businesses solely depended on offering eventual consistency workarounds that probably now need to pivot.

pas5y ago

The hadoop companies spent quite some bucks on developing S3Guard, and selling it.

https://hadoop.apache.org/docs/r3.0.3/hadoop-aws/tools/hadoo...

whateveracct5y ago· 1 in thread

This is really something, especially for small apps. S3 is now the easiest KV store I can imagine.

sudeepj5y ago

> small apps

Even the big ones. A lot of applications (e.g backup vendors) had to resort to use EBS volumes because of lack of read-after-update consistency.

The barrier to entry for lot of developers has been reduced.

My prediction: A lot of storage infra (FS, DB) with native support for S3 will get commoditised now.

miohtama5y ago· 1 in thread

Is Dropbox still using AWS? They had a testimonial on the page.

fcantournet5y ago

I think they use cloud providers in regions where they don't have physical footprints, to offer data locality guarantees.

But the testimonial only talks about Dropbox's data lake. So this is not about dropbox's main product storage, it's "just" their data lake for analytics. (which is apparently 32PB !)

The_rationalist5y ago· 1 in thread

Noob question: when to use S3 vs HDFS?

jamesblonde5y ago

For cost on AWS, use S3. AWS price local storage on VMs to make HDFS not competitive. If you want the HDFS API but S3 storage costs on AWS, use HopsFS-S3 - https://www.logicalclocks.com/blog/hopsfs-100x-times-faster-... (disclosure: work on HopsFS).

HDFS provides POSIX-like API, and now has atomic metadata operations that S3 doesn't (mv/rename, chown, chmod) and append for files.

iblaine5y ago

AWS Re:invent talked about improvements to S3 earlier today. You can see that talk here https://virtual.awsevents.com/media/1_usujusng

For any AWS EMR/hadoop users out there, this means the end of emrfs.

For the uninitiated, emrfs recognizes consistency problems but does not fix them. It'll throw an exception if there's a consistency problem, even give you some tools to fix the problem, but the tools may give false positives. The result is you've got to fix some consistency problems yourself, parse items out of the emrfs DynamoDB catalog, match them up to s3, then make adjustments where needed. It's an unnecessary chore.

It surprises me that this issue has not got more attention over the years and thankfully it'll be solved soon.

psanford5y ago

Neat! Would this make it possible in the future for S3 to support CAS like operations for PutObject? Something like supporting `if-match` and `if-none-match` headers on the etag of an object would be so useful.

jamesblonde5y ago

Listing and update consistency is great, but the ACID SQL-on-S3 frameworks (Apache Hudi, Dela Lake) also require atomic rename to support transactions.

HopsFS-S3 solves the ACID SQL-on-S3 problem discussed here on HN last week ( https://news.ycombinator.com/item?id=25149154 ) by adding a metadata layer over S3 and providing a POSIX-like HDFS API to clients. Operations like rename, mv, chown, chmod are metadata operations - rename is not copy and delete.

Disclosure: I work on HopsFS

gaul5y ago

This is great news for application developers! And probably hard work for the AWS implementors. I will try to update this over the weekend: https://github.com/gaul/are-we-consistent-yet

juliansimioniOP5y ago

Another, perhaps better link for the announcement: https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...

ed25519FUUU5y ago

The most devious s3 consistency issue I encountered was that S3 bucket configs are actually stored in S3, and were not read-after-write consistent.

It was bad. In order to "update" a bucket config programmatically you need to read the entire existing config, make an update, and then PUT it back (overwriting what's there). The problem is when you went to read the config it's possibly 15 minutes old or more, and when you put it back you overwrite any changes that may not be consistent.

I had to work around the issue by storing a high-water mark in the config ID, and then poll an external system to find the expected high water mark before I knew I could safely read and update.

ChuckMcM5y ago

This is awesome, I could have used it for a project we did a while back.

I'm sure that somewhere there is a Product Manager at Amazon who talks to customers they find out are using Google Cloud or Azure and asks them "Why not use AWS?" and they mumble some feature of the other service that they need.

Sometime later, said service shows up as part of AWS. :-)

For me this is the best part of tech rivalries, whether they are in CPU performance/architecture, programming language features/tools, or network services. Pressure to improve.

throwaway_qcjal5y ago

Congratulations to the S3 team from a friendly cloud storage neighbor in SLU :-)

marius_k5y ago

I was naive to think that these type of consistencies were already in place.

intricatedetail5y ago

Is "strong" redundant? Something can be either consistent or not. Seems like developers asked marketing for a catchy name, but that looks comical.

gfody5y ago

we use s3 in a repo pattern pretty heavily and I had no idea they didn't have read-after-write consistency. I've noticed s3 will hold connections open by sending back blank lines and always assumed it was synchronizing for consistency.

amuraruhn5y ago

One capability left to implement now: make `move` operations in S3 O(1)

brodouevencode5y ago

In the early days of S3 this was a real problem. Over the years they've beefed up the backend so objects become became consistent faster, but this is a huge leap.

ris5y ago

If only there were a way of atomically setting a object tag instead of doing a read-modify-write of the whole tagset...

Aeolun5y ago

Oh, so this is why the images I uploaded from my chat app sometimes just didn’t show up from certain locations.

Glad it’s solved now :)

leowoo915y ago

Interesting, I was always versioning file names as they upload. That seems to be mostly for file updates rather.

totaldude875y ago

i already had enough "catching-up-with-latest-aws-tech" problem with amazon, though this is a welcome addition, with daily announcements like this, makes life tougher for a AWS architect who likes to be up to date!

just today proton was announced and now this! this is going to be a hectic month :|

finikytou5y ago

does it mean s3 could be used efficiently as a file system? Would that compete with FSX?

leakybucket5y ago

With this change, is there a way to implement atomic operations like put-if-absent?

2color5y ago

How does this compare to Cloudflare's durable objects?

david-cako5y ago

big news. this was a pretty big issue for a customer at AWS that was using S3 for their analytics platform.

eggie55y ago

does this mean the TF S3 plugin will stop cluttering my logs w/ s3 retries?

toolslive5y ago

atomic multipart uploads, here we come ;)

j / k navigate · click thread line to collapse

227 comments

156 comments · 48 top-level

WrtCdEvrydy5y ago· 18 in thread

Has anyone ever seen S3 behave eventually consistent? I have not seen a lot of eventual consistency in the real world but I wonder if I'm just working on the wrong problems?

macksd5y ago

banana_giraffe5y ago

The first one happened often enough to cause problems, the second one was a fairly rare event, but still had to be handled.

Time to revisit all that mess, I suppose.

allengeorge5y ago

Yes, we have. Specifically:

1. LIST 2. PUT 3. LIST

would trigger situations where (3) wouldn't include the object inserted in (2). This is well-known however.

juancampa5y ago

Also:

  1. HEAD key -> 404
  2. PUT key  -> 200
  3. GET key  -> 404 (what? But I just put it!)

This is commonly used for "upload file if it doesn't exist"

1 more reply

roseway45y ago

enw5y ago

1 more reply

brian_herman5y ago

Yeah, I never had that problem either. But I wonder how they are addressing the https://en.wikipedia.org/wiki/CAP_theorem

content_sesh5y ago

Sacrificing availability I reckon.

1 more reply

SpicyLemonZest5y ago

WrtCdEvrydy5y ago

Yeah, I got seconds but people were saying there were minutes of data... max I ever saw was 3 seconds on a 2gb file...

1 more reply

enw5y ago

All the time when using EMR with S3. Really frequent, and really annoying to mitigate.

rantwasp5y ago

yes. i believe newer region had read after write in a while but you needed to use the right endpoint

i remember a time when if you were using the us-standard “region” and were unlucky it could take 12 hours for your objects to become visible

TheDong5y ago

Yes.

I've observed eventual consistency with S3 prior to this change where it took on the order of several hundred milliseconds to observe an update of an existing key.

I've observed this as recently as 2 years ago.

johncolanduoni5y ago

There used to be a behavior with 404 caching that was pretty easy to trigger: if you race requesting an object and creating it, subsequent reads could continue to 404.

OJFord5y ago

Yes, it caused us some pain with multipart uploads that we sort of proxied, them being multipart from the client to us as well.

richwater5y ago

Yes. When you're writing terabytes of data files it was very obvious.

That's why manifest files became so popular.

dilatedmind5y ago

I think it was already strongly consistent within regions?

like if you had tried to read a non-existent key, then wrote to it, it might continue to appear to not exist for a minute?

TheDong5y ago

Their previous consistency was RAW, but not RAU (for most regions. I think us-east-1 was missing RAW consistency for quite a while).

After a write, you would _always_ be able to read the key you just wrote.

After an update, you could get a stale copy of the key if your subsequent read hit a different server.

1 more reply

k__5y ago· 15 in thread

It's interesting to read all these comments here that talk about the eventual consistency like it was some kind of bug.

deathanatos5y ago

The previous behavior (new keys weren't strongly consistent) were weird, and made a lot of patterns that AWS itself sort of guides you towards, broken.

jsjohnst5y ago

> then you hit the section of the docs that says "no, you shouldn't, they should be uniformly balanced". (Though honestly, I've never felt any negative side-effects from ignoring this

0xbadcafebee5y ago

So the problem was that users don't know how S3 works?

howlgarnish5y ago

Are there any scenarios where you would want eventual consistency over strong consistency? (Assuming pricing, performance, replication etc are the same.)

acdha5y ago

nhumrich5y ago

At the benefit of added availability? absolutely. This is the entire premise of mongodb.

2 more replies

robotresearcher5y ago

No. A magical do-everything datastore would have strong consistency. Eventual consistency is something you might settle for given the performance cost trade offs.

To see this, consider that strong consistency has all the guarantees of eventual consistency, plus some more. And the additional guarantees might make your application much easier to write.

TheCoelacanth5y ago

Job security?

dboreham5y ago

Train tunnels

crazyjncsu5y ago

No.

CobrastanJorji5y ago

If someone uses your API wrong, that's their bug. If a huge number of people use your API wrong, that's your bug.

zwily5y ago

It wasn’t a bug, it was well-documented behavior. However, it was the root of many bugs by people building on top of S3 that did not take the eventual consistency into account.

I’ll bet this release is going to fix a lot of weird bugs in systems out there using S3.

k__5y ago

Guess the name simple storage service gave many people the idea that you could use it without a basic idea of distributed systems.

SilasX5y ago

I'm pulling my hair out because people are using eventual consistency to mean "eventual consistency with a noticeable lag before the consistency is achieved", which bothers the pedant in me.

benlivengood5y ago

1 more reply

damon_c5y ago· 10 in thread

The more I learn about S3 the more I feel the same awe I feel when looking at the Pyramids or Ankor Wat.

> You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in an S3 bucket.

dheera5y ago

rovr1385y ago

I can't answer this directly, but I thought I'd chime in with a possible alternative, or at least something else to test.

I'm using rclone with S3[1] (and others). There are commands[2] where you can use to just sync/copy/ files but it also has a mount option[3]. It also has caching[4] and other things that might help.

[1] https://rclone.org/s3/

[2] https://rclone.org/commands/

[3] https://rclone.org/commands/rclone_mount/

[4] https://rclone.org/cache/

gaul5y ago

banana_giraffe5y ago

Not sure if that's all of it, or even the majority of the slowness.

1 more reply

web0075y ago

rossmohax5y ago

In other word ratelimiting key is `${HTTP_METHOD}(${QURERY_PARAM_KEY}|${QUERY_PARAM_PREFIX})`

randallsquared5y ago

It shouldn't be one if you're sending 3500+ requests a second. It may take some time for partitioning to happen, though.

1 more reply

swyx5y ago

11 nines of reliability. I don't know anything else promising that (unclear if they actually reach it)

ignoramous5y ago

Route53 promises a 100% availability.

https://www.slideshare.net/AmazonWebServices/under-the-hood-...

https://aws.amazon.com/route53/sla/

darkr5y ago

11 nines of _durability_. Uptime SLA for s3 is 99.9% AFAICR

2 more replies

yodsanklai5y ago· 9 in thread

Could someone describe a few real-life scenarios where this is useful and noticeable?

Macha5y ago

You have a processing job which dumps a bunch of output files in a directory. A downstream job uses these files as input, sees a new directory, and pulls all the files in the new directory.

mullen5y ago

If this is an issue, you wait X amount of seconds after the file has been created and then start processing it. This would allow the file to be consistent before processing it.

1 more reply

PetahNZ5y ago

random56345y ago

ramraj075y ago

That sounds _very_ wrong, what type of throughput are we talking about here?

2 more replies

mcqueenjordan5y ago

Out of curiosity, did you have retries on the GETs?

1 more reply

HiJon895y ago

We switched to GCS for our Maven artifacts and the problem went away.

riking5y ago

To be clear: the "blowing up" would occur when a client observed the new maven-metadata.xml file, but old ("does-not-exist") records for the newly uploaded artifact, correct?

With this update, ordering the metadata update after the artifact upload means this failure is now impossible.

1 more reply

juliansimioniOP5y ago

One very common one is for situations where you might have a multi-step pipeline to process data

- step 1 generates/processes data, stores it in S3, overwriting the previous copy. triggers step 2 to run

- step 2 runs, fetches the data from s3 for its own processing. However, because only a few seconds have elapsed, step 2 fetches the old version of data from the S3 bucket

The Argo workflow tool (https://argoproj.github.io/) is one example of a tool that can suffer from this problem.

boulos5y ago· 6 in thread

Disclosure: I work on Google Cloud.

I'm now much more inclined to make my GCS FUSE rewrite in Rust also support S3 directly :).

gopalv5y ago

> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop

Put a breakpoint and the bug disappears, hit the rename and listing from the same JVM, the bug disappears.

Consistency issues are hell.

> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.

Yes!

boulos5y ago

If it's been a few years, it was likely before the migration to Spanner:

https://cloud.google.com/blog/products/gcp/how-google-cloud-...

We posted that blog in early 2018, but stated:

> Last year we migrated all of Cloud Storage metadata to Spanner, Google’s globally distributed and strongly consistent relational database.

So maybe pre-2017?

1 more reply

basicneo5y ago

> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop

Is this public?

ceocoder5y ago

Yes, https://github.com/GoogleCloudDataproc/hadoop-connectors

disclosure: I work at Google as well.

2 more replies

ofek5y ago

Is your rewrite available?

boulos5y ago

Not yet, it’s too crappy :). I only get to it every once in a while and it shows.

As an example, until last weekend, I would just hardcode an access token rather than doing an oauth dance. Luckily, tame-oauth [1] from the folks at Embark was reasonably easy to integrate with.

I also got a little depressed that zargony/rust-fuse was stuck on an ancient version until I learned that Chris Berner from OpenAI had forked it earlier this year to modernize / update it [2].

[1] https://github.com/EmbarkStudios/tame-oauth

[2] https://github.com/cberner/fuser

1 more reply

mholt5y ago· 5 in thread

Would this enable atomic operations with S3 (e.g. "put object only if it does not exist")?

leakybucket5y ago

With the consistency change, those might be useful as the basis for atomic operations.

OJFord5y ago

Could you not already do that by specifying ETag?

rcaught5y ago

No, because the object may not be there (but already uploaded) with eventual consistency.

1 more reply

scarface745y ago

ryanworl5y ago

I also want the answer to this, but to the best of my knowledge it does not.

cash225y ago· 5 in thread

How do they accomplish strong consistency without any hit to availability?

YZF5y ago

While I don't know the specifics I can say that availability is down to engineering practices.

This is just another example of why the CAP theorem isn't really as useful to determining limitations of practical systems as it may seem at first site.

beoberha5y ago

throw935y ago

1 more reply

valenterry5y ago

> (of their metadata database)

With that you have just moved the question to: how do they ensure that the metadata database is available _and_ strongly consistent at the same time for all the requests?

3 more replies

valenterry5y ago

Good question. There will be some drawback from having strong consistency indeed.

tutfbhuf5y ago· 4 in thread

https://i.ibb.co/DtxrRH3/eventual-consistency.png

pachico5y ago

You just made my day :)

rafaelturk5y ago

I shall print this..

freemint5y ago

Please support the original artist and make sure you have a license to print it before you do.I

4 more replies

pantulis5y ago

Unbelievably good!

bob10295y ago· 4 in thread

0xbadcafebee5y ago

> they had downloaded a cached/older version of the software we had uploaded to the same URL

This is a great example (to me) of how weak consistency is useful. It exposed a problem you otherwise wouldn't notice.

If you design for weak consistency, your system ends up more resilient to multiple problems.

benlivengood5y ago

> I was thinking this might be a way to give the application a way to decide how to deal with the concept of latency WRT cross-site replication.

What if your application crashes in the middle of waiting?

bob10295y ago

The application crashing while awaiting durability is equivalent to not awaiting the assurance mechanism in the first place.

scrollaway5y ago

It would be a cool interface, I would love to see that.

jchw5y ago· 3 in thread

[1]: https://docs.microsoft.com/en-us/azure/storage/common/storag...

[2]: https://cloud.google.com/storage/docs/consistency

joana0355y ago

And Minio too: https://docs.min.io/docs/distributed-minio-quickstart-guide....

eloff5y ago

And Wasabi as well: https://wasabi-support.zendesk.com/hc/en-us/articles/1150016...

y4m4b45y ago

Wasabi is eventually consistent for overwrites ``eventual consistency for overwrite HTTP PUTs and HTTP DELETEs``

This is not the same.

1 more reply

tutfbhuf5y ago· 3 in thread

Ask HN: I have a question regarding strong consistency. The question is related to all geo distributed strong consistent storages, but let me pick Google Spanner as example.

foota5y ago

Client (Australia) -> Leader (Canada): Take a write write lock on 'Foo' and try to commit transaction

Leader -> other replicas: Start commit, timestamp 123

Other replicas -> Leader: Ready for commit

Leader waits for majority ready for commit and for timestamp 123 to be before the current truetime interval

Leader -> Other replicas: Commit transaction, and in parallel replies to client.

rockinghigh5y ago

The Spanner paper goes through this: http://static.googleusercontent.com/media/research.google.co...

throwaway_dcnt5y ago

Partitioning.

jz-amz5y ago· 3 in thread

Some related reading -

* Jeff Barr's blog post on this topic, has an interesting case study: https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...

* Collaboration with Hadoop/S3A maintainers leading to this release: https://aws.amazon.com/blogs/opensource/community-collaborat...

yftsui5y ago

I am curious about how much Dropbox pays for data ingress/egress, they migrated storage to their own on premise data center, then now moving data back to S3 for the data lake.

Scaevolus5y ago

That's analytics data, not file storage. If they have 34PB of analytics, they certainly have far more than that of blob storage.

3 more replies

bboreham5y ago

Amazon has hard-disk products for large transfers. Snowcone, Snowball and Snowmobile.

mikaeluman5y ago· 3 in thread

I thought we had worked out in 2010-2012 that the CAP theorem, consistency, availability and partition tolerance (https://en.wikipedia.org/wiki/CAP_theorem#History) made this impossible?

Where is the mistake.

johncolanduoni5y ago

pricechild5y ago

"almost certainly not fully partition tolerant at the node level"

Whoa hang on! You can't just say you're not tolerant of partitions...

I'm struggling to find the best post on aphyr.com about this but https://aphyr.com/posts/325-comments-on-you-do-it-to is a good one, specifically the line:

"CP and AP are upper bounds: systems can provide C or A during a partition, but might provide neither."

In short, Partitions happen, no matter what. No really, when you have multiple nodes they happen, you must tolerate them. You can't have CA.

1 more reply

chillydawg5y ago

I would guess write times are slightly up to cover the cost of the new replication completion guarantees.

QuinnyPig5y ago· 3 in thread

This was a problem with data lakes / analytics. Small inconsistencies would trash runs; that problem now goes away and removes a lot of janky half-fixes to work around the issue.

For most common use cases it's not really an issue.

FridgeSeal5y ago

I dig the feature (strong consistency) but in some ways it just enables tools that were abusing it to just do so more easily.

Decker875y ago

Isn't that the point? To provide some useful functionality for others to build on (and pay for)?

staticassertion5y ago

Like RedShift? AWS uses S3 as a database, seems reasonable that their customers would.

1 more reply

yellow_lead5y ago· 3 in thread

What's up with all the new AWS features in the past few days? Coincidence or not?

kommissar5y ago

No coincidence. It’s reinvent season. https://reinvent.awsevents.com/

yellow_lead5y ago

Thanks, I was totally out of the loop there.

1 more reply

banana_giraffe5y ago

AWS Re:Invent is going on. It's virtual this year, but they're still using it to unveil a bunch of new stuff.

stubish5y ago· 2 in thread

Once libraries and tools start relying on this, it is going to make life interesting to the S3-compatible players. The API remains the same, but the behavior is quite different.

sleepydog5y ago

Google Cloud storage has behaved this way for years. According to other comments, so does Azure.

stubish5y ago

1 more reply

kleebeesh5y ago· 2 in thread

Fantastic. Any write-ups or descriptions for how they made it happen?

hoytschermerhrn5y ago

Seconded, I'd love to read a whitepaper if anyone is able to provide one. I'm assuming this operates on some variant of Google's TrueTime[0].

[0] https://cloud.google.com/spanner/docs/true-time-external-con...

foota5y ago

Not necessarily, I think Spanner's magic is most useful for bounded staleness reads and scaling reads through non-leader replicas. Otherwise I think you're looking at normal "NewSQL" guarantees.

hendry5y ago· 2 in thread

I wonder if this would make S3 a host for sqlite!

foxhop5y ago

Yup, I think it would.

sudeepj5y ago

I have started working on this actually (the vfs being developed in Rust). The design was complex but now this simplifies a lot.

The vfs will also have snapshotting capabilities i.e. you can have multiple versions of sqlite db.

1 more reply

marwan-nwh5y ago· 2 in thread

What about CAP theorem?

danenania5y ago

The CAP theorem doesn't preclude strong consistency. It just means that at a certain scale, you'll pay for it with extra latency.

marwan-nwh5y ago

And that latency means that some reads won't see the latest write, right? Do I understand this correctly?

1 more reply

scrollaway5y ago· 1 in thread

That's pretty incredible. With no price increase either. Using s3 as a makeshift database is possible in a lot more scenarios now.

foxhop5y ago

https://github.com/remind101/dbsnap/blob/master/dbsnap_verif...

I should likely pull out the StateDoc class into it's own module.

kartayyar5y ago· 1 in thread

This is one reason I have been a big fan of Google Cloud Storage over AWS S3: at a past company AWS consistency was a huge pain, and GCS has had this for years.

https://cloud.google.com/blog/products/gcp/how-google-cloud-...

aparsons5y ago

Azure as well

ultimoo5y ago· 1 in thread

> S3 consistency is available at no additional cost and removes the need for additional third-party, services, and complex architecture.

I wonder whether any companies/businesses solely depended on offering eventual consistency workarounds that probably now need to pivot.

pas5y ago

The hadoop companies spent quite some bucks on developing S3Guard, and selling it.

https://hadoop.apache.org/docs/r3.0.3/hadoop-aws/tools/hadoo...

whateveracct5y ago· 1 in thread

This is really something, especially for small apps. S3 is now the easiest KV store I can imagine.

sudeepj5y ago

> small apps

Even the big ones. A lot of applications (e.g backup vendors) had to resort to use EBS volumes because of lack of read-after-update consistency.

The barrier to entry for lot of developers has been reduced.

My prediction: A lot of storage infra (FS, DB) with native support for S3 will get commoditised now.

miohtama5y ago· 1 in thread

Is Dropbox still using AWS? They had a testimonial on the page.

fcantournet5y ago

I think they use cloud providers in regions where they don't have physical footprints, to offer data locality guarantees.

But the testimonial only talks about Dropbox's data lake. So this is not about dropbox's main product storage, it's "just" their data lake for analytics. (which is apparently 32PB !)

The_rationalist5y ago· 1 in thread

Noob question: when to use S3 vs HDFS?

jamesblonde5y ago

HDFS provides POSIX-like API, and now has atomic metadata operations that S3 doesn't (mv/rename, chown, chmod) and append for files.

iblaine5y ago

AWS Re:invent talked about improvements to S3 earlier today. You can see that talk here https://virtual.awsevents.com/media/1_usujusng

For any AWS EMR/hadoop users out there, this means the end of emrfs.

It surprises me that this issue has not got more attention over the years and thankfully it'll be solved soon.

psanford5y ago

jamesblonde5y ago

Listing and update consistency is great, but the ACID SQL-on-S3 frameworks (Apache Hudi, Dela Lake) also require atomic rename to support transactions.

Disclosure: I work on HopsFS

gaul5y ago

This is great news for application developers! And probably hard work for the AWS implementors. I will try to update this over the weekend: https://github.com/gaul/are-we-consistent-yet

juliansimioniOP5y ago

Another, perhaps better link for the announcement: https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...

ed25519FUUU5y ago

The most devious s3 consistency issue I encountered was that S3 bucket configs are actually stored in S3, and were not read-after-write consistent.

I had to work around the issue by storing a high-water mark in the config ID, and then poll an external system to find the expected high water mark before I knew I could safely read and update.

ChuckMcM5y ago

This is awesome, I could have used it for a project we did a while back.

Sometime later, said service shows up as part of AWS. :-)

For me this is the best part of tech rivalries, whether they are in CPU performance/architecture, programming language features/tools, or network services. Pressure to improve.

throwaway_qcjal5y ago

Congratulations to the S3 team from a friendly cloud storage neighbor in SLU :-)

marius_k5y ago

I was naive to think that these type of consistencies were already in place.

intricatedetail5y ago

Is "strong" redundant? Something can be either consistent or not. Seems like developers asked marketing for a catchy name, but that looks comical.

gfody5y ago

amuraruhn5y ago

One capability left to implement now: make `move` operations in S3 O(1)

brodouevencode5y ago

In the early days of S3 this was a real problem. Over the years they've beefed up the backend so objects become became consistent faster, but this is a huge leap.

ris5y ago

If only there were a way of atomically setting a object tag instead of doing a read-modify-write of the whole tagset...

Aeolun5y ago

Oh, so this is why the images I uploaded from my chat app sometimes just didn’t show up from certain locations.

Glad it’s solved now :)

leowoo915y ago

Interesting, I was always versioning file names as they upload. That seems to be mostly for file updates rather.

totaldude875y ago

just today proton was announced and now this! this is going to be a hectic month :|

finikytou5y ago

does it mean s3 could be used efficiently as a file system? Would that compete with FSX?

leakybucket5y ago

With this change, is there a way to implement atomic operations like put-if-absent?

2color5y ago

How does this compare to Cloudflare's durable objects?

david-cako5y ago

big news. this was a pretty big issue for a customer at AWS that was using S3 for their analytics platform.

eggie55y ago

does this mean the TF S3 plugin will stop cluttering my logs w/ s3 retries?

toolslive5y ago

atomic multipart uploads, here we come ;)

j / k navigate · click thread line to collapse