To be fair to S3, Hadoop pretending S3 is a hierarchical filesystem is a bit of a hack. But I had cases where new objects wouldn't be listed for hours. There's only so much you can do to design an application around that, especially when Azure and Google Storage don't have that problem.
The first one happened often enough to cause problems, the second one was a fairly rare event, but still had to be handled.
It might have been more of an issue with "large" buckets. The bucket in particular where I had to dance around the missing object issue had ~100 million fairly small objects. I ended up having to create a database to track if objects exist so no consumer would ever try to see if a non-existent object existed.
Time to revisit all that mess, I suppose.
1. LIST 2. PUT 3. LIST
would trigger situations where (3) wouldn't include the object inserted in (2). This is well-known however.
1. HEAD key -> 404
2. PUT key -> 200
3. GET key -> 404 (what? But I just put it!)
This is commonly used for "upload file if it doesn't exist"i remember a time when if you were using the us-standard “region” and were unlucky it could take 12 hours for your objects to become visible
I've observed eventual consistency with S3 prior to this change where it took on the order of several hundred milliseconds to observe an update of an existing key.
I've observed this as recently as 2 years ago.
That's why manifest files became so popular.
like if you had tried to read a non-existent key, then wrote to it, it might continue to appear to not exist for a minute?
After a write, you would _always_ be able to read the key you just wrote.
After an update, you could get a stale copy of the key if your subsequent read hit a different server.
For example, you can set up an SQS queue to be informed of writes into a bucket. The idea being that some component writes into the bucket, and a consumer then fetches the data & processes it. Except — gotcha! — there wasn't a guarantee of consistency. It would usually work, though, and so you'd get these weird failures that you couldn't explain unless you already knew the answer, really.
Same thing with how S3 object names sort of feel like file names, and that you might use them for something useful, and you design around that, and then you hit the section of the docs that says "no, you shouldn't, they should be uniformly balanced". (Though honestly, I've never felt any negative side-effects from ignoring this, unlike the consistency guarantees.)
You likely won’t, unless you hit a bucket size that has at least a billion small files inside. If you get into the trillion and above range and are heavily unbalanced due to a recent deploy of yours, ooops, there goes S3. Or it did in 2013 anyway.
To see this, consider that strong consistency has all the guarantees of eventual consistency, plus some more. And the additional guarantees might make your application much easier to write.
I’ll bet this release is going to fix a lot of weird bugs in systems out there using S3.
I mean, technically, strong consistency implies eventual consistency (with lag = 0). But everyone's equating eventual consistency with the noticeable lag itself, implying that EC per se is a bad thing.
For an analogy, it would be like if people were talking about how rectangular cakes suck, because <problems of unequal width vs length>, and thus they use square cakes, but ... square cakes are rectangular too.
(What's going on is that people use <larger set> to mean <larger set, excluding members that qualify for smaller set>. <larger set> = eventual consistency, rectangle; <smaller set> = strong consistency, square.)
Obviously, it's not a problem for communication because everyone implicitly understands what's going on and uses the words the same way, but I wish people spoke in terms of "has anyone actually seen a consistency lag?" rather than "has anyone actually seen eventual consistency?", since the latter is not the right way to frame it, and is actually a good thing to have, which happened both before and after this development.
EDIT: I think the reason people care about the lag is specifically for this case; if you want to know if your eventually consistent write actually succeeded then you need to know how long to wait to check.
> You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in an S3 bucket.
I'm using rclone with S3[1] (and others). There are commands[2] where you can use to just sync/copy/ files but it also has a mount option[3]. It also has caching[4] and other things that might help.
[2] https://rclone.org/commands/
Not sure if that's all of it, or even the majority of the slowness.
In other word ratelimiting key is `${HTTP_METHOD}(${QURERY_PARAM_KEY}|${QUERY_PARAM_PREFIX})`
https://www.slideshare.net/AmazonWebServices/under-the-hood-...
Because s3 was not strongly consistent, you would have the downstream job see a arbitrary subset of the files for a short while after creation, and not just the oldest files. This could cause your job to skip processing input files unless you provided some sort of manifest of all the files it would expect in that batch. So then you'd have to load the manifest, then keep retrying until all the input files showed up in whatever s3 node you were hitting.
A better idea would be triggering Lambda jobs which either directly processes the files as they are added to S3 or trigger Lambda jobs which add the files to SQS and each job in the SQS is processed by another Lambda job.
We switched to GCS for our Maven artifacts and the problem went away.
With this update, ordering the metadata update after the artifact upload means this failure is now impossible.
- step 1 generates/processes data, stores it in S3, overwriting the previous copy. triggers step 2 to run
- step 2 runs, fetches the data from s3 for its own processing. However, because only a few seconds have elapsed, step 2 fetches the old version of data from the S3 bucket
You can work around this by, for example, always using unique S3 object keys, but then you have to coordinate across the data processing steps, and it becomes harder to manage things like storing only the 10 latest versions.
The Argo workflow tool (https://argoproj.github.io/) is one example of a tool that can suffer from this problem.
This is super awesome for customers. I am also beyond excited for all the open-source connectors to finally be simplified so that they don't have to deal with the "oh right, gotta be careful because of list consistency". It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.
I'm now much more inclined to make my GCS FUSE rewrite in Rust also support S3 directly :).
It's been a few years, but I lost customer data on gs:// due to listing consistency - create a bunch of files, list dir, rename them one by one to commit the dir into Hive - listing missed a file & the rename skipped the missing one.
Put a breakpoint and the bug disappears, hit the rename and listing from the same JVM, the bug disappears.
Consistency issues are hell.
> It was super awesome when we were able to delete a huge chunk of the GCS Connector for Hadoop, and I hope to see the same across the S3-focused connectors.
Yes!
S3guard was a complete pain to secure, since it was built on top of dynamodb and since any client could be a writer for some file (any file), you had to give all clients write access to the entire dynamodb table, which was quite a security headache.
https://cloud.google.com/blog/products/gcp/how-google-cloud-...
We posted that blog in early 2018, but stated:
> Last year we migrated all of Cloud Storage metadata to Spanner, Google’s globally distributed and strongly consistent relational database.
So maybe pre-2017?
Is this public?
disclosure: I work at Google as well.
As an example, until last weekend, I would just hardcode an access token rather than doing an oauth dance. Luckily, tame-oauth [1] from the folks at Embark was reasonably easy to integrate with.
I also got a little depressed that zargony/rust-fuse was stuck on an ancient version until I learned that Chris Berner from OpenAI had forked it earlier this year to modernize / update it [2].
With the consistency change, those might be useful as the basis for atomic operations.
Let's say that their consistency model is achieved via quorum, that is writes write to a quorum of nodes while reads read from a quorum of nodes (of their metadata database) then this guarantees read after write consistency.
The availability aspect of this is just engineering, making sure you're never down to less than a quorum of nodes. E.g. by having redundant power supplies, generators, networks, or whatever engineering it takes to reduce the probability of failure.
There's other aspects, e.g. latency, that may suffer, but again this is solved via engineering. Just throw more hardware at it to bring the latency down. The only time where you absolutely can't solve it is if you provide strong consistency across geographical regions that are far apart, there's no way then not to pay that latency.
This is just another example of why the CAP theorem isn't really as useful to determining limitations of practical systems as it may seem at first site.
If reads are reading only from a quorum of nodes how do you guarantee they have latest data? In theory, while a node is servicing a read request wouldn’t you need to query “all” other nodes to verify that the data in current node is latest? What if the quorum of nodes don’t have the latest data yet?
With that you have just moved the question to: how do they ensure that the metadata database is available _and_ strongly consistent at the same time for all the requests?
Because the metadata database is certainly also a distributed one, hence you need to query _all_ the nodes or can end up in a split-brain situation and lose consistency (or availability if you choose to down the system or a part of the system).
One potentially related item I was thinking about - How does HN feel about the idea of a system that has eventual durability guarantees which can be inspected by the caller? I.e. Call CloudService.WriteObject(). It writes the object to the local datacenter and asynchronously sends it to the remote(s). It returns some transaction id. You can then pass this id to CloudService.TransactionStatus() to determine if a Durable flag is set true. Or, have a convenience method like CloudService.AwaitDurabilityAsync(txnid). In the most disastrous of circumstances (asteroid hits datacenter 1ms after write returns to caller), you would get an exception on this await call after a locally-configured timeout in which case you can assume you should probably retry the write operation.
I was thinking this might be a way to give the application a way to decide how to deal with the concept of latency WRT cross-site replication. You could not await the durability for 4 nines or wait the additional 0-150 ms to see more nines. I wonder how this risk profile sits with people on the business side. I feel like having a window of reduced durability per object that is only ~150ms wide and up-front can be safely ignored. Especially considering the hypothetical ways in which you could almost instantaneously interrupt the subsequent activities/processing with the feedback mechanism proposed above.
This is a great example (to me) of how weak consistency is useful. It exposed a problem you otherwise wouldn't notice.
You were deploying offline artifacts to customers without giving the customer the version of the new artifact. Regardless of backend consistency, the customer will not be able to tell (from the URL, anyway) what version they are downloading. Nor will they be able to, say, download the old version if the new version introduces bugs. By changing your deployment method to use unique URLs per version, your customer gains useful functionality and you avoid having to depend on strong consistency (which actually removes expensive requirements).
If you design for weak consistency, your system ends up more resilient to multiple problems.
What if your application crashes in the middle of waiting?
Ultimately if you want full consistency you need it everywhere. Your workflow writing to consistent storage needs to be an atomic output of its own inputs with a scheduler that will re-run the processing if it hasn't committed its output durably.
Ways to recover from this could include simply scanning through entities modified previously to determine durability status. Other application nodes could participate in this effort if the original is not recoverable. An explicit transaction should be used in cases where partial completion is not permissible, as with any complex database operation.
This is a tricky way to think about the problem because you are still getting transactional semantics regardless. The only thing that would be in question is if your data can survive an asteroid impact. With this model, even with an application crashing, there is still an exceedingly high probability that any given object will survive into additional asynchronous destinations (assuming a transaction scope is implicit or externally completed).
I think the biggest problem in selling this concept is with the fatalists in the developer/business community who are unwilling to make compromises in their application domain logic in order to conform to our physical reality. From a risk modeling perspective, I feel like we have way more dangerous things to worry about on a daily basis than an RPO in which the latest ~150ms of business data at one site might be lost in an extremely catastrophic scenario. Trying to synchronously replicate data to 2+ sites for literally everything a business does is probably going to cause more harm than good in most shops. If you operate nuclear reactors or manage civilian airspace, then perhaps you need something that can bridge that gap. But, most aren't even remotely close to this level of required assurance.
For what it’s worth, consistency in S3 was usually pretty good anyways, but I ran into issues where it could vary a bit in the past. If you designed your application with this in mind, of course, it shouldn’t be an issue. In my case I believe we just had to add some retry logic. (And of course, that is no longer necessary.)
[1]: https://docs.microsoft.com/en-us/azure/storage/common/storag...
This is not the same.
Spanner has a maximum upper bound of 7 milliseconds clock offset between nodes, thx to TrueTime. To achieve external consistency, Spanner simply waits 7 ms before committing a transaction. After that amount of time the window of uncertainty is over and we can be sure, that every transaction happens to be in the right order. So far so good, but how does Spanner deal with cross data center network latency? Let's say I commit a transaction to Spanner in Canada and after 7 ms, I get my confirmation, but now in Australia someone also does a conflicting transaction and also get his confirmation after 7 ms. Spanner however, bound to the laws of physics, can only see this conflict after 100+ ms delay, due to network latency between the datacenters. How is that problem solved?
The simple answer is that there are round trips required between datacenters when you have a quorum that spans data centers. Additionally, one replica of a split is the leader, and the leader holds write locks. So already you have to talk to the leader replica, even if it's out of the DC you're in. Getting the write lock overlaps with the transaction commit though. So for your example if we say the leader is in Canada and the client is in Australia, and we're doing a write to row 'Foo' without first reading (a so called blind write):
Client (Australia) -> Leader (Canada): Take a write write lock on 'Foo' and try to commit transaction
Leader -> other replicas: Start commit, timestamp 123
Other replicas -> Leader: Ready for commit
Leader waits for majority ready for commit and for timestamp 123 to be before the current truetime interval
Leader -> Other replicas: Commit transaction, and in parallel replies to client.
Of course there are things you can do to mitigate this depending on your needs, but there's no free lunch. If you have a client in Australia and a client in Canada writing to the same data you're going to pay for round trips for transactions.
Writes are implemented with pessimistic locking. A write in a Canada datacenter may have to acquire a lock in Australia as well. See page 12 of the paper for the F1 numbers (mean of 103.0ms with standard deviation of 50ms for multi-site commits).
* Jeff Barr's blog post on this topic, has an interesting case study: https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...
* Collaboration with Hadoop/S3A maintainers leading to this release: https://aws.amazon.com/blogs/opensource/community-collaborat...
Where is the mistake.
Whoa hang on! You can't just say you're not tolerant of partitions...
I'm struggling to find the best post on aphyr.com about this but https://aphyr.com/posts/325-comments-on-you-do-it-to is a good one, specifically the line:
"CP and AP are upper bounds: systems can provide C or A during a partition, but might provide neither."
In short, Partitions happen, no matter what. No really, when you have multiple nodes they happen, you must tolerate them. You can't have CA.
For most common use cases it's not really an issue.
I dig the feature (strong consistency) but in some ways it just enables tools that were abusing it to just do so more easily.
[0] https://cloud.google.com/spanner/docs/true-time-external-con...
I especially like using s3 as a db when dealing with ETLs, where the state of some data is stored in its key prefix. This means that the etl can stop / resume at any point with a single source of truth as the database.
The potential drawback is cost of course; moving (renaming) is free but copying is not. S3's biggest price pain is always its PUTs. In many ETLs this is usually a non-issue because you have to do those PUTs regardless, as you probably want the data to be saved at the various stages for future debugging and recovery.
With the news of S3's strong consistency my program is immediately safer to use for some set of defects, since the underlying datastore properties, S3 has made my codes naive/invalid assumptions, true.
https://github.com/remind101/dbsnap/blob/master/dbsnap_verif...
I should likely pull out the StateDoc class into it's own module.
https://cloud.google.com/blog/products/gcp/how-google-cloud-...
I wonder whether any companies/businesses solely depended on offering eventual consistency workarounds that probably now need to pivot.
https://hadoop.apache.org/docs/r3.0.3/hadoop-aws/tools/hadoo...
Even the big ones. A lot of applications (e.g backup vendors) had to resort to use EBS volumes because of lack of read-after-update consistency.
The barrier to entry for lot of developers has been reduced.
My prediction: A lot of storage infra (FS, DB) with native support for S3 will get commoditised now.
But the testimonial only talks about Dropbox's data lake. So this is not about dropbox's main product storage, it's "just" their data lake for analytics. (which is apparently 32PB !)
HDFS provides POSIX-like API, and now has atomic metadata operations that S3 doesn't (mv/rename, chown, chmod) and append for files.
For any AWS EMR/hadoop users out there, this means the end of emrfs.
For the uninitiated, emrfs recognizes consistency problems but does not fix them. It'll throw an exception if there's a consistency problem, even give you some tools to fix the problem, but the tools may give false positives. The result is you've got to fix some consistency problems yourself, parse items out of the emrfs DynamoDB catalog, match them up to s3, then make adjustments where needed. It's an unnecessary chore.
It surprises me that this issue has not got more attention over the years and thankfully it'll be solved soon.
HopsFS-S3 solves the ACID SQL-on-S3 problem discussed here on HN last week ( https://news.ycombinator.com/item?id=25149154 ) by adding a metadata layer over S3 and providing a POSIX-like HDFS API to clients. Operations like rename, mv, chown, chmod are metadata operations - rename is not copy and delete.
Disclosure: I work on HopsFS
It was bad. In order to "update" a bucket config programmatically you need to read the entire existing config, make an update, and then PUT it back (overwriting what's there). The problem is when you went to read the config it's possibly 15 minutes old or more, and when you put it back you overwrite any changes that may not be consistent.
I had to work around the issue by storing a high-water mark in the config ID, and then poll an external system to find the expected high water mark before I knew I could safely read and update.
I'm sure that somewhere there is a Product Manager at Amazon who talks to customers they find out are using Google Cloud or Azure and asks them "Why not use AWS?" and they mumble some feature of the other service that they need.
Sometime later, said service shows up as part of AWS. :-)
For me this is the best part of tech rivalries, whether they are in CPU performance/architecture, programming language features/tools, or network services. Pressure to improve.
Glad it’s solved now :)
just today proton was announced and now this! this is going to be a hectic month :|