We're in the AWS ecosystem, and the database offerings are really subpar. DynamoDB, which I originally expected to be somewhat comparable to MongoDB, is an incredibly frustrating (and expensive) product to use. AWS Data Pipeline is extremely confusing and very expensive as well.
AWS's offerings really lag behind Google's offerings (like BigQuery) in this space. Hopefully AWS can catch up because I'd rather not have requests bouncing between data centers.
Databases is not the area I would be choosing Google for.
But it's actually the only generally available product I know of that solves all the hard problems (availability, partition tolerance, some - but well defined - consistency with cross entity transactions) with zero hassle for you.
If you read through http://aphyr.com/tags/Jepsen, you get some appreciation for how hard this is to pull off without running into operational nightmares (massive data loss, split brains, etc).
Disclaimer: I work for Google, though not on Datastore.
We're a SaaS company with lots of tiny customers and a few very large customers. We need to keep an index to show a specific customer only their data. That means the index for our largest customers gets hit a lot. The problem with this structure is we have to pay as if all of our customers were as popular as our biggest customers, or we get throttled. And even though the DynamoDB interface shows that you have provisioned 10x above your current usage, you still get throttled, because you're being throttled only in a single cluster.
So, let's say you solve that problem, but now you need to drop the troublesome index on a billion+ row table. With DynamoDB you can't change a table's indexes, so you have to migrate your table to a new table. Doing that without downtime is an incredible challenge.
Which reminds me of when they announced indexes. We were so excited only to find out we couldn't add indexes to our tables, but instead had to recreate them all.
The whole point of SaaS is to make our lives easier, but with DynamoDB our lives were much more difficult than just using Mongo.
Anyway, I need to do a blog post on this -- it's a bit too complicated for a HN comment. :-)
http://aws.amazon.com/datapipeline/
Simple Workflow is new to me, so thanks for putting it on my radar!
We leverage DD’s API primarily for eventing. For example, deployment notifications are posted to datadog, where they overlay our metrics. This has proven very useful in tracking changes due to deploys and/or configuration changes.
While we do leverage the DataDog agent for standard and custom metrics, DataDog’s ability to put together dashboards (and alerting) for AWS without any modifications to the host is what really closed the deal for us.
It was basically a monitoring/metrics system to merge how I handle the monitoring of crons, work queue, system metrics, analytics, etc. into a single service. Right now, I'm stuck using 3.
Sure, I could just build something to merge it together ... but at that point, I'm halfway to building my own.
k.z
I'm in a rather awkward phase of having small enough data that I don't need "Scale to 1000 machines!", I want just one or a few machines occasionally but managed for me (turn on, run code, shut off). Tutum works very well for this, but I'd like to use more of the ecosystem available at Google or AWS (pay-per-usage datastorage, for example). GCE is pretty decent, but a bit awkward, although the new docker support helps (but I've had problems getting it even working).
Maybe this is my magic bullet :)
http://star.mit.edu/cluster/ http://www.youtube.com/watch?v=2Ym7epCYnSk