I don't know much about GAE, but a datastore-as-a-service that takes 2 weeks to delete your data and charges $300 a day to do so just seems... absurd.
http://code.google.com/intl/it-IT/appengine/casestudies.html
Contrast with AWS:
I've worked with two clients that have extremely successful websites running on top of AppEngine and are happy with it, but the costs mentioned in the article are absolutely true and something that they chose to bite the bullet on for a one time transition. I can't imagine what it would cost to get all of the data out to do a migration.
Without seeing the code, it's hard to tell what's going on. I would suspect cascading ungrouped datastore puts and their index updates, but, from what's on the list, I can't even make an informed guess.
http://googleappengine.blogspot.com/2010/06/how-app-engine-s...
The point is, terming an infra stack as great as App engine as made for non serious applications is a gross misunderstanding of what the app engine is capable of. You should first give it a try, develop an app and then make a statement.
I am about to shut down one application that declined in popularity, because it costs me $20 / week to run it and revenue just dropped under $20 / week. The cost is not from instance hours, but purely from the stored data. Deleting the data from the data store would cost more than I could recoup, so that is not an option either.
Also I really feel frustrated giving hours of thought to something that should be a really simple operation. Perhaps .delete() should be free? After all, when I shut down the app, Google does delete everything for free.
In my experience, App Engine is mainly useful for in-house infrastructure apps for companies using the Google Apps platform. That or cheapskate developers throwing together a proof of concept / toy app in their spare time.
But this bit stood out: $0.10 per 100k writes. That price seems to be far too high. The poster is doing (something like) a reindex of 10M entries (that kind of data is pretty small really: it's the kind of database you might use as a test set on your laptop interactively). Figure each modification is atomic, and that the b-tree height of the storage is ~4. So that's 40M writes to create an index, or $400!
Seriously? Again, this is the kind of task you'd expect to do quickly and interactively on your development box, and it costs a price of the same order as your day's salary (!) to execute in the cloud?
Looking at this from the perspective of the underlying I/O device: this index consumes just a tiny, tiny fraction of a hard disk drive's capacity. Yet creating it costs enough to buy the device several times over?
Something is wrong. Is that a misquote or have I misunderstood?
App Engine pricing might seem expensive if you try to do a simple table comparsion with alternatives, but when you get more deeply into it you'll find that a lot of stuff that is included in the service with GAE will cost you extra when you use the alternatives.
The only problem here is that "delete" is considered a write and when you want to delete data you just cannot accept the fact that you need to pay for something you do not want to keep. I think GAE should definitely look into this aspect and try to get some cheaper alternatives for data deletion.
Disclaimer: I work at Microsoft and am required by the terms of my employment to believe in "the cloud".
Correct. Of course, those are object writes -- Elastic Block Store disk I/O is 10x cheaper.
At $0.10 per 100k writes, 40M would be $40.
Also, reads, writes and small operations (which are the ones billed) are low level operations. An API operation actually translates into several low-level operations. And the way it is described [2] I think the poster is doing more writes to reindex 10M entries.
Considering that reindexing takes 1 write for the entity itself (existing put) + 4 writes for each element in the list property and considering that the poster has on average 18 elements in that list for each entity, then he's probably doing on average 73 writes per entity (I'm taking the "Existing Entity Put" scenario into account, otherwise for new entities it would be 2 + 2 per list element == 38 writes).
So by these numbers, that's 730,000,000 writes, or a cost of $730 -- if you go over them sequentially, only one time. But considering that he's doing manual full-text indexing, maybe he had to go over those items several times for the reindexing being done.
Maybe I'm missing something here. I don't know.
[1] http://code.google.com/appengine/docs/billing.html
[2] http://code.google.com/appengine/docs/python/datastore/entit...
I don't have the time or resources to move the site, so I'm forced to shut it down. It really, really sucks.
Personally I'm more disappointed by the lack of notice (1 month is nowhere near enough time) than the actual increase. I totally understand the need to charge.
I spent some time tuning SharedCount's API, which would have cost me $30-$50/day, and its now at about $1-$2/day.
- Move to Python 2.7 and enable multithreading
- Setup Cloudflare (this swallows about half of all my requests)
- Increase minimum latency and reduce the maximum number of idle instances. (I have 5-8s and 1-2 set, respectively)
- Setup the semi-undocumented Google edge cache (basically, just a Cache-Control: public, max-age=[seconds] header.
- Take advantage of memcache.
With this setup, I'm doing 3 million API calls per day at $2.
Also, high replication queries can return stale results unless you use ancestor queries. Ancestor queries require putting entities in groups by giving them all the same parent (which can never be changed). Basically it's a very inflexible semaphore and kind of sucks IMO.
Your suggestions in general are very good though. Thanks, I'm switching my DNS to CloudFlare now.
Honestly it has more to do with how much time I want to spend on the site, how much it returns, and whether or not I should spend my nights and weekends transferring it to another host.
The last time I backed up all the data from GAE it took 4 days to download all of it to a VPS. 4 days to download all of it. Migrating means 4 days of downtime, or alternatively some complex solution involving posting all new data to BOTH places while the migration takes place.
That takes time and energy, and quite frankly I'd rather see someone with the resources do it right instead of trying to hack it.
Unfortunately, AppEngine isn't forgiving of that and there is a real monetary value associated with questionable engineering design. Or, design that wasn't thought through enough in the context of a service like AppEngine.
This leads to a few people getting upset and making a lot of noise when the reality is that AppEngine is actually an amazing service.
So, to boil down the operator error from a quote in the thread:
"We're running a mapreduce to change the geobox sizes/precision for a large number of entities."
That is the real source of the problem. Instead of using geoboxes, they should be using geohashes, which allow arbitrary precision.
http://code.google.com/apis/maps/articles/geospatial.html http://en.wikipedia.org/wiki/Geohash
Instead of an indexed property that looks like this (what they currently have):
[u'37.3411|-121.8940|37.3395|-121.8926', u'37.3411|-121.8929|37.3395|-121.8916', ...]
They would have an indexed List<String> property that looks like this:
[8, 8f, 8f1, 8f12, 8f12a, 8f12ac, 8f12ac6, 8f12ac60, 8f12ac605, 8f12ac605f, 8f12ac605fb, 8f12ac605fb3, 8f12ac605fb34]
Finding if the location is in a box would be computing the hash from the lat/lng (there is free code out there to do that) and then doing an indexed 'in' query. The indexes would only need to be updated if the location of the entity changes, not when they want varying levels of precision.
First off, they mention that when the initial design decision was made, a similar operation cost ~$160, which is tenable for an operation that only happens once in a while. This is in fact a case of them getting bitten by the pricing structure changing after a reasonable design decision (at the time) was implemented.
Secondly, they mention that this is part of a larger issue: "In our most common case we might have to add and delete a couple items to the list property every once in a while. That would still cost us well over $1,000 each time. Most of the reasons for this type of data in our product is to compensate for the fact that there isn't full text search yet. I know they are beta testing full text, but I'm still worried that that also might be too expensive per write."
This is a real problem that GAE needs to solve.
Finally, their problem doesn't seem to be that they need arbitrary precision, its that they seem to need fast location centric queries of a large database.
Geoboxes allow you to solve this problem correctly (and quickly), returning the results in the database that are closest to you. Matching on a geohash can end up serving the incorrect data unless you resort to hacks involving a number of queries.
2) They seem to have an extreme use case. No one is going to argue that maybe AppEngine doesn't fit the bill for them. Or, one could argue that doing 6.5 billion writes times a large number of customers, across multiple datacenters is something that a lot of databases would choke on.
3) Running more queries, while admittedly hacky is less expensive than doing more writes.
Before anyone gets the idea that the links in the parent are worth trying, they're not: the performance is absolutely atrocious – on our data set, 13 seconds per lookup.
We wrote our own approach on App Engine and now get stable performance on our datasets at ~300ms per lookup.
(We're doing Foursquare-type lookups.)
'We wrote our own approach on App Engine'
Hmm... details?
I got burnt by AppEng, too. Picking AppEng as a platform is one of my worst technical decisions.
"Google App Engine is free to use during the preview release, but the amount of computing resources any app can use is limited. In the future, developers will be able to purchase additional computing resources as needed, but Google App Engine will always be free to get started."
You got 2+ years of use for 'free' and now they decided to turn it into a supported business model and are asking you to pay for what you use. Seems reasonable to me.
I don't disagree that they fubar'd their original pricing release announcement and should have had multithreading Python 2.7 for those folks.
But they did listen to the (loud) feedback, made adjustments and even apologized (were you there at the ThirstyBear meetup where they bought us all beers?).
Not having to hire an IT staff or be woken up in the middle of the night when AWS decides to reboot the host and your servers go down is worth its weight in gold.
Sometime's it's still cheaper to have your own managed / self-managed gear... and from the looks of this pricing, even hire someone fulltime/freelancing to manage it all for you.
It wouldn't cost you any money (unless you have metered electricity), but rather just opportunity cost of being able to do other work with your resources.
Unless you only need a short-term lease on the equipment, cloud servers will be more expensive that dedicated/colocated servers.
Hardening the server is something that can be outsourced for a lot less than thousands. Services like linode seem to be a nice middle ground. While I don't see myself going back to my own hardware in a rack i run in a datacenter, I do still see the benefit of knowing your stack a bit beyond coding. Knowing how the stack works helps when building software quite often.
Anyhow, those are just my experiences. VPS' with a very strong toolkit to take the edge off self-administering like Linode, etc, seem to be a very nice option. Heroku has caught my eye too but they have completely different measurements.
I honestly do not get why people are so fascinated with the cloud. It's a very expensive way to avoid having to know what you're doing.
It's all about time, and lack of it. If you can spend 1/10th of the time and still make a good profit, you could spend the other 9/10ths doing other profitable things.
Contrast this with running a stand-alone application server for each site, which is what GAE does. Here, even if your code is not serving any requests it's still waiting to get them. Now, GAE has powerful magic in it to retire request handlers which aren't frequently used. This way if site foo.com is getting 1 request/minute, it only really needs one process/thread/hander abstraction at a time. However, it is expensive to start/stop these "processes", so instead GAE is forced to keep this "process" around for a while after a request has been served hoping that the cost of keeping it alive would be justified by a second request. Thus these stateful, slow-to-start processes are always taking up resources that could be used to serve other requests.
Disclaimer: all my knowledge of GAE has been from reading their docs/blog, not from deploying projects to it.
Disclaimer 2: I am not saying that PHP is better/worse than GAE in any way. However, I am saying that the model that GAE uses is more costly for a typical application. This can be easily seen by comparing the cost of running a basic site on GAE vs $2/month shared hosting.
GAE has problems, but I think the root is just how unique everything is. That manifests itself in people using a datastore that they don't understand, with Google expecting them to know how many writes an action will take and whether that feels like the right number of writes or two orders of magnitude more than if they made a different decision about how to store their data and solve their problems.
It also manifests itself in the lockin that Heroku mostly avoids (which is a huge problem if some subset of users get to a point where they realize "whoops, this would be much easier if I could do things Google won't let me do, time to leave").
I think a good counterexample is Engine Yard and GitHub. Engine Yard had a somewhat limited offering (especially for what GitHub was willing to pay) that didn't really fit with GitHub's heavy direct disk I/O. (Most Rails apps almost entirely read and write from the db, but GitHub does a lot of direct operations on the git repositories.) But GitHub was still just a Rails app, not an app for some specially-designed Engine Yard framework. So it was fairly painless for them to decide to solve the problem in a way that didn't fit with what Engine Yard would offer them and migrate to their own hardware. It wasn't easy, especially since they weren't solving an easy problem, but at least they didn't have to replace their database.
I am not familiar with the internals of Heroku and don't know how they solve the problems I outlined. Maybe someone else can elaborate.
The value proposition of App Engine is that with no systems administration expertise you can rent an extremely reliable, massively scalable web platform that is managed around the clock by a world class devops teams. Unsurprisingly this costs money. If you don't need the reliability or scalability of App Engine, no one is forcing you to pay for it. But it's absurd to suggest that you can get anything remotely comparable in PHP for $2/month.
For those unfamiliar with GAE, a ListProperty is really a collection of properties. The author is using the property as a geohash with a significant number of values, plus he has additional multiproperty indexes defined, plus he's doing a rewrite (delete + write). All combined it appears to be ~460 writes per entity.
So what we're talking about is $6500 for 6.5 billion writes... exactly what is printed on the sales brochure. Is that a lot? Most datastores don't charge by the operation so I don't have a lot to compare it to. It seems expensive but not crazy, especially considering that the data is replicated via PAXOS to 3+ datacenters with automatic loadbalancing and failover.
So their implementation is a compromise on account of GAE's limitations, and they have to pay through the nose to use it. This is when I'd be looking at hosting some features outside of GAE, which is what we do with Full-text Search.
Geohashing is a reasonable solution for some spatial problem domains; it's one solution along the spectrum of "precalculate a lot up front and make queries cheap" vs "write in a cheap & easy format but make queries more expensive". Pre-calculation strategies are usually more scalable when you have large query loads, but they suck bigtime if you need to fully recalculate a large body of data (as the original blog author is doing).
Maybe the blogger would be better off using PostGIS; but then, scaling and synchronizing a large cluster of PostGIS systems is nontrivial. The issues here are too application-specific to draw any positive or negative conclusions about appengine.
See:
http://code.google.com/p/googleappengine/issues/detail?id=79...
People requested custom SSL support at 2008, and today is 2012, if you still believe in App Engine, good luck!
>> the "trusted tester program" is a joke . They never respond so it's just a waste of time .
Even they launch this feature TODAY, so 4 years for a basic requirement, what you can expect from them?
Sadly, this is kind of "typical Google" -- great product, decent execution, but a bad identity problem -- it really feels like they're not sure yet what they want to do with this.
So, yes you can build serious applications on GAE but like everything else it boils down to, it depends on what you really need.
It also casts the other users as the opponent, instead of google.
And with Heroku you can have it taken care for you, following a few simple rules.
So, why exactly would one use the crippled GAE platform, that constantly breaks its promises (re: reliability), forces you to code with very little flexibility (and, no, not every app that needs to automatically and massively scale "has to be coded exactly like a GAE app anyway"), costs a fortune (and sometimes an unexpected fortune), and breaks for you as soon as you need a technology not on offer?