undefined | Better HN

0 pointsjustinsb14y ago0 comments

I can't say I've ever put something in my cart, not logged in to Amazon for months, gone back to see it still there, and thought "wow - they must use some sort of webscale storage solution - I'm now going to buy the thing I didn't want to buy six months ago."

From the point of view of solving the actual business problem, therefore, I think something based around Memcache would work just fine!

Dynamo is very useful for persuading you that working for Amazon would be interesting, though.

0 comments

3 comments · 1 top-level

haberman14y ago· 2 in thread

Have you ever worked in an environment at the scale of Amazon? Machines go down all the time. I used the "six months" example to illustrate that the shopping cart is persistent, but the machine crash could just as easily happen the moment before you push "checkout." Losing shopping cart data just because one machine crashed is totally unacceptable.

Programming distributed systems that must survive machine failure and network partitions is a completely different ball of wax compared to simple web programming. If you haven't done it before, you would not believe how much more complicated it is.

Here is an extremely simple example. Suppose you're a radio station that's taking phone calls from people and you want to give an award to the 5th caller. Implementing a program that does this for a single machine is easy, and could be accomplished with a program something like this:

  import SocketServer

  class MyHandler(SocketServer.BaseRequestHandler):
    def handle(self):
      self.server.caller += 1
      if self.server.caller == 5:
        self.request.sendall("Congratulations, you are the 5th caller!\n")
      else:
        self.request.sendall("Sorry, you're caller #%d\n" % (self.server.caller))

  server = SocketServer.TCPServer(("localhost", 9999), MyHandler)
  server.caller = 0
  server.serve_forever()

Using this program, you can "call" the program by doing "telnet localhost 9999" and the program will tell you what caller you are. This took me about 5 minutes to write, and I'd never used this Python API before.

Now imagine that you want to implement this same logic, but using a cluster of machines that could go down at any time. You want the group of machines to form "consensus" about which number each caller is; consensus in this context just means that the group of machines arrives at a single answer, and any machine you ask will give you the same answer.

Finding an algorithm that can do this robustly is so difficult that it was a major breakthrough when one was discovered in 1988. It's called Paxos and you can read about it here: http://en.wikipedia.org/wiki/Paxos_(computer_science) Even though it has been known for over 20 years, it is still a complex topic that very few people understand the details of.

The point of all of this is just to say; you can't compare a single-process in-memory cache to a distributed and fault-tolerant system. They are completely different beasts, and many business problems do indeed need the latter.

justinsbOP14y ago

I know about Paxos. I know when to use it, and when not to use it.

Seeing as you came up with the example, using Paxos for an "Nth caller wins" is a really bad idea - the Nth caller to your switchboard likely won't be the one selected. You probably _need_ a single server system, like the example you wrote (but without the threading errors.)

haberman14y ago

Then why would you compare dynamo and memcached? Or suggest that Amazon could use an in-memory cache as the primary store for shopping cart data? It makes no sense.

If you want to be particular about ordering, then you can always use paxos to elect a master that handles everything serially and failsover.

1 more reply

j / k navigate · click thread line to collapse