It isn't true. There're plenty of storages with same strong consistency model (tolerates F failures of 2F+1 nodes), among them are Cassandra with lightweight transactions, Riak with consistent buckets, CockroachDB and ZooKeeper.
> However, GoshawkDB has a radically different architecture to most databases and its object-store design means that contention on objects can be dramatically reduced, allowing the performance impact of strong-serializability to be minimised.
Can you please describe the algorithm that you use for implementing distributed transactions?
> just three network delays occur (and one fsync) before the outcome of the transaction is known to that node
3 round trips for per transaction is the result which is very similar to the Yabandeh-style transaction, see the Cockroach's blog for details - http://www.cockroachlabs.com/blog/how-cockroachdb-distribute...
Do you use the same approach?
> It isn't true. There're plenty of storages with same strong consistency model (tolerates F failures of 2F+1 nodes), among them are Cassandra with lightweight transactions, Riak with consistent buckets, CockroachDB and ZooKeeper.
The point I was trying to make is the decoupling between cluster size and F. ZK, for example, requires that the cluster size is exactly 2F+1, not is minimum 2F+1. That means that as the cluster size increases, performance will decrease as more nodes have to be contacted.
> Can you please describe the algorithm that you use for implementing distributed transactions?
Yes. In detail, there will be a blog post forthcoming. However the two main points is that 2PC is replaced by Gray and Lamport's paper "Consensus on transaction Commit", and dependencies between transactions are modelled through vector clocks. As I'm visiting lots of family over xmas, I'm afraid I don't have the 12+ hours necessary to document the entire algorithm just at the moment. However, I will certainly get to this asap.
> Do you use the same approach?
I don't know. I don't use the term round-trip because that doesn't make it clear if we're talking individual messages or message-hops, and these sorts of conversation and discussion can often go wrong if we use different terminology. The 3 network-delays that I mention comes directly from the Gray and Lamport paper I mention above.
How does this work? Suppose we set F=2 and have 7 nodes, A-G. Client 1 tries to write, contacting 5 nodes, A-E. A and B are down at this point, but C, D and E accept the write so it calls it good. Then A and B come up and C and D go down. Client 2 tries to read, contacting nodes A, B, C, F, G. 4 out of the 5 respond successfully, so it calls it good. But it can't possibly read the value that client 1 wrote.
Whatever way you do it, the number of nodes contacted for a write + number of nodes contacted for a read has to add up to more than the cluster size. If your cluster is somehow sharded then you can work around this with e.g. consistent hashing (so that client 2 knows which 5 of the 7 nodes a particular key must be stored on, and contacts the correct subset), but that only works if client 2 can know what the key was - and I think Cassandra already supports that.
2PC, at least in its usual forms, is not fool-proof: if failures happen in the "right" order then the whole txn gets blocked: it's actually proven to be equivalent to Paxos with F=0 (i.e. no ability to tolerate failure). On this point alone, GoshawkDB stands apart from anything that uses 2PC.
I fail to see the point of making your cluster unavailable before you've lost so many nodes that you no longer have a quorum. It seems odd to have a cluster that can handle e.g. 4 node failures, and take it offline after only 2. Why would anyone want a feature like that?
0: Reliable Autonomic Distributed Object Store
1: http://docs.ceph.com/docs/master/rados/api/librados-intro/
Curious how this works at scale. For example if a node is down and N requests start blocking, is there an upper bound to N? What happens when this is reached? Does N map to a goroutine? A file descriptor? Is there timeouts?
Seems like possibly a node going down essentially means the entire show going down if requests are filling up and not being resolved, you would think there may be a tipping point in which other requests may not be able to be served as all resources are blocked.
Possibly there is just a sane timeout and these requests also just fail, leaving the client to know something is amiss?
> Curious how this works at scale. For example if a node is down and N requests start blocking, is there an upper bound to N? What happens when this is reached? Does N map to a goroutine? A file descriptor? Is there timeouts?
The blocking actually occurs on a per-client basis: when a client submits a txn to the server to which it's connected, that server will calculate which other server nodes need to be contacted in order for the txn to be voted on. If it cannot find enough servers due to failures then it will outright reject the txn and tell the client that there are currently too many failures going on.
However, the server could find the right set of server nodes and send off the txn to be voted on and send of the txn for verification and validation. Right at that point there could be further failures and so it's not known how far those messages got. It is in this case that the txn could block as until some node recover it's impossible to safely abort the txn. But it's important to note that:
1. If a txn starts blocking, that'll block just that txn and that client (as the client is blocked waiting on the outcome of the txn). No other part of the server or any other server is affected.
2. Other clients can be submitting other txns at the same time that need different un-failed nodes to proceed, and they progress just fine.
3. There is no global determination (yet) of whether a node is failed or not. Thus this approach can cope with really weird network issues - eg random connections between different nodes failing. There is no need for the whole remaining cluster to agree that nodes X, Y and Z are "out" - none of the algorithms depend on that sort of thing.
I hope this answers your questions.
With your situation #1, what if this is very common transaction and therefore you have 100 of these all waiting. What about 1000, 5000 etc. what system resources are used to let these transactions wait indefinitely ( if I understand your semantics with specific regard to blocking )?
Some systems handle this as a failure that is communicated to the client rapidly. Other systems let N clients actually wait indefinitely but at the cost of taking up a thread / file descriptor, etc. in systems that have finite amounts of threads for example this would then be communicated in his paradigm as a such of upper bounds as to how many requests one could have waiting.
So just trying to get a feeling for how this could have infinite amount of waiting transactions due to partial failure and still keep taking requests.
Thanks for the reply, this stuff is always interesting
I'm not expecting to ever need to model a global property of "these are the set of nodes that are up and running". I always worry about the effect of weird and wacky network issues on such systems.
IMHO the perfect hashing solves the problem of distributing the data but the load may follow completely different patterns. See theInstagram's Justin Bieber problem.
The best you'd be able to do seems like read-committed isolation.
That type of model is certainly the goal for GoshawkDB but I certainly accept your point that currently the Go client API offers something that is more similar to a document store. With GoshawkDB it really is up to the client as to what API you offer. For languages which allow you full proxies of all object field gettors and settors then you should be able to match ZODB, and that would certainly be the goal. But without that sort of support in the programming language, the only other way to get there is to define your object types externally, and then compile special classes that do the magic for you. That too is something I'll look at for GoshawkDB: it's likely to ultimately be faster than relying on reflection and other techniques.
An awful lot of problems with these sorts of products are caused by the same terminology being used for different meanings. I'm sorry if you feel I've made this worse. Hopefully it is the case that in future releases there will be both more clients and they will offer a model which is closer to what you're expecting with the term "object store".
So GoshawkDB is not an object database and apparently never will. That's OK, but don't call it object store. It's misleading, as it uses already-established term for something different, which also already has a name.
Pretty cool looking stuff. Where'd you get the design for object/graph structure? It is what we are using for our database, http://gunDB.io/ . Although we take the opposite approach to CAP than you, and have no support for serializability (I don't think this belongs in the protocol but instead in the data, like git).
What's you're take on concurrency? How do you handle conflicts (I assume this is easy for you with a master)?
Would love to jump on a chat too, being DB guys. Shoot me an email mark@gunDB.io welcome arounds!