Skip to content

Top Best Ask Show New Jobs

GoshawkDB: A distributed, transactional, fault-tolerant object store (opens in new tab)

(goshawkdb.io)

87 pointserrordeveloper10y ago45 comments

45 comments

27 comments · 12 top-level

rystsov10y ago· 2 in thread

> The only requirement is that if F is the number of unreachable nodes you wish to be able to tolerate, then your minimum cluster size is 2*F + 1. > Currently no other data store that I'm aware of offers this flexibility.

It isn't true. There're plenty of storages with same strong consistency model (tolerates F failures of 2F+1 nodes), among them are Cassandra with lightweight transactions, Riak with consistent buckets, CockroachDB and ZooKeeper.

> However, GoshawkDB has a radically different architecture to most databases and its object-store design means that contention on objects can be dramatically reduced, allowing the performance impact of strong-serializability to be minimised.

Can you please describe the algorithm that you use for implementing distributed transactions?

> just three network delays occur (and one fsync) before the outcome of the transaction is known to that node

3 round trips for per transaction is the result which is very similar to the Yabandeh-style transaction, see the Cockroach's blog for details - http://www.cockroachlabs.com/blog/how-cockroachdb-distribute...

Do you use the same approach?

msackman10y ago

Hi. I'm the author of GoshawkDB. Thanks for your questions.

> It isn't true. There're plenty of storages with same strong consistency model (tolerates F failures of 2F+1 nodes), among them are Cassandra with lightweight transactions, Riak with consistent buckets, CockroachDB and ZooKeeper.

The point I was trying to make is the decoupling between cluster size and F. ZK, for example, requires that the cluster size is exactly 2F+1, not is minimum 2F+1. That means that as the cluster size increases, performance will decrease as more nodes have to be contacted.

> Can you please describe the algorithm that you use for implementing distributed transactions?

Yes. In detail, there will be a blog post forthcoming. However the two main points is that 2PC is replaced by Gray and Lamport's paper "Consensus on transaction Commit", and dependencies between transactions are modelled through vector clocks. As I'm visiting lots of family over xmas, I'm afraid I don't have the 12+ hours necessary to document the entire algorithm just at the moment. However, I will certainly get to this asap.

> Do you use the same approach?

I don't know. I don't use the term round-trip because that doesn't make it clear if we're talking individual messages or message-hops, and these sorts of conversation and discussion can often go wrong if we use different terminology. The 3 network-delays that I mention comes directly from the Gray and Lamport paper I mention above.

lmm10y ago

> The point I was trying to make is the decoupling between cluster size and F. ZK, for example, requires that the cluster size is exactly 2F+1, not is minimum 2F+1. That means that as the cluster size increases, performance will decrease as more nodes have to be contacted.

How does this work? Suppose we set F=2 and have 7 nodes, A-G. Client 1 tries to write, contacting 5 nodes, A-E. A and B are down at this point, but C, D and E accept the write so it calls it good. Then A and B come up and C and D go down. Client 2 tries to read, contacting nodes A, B, C, F, G. 4 out of the 5 respond successfully, so it calls it good. But it can't possibly read the value that client 1 wrote.

Whatever way you do it, the number of nodes contacted for a write + number of nodes contacted for a read has to add up to more than the cluster size. If your cluster is somehow sharded then you can work around this with e.g. consistent hashing (so that client 2 knows which 5 of the 7 nodes a particular key must be stored on, and contacts the correct subset), but that only works if client 2 can know what the key was - and I think Cassandra already supports that.

ctlaltdefeat10y ago· 2 in thread

This is really cool! I guess the main competition in terms of similar purpose and guarantees is http://hyperdex.org/, although the add-on that enables transactions (quite critical...) is commercial.

helloiamaperson10y ago

The most production ready versions of this kind of thing are http://geode.incubator.apache.org and https://ignite.apache.org .

msackman10y ago

Having just had a quick look at both, the only point I'm willing to make is that ignite's distributed transactions use 2PC. https://apacheignite.readme.io/docs/transactions

2PC, at least in its usual forms, is not fool-proof: if failures happen in the "right" order then the whole txn gets blocked: it's actually proven to be equivalent to Paxos with F=0 (i.e. no ability to tolerate failure). On this point alone, GoshawkDB stands apart from anything that uses 2PC.

eloff10y ago· 2 in thread

GoshawkDB is unique as it allows you to configure the tolerance to failure independently of the size of the GoshawkDB cluster. For example, you could have a cluster of 5 nodes and require that it continues to operate unless more than 2 nodes become unreachable. Or you could have a cluster of 20 nodes and require that it continues to operate unless more than 3 nodes become unreachable. The only requirement is that if F is the number of unreachable nodes you wish to be able to tolerate, then your minimum cluster size is 2F + 1. You may of course choose to have a cluster size larger than this. Currently no other data store that I'm aware of offers this flexibility.

I fail to see the point of making your cluster unavailable before you've lost so many nodes that you no longer have a quorum. It seems odd to have a cluster that can handle e.g. 4 node failures, and take it offline after only 2. Why would anyone want a feature like that?

singron10y ago

It's in the next paragraph. The data has to have more replicas to be resilient to more failures. If you have a lot of data, storing 20 replicas is going to be really expensive and probably unnecessary.

msackman10y ago

Yup, on the whole, in order to increase performance, you don't want more and more nodes to be contacted. Remember that every txn has to have a minimum of F+1 replicas vote on each txn, so if F is really large, although they vote in parallel, you have greater chance of network delay and greater load on all machines. So GoshawkDB allows you to increase cluster size without increasing F, so getting you greater performance.

ymse10y ago· 2 in thread

Why should I choose this over Ceph RADOS? [0][1]

0: Reliable Autonomic Distributed Object Store

1: http://docs.ceph.com/docs/master/rados/api/librados-intro/

TheMagicHorsey10y ago

It's written in Go and nicer to hack on, I suspect.

techdragon10y ago

Being written in Go is hardly some kind of magic advantage. I'd go with Ceph myself because I know exactly where I stand with it. It's also got commercial support which comes in handy.

spotman10y ago· 2 in thread

> The trade-off is that in the case of failure (node failure or network partitions), operations may block until the failures have been resolved.

Curious how this works at scale. For example if a node is down and N requests start blocking, is there an upper bound to N? What happens when this is reached? Does N map to a goroutine? A file descriptor? Is there timeouts?

Seems like possibly a node going down essentially means the entire show going down if requests are filling up and not being resolved, you would think there may be a tipping point in which other requests may not be able to be served as all resources are blocked.

Possibly there is just a sane timeout and these requests also just fail, leaving the client to know something is amiss?

msackman10y ago

Hi, I'm the author of GoshawkDB. Thanks for your questions - they're certainly not answered on the website so I'll do my best to answer them here and incorporate them into the website later.

> Curious how this works at scale. For example if a node is down and N requests start blocking, is there an upper bound to N? What happens when this is reached? Does N map to a goroutine? A file descriptor? Is there timeouts?

The blocking actually occurs on a per-client basis: when a client submits a txn to the server to which it's connected, that server will calculate which other server nodes need to be contacted in order for the txn to be voted on. If it cannot find enough servers due to failures then it will outright reject the txn and tell the client that there are currently too many failures going on.

However, the server could find the right set of server nodes and send off the txn to be voted on and send of the txn for verification and validation. Right at that point there could be further failures and so it's not known how far those messages got. It is in this case that the txn could block as until some node recover it's impossible to safely abort the txn. But it's important to note that:

1. If a txn starts blocking, that'll block just that txn and that client (as the client is blocked waiting on the outcome of the txn). No other part of the server or any other server is affected.

2. Other clients can be submitting other txns at the same time that need different un-failed nodes to proceed, and they progress just fine.

3. There is no global determination (yet) of whether a node is failed or not. Thus this approach can cope with really weird network issues - eg random connections between different nodes failing. There is no need for the whole remaining cluster to agree that nodes X, Y and Z are "out" - none of the algorithms depend on that sort of thing.

I hope this answers your questions.

spotman10y ago

It helps but doesn't quite answer my question yet.

With your situation #1, what if this is very common transaction and therefore you have 100 of these all waiting. What about 1000, 5000 etc. what system resources are used to let these transactions wait indefinitely ( if I understand your semantics with specific regard to blocking )?

Some systems handle this as a failure that is communicated to the client rapidly. Other systems let N clients actually wait indefinitely but at the cost of taking up a thread / file descriptor, etc. in systems that have finite amounts of threads for example this would then be communicated in his paradigm as a such of upper bounds as to how many requests one could have waiting.

So just trying to get a feeling for how this could have infinite amount of waiting transactions due to partial failure and still keep taking requests.

Thanks for the reply, this stuff is always interesting

rystsov10y ago· 2 in thread

On the site it is written that GoshawkDB doesn't support membership change. Since you system is based on single decree paxos then you can take Raft's joint consensus and apply to plain paxos. Probably it's the simplest and easiest to change membership in paxos cluster.

msackman10y ago

I don't think that's the problem though. The problem that I'm thinking of is that when a cluster grows in size, due to the use of consistent hashing, there'll be a set of objects that need to move between nodes. Calculating and achieving that movement is what concerns me. The exact properties are explained early on in http://arxiv.org/abs/1503.04988

I'm not expecting to ever need to model a global property of "these are the set of nodes that are up and running". I always worry about the effect of weird and wacky network issues on such systems.

rystsov10y ago

By the way, I always was wondering why the schemas with consistent hashing maintained via the ring are more popular approach than treating key space as a line (-∞,+∞), maintaining the explicit map from segments and rays of keys to the replica groups and split the group/segment when replica group becomes too hot or heavy.

IMHO the perfect hashing solves the problem of distributing the data but the load may follow completely different patterns. See theInstagram's Justin Bieber problem.

dozzie10y ago· 2 in thread

Anyone bothered with author using term "object store", when in fact it's no match for ZODB or Db4o and only stores serialized documents, more similar to CouchDB or MongoDB?

msackman10y ago

So ZODB has really great integration with Python: the fact that you can just subclass Persistent and get persistent objects is a really nice model.

That type of model is certainly the goal for GoshawkDB but I certainly accept your point that currently the Go client API offers something that is more similar to a document store. With GoshawkDB it really is up to the client as to what API you offer. For languages which allow you full proxies of all object field gettors and settors then you should be able to match ZODB, and that would certainly be the goal. But without that sort of support in the programming language, the only other way to get there is to define your object types externally, and then compile special classes that do the magic for you. That too is something I'll look at for GoshawkDB: it's likely to ultimately be faster than relying on reflection and other techniques.

An awful lot of problems with these sorts of products are caused by the same terminology being used for different meanings. I'm sorry if you feel I've made this worse. Hopefully it is the case that in future releases there will be both more clients and they will offer a model which is closer to what you're expecting with the term "object store".

dozzie10y ago

> With GoshawkDB it really is up to the client as to what API you offer.

So GoshawkDB is not an object database and apparently never will. That's OK, but don't call it object store. It's misleading, as it uses already-established term for something different, which also already has a name.

gregwebs10y ago· 1 in thread

Paxos from the ground-up to offer strong transactional guarantees, similar to CockroachDB: http://www.cockroachlabs.com/blog/ CockroachDB is a KV store (that will be released with a SQL layer on top of it) whereas GoshawkDB is a Graph store. GoshawkDB puts data into client caches (much like Datomic) and even attempts transactions there.

msackman10y ago

Not quite: Paxos Synod is merely used as the replacement for 2PC as 2PC does not work. So that's just achieving consensus on the txn votes for each node. It's actually vector clocks (and improvements thereof) that manage the dependencies between txns. The novel bits of vector clocks in GoshawkDB is getting them to expand an contract as necessary, rather than just becoming enormous, which happens with some systems.

msackman10y ago

For those interested in a bit more detail how GoshawkDB actually works, I've added a (slightly rushed) blog post on the topic, which covers transaction lifecycle. https://goshawkdb.io/blog.html#20151224

im_down_w_otp10y ago

I don't see how it's possible to manage transactions fully client-side while adhering to the consistency model that's being claimed.

The best you'd be able to do seems like read-committed isolation.

marknadal10y ago

Hey msackman,

Pretty cool looking stuff. Where'd you get the design for object/graph structure? It is what we are using for our database, http://gunDB.io/ . Although we take the opposite approach to CAP than you, and have no support for serializability (I don't think this belongs in the protocol but instead in the data, like git).

What's you're take on concurrency? How do you handle conflicts (I assume this is easy for you with a master)?

Would love to jump on a chat too, being DB guys. Shoot me an email mark@gunDB.io welcome arounds!

amelius10y ago

Performance?

j / k navigate · click thread line to collapse