undefined | Better HN

0 pointsspotman10y ago0 comments

It helps but doesn't quite answer my question yet.

With your situation #1, what if this is very common transaction and therefore you have 100 of these all waiting. What about 1000, 5000 etc. what system resources are used to let these transactions wait indefinitely ( if I understand your semantics with specific regard to blocking )?

Some systems handle this as a failure that is communicated to the client rapidly. Other systems let N clients actually wait indefinitely but at the cost of taking up a thread / file descriptor, etc. in systems that have finite amounts of threads for example this would then be communicated in his paradigm as a such of upper bounds as to how many requests one could have waiting.

So just trying to get a feeling for how this could have infinite amount of waiting transactions due to partial failure and still keep taking requests.

Thanks for the reply, this stuff is always interesting

0 comments

3 comments · 1 top-level

msackman10y ago· 2 in thread

> With your situation #1, what if this is very common transaction and therefore you have 100 of these all waiting. What about 1000, 5000 etc. what system resources are used to let these transactions wait indefinitely ( if I understand your semantics with specific regard to blocking )?

So, as each client can only have 1 outstanding txn at a time, this requires there are 100, 1000, 5000 etc clients all submitting the same txn at the same time, and presumably they're all connected to the same server node, and for each of those txns, the set of server nodes that need to be contacted has been calculated, and the necessary messages sent. At that point failures occur so it's not known who has received which messages, so all these txns block.

The only system resources that are held at this point is RAM. Yes, it could get so bad that you eat up a lot of RAM - this is release 0.1 after all. There are a few areas where I have tickets open to make sure that goshawkdb can spill this sort of state to disk as necessary, though it may not be too horrible just relying on swap to start with.

> Some systems handle this as a failure that is communicated to the client rapidly. Other systems let N clients actually wait indefinitely but at the cost of taking up a thread / file descriptor, etc. in systems that have finite amounts of threads for example this would then be communicated in his paradigm as a such of upper bounds as to how many requests one could have waiting.

The problem is that in this particular case, the txn could have committed, it's just the server who initially received the txn from the client and then forwarded it to the necessary server nodes can't learn that it's committed due to failures. Now certainly I could make the client informed that the outcome is delayed, but the client may not be able to do anything with that information: it can't know for sure if the txn has been committed or aborted.

The entire codebase is written using actors and finite state machines. Because Go supports green threads and go-routines can be pre-emptively suspended, there is no problem with eating OS threads. In general, the design is such that the only thing that should block is the client and on the server the actor/state-machine that is talking to that client.

spotmanOP10y ago

Thanks, that answers my question.

Considering more than one exact txn I imagine will hit a single specific node often, at large scale with a single mode down even if that means 5% of transactions block, you are basically growing a queue of waiting work indefinitely with the only upper bound being how much ram you have. Meanwhile 5% of clients will be waiting and this node may take awhile to come back if it needs something.

Once your out of ram / etc constraints the 5% of the system that is not functioning turns into 100% real fast because your capacity to handle the other 95% takes ram or other resources you now have dedicated to waiting.

If what your saying is also each client can only have one blocked transaction that is relevant but doesn't prevent a consumer to spin up as many clients in succession trying to get through.

I would suggest that you have at minimum strict timeouts for the the transactions, in conjunction with an immediate fail if the node is answer is not available right now. So a client would never wait more than X seconds or the transaction is aborted and if necessary rolled back.

What this would create is a system with predictable failure cases when things go pear shaped. You could calculate in advance how much overhead it would add when something goes wrong as you can have a determinant time factor when having clients wait instead of indefinitely.

Furthermore , what if a node never comes back. Somehow there seems to need a transaction failure that is handed back to the client whether it's node is down, node is gone forever , or node simply timed out.

At the end of the day even if your system is able to handle N thousands of transactions pleasantly waiting and can still answer other requests indefinitely that is a great accomplishment, but in practice may not be ideal for many workloads. People and computers tend to both retry interactions with data that are slow or failed and the combination of just taking on more and more work and hoping everything flushes out when things become healthy is a recipe for a thundering herd, and better served by something like an async work queue type of pattern.

Btw I say this w/o looking at your code , just home page, so possible these things exist and it's not clear what the bounds and failure cases are yet.

Keep on hacking on it!

msackman10y ago

> Considering more than one exact txn I imagine will hit a single specific node often, at large scale with a single mode down even if that means 5% of transactions block, you are basically growing a queue of waiting work indefinitely with the only upper bound being how much ram you have. Meanwhile 5% of clients will be waiting and this node may take awhile to come back if it needs something.

Ahh, no! For each obj, there are 2F+1 replicas, each replica on a different node. For each txn that hits an obj, you only need F+1 of those replicas to vote on the txn. So, provided F > 0, a single failure will never cause anything to block.

> I would suggest that you have at minimum strict timeouts for the the transactions, in conjunction with an immediate fail if the node is answer is not available right now. So a client would never wait more than X seconds or the transaction is aborted and if necessary rolled back.

I agree. I think it would be very difficult for a client to do anything sensible with such information, but even if all I'm doing is getting the client to resubmit the txn verbatim, at least it clears up the resource usage on the server, which is the most important thing.

> Furthermore , what if a node never comes back. Somehow there seems to need a transaction failure that is handed back to the client whether it's node is down, node is gone forever , or node simply timed out.

Well, this only becomes a problem if > F nodes fail and never come back - the whole design of consensus systems is to cope with failures up to a certain threshold. Provided <= F nodes fail, the failures are detected and any txns that are in-flight are safely aborted (or, if it actually committed, that information is propogated) - this is all just usual Paxos stuff. But yes, again, I completely agree: if you have a massive failure and you lose data, then you are going to have to recover from that. For goshawkDB, that's going to require changing topology which is not supported in 0.1, but is the main goal for 0.2.

> At the end of the day even if your system is able to handle N thousands of transactions pleasantly waiting and can still answer other requests indefinitely that is a great accomplishment, but in practice may not be ideal for many workloads. People and computers tend to both retry interactions with data that are slow or failed and the combination of just taking on more and more work and hoping everything flushes out when things become healthy is a recipe for a thundering herd, and better served by something like an async work queue type of pattern.

Oh absolutely. In a previous life I did much of the core engineering on RabbitMQ. It was there that I slowly learnt the chief problem tends to be that under heavy load, you end up spending more CPU per event than under light load, so as soon as you go past a certain tipping point, it's very difficult to come back. I certainly appreciate that human interaction with a data store is going to require consistent and predictable behaviour.

Thanks for your input.

1 more reply

j / k navigate · click thread line to collapse