The target computing env. are both GPU clusters (infiniband) as well as other distributed systems (high latency & prone to error).
As for the optimization algorithms it will support both derivative free methods (e.g. VXQR1 or GA) as well as some variation of Gradient Descent.
As we would like the framework to be fault-tolerant MPI is not an option(?).
What message system would be appropriate - I was thinking of 0mq, but I am getting mixed reactions from the experts.