If memcached papers have one thing in common, it's an uncanny ability to get the comparison software to run as slowly as possible. 100k ops/sec/core is what you get when using a single client connection with blocking I/O. Using more clients (as per a normal prod setup) or pipelining queries is more like 1m ops+/core, with writes scaling worse than reads. In production it's easy to get some level of pipelining (multigets, clustered keys, etc), since you're rarely just fetching a key and then blocking.
A much better FPGA paper would show scales of at what level the syscall overhead becomes most of the CPU usage, as well as any measured latency improvements. I think some of the other papers address latency at least.
In reality it hardly matters. If you're hitting memcached hard enough with tiny values for it to matter, ensuring keys are clustered and pipelined is a lot less maintenance overhead than deploying FPGA's.
Every corner case that could be found in s/w was always the topic of a excited benchmark. Also, the old trick of 'hey, let's drop all the matches on the floor in our h/w or FPGA, while getting a huge number of matches in s/w and making the s/w guys look ridiculous'.
Every time I read a paper touting a great new speedup on FPGA (over some crap s/w implementation) I'm reminded of that old joke about the Texan visiting Israel and telling the owner of some small farm that "he can get on a tractor and ride for days without getting the the boundary of his property." The Israeli nods sympathetically and says "Yes, I too used to have a tractor like that".
So Excuse my ignorance, apart from AWS or Azure Scales, what would anyone uses Memcached on FPGA?
* I wouldn't mind if the system was simple plug and play and has all the benefits, cost saving without the headache. But very rarely are any technology deployment without any headache or hassle free.
Using a single AWS F1 (FPGA) instance, our Memcached accelerator achieves over 11 million ops/sec at less than 300 microsecond latency. Compared to ElastiCache, the AWS-managed CPU Memcached server, our Memcached accelerator offers 9X better throughput, 9X lower latency, and 10X better throughput/$.
We need to batch multiple requests per Ethernet packet to get around packet per sec rate limiting on AWS. See more details here: https://www.legupcomputing.com/blog/index.php/2018/05/01/dee...
If anyone is interested we would love to hear from you, we will be showing off an online demo later this week.
FPGAs are great for processing data at 10Gbps line rate with low latency. They are also good for compute tasks like compression and encryption.
Happy to talk more at kirvy@plunify.com if you are interested. Congrats on getting a seed round from Intel Capital!
The answer is - it depends. Unfortunately, we have not found a "golden" combination of settings yet. If you have a highly congested design, synthesis does help a lot, but not all cases. There are correlations between the settings, so if A is good, B is good, A+B could be bad. Seeds belong to a category of techniques that we classify as random. For example, for Vivado, although Xilinx removed the seeds feature, we created a technique to trigger randomness in the placement using a property of Vivado.
What we do is not new in the sense that settings exploration has always been around. But with cloud compute resources and ML approaches, it really enables timing closure and optimization methods in a cheaper and more disciplined fashion.
We are also very interested in users of OpenCL/HLS/C. The translated RTL is often not as optimized/readable as what an RTL designer will do if he/she does it directly in RTL. Our tool (InTime) can be a good boost to the performance of such RTL.
edit: r4.4xlarge as per the link. 16vcpu? You should be able to beat on latency but beating on throughput means elasticache is misconfigured, likely. Or you're putting on way too much set traffic (think I saw you set the bench to 1:1 ratio of gets to sets?)
The interesting part is the FPGA could still do much more computation (for example, compression or encryption) while maintaining the same throughput due to hardware pipelining. We described this concept further in the blog post I linked to.
The only reason why you can claim 9x latency is because you've saturated the worker threads. You should still win on latency even if it were properly bottlenecking on the network, but 9x throughput and 9x latency is completely false as a capacity limit in this test.
The other issue is 100 bytes isn't typical. It's common but almost every user has a varied workload. Deploying FPGA's for the larger cache values ends up being a waste. I designed a new storage system based off of offloading larger cold keys to flash, even.
We have been told by cloud providers that the FPGA cannot be directly connected due to network security concerns. Since there is no easy way to control how the arbitrary hardware programmed by users on the FPGA will interact with their network. Microsoft has been using FPGAs directly connected to their network (called a bump-in-the-wire architecture) for the FPGAs used in their datacenter (see Project Catapult for details). But these FPGAs are not programmable by Azure users yet.
On our side, maxing the IOPS was the easy part. The hard one was to marry the protocol with converged/deterministic Ethernet with RDMA. We were split in between "one request, one frame/packet burst" vs "all requests are somehow smartly aligned with frames by stateful logic." The first one was surprisingly susceptible to performance artifacts due to varying round trip latency of few microseconds, thus it was possible to get packets in transit being dropped due to receiving NIC (a top tier hardware) being momentarily overloaded.
You have an advantage of being DC provider independent, and can jump the AWS ship whenever you want. Alibaba's solution will be tied to its infrastructure with its very expensive RDMA capable network.
Google Alibaba China F1 or F2.
As for the offering, the idea is that end the clients will only have to deal with SDK and libs on instances, not raw RDMA or anything related to internal infrastructure. This is how much I am allowed to say besides the fact of its existence.
From their current experience, and that of other hosting providers, not many people who go with F1, F2, or other FPGA instance actually do reap any benefit, and some drop mid-way. That's why they want to get more people using them pass the "toying with it stage." The "Herokuification" (god, Heroku sounds beyond hilarious in Russian) is there to let people use the common APIs while getting benefit from FPGA performance, without dealing with things outside of average webdev area of expertise.