An FPGA-based In-line Accelerator for Memcached (2013) [pdf] (opens in new tab)

(hotchips.org)

89 pointscleong8y ago32 comments

32 comments

14 comments · 3 top-level

andrewcanis8y ago· 8 in thread

Our startup is working on accelerators using FPGAs on AWS including memcached.

Using a single AWS F1 (FPGA) instance, our Memcached accelerator achieves over 11 million ops/sec at less than 300 microsecond latency. Compared to ElastiCache, the AWS-managed CPU Memcached server, our Memcached accelerator offers 9X better throughput, 9X lower latency, and 10X better throughput/$.

We need to batch multiple requests per Ethernet packet to get around packet per sec rate limiting on AWS. See more details here: https://www.legupcomputing.com/blog/index.php/2018/05/01/dee...

If anyone is interested we would love to hear from you, we will be showing off an online demo later this week.

FPGAs are great for processing data at 10Gbps line rate with low latency. They are also good for compute tasks like compression and encryption.

kirvyteo8y ago

I am the co-founder of Plunify. We have an ML software solution (InTime) that optimizes FPGA design by tuning the parameters of the compilation, i.e. synthesis, P&R. I don't know much about memcached accelerators but if it is a performance driven application, like HFT, I believe we can make it go even faster, i.e. increase the FMax. I read from your website that you are using Intel PSG devices. We often see designers using seeds to close timing or optimize the timing, but that is leaving performance on the table. For more details: https://support.plunify.com/en/2018/04/17/compare-timing-per...

Happy to talk more at kirvy@plunify.com if you are interested. Congrats on getting a seed round from Intel Capital!

andrewcanis8y ago

Interesting, what synthesis settings have you found have the most impact? I have also seen FPGA designers trying different seeds when closing timing. In this case, AWS provides an FPGA shell for external interfaces that has a maximum clock frequency of 250MHz. We have been able to meet this timing constraint without many issues. But we will keep you in mind for the Intel FPGA boards we are working with now.

1 more reply

dormando8y ago

How is elasticache so slow? what instances does it run on?

edit: r4.4xlarge as per the link. 16vcpu? You should be able to beat on latency but beating on throughput means elasticache is misconfigured, likely. Or you're putting on way too much set traffic (think I saw you set the bench to 1:1 ratio of gets to sets?)

andrewcanis8y ago

I wouldn't characterize Elasticache as running slow, a single instance in this case is handling 1.3M request/sec. But we can be 9X faster by batching multiple requests per packet and then offloading the TCP network stack and memcached operations to the FPGA. The FPGA allows us to handle the requests at network line-rate, even with small 100-byte requests. On Elasticache, past a certain point these small requests start to overload the CPU.

The interesting part is the FPGA could still do much more computation (for example, compression or encryption) while maintaining the same throughput due to hardware pipelining. We described this concept further in the blog post I linked to.

1 more reply

rjeli8y ago

really wish AWS supported Ethernet on fpga

brooksbp8y ago

There aren't Ethernet interfaces (transceivers, PCS/PMA, MAC) on the AWS FPGAs?

1 more reply

baybal28y ago

About to congratulate you that you are about to make competition to Alibaba.com.

On our side, maxing the IOPS was the easy part. The hard one was to marry the protocol with converged/deterministic Ethernet with RDMA. We were split in between "one request, one frame/packet burst" vs "all requests are somehow smartly aligned with frames by stateful logic." The first one was surprisingly susceptible to performance artifacts due to varying round trip latency of few microseconds, thus it was possible to get packets in transit being dropped due to receiving NIC (a top tier hardware) being momentarily overloaded.

You have an advantage of being DC provider independent, and can jump the AWS ship whenever you want. Alibaba's solution will be tied to its infrastructure with its very expensive RDMA capable network.

andrewcanis8y ago

Sounds interesting, are there any papers or public details that you could point me to with more technical information about this Alibaba RDMA-based memcached project? Alibaba also has FPGA instances available and we have been investigating their cloud offering.

1 more reply

dormando8y ago· 3 in thread

This seems old (2013ish?). There're newer "key/value on FPGA" papers that're more modern.

If memcached papers have one thing in common, it's an uncanny ability to get the comparison software to run as slowly as possible. 100k ops/sec/core is what you get when using a single client connection with blocking I/O. Using more clients (as per a normal prod setup) or pipelining queries is more like 1m ops+/core, with writes scaling worse than reads. In production it's easy to get some level of pipelining (multigets, clustered keys, etc), since you're rarely just fetching a key and then blocking.

A much better FPGA paper would show scales of at what level the syscall overhead becomes most of the CPU usage, as well as any measured latency improvements. I think some of the other papers address latency at least.

In reality it hardly matters. If you're hitting memcached hard enough with tiny values for it to matter, ensuring keys are clustered and pipelined is a lot less maintenance overhead than deploying FPGA's.

glangdale8y ago

Can confirm this eerie ability of FPGA and h/w folks from another domain (regular expressions - I'm the designer of Hyperscan, a s/w solution).

Every corner case that could be found in s/w was always the topic of a excited benchmark. Also, the old trick of 'hey, let's drop all the matches on the floor in our h/w or FPGA, while getting a huge number of matches in s/w and making the s/w guys look ridiculous'.

Every time I read a paper touting a great new speedup on FPGA (over some crap s/w implementation) I'm reminded of that old joke about the Texan visiting Israel and telling the owner of some small farm that "he can get on a tractor and ride for days without getting the the boundary of his property." The Israeli nods sympathetically and says "Yes, I too used to have a tractor like that".

rjeli8y ago

I don’t understand the joke?

2 more replies

ksec8y ago

Especially when we have Xeon -D that goes up to 16 Core. AMD EPYC that gives more Core per dollar. And more IPC soon in Zen 2. 7nm and 10nm from AMD and Intel next year on Server. Not to mention it now Support ARMv8

So Excuse my ignorance, apart from AWS or Azure Scales, what would anyone uses Memcached on FPGA?

* I wouldn't mind if the system was simple plug and play and has all the benefits, cost saving without the headache. But very rarely are any technology deployment without any headache or hassle free.

konschubert8y ago

These kind of optimisations will become more common as we near the end of Moore's law.

j / k navigate · click thread line to collapse

32 comments

14 comments · 3 top-level

andrewcanis8y ago· 8 in thread

Our startup is working on accelerators using FPGAs on AWS including memcached.

We need to batch multiple requests per Ethernet packet to get around packet per sec rate limiting on AWS. See more details here: https://www.legupcomputing.com/blog/index.php/2018/05/01/dee...

If anyone is interested we would love to hear from you, we will be showing off an online demo later this week.

FPGAs are great for processing data at 10Gbps line rate with low latency. They are also good for compute tasks like compression and encryption.

kirvyteo8y ago

Happy to talk more at kirvy@plunify.com if you are interested. Congrats on getting a seed round from Intel Capital!

andrewcanis8y ago

1 more reply

dormando8y ago

How is elasticache so slow? what instances does it run on?

andrewcanis8y ago

1 more reply

rjeli8y ago

really wish AWS supported Ethernet on fpga

brooksbp8y ago

There aren't Ethernet interfaces (transceivers, PCS/PMA, MAC) on the AWS FPGAs?

1 more reply

baybal28y ago

About to congratulate you that you are about to make competition to Alibaba.com.

andrewcanis8y ago

1 more reply

dormando8y ago· 3 in thread

This seems old (2013ish?). There're newer "key/value on FPGA" papers that're more modern.

glangdale8y ago

Can confirm this eerie ability of FPGA and h/w folks from another domain (regular expressions - I'm the designer of Hyperscan, a s/w solution).

rjeli8y ago

I don’t understand the joke?

2 more replies

ksec8y ago

So Excuse my ignorance, apart from AWS or Azure Scales, what would anyone uses Memcached on FPGA?

* I wouldn't mind if the system was simple plug and play and has all the benefits, cost saving without the headache. But very rarely are any technology deployment without any headache or hassle free.

konschubert8y ago

These kind of optimisations will become more common as we near the end of Moore's law.

j / k navigate · click thread line to collapse