Linux Kernel vs. DPDK: HTTP Performance Showdown (opens in new tab)

(talawah.io)

168 pointstalawahtech3y ago71 comments

71 comments

55 comments · 12 top-level

0xbadcafebee3y ago· 13 in thread

For those like me going "......what is dpdk"

  The Data Plane Development Kit (DPDK) is an open source software project managed
  by the Linux Foundation. It provides a set of data plane libraries and network 
  interface controller polling-mode drivers for offloading TCP packet processing 
  from the operating system kernel to processes running in user space. This offloading
  achieves higher computing efficiency and higher packet throughput than is possible 
  using the interrupt-driven processing provided in the kernel.

https://en.wikipedia.org/wiki/Data_Plane_Development_Kit https://www.packetcoders.io/what-is-dpdk/

jaimex23y ago

Basically its what people who want to bypass the kernel network stack because they think its slow. They then spend the next few years writing their own stack till they realise they've just re-written what the kernel does and its slower and full of exploits.

Yeah, receiving packets is fast when you aren't doing anything with them.

grive3y ago

Networking stacks are used in two ways: endpoints and forwarders.

Forwarders are usually not doing much with packets, just reading a few fields and choosing an output port. These are very common function in network cores (telcos, datacenters, ...).

DPDK is not well-suited for endpoint application programming, even though here you can still squeeze some additional performance.

But don't dismiss a framework that is currently so widely deployed just because you are not familiar with its most common use-case.

touisteur3y ago

Sometimes you really don't do much more on packets than receive and store in (e.g) ring buffers until you can trigger some computation on aggregated data. Sometimes your application has critical latency and you need very light parsing. Not everyone uses tcp and all the options, some people are just down to jumbo udp packets, no fragmentation, and 800Gb/s rx...

dclusin3y ago

It’s typically done when end to end latency is more important than full protocol compliance and hardening against myriad types of attackers. High frequency trading is typically where the applications where you see these sorts of implementations being really compelling.

For these types of applications there are proprietary implementations that you can buy from vendors that are more suited to latency sensitive applications.

The next level of optimization after kernel bypass is to build or buy FPGAs which implement the wire protocol + transport as an integrated circuit.

1 more reply

ardel953y ago

DPDK is common within network appliances, where you don't need most of the features within kernel networking stack.

gjulianm3y ago

That's not the reality. The kernel network stack does a lot of things that for some applications are not needed, so DPDK is used for those cases when they need very high performance. For example, packet capture, routing, packet generation... I don't think I've seen anyone rewriting what the kernel does precisely because when you use DPDK you don't want to do what the kernel does.

shaklee33y ago

this is completely wrong. as the article points out, there are plenty of tcp stacks and other user space stacks available. dpdk is used in places where performance matters, and just because you haven't worked in that area doesn't mean it's useless. all telcos use it for the most part.

wumpus3y ago

> They then spend the next few years writing their own stack

Supercomputing people have done this repeatedly and successfully.

A more general purpose stack is UDP-based applications.

I wouldn't do this with TCP, which I agree is complicated and difficult to get right.

baruch3y ago

Normally the endpoint stack you need is more limited initially and you don't have to have all the baggage. The biggest advantage you get at the end is zero copy network and if what you do involves tons of reading and writing to the network you can definitely gain performance in both bandwidth and latency by better using your CPU.

You can do XDP as well and gain zero copy but for a small subset of NICs and you still need to implement you own network stack I don't see ac way to avoid it much. There are existing TCP stacks for DPDK that you can use as well.

AlphaSite3y ago

Apps probably don’t but networking related code might benefit a lot from DPDK.

harpratap3y ago

How does this differ from eBPF?

monocasa3y ago

eBPF pushes computation into the kernel to be colocated with kernel data/events. DPDK pushes all of the relevant data (the network packets) into user space to be colocated with user data/events.

You can mix both a bit to make the selection of where packets go with a eBPF program, and running higher level stacks in user space where user space is still dealing with raw packets rather than sockets.

1 more reply

shaklee33y ago

dpdk is more equivalent to xdp. ebpf is not really comparable since it can't do large programs within a bpf kernel. dpdk is much, much more than just a packet processing framework.

tomohawk3y ago· 9 in thread

From: https://talawah.io/blog/extreme-http-performance-tuning-one-...

> I am genuinely interested in hearing the opinions of more security experts on this (turning off speculative execution mitigatins). If this is your area of expertise, feel free to leave a comment

Are these generally safe if you have a machine that does not have multi-user access and is in a security boundary?

salmo3y ago

For regulatory compliance, that’s still not acceptable because it opens the door to cross user data access or privilege escalation.

If someone exploited a process with a dedicated unprivileged user, had legit limited access, or got in a container on a physical, they might be able to leverage it for the forces of evil.

There’s really no such practical thing as single user Linux. If you’re running a network exposed app without dropping privileges, that’s a much bigger security risk than speculative execution.

Now, if you were skipping an OS and going full bare metal, then that could be different. But an audit for that would be a nightmare :).

flatiron3y ago

There’s definitely sneakernet Linux boxes. I’ve worked at a bunch of places with random Linux boxes running weird crap totally off the actual network because nobody was particularly sure how to get those programs updated. Technical debt is a pita!

hedora3y ago

I suspect you are conflating security regulations for Unix users with regulations targeting users of the system.

Why would a regulatory framework care if a Linux box running one process was vulnerable to attacks that involve switching UIDs?

Converse, why would that same regulatory framework not care if users of that network service were able to impersonate each other / access each others’ data?

1 more reply

throwawaylinux3y ago

Which regulations?

1 more reply

toast03y ago

You kind of have to consider the whole system. Who has access, including hypothetical attackers that get in through network facing vulnerabilities, and what privilege escalations they could do with the mitigations on or off.

I've ran production systems with mitigations off. All of the intended (login) users had authorized privilege escalation, so I wasn't worried about them. There was a single primary service daemon per machine, and if you broke into that, you wouldn't get anything really useful by breaking into root from there. And the systems where I made a particularly intentional decision were more or less syscall limited; enabling mitigations significantly reduced capacity, so mitigations were disabled. (This was inline with guidance from the dedicated security team where I was working).

fulafel3y ago

Browser tabs from different origins are equivalent of the multi-user access here, at least in cases where the speculative execution vulnerability can be exploited from JS. Same for other workloads where untrusted parties have a sufficient degree of control on what code executes.

Generally.. depends on what you mean by generally. In the casual sense of the word, speculative execution attacks are not very common, so it can be said that most people are mostly safe from them independent of mitigations. Someone might also use "generally safe" to mean proven security against a whole class of attacks, in which case the answer would be no.

urthor3y ago

It never happens.

Anytime the discussion around turning off mitigations comes up, it's trumped by the "why don't we just leave them on and buy more computers" trump card.

An easier solution than mitigating is to just upgrade your AWS instance size.

dsp3y ago

> It never happens.

Many intentionally run without mitigations.

staticassertion3y ago

As far as I am aware, the capability required to exploit Meltdown is compute and timers. For example, an attacker running a binary directly on the host, or a VM/interpreter (js, wasm, python, etc) executing the attacker's code that exposes a timer in some way.

If you just have a dumb "request/response" service you may not have to worry.

A database is an interesting case. Scylla uses CQL, which is a very limited language compared to something like javascript or SQL - there's no way to loop as far as I know, for example. I would probably recommend not exposing your database directly to an attacker anyways, that seems like a niche scenario.

If you're just providing, say, a gRPC API that takes some arguments, places them (safely) into a Scylla query, and gives you some result, I don't think any of those mitigations are necessary and you'll probably see a really nice win if you disable them.

This is my understanding as a security professional who is not an expert on those attacks, because I frankly don't have the time. I'll defer to someone who has done the work.

Separately, (quoting the article)

> Let's suppose, on the one hand, that you have a multi-user system that relies solely on Linux user permissions and namespaces to establish security boundaries. You should probably leave the mitigations enabled for that system.

Please don't ever rely on Linux user permissions/namespaces for a multi-tenant system. It is not even close to being sufficient, with or without those mitigations. It might be OK in situations where you also have strong auth (like ssh with fido2 mfa) but if your scenario is "run untrusted code" you can't trust the kernel to do isolationl

> On the other hand, suppose you are running an API server all by itself on a single purpose EC2 instance. Let's also assume that it doesn't run untrusted code, and that the instance uses Nitro Enclaves to protect extra sensitive information. If the instance is the security boundary and the Nitro Enclave provides defense in depth, then does that put mitigations=off back on the table?

Yeah that seems fine.

> Most people don't disable Spectre mitigations, so solutions that work with them enabled are important. I am not 100% sure that all of the mitigation overhead comes from syscalls,

This has been my experience and, based on how the mitigation works, I think that's going to be the case. The mitigations have been pretty brutal for syscalls - though I think the blame should fall on intel, not the mitigation that has had to be placed on top of their mistake.

Presumably io_uring is the solution, although that has its own security issues... like an entirely new syscall interface with its own bugs, lack of auditing, no ability to seccomp io_uring calls, no meaningful LSM hooks, etc. It'll be a while before I'm comfortable exposing io_uring to untrusted code.

pclmulqdq3y ago· 6 in thread

This was a fascinating read and the kernel does quite nicely in comparison - 66% of DPDK performance is amazing. That said, the article completely nails the performance advantage: DPDK doesn't do a lot of stuff that the kernel does. That stuff takes time. If I recall correctly, DPDK abstractions themselves cost a bit of NIC performance, so it might be interesting to see a comparison including a raw NIC-specific kernel bypass framework (like the SolarFlare one).

gjulianm3y ago

> If I recall correctly, DPDK abstractions themselves cost a bit of NIC performance, so it might be interesting to see a comparison including a raw NIC-specific kernel bypass framework (like the SolarFlare one).

DPDK performs fairly well, even better for the most part. For some years I maintained a modification of the ixgbe kernel driver for Intel NICs that allowed us to perform high-performance traffic capture. We finally moved to DPDK once it was stable enough and we had the need to support more NICs, and in our comparisons we didn't see a performance hit.

Maybe manufacturer-made drivers can be better than DPDK, but if I had to guess that would be not because of the abstractions but because of the knowledge of the NIC architecture and parameters. I remember when we tried to do a PoC of a Mellanox driver modification and a lot of the work to get high performance was understanding the NIC options and tweaking them to get the most for our use case.

medawsonjr3y ago

Or better yet, Mellanox VMA since it's open source (unlike Solarflare OpenOnload) and the NICs are far less expensive.

bitcharmer3y ago

That's not true. OpenOnload is open source:

https://github.com/majek/openonload

1 more reply

pclmulqdq3y ago

I have found OpenOnload to be easier to use than VMA, although I think you can go a bit faster with Mellanox NICs.

galangalalgol3y ago

Is there a good comparison of these technologies? I've used dpdk for high rate streaming data and it roughly doubled my throughput over 10GE. I hear people using things like dma over Ethernet, and it sounds like there are several competing technologies. My use case is to get something from phy layer into gpu memory as fast as possible, latency is less important than throughput.

benou3y ago

What you're looking for is RDMA. It was mostly restricted to Infiniband (IB) back in the days, but nowadays you probably want RoCEv2. You can look at iWARP too but I think nowadays RoCE won.

In any case, the standard software API for RDMA is ibverbs. All adapters supporting RDMA (be it IB, RoCE or iWARP) will expose it. You can get cloud instances with RDMA on AWS and Azure.

1 more reply

evgpbfhnr3y ago· 5 in thread

At the point you've gotten syscall overhead is definitely going to be a big thing (even without spectre mitigations enabled) -- I'd be very curious to see how far a similar io_uring benchmark would get you.

It supports IOPOLL (polling of the socket) and SQPOLL (kernel side polling of the request queue) so hopefully the fact that application driving it is in another thread wouldn't slow it too much... With multi-shot accept/recv you'd only need to tell it to keep accepting connections on the listener fd, but I'm not sure if you can chain recvs to the child fd automatically from kernel or not yet... We live in interesting times!

JoshTriplett3y ago

I would love to see an io_uring comparison as well; while it's a substantial amount of work to port an existing framework to io_uring, at the point where you're considering DPDK, io_uring seems relatively small by comparison.

mgaunard3y ago

I personally started a new framework and I went for io_uring for simplicity, that is already giving most of what I need -- asynchronous I/O with no context switching.

DPDK is huge and inflexible, it does a lot of things which I'd rather be in control of myself and I think it's easier to just do my own userspace vfio.

anonymoushn3y ago

IOPOLL is for disks. I think that with very recent kernels you will get busy polling of the socket with just SQPOLL. See here: https://github.com/axboe/liburing/issues/345#issuecomment-10...

evgpbfhnr3y ago

oh! That's not obvious at all from the man page (io_uring_enter.2)

If the io_uring instance was configured for polling, by specifying IORING_SETUP_IOPOLL in the call to io_uring_setup(2), then min_complete has a slightly different meaning. Passing a value of 0 instructs the kernel to return any events which are already complete, without blocking. If min_complete is a non-zero value, the kernel will still return immediately if any completion events are available. If no event completions are available, then the call will poll either until one or more completions become available, or until the process has ex‐ ceeded its scheduler time slice.

... Well, TIL -- thanks! and the NAPI patch you pointed at looks interesting too.

olodus3y ago

Yes, I would be very interested in that as well. I work with a DPDK based app and have sometimes thought about how close to the same performance we could get by using as many kernel optimizations as possible (io_uring, xdp, I don't know what else). It would be an interesting option to give the app since it would greatly simplify the deployment sometimes, as dealing with the DPDK drivers and passthrough in a container and cloud world isn't always the easiest thing.

limoce3y ago· 3 in thread

  I am not 100% sure that all of the mitigation overhead 
  comes from syscalls, but it stands to reason that a lot 
  of it arises from security hardening in user-to-kernel 
  and kernel-to-user transitions.

Will io_uring be also affected by Spectre mitigations given it has eliminated most kernel/user switches?

And did anyone do a head-to-head comparison between io_uring and DPDK?

anonymoushn3y ago

You can use io_uring with 0 steady-state context switches if you're willing to use 100% CPU on 2 cores :)

thekozmo3y ago

Good point. This is more of a tcp stack comparison between the kernel and userspace. Seastar has a sharded (per core) stack, which is very beneficial when the number of threads is high

anonymoushn3y ago

You can set up one or many rings per core, but the idea I alluded to elsewhere in this comment section of spending 2 cores to do kernel busy polling and userspace busy polling for a single ring is less useful if your alternative makes good use of all cores.

fefe233y ago· 3 in thread

Why is this interesting to anyone? Haven't we all moved to https by now?

Optimizing raw http seems to me like a huge waste of time by now. I say that as someone who has spent years optimizing raw http performance. None of that matters these days.

gjulianm3y ago

For me the interest comes from seeing the speed boost between regular kernel/optimized kernel/DPDK. What you put behind the RX layer doesn't really matter, but it's good to see numbers and things to do when your RX system isn't giving you enough throughput.

staticassertion3y ago

I wouldn't expect HTTPS to make any difference vs HTTP for long lived connections.

rohith25063y ago

This is particularly interesting in HFT where network latency plays a major role in win ratio

Matthias2473y ago· 2 in thread

Hi Marc (talawahtech)! Thanks for the exhaustive article.

I took a short look at the benchmark setup (https://github.com/talawahtech/seastar/blob/http-performance...), and wonder if some simplifications there lead to overinflated performance numbers. The server here executes a single read() on the connection - and as soon as it receives any data it sends back headers. A real world HTTP server needs to read data until all header and body data is consumed before responding.

Now given the benchmark probably sends tiny requests, the server might get everything in a single buffer. However every time it does not, the server will send back two responses to the server - and at that time the client will already have a response for the follow-up request before actually sending it - which overinflates numbers. Might be interesting to re-test with a proper HTTP implementation (at least read until the last 4 bytes received are \r\n\r\n, and assume the benchmark client will never send a body).

Such a bug might also lead to a lot more write() calls than what would be actually necessary to serve the workload, or to stalling due to full send or receive buffers - all of those might also have an impact on performance.

talawahtechOP3y ago

Yea, it is definitely a fake HTTP server which I acknowledge in the article [1]. However based on the size of the requests, and my observation of the number of packets per second in/out being symmetrical at the network interface level, I didn't have a concern about doubled responses.

Skipping the parsing of the HTTP requests definitely gives a performance boost, but for this comparison both sides got the same boost, so I didn't mind being less strict. Seastar's HTTP parser was being finicky, so I chose the easy route and just removed it from the equation.

For reference though, in my previous post[2] libreactor was able to hit 1.2M req/s while fully parsing the HTTP requests using picohttpparser[3]. But that is still a very simple and highly optimized implementation. FYI, from what I recall, when I played with disabling HTTP parsing in libreactor, I got a performance boost of about 5%.

1. https://talawah.io/blog/linux-kernel-vs-dpdk-http-performanc...

2. https://talawah.io/blog/extreme-http-performance-tuning-one-...

3. https://github.com/h2o/picohttpparser

BeefWellington3y ago

> Yea, it is definitely a fake HTTP server which I acknowledge in the article

It's not actually an HTTP server though... For these purposes, it's essentially no more useful than netcat dumping out a preconfigured text file. Titling it "HTTP Performance showdown" is doubly bad here since there's no real-world (or even moderately synthetic) HTTP requests happening; you just always get the same static set of data for every request, regardless of what that request is. Call it whatever you like but that isn't HTTP. A key part of performance equation on the web is the difference in response time involved in returning different kinds (sizes and types) of responses.

A more compelling argument could be made for the improved performance you can get bypassing the Kernel's networking, but this article isn't it. What this article demonstrates is that in this one very narrow case where you want to always return the same static data, there's vast speed improvements to be had. This doesn't tell you anything useful about the performance in basically 100% of the cases of real-world use of the web, and its premise falls down when you consider that kernel interrupt speeds are unlikely to be the bottleneck in most servers, even caches.

I'd really love to see this adapted to do actual webserver work and see what the difference is. A good candidate might be an in-memory static cache server of some kind. It would require URL parsing to feed out resources but would better emulate an environment that might benefit from this kind of change and certainly would be a real-world situation that many companies are familiar with. Like it or not, URL parsing is part of the performance equation when you're talking HTTP.

2 more replies

Thaxll3y ago· 1 in thread

Do Google and the like actually use TCP in user space or they just use the Linux kernel?

Edit: Looks like they do but not for TCP from what I can find: https://static.googleusercontent.com/media/research.google.c...

jeffinhat3y ago

See also- Snap: a Microkernel Approach to Host Networking: https://research.google/pubs/pub48630/

gonzo3y ago· 1 in thread

My bet is that the stack in VPP is even faster.

shaklee33y ago

vpp uses dpdk

thekozmo3y ago

What's amazing is that the seastar tcp stack hasn't been changed over the past 7 years, while the kernel received plenty of improvements (in order to close the gap vs kernel bypass mechanisms). Still, for >> 99% of users, there is no need to bypass the kernel.

touisteur3y ago

I feel this would be a good place to use a spark-based TCP stack. You're bypassing the kernel, have to run stuff as root or risky CAP_ rights, your stack should be as solid as possible.

https://www.adacore.com/papers/layered-formal-verification-o...

Might also give people here some ideas on how to combine symbolic execution, proof, C and SPARK code and how to gain confidence in each part of a network stack.

I think there's even some ongoing work climbing up the stack up to HTTP but not sure of the plan (not involved).

maxgio923y ago

Thank you, very exhaustive and interesting. A note: the link to bftrace script is broken.

j / k navigate · click thread line to collapse

71 comments

55 comments · 12 top-level

0xbadcafebee3y ago· 13 in thread

For those like me going "......what is dpdk"

  The Data Plane Development Kit (DPDK) is an open source software project managed
  by the Linux Foundation. It provides a set of data plane libraries and network 
  interface controller polling-mode drivers for offloading TCP packet processing 
  from the operating system kernel to processes running in user space. This offloading
  achieves higher computing efficiency and higher packet throughput than is possible 
  using the interrupt-driven processing provided in the kernel.

https://en.wikipedia.org/wiki/Data_Plane_Development_Kit https://www.packetcoders.io/what-is-dpdk/

jaimex23y ago

Yeah, receiving packets is fast when you aren't doing anything with them.

grive3y ago

Networking stacks are used in two ways: endpoints and forwarders.

Forwarders are usually not doing much with packets, just reading a few fields and choosing an output port. These are very common function in network cores (telcos, datacenters, ...).

DPDK is not well-suited for endpoint application programming, even though here you can still squeeze some additional performance.

But don't dismiss a framework that is currently so widely deployed just because you are not familiar with its most common use-case.

touisteur3y ago

dclusin3y ago

For these types of applications there are proprietary implementations that you can buy from vendors that are more suited to latency sensitive applications.

The next level of optimization after kernel bypass is to build or buy FPGAs which implement the wire protocol + transport as an integrated circuit.

1 more reply

ardel953y ago

DPDK is common within network appliances, where you don't need most of the features within kernel networking stack.

gjulianm3y ago

shaklee33y ago

wumpus3y ago

> They then spend the next few years writing their own stack

Supercomputing people have done this repeatedly and successfully.

A more general purpose stack is UDP-based applications.

I wouldn't do this with TCP, which I agree is complicated and difficult to get right.

baruch3y ago

AlphaSite3y ago

Apps probably don’t but networking related code might benefit a lot from DPDK.

harpratap3y ago

How does this differ from eBPF?

monocasa3y ago

eBPF pushes computation into the kernel to be colocated with kernel data/events. DPDK pushes all of the relevant data (the network packets) into user space to be colocated with user data/events.

1 more reply

shaklee33y ago

dpdk is more equivalent to xdp. ebpf is not really comparable since it can't do large programs within a bpf kernel. dpdk is much, much more than just a packet processing framework.

tomohawk3y ago· 9 in thread

From: https://talawah.io/blog/extreme-http-performance-tuning-one-...

> I am genuinely interested in hearing the opinions of more security experts on this (turning off speculative execution mitigatins). If this is your area of expertise, feel free to leave a comment

Are these generally safe if you have a machine that does not have multi-user access and is in a security boundary?

salmo3y ago

For regulatory compliance, that’s still not acceptable because it opens the door to cross user data access or privilege escalation.

If someone exploited a process with a dedicated unprivileged user, had legit limited access, or got in a container on a physical, they might be able to leverage it for the forces of evil.

There’s really no such practical thing as single user Linux. If you’re running a network exposed app without dropping privileges, that’s a much bigger security risk than speculative execution.

Now, if you were skipping an OS and going full bare metal, then that could be different. But an audit for that would be a nightmare :).

flatiron3y ago

hedora3y ago

I suspect you are conflating security regulations for Unix users with regulations targeting users of the system.

Why would a regulatory framework care if a Linux box running one process was vulnerable to attacks that involve switching UIDs?

Converse, why would that same regulatory framework not care if users of that network service were able to impersonate each other / access each others’ data?

1 more reply

throwawaylinux3y ago

Which regulations?

1 more reply

toast03y ago

fulafel3y ago

urthor3y ago

It never happens.

Anytime the discussion around turning off mitigations comes up, it's trumped by the "why don't we just leave them on and buy more computers" trump card.

An easier solution than mitigating is to just upgrade your AWS instance size.

dsp3y ago

> It never happens.

Many intentionally run without mitigations.

staticassertion3y ago

If you just have a dumb "request/response" service you may not have to worry.

This is my understanding as a security professional who is not an expert on those attacks, because I frankly don't have the time. I'll defer to someone who has done the work.

Separately, (quoting the article)

Yeah that seems fine.

> Most people don't disable Spectre mitigations, so solutions that work with them enabled are important. I am not 100% sure that all of the mitigation overhead comes from syscalls,

pclmulqdq3y ago· 6 in thread

gjulianm3y ago

medawsonjr3y ago

Or better yet, Mellanox VMA since it's open source (unlike Solarflare OpenOnload) and the NICs are far less expensive.

bitcharmer3y ago

That's not true. OpenOnload is open source:

https://github.com/majek/openonload

1 more reply

pclmulqdq3y ago

I have found OpenOnload to be easier to use than VMA, although I think you can go a bit faster with Mellanox NICs.

galangalalgol3y ago

benou3y ago

What you're looking for is RDMA. It was mostly restricted to Infiniband (IB) back in the days, but nowadays you probably want RoCEv2. You can look at iWARP too but I think nowadays RoCE won.

In any case, the standard software API for RDMA is ibverbs. All adapters supporting RDMA (be it IB, RoCE or iWARP) will expose it. You can get cloud instances with RDMA on AWS and Azure.

1 more reply

evgpbfhnr3y ago· 5 in thread

JoshTriplett3y ago

mgaunard3y ago

I personally started a new framework and I went for io_uring for simplicity, that is already giving most of what I need -- asynchronous I/O with no context switching.

DPDK is huge and inflexible, it does a lot of things which I'd rather be in control of myself and I think it's easier to just do my own userspace vfio.

anonymoushn3y ago

IOPOLL is for disks. I think that with very recent kernels you will get busy polling of the socket with just SQPOLL. See here: https://github.com/axboe/liburing/issues/345#issuecomment-10...

evgpbfhnr3y ago

oh! That's not obvious at all from the man page (io_uring_enter.2)

... Well, TIL -- thanks! and the NAPI patch you pointed at looks interesting too.

olodus3y ago

limoce3y ago· 3 in thread

  I am not 100% sure that all of the mitigation overhead 
  comes from syscalls, but it stands to reason that a lot 
  of it arises from security hardening in user-to-kernel 
  and kernel-to-user transitions.

Will io_uring be also affected by Spectre mitigations given it has eliminated most kernel/user switches?

And did anyone do a head-to-head comparison between io_uring and DPDK?

anonymoushn3y ago

You can use io_uring with 0 steady-state context switches if you're willing to use 100% CPU on 2 cores :)

thekozmo3y ago

Good point. This is more of a tcp stack comparison between the kernel and userspace. Seastar has a sharded (per core) stack, which is very beneficial when the number of threads is high

anonymoushn3y ago

fefe233y ago· 3 in thread

Why is this interesting to anyone? Haven't we all moved to https by now?

Optimizing raw http seems to me like a huge waste of time by now. I say that as someone who has spent years optimizing raw http performance. None of that matters these days.

gjulianm3y ago

staticassertion3y ago

I wouldn't expect HTTPS to make any difference vs HTTP for long lived connections.

rohith25063y ago

This is particularly interesting in HFT where network latency plays a major role in win ratio

Matthias2473y ago· 2 in thread

Hi Marc (talawahtech)! Thanks for the exhaustive article.

talawahtechOP3y ago

1. https://talawah.io/blog/linux-kernel-vs-dpdk-http-performanc...

2. https://talawah.io/blog/extreme-http-performance-tuning-one-...

3. https://github.com/h2o/picohttpparser

BeefWellington3y ago

> Yea, it is definitely a fake HTTP server which I acknowledge in the article

2 more replies

Thaxll3y ago· 1 in thread

Do Google and the like actually use TCP in user space or they just use the Linux kernel?

Edit: Looks like they do but not for TCP from what I can find: https://static.googleusercontent.com/media/research.google.c...

jeffinhat3y ago

See also- Snap: a Microkernel Approach to Host Networking: https://research.google/pubs/pub48630/

gonzo3y ago· 1 in thread

My bet is that the stack in VPP is even faster.

shaklee33y ago

vpp uses dpdk

thekozmo3y ago

touisteur3y ago

I feel this would be a good place to use a spark-based TCP stack. You're bypassing the kernel, have to run stuff as root or risky CAP_ rights, your stack should be as solid as possible.

https://www.adacore.com/papers/layered-formal-verification-o...

Might also give people here some ideas on how to combine symbolic execution, proof, C and SPARK code and how to gain confidence in each part of a network stack.

I think there's even some ongoing work climbing up the stack up to HTTP but not sure of the plan (not involved).

maxgio923y ago

Thank you, very exhaustive and interesting. A note: the link to bftrace script is broken.

j / k navigate · click thread line to collapse