Low Latency Optimization: Understanding Pages (Part 1) (opens in new tab)

(hudsonrivertrading.com)

118 pointsJumptadel3y ago57 comments

57 comments

36 comments · 9 top-level

zozbot2343y ago· 7 in thread

If you're doing truly low latency stuff you shouldn't be swapping at all, everything should be 100% resident in memory at all times. So "pages" are totally irrelevant to you. (You should also probably be using something like the PREEMPT_RT patchset, adjust scheduling priorities and try your best to ensure that the CPU core(s) your app is running on aren't burdened by serving interrupts. Plus likely a lot of other stuff that I haven't touched on in this brief comment.)

vgatherps3y ago

Stock / near stock Linux is pretty close to fine for HFT.

You basically only interact with the kernel on init/shutdown or outside of the fast path, and do something like isolcpus to delegate the kernel and interrupt handling to some garbage cores and give you the rest to do what you want with.

anonymoushn3y ago

Your comment is correct but might cause readers to underestimate how annoying this tuning work is and how difficult it is to get everything into hugepages (executable memory and stack memory and shared libraries if applicable, not just specific heap allocations). We are trading a joke asset class on joke venues that have millisecond-scale jitter, so we can get away with using io_uring instead of kernel bypass networking.

2 more replies

aldanor3y ago

Not really. Most HFTs would choose some sort of kernel bypass for critical path networking needs. (unless that's what you mean by near stock)

1 more reply

bitcharmer3y ago

It looks like you don't have a good understanding of how virtual memory works and how in that space the hardware (TLB), the OS (page tables) and higher level software are intertwined.

Also, PREEMPT_RT is the worst option for low latency because it's about execution time guarantees and not speed specifically. If you're on PREEMPT_RT and give your critical thread highest prio, be prepared for some serious OS-level lock-ups.

zozbot2343y ago

> If you're on PREEMPT_RT and give your critical thread highest prio, be prepared for some serious OS-level lock-ups.

PREEMPT_RT includes priority inheritance, specifically to avoid this scenario. So your app should indeed be favored if you tune accordingly. What you also seem to be saying is that using PREEMPT_RT may lead to lower throughput, but that's not the same thing as latency.

1 more reply

anonymoushn3y ago

Pages still exist even if you disable swap. Maybe you could benefit from reading TFA?

WhiteBlueSkies3y ago

I'm not entirely sure you understand memory hierarchy and that RAM is volatile so paging has to happen and keep in mind that reading memory from disk is a LOT slower than from main memory.

pixelpoet3y ago· 4 in thread

So many puzzling things here, from the brand new user account created to post this (a portmanteau of Jump Trading and Citadel), to the very minimal information presented (even my own article on it for my software covers about as much), to the people in the comments here conflating virtual memory with hard disk paging in spite of TFA, to red herring comments about RT scheduling, ...

brooksbp3y ago

Performance is a shiny toy that attracts and distracts many from understanding the fundamentals. Interestingly, understanding fundamentals is a pre-requisite to understanding performance!

deadcanard3y ago

I'd argue that understanding what happens on every single memory access qualifies as fundamental.

deadcanard3y ago

URL for your article?

pixelpoet3y ago

https://www.chaoticafractals.com/manual/getting-started/enab...

Admittedly it's a bit terse, but at least it gives some steps you can use to enable it on Windows. It also benefits other software, such as 7zip. I need to update the page because these days the performance benefits are larger, due to the ever-widening divide between compute and memory speeds, CPUs having bigger large page TLBs, and additional optimisations...

1 more reply

SkipperCat3y ago· 4 in thread

I've been working for HFT firms since I moved to NYC over a decade ago. The article looks like a good summation HugePage benefits (I'm a sysadmin, not a programmer so I understand it on a topical level only).

What I do find fascinating is that HRT is actively blogging about this stuff. Ten years ago, everyone in the biz was super secretive and never made any public announcement about what we did - even stuff that I would take from HPE and RHEL low latency manuals (which were public knowledge). You never said anything publicly because protecting the "secret ingredients" of the trading system was paramount and any disclosure was one step towards breaking that barrier.

Now, I'm seeing HFT companies post articles like this and I'm thinking it has to be for recruiting. Why else would they do it?.

Anyway, as a side note, if you liked this article, you'd also probably like this:

http://hackingnasdaq.blogspot.com/

It was one of my favorite reads because it was written by someone going thru the journey of low latency exploration - before everything was taken over by FPGAs.

Bluecobra3y ago

> Now, I'm seeing HFT companies post articles like this and I'm thinking it has to be for recruiting. Why else would they do it?

I think you’re right in the money here. All the secret sauce is in FPGA trading now so there’s nothing secret about sharing this info.

brooksbp3y ago

> everyone in the biz was super secretive [..] even stuff that I would take from [..] (which were public knowledge)

It's not just Finance but other industries have this culture as well. I suspect it manifests in an environment that is perceived to be hyper competitive--any perceived advantage regardless of where it came from or how differentiated it is, is held closely and over-weighed if proper metrics aren't in place to continue pushing for improvement.

mhh__3y ago

This is ezpz optimization. Everyone and their dog knows about huge pages (or at least anyone I deem worthy).

It's for recruiting, clout, and also generally expressing the culture of the firm.

boshalfoshal3y ago

This but unironically - I was at a top HFT firm and admittedly lots of SWE talent we try to get end up going to HRT (comp being equal) because they present themselves as more tech forward, due in part to articles like this. I'm guessing the author of the article wrote this with good intentions of providing some insight into their process, but tech blogs at the end of the day are recruiting tools.

Also agreed that this is a pretty surface level optimization, theres a reason why they are talking about it. If you are doing true HFT with purely software traders, you will probably lose to more serious players using FPGAs, which as OP mentioned isn't exactly new.

1 more reply

WiSaGaN3y ago· 3 in thread

Low latency trading (sometimes referred as HFT as well) focuses a lot in data locality. That is to make sure the critical path, that is from market data coming in to new order or cancel order sending out (tick-to-trade), operates in cache as much as possible and avoid memory access as much as possible. Put all those needed data together within few cachelines as possible. To make sure those data are in the cache so the tick-to-trade operates on cache, sometimes warmup of caches between orders are employed too. This is to prevent those caches swapped out, and involves sending fake orders that won't actually go out, but swapped in the needed cachelines before the real orders, so that caches are hot when they are needed.

trh0awayman3y ago

are there any hardware/software systems/setups that can put guarantees around L1/L2/main memory usage? it seems like that would be a major boon rather than just put everything in tiny arrays and hope for the best

bitcharmer3y ago

Intel's RDT stack has that in the form of CAT (Cache Allocation Technology) and MBA (Memory Bandwidth Allocation). Some more advanced HFT shops use that extensively

https://github.com/intel/intel-cmt-cat

1 more reply

intelVISA3y ago

Custom OS that runs in L3 only (cache as RAM).

bob10293y ago· 3 in thread

If any of this is interesting to you, but you would like some deeper content to bite into, perhaps start here:

https://lmax-exchange.github.io/disruptor/disruptor.html

This technical paper sent me on a multi-year journey regarding one simple question: "If this stuff is fast enough for fintech, why can't we make everything work this way?" Handling millions of requests per second on 1 thread is well beyond the required performance envelope for most public "webscale" products.

boshalfoshal3y ago

The thing is, some tech companies do optimize to this degree (fb, google) - except it only really makes a lot of sense for companies that have

A. Capital (human and money)

B. Bespoke internal tools from the server up

C. Insane scale

Google and meta put in lots of effort to have fast C++ code for their core infra, and they have several teams contributing to the LLVM project. Even places like Figma optimize to some extent because part of their business alpha is being performant and smooth. When you are at the scale of FB or G, optimizing the small things can lead to massive aggregate gains, and they have the eng talent/time to justify it.

At smaller companies though, as others have mentioned, iteration time and efficient dev spend are paramount. Optimizing for microsecond latency with on your B2B SaaS product written in Python/React is most likely not part of your business case, and it is a waste of engineering time and effort to do this when you could be putting that time and money into new and better features. Most of the time, these very niche performance considerations are taken care of to a decent degree with off the shelf tools, maybe with a bit of tuning.

intelVISA3y ago

That's like asking why a motorbike isn't a car - they have the same goal: get from A to B but quite different considerations.

Most webscale products aren't written in performant languages and are, instead, optimized around fast feature generation and being easy (cheap) to hire for.

There's a reason laggy Electron apps are the norm $$.

smabie3y ago

HFT architecture tends to be grossly inefficient from a throughput perspective: i.e you'll usually be busy waiting for new data.

gavinray3y ago· 2 in thread

An OS page size is such a prevalent notion in software it's shocking

I was oblivious to this a year ago before I got interested in database internals

Something that I found interesting, there's a recent presentation by Neumann about the Umbra DBMS where he fields a question about hugepages at the end. I recall him saying they don't use it, which I found interesting.

I know Oracle and MySQL recommended Transparent Hugepages IIRC

menaerus3y ago

Transparent huge-pages are actually rarely recommended, and I think it's a quite prevalent idiom in database systems world, and that is because they can cause unexpected stalls and latency spikes during the application runtime. This is possible because THB is an in-kernel implementation of system-wide huge-page support and it's basically hard to get a lot of control over the process.

OTOH to make use of "normal" huge-pages, you have to allocate them up front so it's not possible to run into THB type of issues.

That said, I doubt that enabling huge-pages for complex database workloads, that cannot run solely in-memory, will show any noticeable performance improvement. There's a lot of IO and memory R/W involved and I think this is what shadows the TLB miss cost. What would be interesting, and what I haven't done so far, is to estimate the number of CPU cycles needed for a TLB miss.

pca0061323y ago

You need to understand the bottleneck to determine whether or not huge pages are useful. THP requires the kernel to do additional work to give you the huge pages, which is usually more expensive then allocating huge pages through mmap directly.

up2isomorphism3y ago· 2 in thread

So useless.

pixelpoet3y ago

I'm with you: zero lines of code presented for the most verbose description on the topic I've seen. No mention of other software that benefits from it, no actual latency graphs (it's somewhat implied by the throughput graph), only one CPU measured, ...

WhiteBlueSkies3y ago

Lol, this is the very definition of interesting. They included source code here: https://github.com/hudson-trading/hrtbeat/blob/master/huge_m....

angry_octet3y ago· 1 in thread

This article is pretty thin but it's not wrong.

If you're interested in consistent low latency you do need to avoid TLB misses, and also page faults, cache contention, cache coherency delay (making sure no other cores are accessing your memory) from the CC protocol (MOESI/MESI(F)) and mis-prediction, and that's after you have put all your core's threads into SCHED_FIFO. Using https://lttng.org/ can be really helpful in checking what's happening.

deadcanard3y ago

Again, I am biased. But the article explains mem translation in fairly simple terms, hammers the main advantages of HPs (better use of the TLB, simpler and smaller PT). Explains clearly what how much mem the TLB can cover, what a page walk is and how much time it takes (before even loading actual data), the importance of the cache wrt PT. It shows some perf numbers of random vs iterative mem accesses.

I don't think you'll find many articles that detail these points. Now, they might be trivial to you and that's totally fair. But the goal is to address a wide audience. Additionally, the article is not addressing how to use HPs but that's for part 2.

Wrt to other points, I certainly agree they are important topics to explore. I would add using perf is super important to easily access the perf counters

mhh__3y ago· 1 in thread

Feels a bit blogspammy. Drepper's article is linked to for good reason

vgatherps3y ago

They don’t want to release actually interesting optimization content but need something to fill the tech blob maybe?

Although huge pages are pretty basic table stakes for hft software nowadays, not much alpha left to high by really going into detail on them?

j / k navigate · click thread line to collapse

57 comments

36 comments · 9 top-level

zozbot2343y ago· 7 in thread

vgatherps3y ago

Stock / near stock Linux is pretty close to fine for HFT.

anonymoushn3y ago

2 more replies

aldanor3y ago

Not really. Most HFTs would choose some sort of kernel bypass for critical path networking needs. (unless that's what you mean by near stock)

1 more reply

bitcharmer3y ago

It looks like you don't have a good understanding of how virtual memory works and how in that space the hardware (TLB), the OS (page tables) and higher level software are intertwined.

zozbot2343y ago

> If you're on PREEMPT_RT and give your critical thread highest prio, be prepared for some serious OS-level lock-ups.

1 more reply

anonymoushn3y ago

Pages still exist even if you disable swap. Maybe you could benefit from reading TFA?

WhiteBlueSkies3y ago

I'm not entirely sure you understand memory hierarchy and that RAM is volatile so paging has to happen and keep in mind that reading memory from disk is a LOT slower than from main memory.

pixelpoet3y ago· 4 in thread

brooksbp3y ago

Performance is a shiny toy that attracts and distracts many from understanding the fundamentals. Interestingly, understanding fundamentals is a pre-requisite to understanding performance!

deadcanard3y ago

I'd argue that understanding what happens on every single memory access qualifies as fundamental.

deadcanard3y ago

URL for your article?

pixelpoet3y ago

https://www.chaoticafractals.com/manual/getting-started/enab...

1 more reply

SkipperCat3y ago· 4 in thread

Now, I'm seeing HFT companies post articles like this and I'm thinking it has to be for recruiting. Why else would they do it?.

Anyway, as a side note, if you liked this article, you'd also probably like this:

http://hackingnasdaq.blogspot.com/

It was one of my favorite reads because it was written by someone going thru the journey of low latency exploration - before everything was taken over by FPGAs.

Bluecobra3y ago

> Now, I'm seeing HFT companies post articles like this and I'm thinking it has to be for recruiting. Why else would they do it?

I think you’re right in the money here. All the secret sauce is in FPGA trading now so there’s nothing secret about sharing this info.

brooksbp3y ago

> everyone in the biz was super secretive [..] even stuff that I would take from [..] (which were public knowledge)

mhh__3y ago

This is ezpz optimization. Everyone and their dog knows about huge pages (or at least anyone I deem worthy).

It's for recruiting, clout, and also generally expressing the culture of the firm.

boshalfoshal3y ago

1 more reply

WiSaGaN3y ago· 3 in thread

trh0awayman3y ago

bitcharmer3y ago

Intel's RDT stack has that in the form of CAT (Cache Allocation Technology) and MBA (Memory Bandwidth Allocation). Some more advanced HFT shops use that extensively

https://github.com/intel/intel-cmt-cat

1 more reply

intelVISA3y ago

Custom OS that runs in L3 only (cache as RAM).

bob10293y ago· 3 in thread

If any of this is interesting to you, but you would like some deeper content to bite into, perhaps start here:

https://lmax-exchange.github.io/disruptor/disruptor.html

boshalfoshal3y ago

The thing is, some tech companies do optimize to this degree (fb, google) - except it only really makes a lot of sense for companies that have

A. Capital (human and money)

B. Bespoke internal tools from the server up

C. Insane scale

intelVISA3y ago

That's like asking why a motorbike isn't a car - they have the same goal: get from A to B but quite different considerations.

Most webscale products aren't written in performant languages and are, instead, optimized around fast feature generation and being easy (cheap) to hire for.

There's a reason laggy Electron apps are the norm $$.

smabie3y ago

HFT architecture tends to be grossly inefficient from a throughput perspective: i.e you'll usually be busy waiting for new data.

gavinray3y ago· 2 in thread

An OS page size is such a prevalent notion in software it's shocking

I was oblivious to this a year ago before I got interested in database internals

I know Oracle and MySQL recommended Transparent Hugepages IIRC

menaerus3y ago

OTOH to make use of "normal" huge-pages, you have to allocate them up front so it's not possible to run into THB type of issues.

pca0061323y ago

up2isomorphism3y ago· 2 in thread

So useless.

pixelpoet3y ago

WhiteBlueSkies3y ago

Lol, this is the very definition of interesting. They included source code here: https://github.com/hudson-trading/hrtbeat/blob/master/huge_m....

angry_octet3y ago· 1 in thread

This article is pretty thin but it's not wrong.

deadcanard3y ago

Wrt to other points, I certainly agree they are important topics to explore. I would add using perf is super important to easily access the perf counters

mhh__3y ago· 1 in thread

Feels a bit blogspammy. Drepper's article is linked to for good reason

vgatherps3y ago

They don’t want to release actually interesting optimization content but need something to fill the tech blob maybe?

Although huge pages are pretty basic table stakes for hft software nowadays, not much alpha left to high by really going into detail on them?

j / k navigate · click thread line to collapse