The time complexity of the libc++ deque push_front implementation is O(log n) (opens in new tab)

(1f604.blogspot.com)

80 pointsx1f6044y ago54 comments

54 comments

34 comments · 9 top-level

colanderman4y ago· 9 in thread

The best thing I learned in algorithms class is:

For all realizable n, ln(n) is less than 40.

So when amortizing over a large window (say, a block size of 512), it can be the case that, for any realizable n, the logarithmic amortized costs are completely dominated by the constant non-amortized costs (adding an element to a block in this case)... making the operation effectively constant-time.

tomerv4y ago

Practically speaking, it is not a good advice to just treat log(n) as if it is constant. While it's true that log(n) is always small in practice, the conclusion should not be to ignore it, but rather to notice the constant factor. And in practice, usually data structures with O(log n) complexity also have a bigger hidden constant. For example, std::unordered_map is much faster in practice than std::map. Of course, this is not strictly correct, it's just a heuristics. Quicksort with its O(log n) [Edit: O(n log n)] complexity is a counter-example to this.

colanderman4y ago

To be clear, that is not the advice I'm giving -- but rather, when your performance looks like p*log n + q, if q is much greater than p/40 -- that is, the constant term dwarfs the logarithmic term -- then it is safe to consider it constant.

1 more reply

kadoban4y ago

That seems like a pretty good argument _for_ treating it as constant though and just shifting your focus to how large the constants actually are.

jltsiren4y ago

In a virtual memory system, random access to an array of size n takes O(log n) time, and the constant factors in that O(log n) are also nontrivial. Algorithms that do O(log n) computation with O(log n) independent elements tend to take O(log^2 n) time, while those that do O(log n) computation with O(log n) contiguous elements or O(log n) iterations with O(1) elements still take O(log n) time. If the constant factors are small enough, it can be hard to distinguish the latter two from algorithms doing O(1) computation with O(1) elements.

1 more reply

kwertyoowiyop4y ago

And of course, nothing is important until you’ve profiled your code and measured that it is.

gpderetta4y ago

Random fact of the day: The currently best known upper bound for the complexity of the Union-Find algorithm is the reverse Ackermann function, which can be treated as a constant (4) for all remotely practical (and most of the impractical) values of n.

tylerhou4y ago

Tarjan's proof of inverse* Ackermann bound is hard to understand; it's much easier to understand the proof of the log*(n) bound, which is also an extremely slow growing function: https://people.eecs.berkeley.edu/~vazirani/algorithms/chap5....

where log* is the iterated log function; i.e. the number of times you have to apply log repeatedly until you reach a value of 1 or smaller. For reference, log_2*(10^10000) = 5.

https://en.wikipedia.org/wiki/Iterated_logarithm

maple31424y ago

I wonder if it is true that everything is O(1) if there is an upper bound? Even for an algorithm with O(n^n) complexity, it is still O(1) if n is bounded, just with a extremely large constant.

colanderman4y ago

n has to be bounded much much smaller, and the constant overhead of the algorithm in question much much larger, for any of those approximations to be valid. The approximation works for log n only because it's not uncommon to have constant overheads which dwarf log n for all plausible n (and only works in such cases!) This is very unlikely to be true even for a linear algorithm.

logicchains4y ago· 6 in thread

Anyone who programs low-latency software knows std::deque is an absolutely horrible-performance deque implementation due to how much it allocates and the pointer chasing involved. In most cases it's better to use a vector-backed deque (ring buffer) for anything performance-sensitive.

mgaunard4y ago

That is not true.

Anyone who programs low-latency C++ knows that the libstdc++ implementation is great (which is what 99.9% of people use) while others tend to be less stellar.

It's just a segmented vector. The libstdc++ implementation always allocates one segment even if empty, and while I've seen low-latency guidelines arguing for empty things not allocating, my personal guideline is to always allocate reasonable capacity on construction rather than on first insertion.

A ring buffer is a completely different data structure, it only works for fixed-capacity queues.

logicchains4y ago

>which is what 99.9% of people use

Maybe we have different ideas about what constitutes "low-latency" but in HFT std::deque is rarely used. Much like std::unordered_map, which allocates every insert potentially costing up to a microsecond for each insert.

>It's just a segmented vector.

https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USER... we can look at the source. This "segmented vector" is `_Tp** _M_map`, essentially a vector of vectors. This means potentially twice the pointer-chasing, as we have to do two lookups. It also means allocation/deallocation each time a new segment is added/destroyed, which leads to more allocation than when using a vector (first need to allocate the segment, then potentially allocate entries within that).

>A ring buffer is a completely different data structure, it only works for fixed-capacity queues.

Where possible it's better to use a fixed capacity queue, especially one where the capacity is a power of 2 so wrap-around calculation can be done more efficiently. But the same kind of thing can be done for a dynamic-capacity queue, by resizing the whole backing array when capacity is reached.

3 more replies

IshKebab4y ago

I think he might be talking about a segmented ring buffer, which is similar to a deque but each segment is a ring buffer. It gives much better insertion complexity.

Anyway I have to echo his point that I've never found a situation where deque performs better than vector, even when you really might expect it to. I guess the extra allocations are too slow. Maybe it would be more useful with arenas or something.

ComputerGuru4y ago

> my personal guideline is to always allocate reasonable capacity on construction rather than on first insertion.

That works until probabilistically there is a decent chance the capacity will always be zero.

3234y ago

How do you feel about the absl containers? Are they also slow for low-latency?

ncmncm4y ago

They are OK. But "slow for low latency" is not really defined; for many low-latency cases, only static or at-startup allocation is permitted, thus not deque at all. If you might need to build for production with MSVC, definitely do not use the std::deque that comes with it.

krona4y ago· 4 in thread

CppReference defines the time complexity is constant: https://en.cppreference.com/w/cpp/container/deque/push_front

So does this mean: - We're talking about different complexities (Amortized vs something else) - The libc++ implementation isn't standards conformant - The analysis in the c++ standard is incorrect. - Something else

Kranar4y ago

Neither is correct. Generally when complexity is specified in the standard, it's with respect to a specific operation. In this case the standard specifies that it's the number of calls to the constructor of T that must be constant, which means that push_back and push_front can not copy or move its elements.

There is never a way to guarantee that the physical amount of time is bounded by O(1) in the worst case. You could always have pathological scenarios or data structures written so that performing a move or copy of a single T could require O(n^n) time for all anyone knows.

gpderetta4y ago

Memory allocation in particular is very hard to analyze.

im3w1l4y ago

It's not conformant.

mkl4y ago

To make bullet points on HN, put a blank line between the paragraphs.

beached_whale4y ago· 3 in thread

std::deque is odd in that there's no std size of the blocks used. We have libstdc++ that is about 512bytes, libc++ at 4096bytes, and MS STL at around 64bytes. This leaves a bunch of room for optimizing for certain cases. Smaller is more insert friendly, larger more op[] reading friendly. But push_front after a pop_front should be O(1) on all of them. Otherwise it is moving/copying data to make room. Another nice/similar data structure is something like Boosts devector that adds a pointer to vector for the front capacity.

nly4y ago

Boosts deque let's you decide the the block size.

brandmeyer4y ago

> MS STL at around 64bytes.

Citation? Are you sure it isn't 64 size-independent elements?

MauranKilom4y ago

It's worse actually. The MSVC STL uses 16 byte (or however large your element is, if greater) chunks. Source:

https://github.com/microsoft/STL/blob/main/stl/inc/deque#L56...

They have made clear that this won't be changed, for ABI stability reasons.

That makes std::dequeue basically unusable in a portable context. In virtually any situation, "allocate each element separately" and "allocate elements in 4k chunks" are on opposite ends of the performance spectrum.

2 more replies

im3w1l4y ago· 2 in thread

So iiuc:

The libc++ implementation is bad. The libstdc++ implementation is fine. The issue with the former is that it doesn't have enough spare capacity so it has to shift elements around too often.

Actually I think the push_front is even worse than stated: O(n). Consider an array with capacity N+1, contains N elements. Now if you alternate push_front and pop_back then every push_front will cause memmove of N elements.

Oh and to make like a 4th addition to this comment: It's kind of funny that the code is so unreadable that the author found it more helpful to look at the generated assembler code.

rightbyte4y ago

Reading those C or C++ standard libraries is like a joke. Almost nothing is done in a direct or clear way and internal methods have cryptic names hidden in macros.

Maybe for a good reason I dunno. But it would be nice if the code was clearer so you could make sense of it when gdb or callgrind jumps into an inlined chunk ...

brandmeyer4y ago

> internal methods have cryptic names

They choose names like _Capitalized and __lowercase because those identifiers are reserved for the implementation. Its a consequence of the preprocessor's lack of hygiene.

So where you might see a convention of naming members like m_thingish_foo, in the implementation's library headers they would be named _M_thingish_foo or __m_thingish_foo.

1 more reply

blahgeek4y ago· 1 in thread

While I very much appreciate and respect author's detailed analysis, I am still not 100% convinced without a corresponding benchmark result. If it's really O(log n) vs O(1), it should be very easy to verify in some micro-benchmark.

Instead of that, the author mentioned that:

> The tests that I did run, initially the libstdc++ and the libc++ versions have roughly the same performance, but the performance diverges after a number of insertions somewhere between 6,400 and 12,800. After that the runtime of the libc++ version is roughly 2-3 times that of the libstdc++ version. (...) the outcomes of which depend on the compiler version anyway.

This does not seem right.

x1f604OP4y ago

Here are the results of the benchmark that I did: https://imgur.com/a/2lCktTO

It didn't look quite right to me so I didn't post it.

Here's another benchmark that I did where you can clearly see that something has gone wrong: https://imgur.com/a/s2rA8qE

williamkuszmaul4y ago

Very nice post!

Small comment: Ideally, big-O notation is for upper bounds. If you are doing lower bounds, you should ideally use big-Omega notation. But Omegas are harder to format in plain text, so it may be better to abuse notation and use big-O...

ncmncm4y ago

The most important thing to know about std::deque is that the implementation in MSVC is really just awfully bad. So, if your code might need to be built there, you are should consider using an implementation from somewhere other than std::.

Labo3334y ago

Interesting!

It reminds me of the hash maps that are said to be amortized O(1) but can be O(n) for some sequences of operations in various languages like Python and JS: https://blog.toit.io/hash-maps-that-dont-hate-you-1a96150b49...

j / k navigate · click thread line to collapse

54 comments

34 comments · 9 top-level

colanderman4y ago· 9 in thread

The best thing I learned in algorithms class is:

For all realizable n, ln(n) is less than 40.

tomerv4y ago

colanderman4y ago

1 more reply

kadoban4y ago

That seems like a pretty good argument _for_ treating it as constant though and just shifting your focus to how large the constants actually are.

jltsiren4y ago

1 more reply

kwertyoowiyop4y ago

And of course, nothing is important until you’ve profiled your code and measured that it is.

gpderetta4y ago

tylerhou4y ago

where log* is the iterated log function; i.e. the number of times you have to apply log repeatedly until you reach a value of 1 or smaller. For reference, log_2*(10^10000) = 5.

https://en.wikipedia.org/wiki/Iterated_logarithm

maple31424y ago

I wonder if it is true that everything is O(1) if there is an upper bound? Even for an algorithm with O(n^n) complexity, it is still O(1) if n is bounded, just with a extremely large constant.

colanderman4y ago

logicchains4y ago· 6 in thread

mgaunard4y ago

That is not true.

Anyone who programs low-latency C++ knows that the libstdc++ implementation is great (which is what 99.9% of people use) while others tend to be less stellar.

A ring buffer is a completely different data structure, it only works for fixed-capacity queues.

logicchains4y ago

>which is what 99.9% of people use

>It's just a segmented vector.

>A ring buffer is a completely different data structure, it only works for fixed-capacity queues.

3 more replies

IshKebab4y ago

I think he might be talking about a segmented ring buffer, which is similar to a deque but each segment is a ring buffer. It gives much better insertion complexity.

ComputerGuru4y ago

> my personal guideline is to always allocate reasonable capacity on construction rather than on first insertion.

That works until probabilistically there is a decent chance the capacity will always be zero.

3234y ago

How do you feel about the absl containers? Are they also slow for low-latency?

ncmncm4y ago

krona4y ago· 4 in thread

CppReference defines the time complexity is constant: https://en.cppreference.com/w/cpp/container/deque/push_front

Kranar4y ago

gpderetta4y ago

Memory allocation in particular is very hard to analyze.

im3w1l4y ago

It's not conformant.

mkl4y ago

To make bullet points on HN, put a blank line between the paragraphs.

beached_whale4y ago· 3 in thread

nly4y ago

Boosts deque let's you decide the the block size.

brandmeyer4y ago

> MS STL at around 64bytes.

Citation? Are you sure it isn't 64 size-independent elements?

MauranKilom4y ago

It's worse actually. The MSVC STL uses 16 byte (or however large your element is, if greater) chunks. Source:

https://github.com/microsoft/STL/blob/main/stl/inc/deque#L56...

They have made clear that this won't be changed, for ABI stability reasons.

2 more replies

im3w1l4y ago· 2 in thread

So iiuc:

The libc++ implementation is bad. The libstdc++ implementation is fine. The issue with the former is that it doesn't have enough spare capacity so it has to shift elements around too often.

Oh and to make like a 4th addition to this comment: It's kind of funny that the code is so unreadable that the author found it more helpful to look at the generated assembler code.

rightbyte4y ago

Reading those C or C++ standard libraries is like a joke. Almost nothing is done in a direct or clear way and internal methods have cryptic names hidden in macros.

Maybe for a good reason I dunno. But it would be nice if the code was clearer so you could make sense of it when gdb or callgrind jumps into an inlined chunk ...

brandmeyer4y ago

> internal methods have cryptic names

They choose names like _Capitalized and __lowercase because those identifiers are reserved for the implementation. Its a consequence of the preprocessor's lack of hygiene.

So where you might see a convention of naming members like m_thingish_foo, in the implementation's library headers they would be named _M_thingish_foo or __m_thingish_foo.

1 more reply

blahgeek4y ago· 1 in thread

Instead of that, the author mentioned that:

This does not seem right.

x1f604OP4y ago

Here are the results of the benchmark that I did: https://imgur.com/a/2lCktTO

It didn't look quite right to me so I didn't post it.

Here's another benchmark that I did where you can clearly see that something has gone wrong: https://imgur.com/a/s2rA8qE

williamkuszmaul4y ago

Very nice post!

ncmncm4y ago

Labo3334y ago

Interesting!

j / k navigate · click thread line to collapse