You can beat the binary search (opens in new tab)

(lemire.me)

344 pointsvok13d ago152 comments

152 comments

Daniel Lemire's points about low-level hardware optimization notwithstanding, it's worth pointing out that binary search (or low-level implementation variants) is the best only if you know nothing about the data beyond the fact that it is sorted / monotonic.

If you have priors about the data distribution, then it's possible to design algorithms which use that extra information to perform MUCH better. eg: a human searching a physical paper dictionary can zoom into the right bunch of pages faster than pure idealized binary search; it's a separate matter it's hard for humans to continue binary search till the very end and we might default to scanning linearly for the last few iterations (cognitive convenience / affordances of human wetware / etc).

In mathematical language, searching a sorted list is basically inverting a monotonic function, by using a closed-loop control algorithm. Often, we could very well construct a suitable cost function and use gradient descent or its accelerated cousins.

More generally, the best bet to solving a problem more efficiently is always to use more information about the specific problem you want to solve, instead of pulling up the solution for an overly abstract representations. That can offer scalable orders of magnitude speedup compared to constant factor speedups from just using hardware better.

13 more replies

zug_zug10d ago

I'd like to point out that all the graphs only go to 4,000 elements, which is basically non-data. Basically it'd be like measuring which car wins a 1cm race.

For small workloads binary search is slower than just checking every element.

To add to this, I think people can forget how small log(n) is... it can practically be seen as constant (as the log base 2 of ATOMS IN THE UNIVERSE is ~300).

listeria10d ago

This reminds me of two excellent articles[1][2] by Paul Khuong, in which he talks about using size-specialized binary search for power-of-two sized arrays (special-casing the first iteration for other sizes).

He uses conditional moves and defines the number of iterations in advance to ellide the often mispredicted branch, and in the second article goes on to fix cache aliasing issues for large vectors using ternary search.

[1]: https://pvk.ca/Blog/2012/07/03/binary-search-star-eliminates...

[2]: https://pvk.ca/Blog/2012/07/30/binary-search-is-a-pathologic...

drob51811d ago

Isn't "quaternary" just sort of unrolling the binary search loop by one level? I mean, to find the partition in which the item is located, you still do roughly the same rough number of comparisons. You're just taking them 4 at a time, not 2 at a time. Seems like loop unrolling would give you the same.

nkurz11d ago

It's trickier than that. Modern processors are speculative, which means that they guess at the result for a comparison and keep going along one side of a branch as far as they can until they are told they guessed wrong or hit some internal limit. If they guessed wrong, they throw away the speculative work, take a penalty of a handful of cycles, and do the same thing again from a different starting point.

Essentially, this means that all loops are already unrolled from the processors point of view, minus a tiny bit of overhead for the loop itself that can often be ignored. Since in binary search the main cost is grabbing data from memory (or from cache in the "warm cache" examples) this means that the real game is how to get the processor to issue the requests for the data you will eventually need as far in advance as possible so you don't have to wait as long for it to arrive.

The difference in algorithm for quad search (or anything higher than binary) is that instead of taking one side of each branch (and thus prefetching deeply in one direction) is that you prefetch all the possible cases but with less depth. This way you are guaranteed to have successfully issued the prefetch you will eventually need, and are spending slightly less of your bandwidth budget on data that will never be used in the actual execution path.

As others are pointing out, "number of comparisons" is almost useless metric when comparing search algorithms if your goal is predicting real world performance. The limiting factor is almost never the number of comparisons you can do. Instead, the potential for speedup depends on making maximal use of memory and cache bandwidth. So yes, you can view this as loop unrolling, but only if you consider how branching on modern processors works under the hood.

drob51811d ago

Yea, I get that the actual comparison instruction itself is insignificant. It's everything that goes along with it. Seems like quaternary is fetching more data, however.

For instance, if you have 8 elements, 01234567, and you're looking for 1, with binary, you'd fetch 4, 2, and then 1. With quaternary, you'd fetch 2, 4, 6, then 1. Obviously, if you only have 8 elements, you'd just delegate to the SIMD instruction, but if this was a much larger array, you'd be doing more work.

I guess on a modern processor, eliminating the data dependency is worth it because the processor's branch prediction and speculation only follows effectively a single path.

Would be interesting to see this at a machine cycle level on a real processor to understand exactly what is happening.

LoganDark11d ago

It's not about doing more or less work; it's about doing the work faster. For instance, it's relatively common to discover that some recomputation can be faster than caching or lookup tables. Similarly, fetching more from memory also can be faster if it means you make less roundtrips.

1 more reply

wtallis11d ago

Yes, this can be seen as unrolling the loop a bit. It improves performance not by significantly reducing the number of instructions or memory reads, but by relaxing the dependencies between operations so that it doesn't have to be executed purely serially. You could also look at it as akin to speculatively executing both sides of the branch.

mayoff11d ago

Quaternary search effectively performs both of the next loop iteration’s possible comparisons simultaneously with the current iteration’s comparison. This is a little more complex than simple loop unrolling.

Regardless, both kinds of search are O(log N) with different constants. The constants don’t matter so much in algorithms class but in the real world they can matter a lot.

loeg11d ago

Sort of, yes, but you're also removing a data dependency between the unrolled stages.

pfortuny11d ago

It is because processors do not do what one might naively think they do.

lalitmaganti10d ago

I also wrote recently [1] about Exponential Search [2] which is another algorithm if you need to repeatedly binary search in an array where the elements you're searching are themselves are sorted. It allowed for an 8x speedup in our workload!

[1] https://lalitm.com/post/exponential-search/ [2] https://en.wikipedia.org/wiki/Exponential_search

10000truths10d ago

Exponential search is useful when you're querying a REST API that addresses resources with sequential IDs, and need the last ID, but there's no dedicated endpoint for it:

  HEAD /users/1 -> 200 OK
  HEAD /users/2 -> 200 OK
  HEAD /users/4 -> 200 OK
  ...
  HEAD /users/2048 -> 200 OK
  HEAD /users/4096 -> 404 Not Found

And then a binary search between 2048 and 4096 to find the most recent user (and incidentally, the number of users). Great info to have if you're researching competing SaaS companies.

YourDadVPN10d ago

I'm guessing you don't really do this for users since the response for all of them should be 401 on any user that you aren't logged in as? I would argue even for IDs that don't exist, you should get the same error whether they don't exist or you just aren't authorised to see them. It's been a few years since I worked in web but I think that's what I would have done, GitHub does similar for private repos.

drewg12310d ago

One thing that makes me nervous about the increased use of SIMD in new ways is that more processes will be using SIMD. This, in turn, make context switches that much more expensive as the SIMD registers must be saved and restored when switching to a new process that also uses SIMD.

Cthulhu_10d ago

Is this an actual, measurable, major issue or just a gut feeling? Context switches in general are suboptimal but pretty normal.

drkrab10d ago

Since the cpu always accesses a full cache line (64 bytes) at a time, you might as well search the entire cache line (it’s practically free once the data is on-cpu). So I’d like to try a ‘binary’ search that tests all the values in the ‘middle cache line’ and then chooses to go left or right if none match. You can do the cache line search as a single 512bit simd instruction. A cache line is 64 bytes (or 32 16-bit integers); such a search might well be almost 32 times faster than simple binary search; at least it’ll do 32x less memory accesses, which will dominate in most realistic programs.

nly10d ago

Searching the upper cache lines in your binary search tree (sorted vector) for your target is unlikely to yield results. Instead you want to use the extra data in the line to shorten the search, which leads you to a B-Tree or B+tree.

For 4 byte keys and 4 byte child pointers (or indexes in to an array) your inner nodes would have 7 keys, 8 child pointers and 1 next pointer, completely filling a 64 byte cache-line and your tree depth for 1 million entries would go down from ~20 to ~7, the top few levels of which are likely to remain cache resident.

With some thought, it's possible to use SIMD on B-tree nodes to speed up the search within the node, but it's all very data dependent.

aidenn010d ago

Binary searching a sorted array is isomorphic to a sorted binary tree with implicit child pointers.

It seems to me like there should be a sort order that stores the items as a fully-dense left-shifted binary tree from top-to-bottom (e.g. like the implicit heap in an in-place heap sort, but a binary search tree instead of a hea). Is there a name for this? Does it show any performance wins in practice?

Rusky10d ago

There's Eytzinger order: https://algorithmica.org/en/eytzinger

1 more reply

pantsforbirds10d ago

If your data follows a relatively uniform distribution, you can try out interpolation search: https://en.wikipedia.org/wiki/Interpolation_search

taeric11d ago

If you are talking smaller arrays, linear search with a sentinel value at the end is already tough to beat. The thing that sucks about that claim, is that "smaller" is such a nebulous qualifier that it is really hard to internalize.

rao-v11d ago

This is simply not true - if you look at this article’s excellent benchmarking, linear search falls behind somewhere around 200-400 elements.

In general I love this article, it took what I’ve often wondered about and did a perfect job exploring with useful ablation studies.

BeetleB11d ago

For that machine and compiler version, yes.

eggprices11d ago

Except on Apple, where binary search always wins. Does anyone know why?

SuperV123411d ago

That's not what the article is about.

srcreigh11d ago

The algorithm description was a bit confusing for me.

The SIMD part is just in the last step, where it uses SIMD to search the last 16 elements.

The Quad part is that it checks 3 points to create 4 paths, but also it's searching for the right block, not just the right key.

The details are a bit interesting. The author chooses to use the last element in each block for the quad search. I'm curious how the algorithm would change if you used the first element in each block instead, or even an arbitrary element.

codedokode10d ago

I did little experiments with search in small arrays (16-32 items) and binary search is one of the worst methods because it requires lot of branches. The fastest method for small arrays was linear branchless search (you walk over all elements without breaking out of the loop. For example, if you want to know whether the array contains a number, you logically OR the checks for all items). I didn't use SIMD though, but the branches are very expensive for small arrays and simply checking all elements without branching is faster.

YourDadVPN10d ago

How did you do it? Hopefully you generated thousands of such arrays and measured the cost of searching all of them as a single iteration? Because depending on what you use to measure time, the overhead of reading the timer would likely dwarf the cost of searching such a small array. The best case is a dedicated cycle counter instruction/register (e.g. rdtsc) and even that may cost hundreds of cycles. Cache hits cost less than a cycle so if your timing code gated a small number of searches you essentially didn't measure the search at all.

That aside your findings point to the prefetcher having identified your linear searches as sequential access so practically every single access was a cache hit (you effectively measured the latency between a CPU and its L1 cache). If you wanted to test this you could do something like make each element some number of cache lines wide. Stream prefetchers have a maximum stride, so variables more than that many bytes apart won't be prefetched. Google's multichase benchmark uses 256 bytes IIRC.

codedokode9d ago

I often benchmark my DIY data structures, as well as check the assembly on godbolt. I think I wrote the code using ML model (they save lot of time), and then fixed the issues and extended it myself. I have the code here: https://pastebin.com/GzY4HMsZ

I do the search multiple (thousands) times and calculate total time spent, so the timer is called only 2 times. I search across the same array, but for the different number each time. So the array should be completely within L1 cache. It's also important to use the result of the search otherwise the compiler simply omits the code.

I use clock_gettime() which takes on order of 1us. Timestamp counter (which is faster) is not available for me because every CPU core has its own counter and they show different values, so Linux refuses to use it. First core is 600 ms ahead of other cores. How in 2025 we cannot make a boring counter amazes me. Or, maybe this is intentional to prevent using cheaper consumer hardware for professional purposes.

Here are some numbers:

Array size 8, 200 000 iterations, time ns/seach: linear 8 ns, linear no-branch <1 ns (I wonder if it is an error), binary 11 ns, binary no-branch 7 ns.

Array size 128, 100 000 iterations, time: linear 34 ns, linear no-branch 22 ns, binary 35 ns, binary no-branch 16 ns.

As you can see, the branches is what ruins the performance here. Every time the CPU mis-predicts a branch, it wastes several cycles.

Looking through disassembly on godbolt, the compiler inlined and vectorized the measurement loop for no-branch code, which might explain the numbers. The SIMD assembly is hard to read and I don't quite understand what it does: https://godbolt.org/z/dofKnT3W3

I'll be happy to read if you point me at any mistakes in my benchmark. If you want to compile the code, I used gcc with "-O3" and maybe with "-march=native".

YourDadVPN9d ago

Interesting, thanks for sharing. I actually assumed ?: desugared to an if-else and never checked. I was a bit confused by the if-else chains you have e.g. on line 111 where it looks like you could just use the else branch for every case?

You could circumvent the counter issue by pinning to a specific core, otherwise you'd have to use scheduler events (ftrace/perf on Linux) to know which CPU you were running on at specific times and when so you could subtract the right counters (I've had to do this before and it isn't pretty). It would also prevent issues where your task gets preempted and moved to another core and you have to warm the caches again. If you use Linux, options like isolcpus and nohz_full ensure you're not measuring the scheduler/other tasks but have to be set at boot. But if your loops between clock_gettime take substantially more than clock_gettime itself then at least the timer overhead is probably not that important.

As for the branch prediction, I would think most of the branches would be predicted correctly because usually you'll have runs of "not x" before each "is x". Switching between many loops of course hurts the prediction.

My SIMD knowledge only extends as far as which compiler options make my code faster.

No shame using LLMs, I use them extensively, but I find I have to write some code yourself because I make noticeably more mistakes coding if I have let the LLM do everything for a couple of weeks.

1 more reply

LoganDark10d ago

I wonder if this is faster because it makes the prefetcher happy.

codedokode9d ago

I think the problem is with branches misprediction. Binary and linear searches use a lot of unpredictable branches that ruin performance. No-branch versions of search work faster. I wrote the details here: https://news.ycombinator.com/item?id=47983102

alexfoo11d ago

The classical canonical Comp Sci algorithms are effectively "designed" for CPUs with no parallelism (either across multiple cores, via Hyper-threading technology, or "just" SIMD style instructions), and also where all memory accesses take the same amount of time (so no concept of L1/L2/L3/etc caches of varying latencies). And all working on general/random data.

As soon as you move away from either (or both) of these assumptions then there are likely to be many tweaks you can make to get better performance.

What the classical algorithms do offer is a very good starting point for developing a more optimal/efficient solution once you know more about the specific shape of data or quirks/features of a specific CPU.

When you start to get at the pointy end of optimising things then you generally end up looking at how the data is stored and accessed in memory, and whether any changes you can make to improve this don't hurt things further down the line. In a job many many years ago I remember someone who spent way too long optimising a specific part of some code only to find that the overall application ran slower as the optimisations meant that a lot more information needed later on had been evicted from the cache.

(This is probably just another way of stating Rob Pike's 5th rule of programming which was itself a restatement of something by Fred Brooks in _The Mythical Man Month_. Ref: https://www.cs.unc.edu/~stotts/COMP590-059-f24/robsrules.htm...)

jstanley11d ago

As a teenager I spent a weekend thinking that if binary search was good, because it cuts the search space in half at every step, then wouldn't a ternary search be better? Because we'd cut it into thirds at every step.

So instead of just comparing the middle value, we'd compare the one at the 1/3 point, and if that turns out to be too low then we compare the value at the 2/3 point.

Unfortunately although we cut the search space to 2/3 of what it was for binary search at each step (1/3 vs 1/2), we do 3/2 as many comparisons at each step (one comparison 50% of the time, two comparisons the other 50%), so it averages out to equivalence.

EDIT: See zamadatix reply, it's actually 5/3 as many comparisons because 2/3 of the time you have to do 2.

zamadatix11d ago

This ternary approach doesn't actually average 3/2 comparisons per level:

- First third: 1 comparisons

- Second third: 2 comparisons

- Third third: 2 comparisons

(1+2+2)/3 = 5/3 average comparisons. I think the gap starts here at assuming it's 50% of the time because it feels like "either you do 1 comparison or 2" but it's really 33% of the time because "there is 1/3 chance it's in the 1st comparison and 2/3 chance it'll be 2 comparisons".

This lets us show ternary is worse in total average comparisons, just barely: 5/3*Log_3[n] = 1.052... * Log_2[n].

In other words, you end up with fewer levels but doing more comparisons (on average) to get to the end. This is true for all searches of this type (w/ a few general assumptions like the values being searched for are evenly distributed and the cost of the operations is idealized - which is where the main article comes in) where the number of splits is > 2.

jstanley11d ago

Oh yeah!

eggprices11d ago

When you can't seek quickly, e.g. on a disk, you can use a B-tree with say 128-way search. Fetching 128 keys doesn't cost much more than fetching 1 but it saves an additional 7 fetches.

bryanlarsen11d ago

Did you continue by fantasizing about CPU's that contain ternary comparators?

nkurz11d ago

> Unfortunately although we cut the search space to 2/3 of what it was for binary search at each step (1/3 vs 1/2), we do 3/2 as many comparisons at each step (one comparison 50% of the time, two comparisons the other 50%), so it averages out to equivalence.

True, but is there some particular reason that you want to minimize the number of comparisons rather than have a faster run time? Daniel doesn't overly emphasize it, but as he mentions in the article: "The net result might generate a few more instructions but the number of instructions is likely not the limiting factor."

The main thing this article shows is that (at least sometimes on some processors) a quad search is faster than a binary search _despite_ the fact that that it performs theoretically unnecessary comparisons. While some computer scientists might scoff, I'd bet heavily that an optimized ternary search could also frequently outperform.

jstanley11d ago

You normally measure runtime of a sorting algorithm in terms of the number of comparisons it has to do.

Obviously real-world performance depends on other things as well.

Someone11d ago

Not “normally”, but “in computer science” and even then, mostly “in the past” and even then, only “typically” (there are sorting algorithms that make zero comparisons. See for example https://pages.cs.wisc.edu/~paton/readings/Old/fall01/LINEAR-...)

All other people live in the real world, and care about real-world performance, and modern computer scientists know that.

madcaptenor11d ago

Isn't it a bit better on average, although not as much as you'd hoped? For example 19 steps of binary search get you down to 1/524288 of the original search space with 19 comparisons. 12 steps of ternary search get you down to 1/3^12 = 1/531441 of the original search space with, on average, 12 * 3/2 = 18 comparisons.

jstanley11d ago

Maybe! But you can see the other comment that points out I was wrong and it is actually 5/3 comparisons so it still works out worse.

bena11d ago

Imagine if you split the search space N times, no middles. Then you could just compare the value.

ted_dunning10d ago

Seems like you can use the core intuition here of SIMD comparison of multiple elements on more than just the terminal scale.

The outline would be:

a) use a gather to grab multiple elements from 16 evenly spaced locations

b) compare these in parallel using a SIMD instruction

c) focus in on the correct block

d) if the block is small revert to linear search, else repeat the gather/compare cycle

Even though the gather instruction is reading from non-contiguous memory and reading more than you normally would need to with a binary sort, enabling a multiway compare and collapsing 4 levels of binary search should be a win on large tables.

You also may not be able to do a full 16-way comparison for all data types. Searching for float64 will limit you to 8 way comparisons (with AVX-512) but int32, float32 will allow 16 way comparisons.

1 more reply

XCSme10d ago

Is it possible to do some sort of Binary* Search (Binary Star, as in A* star search algorithm, where we use heuristics).

    a: [1,3,5,7,8,9,10,15]  
    x: 8 (query value)

For this array, we would compare a[0], a[3], a[7] (left/mid/right) by subtracting 9.

And we would get d=[-7, -1, 7]

Now, normally, with binary search, because 8 > mid, we would go to (mid+right)/2, BUT we already have some extra information: we see that x is closer to a[3] (diff of 1) than a[7] (diff of 7), so instead of going to the middle between mid and right, we could choose a new "mid" point that's closer to the desired value (maybe as a ratio of (d[right]-d[mid]).

so left=mid+1, right stays the same, but new mid is NOT half of left and right, it is (left+right)/2 + ratioOffset

Where ratioOffset makes the mid go closer to left or right, depending on d.

The idea is quite obvious, so I am pretty sure it already exists.

But what if we use SIMD, with it? So we know not only which block the number is in in, but also, which part of the block the number is likely in. Or is this what the article actually says?

mbowring10d ago

Yeah this is basically interpolation search

XCSme10d ago

Oh, that's what the article was referring to with "interpolation".

Weird that I didn't hear about it before, it's not that used in practice?

One reason I could see is that binary search is fast enough and easy to implement. Even on largest datasets it's still just a few tens of loop iterations.

gobdovan11d ago

I thought this would be about how you can beat binary search in the 'Guess Who?' game. There's a cool math paper about it [0] and an approachable video by the author. [1]

[0] https://arxiv.org/abs/1509.03327

[1] https://www.youtube.com/watch?v=_3RNB8eOSx0

1 more reply

bediger400013d ago

The (AI generated?) image on this article is absolutely not helpful, and I think it's wrong based on how I read the article. Better not to have an image at all.

iosovi11d ago

Agreed, it threw me off at first but the rest of the article was quite nice.

aidenn011d ago

If you are storing 16-bit integers, wouldn't an 8kB bitmap be even faster?

Findecanor11d ago

The range is 1..4096, so 4096 bits = 512 byte bitmap would suffice.

That is, if you're only ever going to test for membership in the set. If you need metadata then ... You could store that in a packed array and use a population count of the bit-vector before the lookup bit as index into it. For each word of bits, store the accumulated population count of the words before it to speed up lookup. Modern CPU's are memory-bound so I don't think SIMD would help much over using 64-bit words. For 4096 bits / 64, that would be 64 additional bytes.

BeetleB11d ago

Some of the plots would have been much more helpful if instead of absolute value in seconds, the y-axis were the multiplier w.r.t binary search (and eyeballing suggests a relatively constant multiplier).

Obviously, this isn't changing the big-Oh complexity, but in the "real world", still nice to see a 2-4x speedup.

quirino11d ago

On optimizing binary search: https://en.algorithmica.org/hpc/data-structures/binary-searc...

garaetjjte11d ago

I once did have a need for binary search in memory mapped files and I experimented with Eytzinger layout (which I learned from https://bannalia.blogspot.com/2015/06/cache-friendly-binary-...). It turned out that it was slower than plain binary search, I think because keys I was looking up were often clumped together thus it played quite well with cache anyway.

senfiaj11d ago

The title is slightly misleading, I mean yes, naive binary search might have larger constant but the algorithm is still O(log(n)). This is still some "divide and conquer" style algorithm just with bunch of CPU specific optimizations. Also this works well with simple data structures, like integers, with more complex objects (custom comparators) it matters less.

pfortuny10d ago

The complexity of binary search in terms of "search" (comparison) operations is exactly log_2(n)+1, not just O(n). This algorithm just uses modern and current processor architecture artifacts to "improve" it on arrays of up to 4096 elements.

So not exactly "n" as in O(n).

Also: only for 16-bit integers.

senfiaj10d ago

> The complexity of binary search in terms of "search" (comparison) operations is exactly log_2(n)+1, not just O(n)

> So not exactly "n" as in O(n).

For large enough inputs the algorithm with better Big O complexity will eventually win (at least in the worst cases). Yes, sometimes it never happens in practice when the constants are too large. But say 100 * n * log(n) will eventually beat 5 * n for large enough n. Some advanced algorithms can use algorithms with worse Big O complexity but smaller constants for small enough sub-problems to improve performance. But it's more like to optimization detail rather than a completely different algorithm.

> This algorithm just uses modern and current processor architecture artifacts to "improve" it on arrays of up to 4096

Yes, that's my point. It's basically "I made binary search for integers X times faster on some specific CPUs". "Beating binary search" is somewhat misleading, it's more like "microptimizing binary search".

cubefox11d ago

> The title is slightly misleading, I mean yes, naive binary search might have larger constant but the algorithm is still O(log(n)).

I think the title is not misleading since the Big O notation is only supposed to give a rough estimate of the performance of an algorithm.

(I agree though that binary search is already extremely fast, so making something twice as fast won't move the needle for the vast majority of applications where the speed bottleneck is elsewhere. Even infinite speed, i.e. instant sorted search, would likely not be noticeable for most software.)

senfiaj10d ago

For me it's slightly misleading because it's almost like saying "I wrote a faster quicksort implementation, so it beats quicksort!". In this case the binary search fundamental idea of "divide and conquer" is still there, the article just does microptimizations (which seem to be not very portable and are less relevant/applicible for more complex data structures) in order to reduce the constant part.

Yes, algorithmic complexity is theoretical, it often ignores the real world constants, but they are usually useful when comparing algorithms for larger inputs, unless we are talking about "galactic algorithms" with insanely large constants.

attractivechaos10d ago

See also: Static search trees: 40x faster than binary search

- https://curiouscoding.nl/slides/p99-text/

- https://curiouscoding.nl/posts/static-search-tree/

- https://news.ycombinator.com/item?id=42562847 (656 points; 232 comments)

layer810d ago

…for 16-bit integers, and it’s still a binary search with the same asymptotic complexity, just a constant-factor speedup.

wood_spirit11d ago

A beautiful algorithm.

Would there be any value in using simd to check the whole cache line that you fetch for exact matches on the narrowing phase for an early out?

gobdovan11d ago

I remember I had a pedagogy class in Uni taught by psychology faculty, and was messing with them by proposing a mock syllabus where we'd teach students binary search, then the advanced advanced ones ternary search, and the very advanced, Quaternary, with a big Q, as in the geological period. Jokes on me now, I suppose.

dicroce10d ago

If you know about the distribution of keys you can do even better by factoring that knowledge into where you split.

kardos11d ago

So is the SIMD the magic piece here, or is it the interpolation search? If the data is evenly distributed, that is pretty optimal for the interpolation search..

1 more reply

vjay1510d ago

Surprisingly simple concept!

vasco10d ago

This was the entry level project we did in a hardware optimization course I took maybe 15 years ago, using SIMD instructions. Lots of things can be naively optimized by unrolling any loops like this. Compilers do some of this themselves.

teo_zero10d ago

TL;DR the author developed an algorithm to solve this specific problem:

> The popular Roaring Bitmap format uses arrays of 16-bit integers of size ranging from 1 to 4096. We sometimes have to check whether a value is present.

There's no claim that this algorithm is universal and performs equally well for other problems.

In fact, note how the compare operation on the data types involved (16-bit integers) is quite cheap for modern CPUs. A similar problem with strings instead of integers would get no benefits from the author's ideas and would actually fare worse, due to useless comparisons per cycle.

amelius10d ago

I always wondered if we could get any faster than O(log n). Glad we're making progress!

owlcompliance11d ago

What about non-binary search?

benmmurphy10d ago

you know things are bad when lemire.me feels he needs to post an AI slop image :(

jonfe-darontos11d ago

And here I thought this was going to be related to quaternions

nullc10d ago

You can improve interpolated search by monitoring progress and if it's not converging fast enough, alternate with bisection steps. (and, as clear from the article, switch to linear/vector scanning when the range is small emough).

Often when an interpolated search is wrong the interpolation will tend to nail you against one side or the other of the range-- so the worst case is linear. By allowing only a finite number of failed probes (meaning they only move the same boundary as last time, an optimally working search will on average alternate hi/lo) you can maintain the log guarantee of bisection.

peter_d_sherman10d ago

>"Virtually all processors today have data parallel instructions (sometimes called SIMD) that can check several values at once.

[...]

The binary search checks one value at a time. However, recent processors can load and check more than one value at once. They have excellent memory-level parllelism. This suggest that instead of a binary search, we might want to try a quaternary search..."

First of all, brilliant observations! (Overall, a great article too!)

Yes, today's processors indeed have a parallelism which was unconceived of at the time the original Mathematicians, then-to-be Computer Scientists, conceived of Binary Search...

Now I myself wonder if these ideas might be extended to GPU's, that is, if the massively parallel execution capability of GPU's could be extended to search for data like Binary Search does, and what such an appropriately parallelized algorithm/data structure would look like... keep in mind, if we consider an updateable data structure, then that means that parts of it may need to be appropriately locked at the same time that multiple searches and updates are occurring simultaneously... so what data structure/algorithm would be the most efficient for a massively parallel scenario like that?

Anyway, great article and brilliant observations!

gowld10d ago

40x Faster Binary Search - This talk will first expose the lie that binary search takes O(lg n) time — it very much does not! Instead, we will see that binary search has only constant overhead compared to an oracle. Then, we will exploit everything that modern CPUs have to offer (SIMD, ILP, prefetching, efficient caching) in order to gain 40x increased throughput over the Rust standard library implementation.

aamargulies10d ago

Here's my version with a key spline improvement. I should really write this up...

#include <stdbool.h> #include <stdint.h> #include <arm_neon.h>

/* Author: aam@fastmail.fm * * Apple M4 Max (P-core) variant of simd_quad which uses a key spline * to great effect (blog post summary incoming!) / bool simd_quad_m4(const uint16_t carr, int32_t cardinality, uint16_t pos) { enum { gap = 64 };

    if (cardinality < gap) {
        if (cardinality >= 32) {
            // 32 <= n < 64: NEON-compare the first 32 as a single x4 load,
            // sweep the remainder.
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8x4_t v = vld1q_u16_x4(carr);
            uint16x8_t hit = vorrq_u16(
                vorrq_u16(vceqq_u16(v.val[0], needle), vceqq_u16(v.val[1], needle)),
                vorrq_u16(vceqq_u16(v.val[2], needle), vceqq_u16(v.val[3], needle)));
            if (vmaxvq_u16(hit) != 0) return true;
            for (int32_t j = 32; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        if (cardinality >= 16) {
            // 16 <= n < 32: paired x2 load + sweep tail.
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8x2_t v = vld1q_u16_x2(carr);
            uint16x8_t hit = vorrq_u16(vceqq_u16(v.val[0], needle),
                                       vceqq_u16(v.val[1], needle));
            if (vmaxvq_u16(hit) != 0) return true;
            for (int32_t j = 16; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        if (cardinality >= 8) {
            // 8 <= n < 16: single 128-bit compare + sweep tail.
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8_t v = vld1q_u16(carr);
            if (vmaxvq_u16(vceqq_u16(v, needle)) != 0) return true;
            for (int32_t j = 8; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        for (int32_t j = 0; j < cardinality; j++) {
            uint16_t v = carr[j];
            if (v >= pos) return v == pos;
        }
        return false;
    }

    int32_t num_blocks = cardinality / gap;
    int32_t base = 0;
    int32_t n = num_blocks;

    while (n > 3) {
        int32_t quarter = n >> 2;
        int32_t k1 = carr[(base + quarter + 1) * gap - 1];
        int32_t k2 = carr[(base + 2 * quarter + 1) * gap - 1];
        int32_t k3 = carr[(base + 3 * quarter + 1) * gap - 1];
        int32_t c1 = (k1 < pos);
        int32_t c2 = (k2 < pos);
        int32_t c3 = (k3 < pos);
        base += (c1 + c2 + c3) * quarter;
        n -= 3 * quarter;
    }
    while (n > 1) {
        int32_t half = n >> 1;
        base = (carr[(base + half + 1) * gap - 1] < pos) ? base + half : base;
        n -= half;
    }
    int32_t lo = (carr[(base + 1) * gap - 1] < pos) ? base + 1 : base;

    if (lo < num_blocks) {
        const uint16_t *blk = carr + lo * gap;
        uint16x8_t needle = vdupq_n_u16(pos);
        uint16x8x4_t a = vld1q_u16_x4(blk);
        uint16x8x4_t b = vld1q_u16_x4(blk + 32);
        uint16x8_t h0 = vorrq_u16(
            vorrq_u16(vceqq_u16(a.val[0], needle), vceqq_u16(a.val[1], needle)),
            vorrq_u16(vceqq_u16(a.val[2], needle), vceqq_u16(a.val[3], needle)));
        uint16x8_t h1 = vorrq_u16(
            vorrq_u16(vceqq_u16(b.val[0], needle), vceqq_u16(b.val[1], needle)),
            vorrq_u16(vceqq_u16(b.val[2], needle), vceqq_u16(b.val[3], needle)));
        return vmaxvq_u16(vorrq_u16(h0, h1)) != 0;
    }

    for (int32_t j = num_blocks * gap; j < cardinality; j++) {
        uint16_t v = carr[j];
        if (v >= pos) return v == pos;
    }
    return false;

}

/* * Spine variant, M4 edition. * * pack the interpolation probe keys into a dense contiguous region so the * cold-cache pointer chase streams through consecutive cache lines: * * n=4096 -> 64 spine keys -> 128 B = 1 M4 cache line * n=2048 -> 32 spine keys -> 64 B = half a line * n=1024 -> 16 spine keys -> 32 B * * The entire interpolation phase for a max-sized Roaring container now * lives in one cache line. The final SIMD block check still loads from * carr. * * The num_blocks <= 3 fallback: * with very few blocks the carr-based probes accidentally prime the final * block's lines, which the spine path disrupts. / bool simd_quad_m4_spine(const uint16_t carr, const uint16_t spine, int32_t cardinality, uint16_t pos) { enum { gap = 64 };

    if (cardinality < gap) {
        // Same fast paths as simd_quad_m4 -- spine is irrelevant here.
        if (cardinality >= 32) {
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8x4_t v = vld1q_u16_x4(carr);
            uint16x8_t hit = vorrq_u16(
                vorrq_u16(vceqq_u16(v.val[0], needle), vceqq_u16(v.val[1], needle)),
                vorrq_u16(vceqq_u16(v.val[2], needle), vceqq_u16(v.val[3], needle)));
            if (vmaxvq_u16(hit) != 0) return true;
            for (int32_t j = 32; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        if (cardinality >= 16) {
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8x2_t v = vld1q_u16_x2(carr);
            uint16x8_t hit = vorrq_u16(vceqq_u16(v.val[0], needle),
                                       vceqq_u16(v.val[1], needle));
            if (vmaxvq_u16(hit) != 0) return true;
            for (int32_t j = 16; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        if (cardinality >= 8) {
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8_t v = vld1q_u16(carr);
            if (vmaxvq_u16(vceqq_u16(v, needle)) != 0) return true;
            for (int32_t j = 8; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        for (int32_t j = 0; j < cardinality; j++) {
            uint16_t v = carr[j];
            if (v >= pos) return v == pos;
        }
        return false;
    }

    int32_t num_blocks = cardinality / gap;

    if (num_blocks <= 3) {
        return simd_quad_m4(carr, cardinality, pos);
    }

    int32_t base = 0;
    int32_t n = num_blocks;

    // Pull the whole spine into L1 up front. For n in [256, 4096] this is
    // 1 line (128 B); for smaller n it is a partial line. Cheap on cold.
    __builtin_prefetch(spine);

    while (n > 3) {
        int32_t quarter = n >> 2;
        int32_t k1 = spine[base + quarter];
        int32_t k2 = spine[base + 2 * quarter];
        int32_t k3 = spine[base + 3 * quarter];
        int32_t c1 = (k1 < pos);
        int32_t c2 = (k2 < pos);
        int32_t c3 = (k3 < pos);
        base += (c1 + c2 + c3) * quarter;
        n -= 3 * quarter;
    }
    while (n > 1) {
        int32_t half = n >> 1;
        base = (spine[base + half] < pos) ? base + half : base;
        n -= half;
    }
    int32_t lo = (spine[base] < pos) ? base + 1 : base;

    if (lo < num_blocks) {
        const uint16_t *blk = carr + lo * gap;
        uint16x8_t needle = vdupq_n_u16(pos);
        uint16x8x4_t a = vld1q_u16_x4(blk);
        uint16x8x4_t b = vld1q_u16_x4(blk + 32);
        uint16x8_t h0 = vorrq_u16(
            vorrq_u16(vceqq_u16(a.val[0], needle), vceqq_u16(a.val[1], needle)),
            vorrq_u16(vceqq_u16(a.val[2], needle), vceqq_u16(a.val[3], needle)));
        uint16x8_t h1 = vorrq_u16(
            vorrq_u16(vceqq_u16(b.val[0], needle), vceqq_u16(b.val[1], needle)),
            vorrq_u16(vceqq_u16(b.val[2], needle), vceqq_u16(b.val[3], needle)));
        return vmaxvq_u16(vorrq_u16(h0, h1)) != 0;
    }

    for (int32_t j = num_blocks * gap; j < cardinality; j++) {
        uint16_t v = carr[j];
        if (v >= pos) return v == pos;
    }
    return false;

}
// Build the spine for a given carr. Caller allocates cardinality/64 u16s. void simd_quad_m4_build_spine(const uint16_t
carr, int32_t cardinality, uint16_t spine) { enum { gap = 64 }; int32_t num_blocks = cardinality / gap; for (int32_t i = 0; i < num_blocks; i++) { spine[i] = carr[(i + 1) gap - 1]; } }

m3kw911d ago

Will I get a job if i say i can beat binary search?

1 more reply

cubefox11d ago

Since binary search is already very fast with its O(log n) time complexity: are there any real world applications which could practically benefit from this improvement?

2 more replies

j / k navigate · click thread line to collapse

152 comments

ssivark10d ago

13 more replies

zug_zug10d ago

I'd like to point out that all the graphs only go to 4,000 elements, which is basically non-data. Basically it'd be like measuring which car wins a 1cm race.

For small workloads binary search is slower than just checking every element.

To add to this, I think people can forget how small log(n) is... it can practically be seen as constant (as the log base 2 of ATOMS IN THE UNIVERSE is ~300).

listeria10d ago

[1]: https://pvk.ca/Blog/2012/07/03/binary-search-star-eliminates...

[2]: https://pvk.ca/Blog/2012/07/30/binary-search-is-a-pathologic...

drob51811d ago

nkurz11d ago

drob51811d ago

Yea, I get that the actual comparison instruction itself is insignificant. It's everything that goes along with it. Seems like quaternary is fetching more data, however.

I guess on a modern processor, eliminating the data dependency is worth it because the processor's branch prediction and speculation only follows effectively a single path.

Would be interesting to see this at a machine cycle level on a real processor to understand exactly what is happening.

LoganDark11d ago

1 more reply

wtallis11d ago

mayoff11d ago

Regardless, both kinds of search are O(log N) with different constants. The constants don’t matter so much in algorithms class but in the real world they can matter a lot.

loeg11d ago

Sort of, yes, but you're also removing a data dependency between the unrolled stages.

pfortuny11d ago

It is because processors do not do what one might naively think they do.

lalitmaganti10d ago

[1] https://lalitm.com/post/exponential-search/ [2] https://en.wikipedia.org/wiki/Exponential_search

10000truths10d ago

Exponential search is useful when you're querying a REST API that addresses resources with sequential IDs, and need the last ID, but there's no dedicated endpoint for it:

  HEAD /users/1 -> 200 OK
  HEAD /users/2 -> 200 OK
  HEAD /users/4 -> 200 OK
  ...
  HEAD /users/2048 -> 200 OK
  HEAD /users/4096 -> 404 Not Found

And then a binary search between 2048 and 4096 to find the most recent user (and incidentally, the number of users). Great info to have if you're researching competing SaaS companies.

YourDadVPN10d ago

drewg12310d ago

Cthulhu_10d ago

Is this an actual, measurable, major issue or just a gut feeling? Context switches in general are suboptimal but pretty normal.

drkrab10d ago

nly10d ago

With some thought, it's possible to use SIMD on B-tree nodes to speed up the search within the node, but it's all very data dependent.

aidenn010d ago

Binary searching a sorted array is isomorphic to a sorted binary tree with implicit child pointers.

Rusky10d ago

There's Eytzinger order: https://algorithmica.org/en/eytzinger

1 more reply

pantsforbirds10d ago

If your data follows a relatively uniform distribution, you can try out interpolation search: https://en.wikipedia.org/wiki/Interpolation_search

taeric11d ago

rao-v11d ago

This is simply not true - if you look at this article’s excellent benchmarking, linear search falls behind somewhere around 200-400 elements.

In general I love this article, it took what I’ve often wondered about and did a perfect job exploring with useful ablation studies.

BeetleB11d ago

For that machine and compiler version, yes.

eggprices11d ago

Except on Apple, where binary search always wins. Does anyone know why?

SuperV123411d ago

That's not what the article is about.

srcreigh11d ago

The algorithm description was a bit confusing for me.

The SIMD part is just in the last step, where it uses SIMD to search the last 16 elements.

The Quad part is that it checks 3 points to create 4 paths, but also it's searching for the right block, not just the right key.

codedokode10d ago

YourDadVPN10d ago

codedokode9d ago

Here are some numbers:

Array size 8, 200 000 iterations, time ns/seach: linear 8 ns, linear no-branch <1 ns (I wonder if it is an error), binary 11 ns, binary no-branch 7 ns.

Array size 128, 100 000 iterations, time: linear 34 ns, linear no-branch 22 ns, binary 35 ns, binary no-branch 16 ns.

As you can see, the branches is what ruins the performance here. Every time the CPU mis-predicts a branch, it wastes several cycles.

I'll be happy to read if you point me at any mistakes in my benchmark. If you want to compile the code, I used gcc with "-O3" and maybe with "-march=native".

YourDadVPN9d ago

My SIMD knowledge only extends as far as which compiler options make my code faster.

No shame using LLMs, I use them extensively, but I find I have to write some code yourself because I make noticeably more mistakes coding if I have let the LLM do everything for a couple of weeks.

1 more reply

LoganDark10d ago

I wonder if this is faster because it makes the prefetcher happy.

codedokode9d ago

alexfoo11d ago

As soon as you move away from either (or both) of these assumptions then there are likely to be many tweaks you can make to get better performance.

jstanley11d ago

So instead of just comparing the middle value, we'd compare the one at the 1/3 point, and if that turns out to be too low then we compare the value at the 2/3 point.

EDIT: See zamadatix reply, it's actually 5/3 as many comparisons because 2/3 of the time you have to do 2.

zamadatix11d ago

This ternary approach doesn't actually average 3/2 comparisons per level:

- First third: 1 comparisons

- Second third: 2 comparisons

- Third third: 2 comparisons

This lets us show ternary is worse in total average comparisons, just barely: 5/3*Log_3[n] = 1.052... * Log_2[n].

jstanley11d ago

Oh yeah!

eggprices11d ago

When you can't seek quickly, e.g. on a disk, you can use a B-tree with say 128-way search. Fetching 128 keys doesn't cost much more than fetching 1 but it saves an additional 7 fetches.

bryanlarsen11d ago

Did you continue by fantasizing about CPU's that contain ternary comparators?

nkurz11d ago

jstanley11d ago

You normally measure runtime of a sorting algorithm in terms of the number of comparisons it has to do.

Obviously real-world performance depends on other things as well.

Someone11d ago

All other people live in the real world, and care about real-world performance, and modern computer scientists know that.

madcaptenor11d ago

jstanley11d ago

Maybe! But you can see the other comment that points out I was wrong and it is actually 5/3 comparisons so it still works out worse.

bena11d ago

Imagine if you split the search space N times, no middles. Then you could just compare the value.

ted_dunning10d ago

Seems like you can use the core intuition here of SIMD comparison of multiple elements on more than just the terminal scale.

The outline would be:

a) use a gather to grab multiple elements from 16 evenly spaced locations

b) compare these in parallel using a SIMD instruction

c) focus in on the correct block

d) if the block is small revert to linear search, else repeat the gather/compare cycle

You also may not be able to do a full 16-way comparison for all data types. Searching for float64 will limit you to 8 way comparisons (with AVX-512) but int32, float32 will allow 16 way comparisons.

1 more reply

XCSme10d ago

Is it possible to do some sort of Binary* Search (Binary Star, as in A* star search algorithm, where we use heuristics).

    a: [1,3,5,7,8,9,10,15]  
    x: 8 (query value)

For this array, we would compare a[0], a[3], a[7] (left/mid/right) by subtracting 9.

And we would get d=[-7, -1, 7]

so left=mid+1, right stays the same, but new mid is NOT half of left and right, it is (left+right)/2 + ratioOffset

Where ratioOffset makes the mid go closer to left or right, depending on d.

The idea is quite obvious, so I am pretty sure it already exists.

But what if we use SIMD, with it? So we know not only which block the number is in in, but also, which part of the block the number is likely in. Or is this what the article actually says?

mbowring10d ago

Yeah this is basically interpolation search

XCSme10d ago

Oh, that's what the article was referring to with "interpolation".

Weird that I didn't hear about it before, it's not that used in practice?

One reason I could see is that binary search is fast enough and easy to implement. Even on largest datasets it's still just a few tens of loop iterations.

gobdovan11d ago

I thought this would be about how you can beat binary search in the 'Guess Who?' game. There's a cool math paper about it [0] and an approachable video by the author. [1]

[0] https://arxiv.org/abs/1509.03327

[1] https://www.youtube.com/watch?v=_3RNB8eOSx0

1 more reply

bediger400013d ago

The (AI generated?) image on this article is absolutely not helpful, and I think it's wrong based on how I read the article. Better not to have an image at all.

iosovi11d ago

Agreed, it threw me off at first but the rest of the article was quite nice.

aidenn011d ago

If you are storing 16-bit integers, wouldn't an 8kB bitmap be even faster?

Findecanor11d ago

The range is 1..4096, so 4096 bits = 512 byte bitmap would suffice.

BeetleB11d ago

Obviously, this isn't changing the big-Oh complexity, but in the "real world", still nice to see a 2-4x speedup.

quirino11d ago

On optimizing binary search: https://en.algorithmica.org/hpc/data-structures/binary-searc...

garaetjjte11d ago

senfiaj11d ago

pfortuny10d ago

So not exactly "n" as in O(n).

Also: only for 16-bit integers.

senfiaj10d ago

> The complexity of binary search in terms of "search" (comparison) operations is exactly log_2(n)+1, not just O(n)

> So not exactly "n" as in O(n).

> This algorithm just uses modern and current processor architecture artifacts to "improve" it on arrays of up to 4096

cubefox11d ago

> The title is slightly misleading, I mean yes, naive binary search might have larger constant but the algorithm is still O(log(n)).

I think the title is not misleading since the Big O notation is only supposed to give a rough estimate of the performance of an algorithm.

senfiaj10d ago

attractivechaos10d ago

See also: Static search trees: 40x faster than binary search

- https://curiouscoding.nl/slides/p99-text/

- https://curiouscoding.nl/posts/static-search-tree/

- https://news.ycombinator.com/item?id=42562847 (656 points; 232 comments)

layer810d ago

…for 16-bit integers, and it’s still a binary search with the same asymptotic complexity, just a constant-factor speedup.

wood_spirit11d ago

A beautiful algorithm.

Would there be any value in using simd to check the whole cache line that you fetch for exact matches on the narrowing phase for an early out?

gobdovan11d ago

dicroce10d ago

If you know about the distribution of keys you can do even better by factoring that knowledge into where you split.

kardos11d ago

So is the SIMD the magic piece here, or is it the interpolation search? If the data is evenly distributed, that is pretty optimal for the interpolation search..

1 more reply

vjay1510d ago

Surprisingly simple concept!

vasco10d ago

teo_zero10d ago

TL;DR the author developed an algorithm to solve this specific problem:

> The popular Roaring Bitmap format uses arrays of 16-bit integers of size ranging from 1 to 4096. We sometimes have to check whether a value is present.

There's no claim that this algorithm is universal and performs equally well for other problems.

amelius10d ago

I always wondered if we could get any faster than O(log n). Glad we're making progress!

owlcompliance11d ago

What about non-binary search?

benmmurphy10d ago

you know things are bad when lemire.me feels he needs to post an AI slop image :(

jonfe-darontos11d ago

And here I thought this was going to be related to quaternions

nullc10d ago

peter_d_sherman10d ago

>"Virtually all processors today have data parallel instructions (sometimes called SIMD) that can check several values at once.

[...]

First of all, brilliant observations! (Overall, a great article too!)

Yes, today's processors indeed have a parallelism which was unconceived of at the time the original Mathematicians, then-to-be Computer Scientists, conceived of Binary Search...

Anyway, great article and brilliant observations!

gowld10d ago

aamargulies10d ago

Here's my version with a key spline improvement. I should really write this up...

#include <stdbool.h> #include <stdint.h> #include <arm_neon.h>

    if (cardinality < gap) {
        if (cardinality >= 32) {
            // 32 <= n < 64: NEON-compare the first 32 as a single x4 load,
            // sweep the remainder.
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8x4_t v = vld1q_u16_x4(carr);
            uint16x8_t hit = vorrq_u16(
                vorrq_u16(vceqq_u16(v.val[0], needle), vceqq_u16(v.val[1], needle)),
                vorrq_u16(vceqq_u16(v.val[2], needle), vceqq_u16(v.val[3], needle)));
            if (vmaxvq_u16(hit) != 0) return true;
            for (int32_t j = 32; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        if (cardinality >= 16) {
            // 16 <= n < 32: paired x2 load + sweep tail.
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8x2_t v = vld1q_u16_x2(carr);
            uint16x8_t hit = vorrq_u16(vceqq_u16(v.val[0], needle),
                                       vceqq_u16(v.val[1], needle));
            if (vmaxvq_u16(hit) != 0) return true;
            for (int32_t j = 16; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        if (cardinality >= 8) {
            // 8 <= n < 16: single 128-bit compare + sweep tail.
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8_t v = vld1q_u16(carr);
            if (vmaxvq_u16(vceqq_u16(v, needle)) != 0) return true;
            for (int32_t j = 8; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        for (int32_t j = 0; j < cardinality; j++) {
            uint16_t v = carr[j];
            if (v >= pos) return v == pos;
        }
        return false;
    }

    int32_t num_blocks = cardinality / gap;
    int32_t base = 0;
    int32_t n = num_blocks;

    while (n > 3) {
        int32_t quarter = n >> 2;
        int32_t k1 = carr[(base + quarter + 1) * gap - 1];
        int32_t k2 = carr[(base + 2 * quarter + 1) * gap - 1];
        int32_t k3 = carr[(base + 3 * quarter + 1) * gap - 1];
        int32_t c1 = (k1 < pos);
        int32_t c2 = (k2 < pos);
        int32_t c3 = (k3 < pos);
        base += (c1 + c2 + c3) * quarter;
        n -= 3 * quarter;
    }
    while (n > 1) {
        int32_t half = n >> 1;
        base = (carr[(base + half + 1) * gap - 1] < pos) ? base + half : base;
        n -= half;
    }
    int32_t lo = (carr[(base + 1) * gap - 1] < pos) ? base + 1 : base;

    if (lo < num_blocks) {
        const uint16_t *blk = carr + lo * gap;
        uint16x8_t needle = vdupq_n_u16(pos);
        uint16x8x4_t a = vld1q_u16_x4(blk);
        uint16x8x4_t b = vld1q_u16_x4(blk + 32);
        uint16x8_t h0 = vorrq_u16(
            vorrq_u16(vceqq_u16(a.val[0], needle), vceqq_u16(a.val[1], needle)),
            vorrq_u16(vceqq_u16(a.val[2], needle), vceqq_u16(a.val[3], needle)));
        uint16x8_t h1 = vorrq_u16(
            vorrq_u16(vceqq_u16(b.val[0], needle), vceqq_u16(b.val[1], needle)),
            vorrq_u16(vceqq_u16(b.val[2], needle), vceqq_u16(b.val[3], needle)));
        return vmaxvq_u16(vorrq_u16(h0, h1)) != 0;
    }

    for (int32_t j = num_blocks * gap; j < cardinality; j++) {
        uint16_t v = carr[j];
        if (v >= pos) return v == pos;
    }
    return false;

}

    if (cardinality < gap) {
        // Same fast paths as simd_quad_m4 -- spine is irrelevant here.
        if (cardinality >= 32) {
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8x4_t v = vld1q_u16_x4(carr);
            uint16x8_t hit = vorrq_u16(
                vorrq_u16(vceqq_u16(v.val[0], needle), vceqq_u16(v.val[1], needle)),
                vorrq_u16(vceqq_u16(v.val[2], needle), vceqq_u16(v.val[3], needle)));
            if (vmaxvq_u16(hit) != 0) return true;
            for (int32_t j = 32; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        if (cardinality >= 16) {
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8x2_t v = vld1q_u16_x2(carr);
            uint16x8_t hit = vorrq_u16(vceqq_u16(v.val[0], needle),
                                       vceqq_u16(v.val[1], needle));
            if (vmaxvq_u16(hit) != 0) return true;
            for (int32_t j = 16; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        if (cardinality >= 8) {
            uint16x8_t needle = vdupq_n_u16(pos);
            uint16x8_t v = vld1q_u16(carr);
            if (vmaxvq_u16(vceqq_u16(v, needle)) != 0) return true;
            for (int32_t j = 8; j < cardinality; j++) {
                uint16_t x = carr[j];
                if (x >= pos) return x == pos;
            }
            return false;
        }
        for (int32_t j = 0; j < cardinality; j++) {
            uint16_t v = carr[j];
            if (v >= pos) return v == pos;
        }
        return false;
    }

    int32_t num_blocks = cardinality / gap;

    if (num_blocks <= 3) {
        return simd_quad_m4(carr, cardinality, pos);
    }

    int32_t base = 0;
    int32_t n = num_blocks;

    // Pull the whole spine into L1 up front. For n in [256, 4096] this is
    // 1 line (128 B); for smaller n it is a partial line. Cheap on cold.
    __builtin_prefetch(spine);

    while (n > 3) {
        int32_t quarter = n >> 2;
        int32_t k1 = spine[base + quarter];
        int32_t k2 = spine[base + 2 * quarter];
        int32_t k3 = spine[base + 3 * quarter];
        int32_t c1 = (k1 < pos);
        int32_t c2 = (k2 < pos);
        int32_t c3 = (k3 < pos);
        base += (c1 + c2 + c3) * quarter;
        n -= 3 * quarter;
    }
    while (n > 1) {
        int32_t half = n >> 1;
        base = (spine[base + half] < pos) ? base + half : base;
        n -= half;
    }
    int32_t lo = (spine[base] < pos) ? base + 1 : base;

    if (lo < num_blocks) {
        const uint16_t *blk = carr + lo * gap;
        uint16x8_t needle = vdupq_n_u16(pos);
        uint16x8x4_t a = vld1q_u16_x4(blk);
        uint16x8x4_t b = vld1q_u16_x4(blk + 32);
        uint16x8_t h0 = vorrq_u16(
            vorrq_u16(vceqq_u16(a.val[0], needle), vceqq_u16(a.val[1], needle)),
            vorrq_u16(vceqq_u16(a.val[2], needle), vceqq_u16(a.val[3], needle)));
        uint16x8_t h1 = vorrq_u16(
            vorrq_u16(vceqq_u16(b.val[0], needle), vceqq_u16(b.val[1], needle)),
            vorrq_u16(vceqq_u16(b.val[2], needle), vceqq_u16(b.val[3], needle)));
        return vmaxvq_u16(vorrq_u16(h0, h1)) != 0;
    }

    for (int32_t j = num_blocks * gap; j < cardinality; j++) {
        uint16_t v = carr[j];
        if (v >= pos) return v == pos;
    }
    return false;

m3kw911d ago

Will I get a job if i say i can beat binary search?

1 more reply

cubefox11d ago

Since binary search is already very fast with its O(log n) time complexity: are there any real world applications which could practically benefit from this improvement?

2 more replies

j / k navigate · click thread line to collapse