Nearly all binary searches and mergesorts are broken (2006) (opens in new tab)

(ai.googleblog.com)

163 pointsfinnlab3y ago195 comments

195 comments

106 comments · 24 top-level

blacklight3y ago· 14 in thread

I'd put the blame on languages that don't allow exceptions, and whose return value in case of errors belong to the same domain as the solution.

I've coded binary searches and sorts tons of times in C++, and yet none was succeptible to this bug. Why? Because, whenever you're talking indices, you should ALWAYS use unsigned int. Since an array can't have negative indices, if you use unsigned ints the problem is solved by design. And, if the element is not found, you throw an exception.

Instead, in C you don't have exceptions, and you have to figure out creative ways for returning errors. errno-like statics work badly with concurrency. And doing something like int search(..., int* err), and setting err inside of your functions, feels cumbersome.

So what does everyone do? Return a positive int if the index is found, or -1 otherwise.

In other words, we artificially extend the domain of the solution just to include the error. We force into the signed integer domain something that was always supposed to be unsigned.

This is the most common cause for most of the integer overflows problems out there.

Sharlin3y ago

When you’re talking indices, you should NEVER use int, unsigned or not. The world is 64-bit these days and int is stuck at 32 bits almost everywhere. And even on 32-bit systems indexing with unsigned int may not be safe unless you think about overflow, as this bug demonstrates (at least unsigned overflow is not immediate UB in C and C++ like signed overflow is…)

C has size_t. Use it.

a13692099933y ago

To be fair, size_t doesn't solve this particular problem; you also need to use correct array slice representation (ptr,len) not (start,end), and calculate the midpoint accordingly (ie (ptr,len/2) or (ptr+len/2,len-len/2)).

(And because C doesn't mandate correct handling of benign undefined behavior, you still have a problem if you `return ptr-orig_ptr` as a size_t offset (rather than returning the final ptr directly), because pointer subtraction is specified as producing ptrdiff_t (rather than size_t), which can 'overflow' for large arrays, despite that it's immediatedly converted back to a correct value of size_t.)

flupe3y ago

The problem is not solved by using unsigned ints though, because it stems from integer overflow. I'm afraid your implementations are, alas, also incorrect.

dataflow3y ago

Confused, how does using unsigned integers not solve this particular problem? Doesn't the article itself show solutions with unsigned integers?

2 more replies

hoosieree3y ago

Or you can return an unsigned int which is the highest valid index+1.

    ['a','b','c'].indexof('b') == 1 // found - return index
    ['a','b','c'].indexof('w') == 3 // not found - return size of array

pharmakom3y ago

There should be a pointer type that is not an int and whose size depends on the host platform.

Casts from int to pointer should be explicit.

We are enamoured with programmer convenience at the expense of the safety of our systems. It’s unprofessional and we should all aim to fix it.

Sharlin3y ago

The type that’s meant for indexing in C and C++ is called `size_t`. It is pointer-sized. In Rust it’s called `usize` and Rust does not have implicit conversions, so if you accidentally use too narrow an integer type to compute an index, at least Rust forces you to add an explicit cast somewhere.

1 more reply

orangepurple3y ago

Seeing the new erroneous assumption being made in this post reminded me of Linus Torvalds' rant about C++ http://harmful.cat-v.org/software/c++/linus

pclmulqdq3y ago

For most practical purposes, an int64 index can include a universe of negative return codes with no loss of functionality.

The problems here are about using integers that are too narrow and not properly doing arithmetic to prevent overflow from impacting the result.

dataflow3y ago

> For most practical purposes, an int64 index can include a universe of negative return codes with no loss of functionality.

Isn't this article a counterexample to that? Where using signed instead of unsigned actually does result in a loss of functionality?

1 more reply

leni5363y ago

The naive (x+y)/2 returns the wrong number for x=UINT_MAX and y=UINT_MAX, for a trivial counter example.

lamp9873y ago

"Because, whenever you're talking indices, you should ALWAYS use unsigned int."

sounds like a lot of your code is in fact broken...

dataflow3y ago

I think it'd be nice if you give some examples of how using unsigned integers for indices breaks code in cases where signed integers don't, because otherwise your comment is very unilluminating.

1 more reply

fulafel3y ago

C & C++ allow exceptions on signed integer overflow.

tromp3y ago· 11 in thread

The bug in question is trying to compute an average as

    avg = (x + y) / 2

which fails both for signed ints (when adding positive x and y overflows maxint) and for unsigned ints (when x + y wraps around 0). Note that this can only be considered a bug for array indices x,y when these are 32 bit variables and the array can conceivably grow to more than 2 billion elements.

I wonder what is the simplest fix if the ordering between x and y is not known (e.g. in applications when x and y are not range bounds) and the language has no right-shift operation...

froh3y ago

M = L + (R - L)/2

looks fairly simple to me. note this works of any ordering of R and L if the data type is signed.

mqus3y ago

But doesn't this have the same overflow issue, e.g. if R is a large positive number and L is a large negative one?

3 more replies

pclmulqdq3y ago

Even if the data type is unsigned it suffices as long as the R - L term is signed (so the division by 2 is an arithmetic shift, not a logical shift).

dukoid3y ago

Guess it depends on how you define "simplest"?

x / 2 + y / 2 + ((x & 1) + (y & 1)) / 2

Jensson3y ago

    x / 2 + y / 2 + (x & y & 1)

Edit: This is the same you wrote, but it gets the wrong number for negative values, for negative ints the rounding will go up and not down.

nerdponx3y ago

And this is exactly why I like to use higher level programming languages. Let someone smart figure all this out for me, and give me (grug) a generic binary search routine that works on arbitrary collections of arbitrary ordered things.

leni5363y ago

Very much relevant talk about std::midpoint in C++:

https://youtu.be/sBtAGxBh-XI

kragen3y ago

In general (x + (y - x) / 2) is more general than (x + y) / 2. If x and y are not in some group, but rather in the torsor of some group, you can't really sum them. Any attempt to do so involves introducing some arbitrary reference point. You can always do this, but once you do, you're at risk of your calculation results depending on the choice of arbitrary reference point and hence being meaningless.

The difference of two elements of the torsor of some group G is an honest-to-God group element of G, though, and so you have an honest-to-God identity element. You may or may not have an honest-to-God division or halving operator (which computes e given (e + e)) but in cases where G is the additive group of some field you do.

However, in this case our array indices are drawn from something like ℤ/2³²ℤ, and we might be trying to halve odd numbers, so none of this is justifiable! We want something different from our halving operator.

https://math.ucr.edu/home/baez/torsors.html

I see dataflow and maxiepoo were already talking about this: https://news.ycombinator.com/item?id=33493149

JadeNB3y ago

> The difference of two elements of the torsor of some group G is an honest-to-God group element of G, though, and so you have an honest-to-God identity element. You may or may not have an honest-to-God division or halving operator (which computes e given (e + e)) but in cases where G is the additive group of some field you do.

… some field of characteristic ≠ 2, of course.

1 more reply

im3w1l3y ago

The simplest fix is obviously to use 64 bit ints and call it a day.

nwallin3y ago

  (x^y)/2 + (x&y)

dataflow3y ago· 8 in thread

Fun fact, there are some other lessons here: it can sometimes pay off to (1) generalize your function, and (2) respect the mathematical axioms you're supposed to be following. This (obviously) isn't to say you should always generalize everything, but you should at least consider what would happen if you did so, and if the difference is small, perhaps do it. The benefit of doing so being that it can avoid problems that aren't otherwise obvious—sometimes by design, sometimes by accident.

In particular, (x + y) / 2 is the wrong implementation of midpoint in general, because it would fail to even compile on objects you can't add together. But midpoint is well-defined on anything you can subtract (i.e. anything you can define a consistent distance function for)—and it doesn't require addition to be well-defined between those objects!

One obvious (in C/C++, and not-so-obvious in Java) counterexample here is pointers/iterators. You can subtract them, but not add them. And, in fact, if you implement midpoint in a manner that generalizes to those and respects the intrinsic constraints of the problem, you end up with the same x + (y - x) / 2 implementation, which doesn't have this bug.

europeanguy3y ago

Interesting. Another example is datetimes. You can't add datetimes. You can add a datetime and a time delta, and the difference of two datetimes is a timedelta.

I guess in maths this is called a generating Lie algebra (maybe someone can comment on this?)

maxiepoo3y ago

I think the concept you are looking for is a ["torsor"](https://en.wikipedia.org/wiki/Principal_homogeneous_space).

Basically,

1. You have a 0 time delta, and you can add and subtract them satisfying some natural equations. (time deltas form a group)

2. You can add time deltas to a datetime to get a new datetime, and this satisfies some natural equations relating to adding time deltas to each other (time deltas act on datetimes).

3. You can subtract two datetimes to get a time delta satisfying some more natural equations (the action is free and transitive).

2 more replies

enriquto3y ago

> I guess in maths this is called a generating Lie algebra

This is often called an affine structure.

1 more reply

zeroonetwothree3y ago

Not all metric spaces have midpoints (or unique midpoints) so it’s not true you can compute a midpoint any time you have a distance function (you are right you can define it but that’s kind of useless computationally since it doesn’t give you an algorithm).

dataflow3y ago

If we're going the pedantic route, note that you don't need (and in fact half the time cannot have) uniqueness in our case anyway. There isn't really a unique midpoint for {0, 1, 2, 3}; both 1 and 2 are valid midpoints, even for binary search. We just pick the first one arbitrarily and work with that.

But note that that sentence was just about calculating midpoints, not about the larger binary search algorithm. And in any case, I was just trying to convey layman intuition, not write a mathematically precise theorem.

morelisp3y ago

This should also be obvious after a bit of thought to anyone who has worked with timestamps, and is also well-known in e.g. animation where midpoint is just a special case of p=0.5.

pmayrgundter3y ago

Not sure about midpoint being well-defined on anything we can subtract.. Z and R are infinite.. there are a lot of values in there that don't compute.

To vary your point here, the axioms for twos complement and IEEE floating-point aren't well known or observed.

ChadNauseam3y ago

There are countably infinite turing machines and there is one for every element in Z. But there are uncountably infinite real numbers, so we’re out of luck for almost all of them.

delusional3y ago· 7 in thread

Calling binary search and mergesort implementations "broken" does the author no service with his argument. If the key lesson is to "carefully consider your invariants" then the proper takeaway is that binary search and mergesort implementation lose generality with large arrays.

The implementation shown works perfectly for arrays on the order 2^30. Calling them broken is like saying strlen is broken for strings that aren't null terminated.

karpierz3y ago

Mergesort and binary search have a contract which defines:

- Which inputs are valid.

- For a valid input, what the constraints on the return value are.

You'd have a point if the implementations had as an input constraint: "array must be less than 2^30". But they didn't.

Otherwise, nothing is broken unless it never returns the right answer. Take:

fn add(x: u32, y:u32) -> u32 { return 1; }

This implementation works perfectly for numbers that add to 1. It just loses generality outside of that.

fn add(x: u32, y: u32) -> u32 { return (x + y) - (x >> 10) - (y >> 10); }

This implementation works for x < 2^10 and y < 2^10. Arguably this implementation is much worse than the previous one because it fails unexpectedly. At least the previous implementation would be much more obviously broken.

But these are both broken because they don't fulfill the (implicit) contract for add. You can't just say "well, it's implied that my add function only takes inputs that add to 1" unless you actually write that somewhere and make it clear.

dataflow3y ago

I get what you're saying but I don't think they're analogous. If nothing else, strlen is defined only with null-terminated strings; this comes in both the spec itself, as well as the documentation of pretty much every implementation you find. Whereas most binary search implementations don't claim they only work under some particular inputs. (I think there are likely more differences too, but this is sufficient to make my point.)

More generally, I feel like the thought process of "it's not broken if it works fine for inputs that occur 99% of the time" is an artifact of how little attention we pay to correctness, not something that is intrinsically true. If your function breaks for inputs that are clearly within its domain without any kind of warning... it's broken, as much as we might not want to admit it. We're just so used to this happening near edge cases that we don't think about it that way, but it's true.

gnull3y ago

> most binary search implementations don't claim they only work under some particular inputs

They do implicitly. It's just common sense. When you read a recipe in a cookbook, it usually doesn't mention that you're expected to be standing on your legs, not on your arms. Reader is expected to derive these things themselves.

A lot of generic algorithm implementations will start acting weird if your input size has the order of INT_MAX. Instances this big will take days or weeks or process on commodity CPUs, so if you're doing something like that you would normally use a specialized library that takes these specifics into account.

1 more reply

queuebert3y ago

> Calling binary search and mergesort implementations "broken" does the author no service with his argument.

Very much on brand for the FAANG r/iamverysmart crowd though.

_dain_3y ago

What on Earth are you talking about? There's nothing "iamverysmart" about the blogpost at all. The guy literally cites an example where the code broke in production, it isn't an esoteric hairsplitting point at all.

1 more reply

Blackthorn3y ago

This blog post predates r/iamverysmart. There was a way of talking and discourse in 2006 that this is very much as example of. One has to take things from the time they were written.

1 more reply

_dain_3y ago

it literally threw an exception in production on a valid input so yes it is fair to say it was broken

fnordpiglet3y ago· 5 in thread

This was always my go to interview question when I wanted to smugly prove to someone I’m smarter than them because I knew in fact they were smarter than me and I was feeling insecure. Good to see others use overflow gotchas too.

0x4454423y ago

My favorite was; write a function that determines the number of games necessary to be played in a single elimination tournament with N participants. It’s interesting to watch how many go off into recursion land when they get into the mind set of solving these Leet Code puzzles.

fnordpiglet3y ago

My favorite is when interviewers expect you to know sportsball stuff like tournament elimination rules when interviewing programmers who clearly don’t care about sportsball

1 more reply

quag3y ago

N-1 games?

junon3y ago

I hate when interviewers rely on niche recall-only interview questions...

latency-guy23y ago

Eh, I don't think integer overflow is a recall-only type question

This type of issue is pretty common to encounter and I make at least a few fixes a year specifically addressing integer overflow across many companies

bugfix-663y ago· 5 in thread

Here is the approach taken in Go's sort.Search()

Do the sum using signed int.

Then cast to unsigned int before the division (i.e., use a non-arithmetic shift low).

Then cast back to signed int.

  func Search(n int, f func(int) bool) int {
      // Define f(-1) == false and f(n) == true.
      // Invariant: f(i-1) == false, f(j) == true.
      i, j := 0, n
      for i < j {
          h := int(uint(i+j) >> 1) // avoid overflow when computing h
          // i ≤ h < j
          if !f(h) {
              i = h + 1 // preserves f(i-1) == false
          } else {
              j = h // preserves f(j) == true
          }
      }
      // i == j, f(i-1) == false, and f(j) (= f(i)) == true  =>  answer is i.
      return i
  }

If you care about stuff like this you may enjoy the puzzle "Upside-Down Arithmetic Shift":

https://bugfix-66.com/76b563beb6f4e61801fce4e835be862fb3dbbe...

morelisp3y ago

The solution here is not really interesting except from a language design perspective. Go avoids this problem by having the maximum array length be int, but doing the math in uint. This won’t work in languages that lack uints (Java) or have maximum array sizes in uint (C/C++).

LoganDark3y ago

Java lacks a distinct uint type, but (since Java 8) allows you to perform unsigned operations on a regular int, effectively treating it as a uint.

It doesn't help that almost nobody knows this, though.

1 more reply

wizeman3y ago

This wouldn't work for C/C++ because in these languages signed integer overflow is undefined behavior.

bugfix-663y ago

That is correct. A serious mistake in C.

Go was designed by (among others) the father of Unix Ken Thompson, with an understanding of the mistakes of C and C++.

Another example is that Go requires explicit integer casts (disallowing implicit integer casts) to avoid what is now understood to be an enormous source of confusion and bugs in C.

You can understand Go as an improved C, designed for a world where parallel computing (e.g., dozens of CPU cores) is commonplace.

1 more reply

morelisp3y ago

You could write the same approach in C as `(size_t)i+(size_t)j` without UB. The real reason it doesn't work in C is because a memory region can be large enough to still overflow in that case.

2 more replies

david_allison3y ago· 5 in thread

16 years later, it's still incorrect on Wikipedia

https://en.wikipedia.org/wiki/Binary_search_algorithm#Proced...

enriquto3y ago

But this is pseudocode. For all you know, it could be implemented in a language whose integers are arbitrary precision, in which case it is perfectly correct and appropriate.

gp3y ago

> language whose integers are arbitrary precision

I’m not sure what this could mean. Could you please share some examples?

9 more replies

froh3y ago

see "implementation issues" in the same article, with

  M = L + (R - L)/2

david_allison3y ago

I'm aware (plus the fact that the algorithm is correct in Python). It's very unlikely that this is an argument I can win.

I'm taking a pragmatic perspective: like it or not, people are going to skim the article and copy & paste the pseudocode.

Given that the pseudocode is buggy in the vast majority of programming languages and the user isn't informed about this in the pseudocode, it's going to lead to unnecessary bugs.

1 more reply

elcomet3y ago

They do discuss it though at the end of the article

https://en.wikipedia.org/wiki/Binary_search_algorithm#Implem...

And as other mentionned, this is pseudo code and not implementation. But if you think it's incorrect, feel free to correct it.

altaltalt3y ago· 5 in thread

Can't it simply be written like this?

    mid = low/2 + high/2

Godel_unicode3y ago

Division is not associative:

https://www.khanacademy.org/math/arithmetic-home/multiply-di...

mimon3y ago

While that is true it is not relevant here, since this example does not involve associativity.

What is relevent here is that integer division is not distributive over addition.

dataflow3y ago

Nope, try low = 1, high = 1 and you get mid = 0.

benmmurphy3y ago

i think you can fix it with: (low >> 1) + (high >> 1) + (low & 1 & high)

for unsigned numbers. not sure if it works for signed numbers.

curling_grad3y ago

For low=3, high=5 case, this gives mid=3.

lkuty3y ago· 4 in thread

"It is not sufficient merely to prove a program correct; you have to test it too."

Well in fact it is exactly the contrary.

Jtsummers3y ago

I took it as a reference to Knuth: "Beware of bugs in the above code; I have only proved it correct, not tried it."

https://staff.fnwi.uva.nl/p.vanemdeboas/knuthnote.pdf [PDF] page 7 of the PDF, 5 of the classroom note.

User233y ago

It’s a clumsy formulation, but if what he means is that you need to be assured that the model you’re proving in accurately reflects the behavior of what is being modeled then he is correct at least sometimes. For example a naive Z3 proof of the mid procedure would be valid since Z3 ints are unbounded. The issue isn’t that the proof is wrong, it’s that the model is.

If the system has a well written formal specification then your model can be built from that without error if done diligently. One real world example is the first Algol 60 compiler, which was built to a formal specification. On the other hand if there is no useful spec or no spec at all then you end up needing to experiment, ie test, and get your model as close as you can.

a13692099933y ago

No, what you've observed is the (IIRC the terminology) converse, namely:

It is not sufficient merely to test a program; you have to prove it correct too.

In addition, it is not sufficient merely to prove a program correct; you have to test it too.

In summary, you have to both prove a program correct, and test it; skipping either will result in buggy garbage.

joshuamorton3y ago

Grandparent is correct. If you've proven the behavior correct, you don't need to test. The proof is the test. This is usually only true in languages-that-are-proof-assistants (idris). In the cases above, they hadn't actually formally proven the behavior correct.

dunhuang_nomad3y ago· 4 in thread

Does anyone know why the bitshift method works?

Is it that low and high are both floating point, so you're not constrained by int precision and so you don't get an overflow error. The article makes it sound like sign switching is the issue, but this is just a general overflow problem, right?

dataflow3y ago

The ">>>" operator works, the ">>" operator doesn't. The reason the former works is that it basically performs unsigned division by a power of 2; the latter does it signed. There's no floating-point.

erikpukinskis3y ago

What do the five >s and the , mean in this comment?

1 more reply

odo12423y ago

No, it's because the reason that integers overflow is that negative numbers are technically stored as larger than positive numbers in the Two's complement representation most computers use to store integers. Neither low and high are floats.

Example with 8-bit integers (from wikipedia):

Bits, Unsigned value, Signed value

0000 0000, 0, 0

0000 0001, 1, 1

0000 0010, 2, 2

0111 1110, 126, 126

0111 1111, 127, 127

1000 0000, 128, −128

When the logical bit shift is conducted on -128, -128 is treated as an unsigned integer. Its sign bit gets shifted such that the integer becomes 0100 0000, aka 64.

dunhuang_nomad3y ago

Oh I see, this is very helpful. Thank you.

vintermann3y ago· 4 in thread

On mobile, this site is broken too. Text doesn't wrap and scrolling seems to be disabled.

rgovostes3y ago

It's a post from before the iPhone came out, try reading the WAP version of the blog on your Cingular connection.

remram3y ago

The blog is still active though. Somehow they fixed their layouts but kept old posts on the old layout?

andai3y ago

Yeah I had to use reader mode.

tiagod3y ago

I really dislike when devs disable mobile scrolling without knowing for sure their content is wrapping properly.

EdSchouten3y ago· 2 in thread

If instead of 'int' you were to use 'size_t' (or the equivalent of that provided by your programming language of choice), then there should be no issues in practice. Then you would only see overflows if your elements were 1 byte in size, and the input spans more than half of the virtual address space. This is unlikely for two reasons:

1. If you only have single byte elements, you'd better use counting sort.

2. There always tend to be parts of the virtual address space that are reserved. On x86-64, most userspace processes can only access 2^47 bytes of space.

junon3y ago

> input spans more than half of the virtual address space

Not only that, but in practice most general purpose operating systems are designed with higher-half kernels[0].

[0] https://wiki.osdev.org/Higher_Half_Kernel

valleyer3y ago

32-bit Mac OS X was not (it had a 4/4 scheme).

Though even then I'm not sure you could reliably allocate two gigs of contiguous virtual space without running into some immovable OS-provided thing.

GuB-423y ago· 2 in thread

It is unfortunate that the language doesn't have a built-in "average between two ints" function. It is a common operation, people often get it wrong, as shown by this article, and it may have a really simple and correct assembly representation that the compiler may take advantage of.

Such a function, even if it seems trivial, has some educative value as it opens an opportunity to explain the problem in the documentation.

fay593y ago

I feel that it’s so simple that many people will overlook that it even exists. In languages that have both, it’s hard for functions to compete with operators. I don’t think that this is the best design to promote correctness.

GuB-423y ago

Maybe, but providing simple functions for "obvious" operations, to promote correctness, make it easier for the compiler, or just for convenience is not uncommon at all. Most languages have a min/max function somewhere, sometimes built-in, sometimes in the standard library, even though it is trivial to implement. C is a notable exception, and it is a problem because, you have a lot of ad-hoc solutions, all with their own issues.

If you look at GLSL, it has many function that do obvious things, like exp2(x) that does the same thing as pow(2,x), and I don't think anyone has any issue with that. It even has a specific "fma" operation (fma(a,b,c) = a*b+c, precisely), that solves a similar kind of problem as the overflowing average.

legosexmagic3y ago· 1 in thread

the right solution is to parametrize the search region as (offset, length) instead of (start, end). then the midpoint is just offset+length/2.

you can also remove that unpredictable branch in the loop if you want.

  whatever_t *bisect(whatever_t *offset, size_t length, whatever_t x) {
    while(size_t midpoint = length / 2) {
      bool side = x < offset[midpoint];
      midpoint &= side - 1;
      length >>= side;
      offset += midpoint;
      length -= midpoint;
    }
    return offset;
  }

abecedarius3y ago

(offset, length) was how I coded it in the 90s, too, precisely because it made correctness clearer. "Nearly all" broken, hmph.

IncRnd3y ago· 1 in thread

There are still edge cases here - various posters here have mentioned them.

The proper method is to type promote first - not just to unsigned but to a wider variable type - 32 to 64 bits or from 64 to 128 bits. Unsigned simply gives a single extra bit, while erasing negative semantics. Promoting to twice the size works for either addition or multiplication. The benefits are correctness and the ability to be understood at a glance.

dataflow3y ago

> There are still edge cases here - various posters here have mentioned them.

Are you sure? What's an example of an array.length that would trigger a remaining edge case here? (Keep in mind array.length is 32-bit in Java.)

1 more reply

kazinator3y ago· 1 in thread

This article is poorly/incompletely reasoned.

Suppose your high, low and mid indexes are as wide as a pointer on your machine: 32 or 64 bits. Unsigned.

Suppose you're binary searching or merge sorting a structure that fits entirely into memory.

The only way (low + high)/2 will overflow is if the object being subdivided fills the entire address space, and is an array of individual bytes. Or else is a sparsely populated, virtual structure.

If the space contains distinct objects from [0] to [high-1], and they are more than a byte wide, this is a non-issue. If the objects are more than two bytes wide, you can use signed integers.

Also, you're never going to manipulate objects that fill the whole address space. On 32 bits, some applications came close. On 64 bits, people are using the top 16 bits of a pointer for a tag.

kragen3y ago

> Suppose your high, low and mid indexes are as wide as a pointer on your machine: 32 or 64 bits. Unsigned.

Yeah, if you suppose that, you can correctly conclude that you only run into overflow if the object is a byte array that fills more than half the address space (though not the entire address space as you say). And that's why this problem remained unnoticed from 01958 or whenever someone first published a correct-on-my-machine binary search until 02006.

But suppose they aren't. Suppose, for example, that you're in Java, where there's no such thing as an unsigned type, and where ints are 32 bits even on a 64-bit machine. Suddenly the move to 64-bit machines around 02006 demonstrates that you have this problem on any array with more than 2³⁰ elements. It's easy to have 2³⁰ elements on a 64-bit machine! Even if they aren't bytes.

1 more reply

jansan3y ago· 1 in thread

Spoiler: If you are using Javascript, this bug only affects you if your arrays have more than Number.MAX_SAFE_INTEGER/2 entries, which is about 2^52. In other words, don't waste your time with fixing this bug.

chowells3y ago

Unless you're binary searching something other than a data structure. Fascinatingly, binary search works just fine in optimization problems where the function to optimize is monotonic.

butlerm3y ago· 1 in thread

Anyone dealing with arrays containing a billion elements or more really ought to be using 64 bit arithmetic to avoid problems like this. Certainly better to do this the right way though.

PartiallyTyped3y ago

Is there any reason not to use 64bit arithmetic anyway?

runeblaze3y ago· 1 in thread

Oh boy, in 2022 you could not afford writing a broken binary search in any serious coding interview. Back before 2006 apparently PhD students in CMU could not.

feoren3y ago

Are you kidding? If you were asked in a coding interview to write a binary search, and you wrote the broken version in the post on a whiteboard, you'd be in the top 5% of applicants. Most applicants can barely write a for loop on the board.

dang3y ago

Google Research Blog: Nearly All Binary Searches and Mergesorts Are Broken - https://news.ycombinator.com/item?id=16890739 - April 2018 (1 comment)

Nearly All Binary Searches and Mergesorts Are Broken (2006) - https://news.ycombinator.com/item?id=14906429 - Aug 2017 (86 comments)

Nearly All Binary Searches and Mergesorts Are Broken (2006) - https://news.ycombinator.com/item?id=12147703 - July 2016 (35 comments)

Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=9857392 - July 2015 (43 comments)

Read All About It: Nearly All Binary Searches and Mergesorts Are Broken - https://news.ycombinator.com/item?id=9113001 - Feb 2015 (2 comments)

Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=7594625 - April 2014 (2 comments)

Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=6799336 - Nov 2013 (46 comments)

Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=1130463 - Feb 2010 (49 comments)

Google Research Blog: Nearly All Binary Searches and Mergesorts are Broken [2006] - https://news.ycombinator.com/item?id=621557 - May 2009 (9 comments)

1 more reply

User233y ago

Knuth’s section on binary search in The Art of Computer Programming is enlightening. One historical curiosity that he notes is that it took something like a decade from the discovery of the algorithm to an implementation that was correct for all inputs.

I briefly tried using binary search as a weeder problem and quickly abandoned it when no one got it right.

queuebert3y ago

This is a great example of how good algorithms are software plus hardware. The idea that a pure mathematical idea can be naively implemented on any hardware has never truly materialized.

Yes, we are a long way from flipping switches to input machine code, but there are still hardware considerations for correctness and performance, e.g. the entire industry of deep learning running somewhat weird implementations of linear algebra to be fast on GPUs.

kfajdsl3y ago

My data structures professor took off points for that in an assignment once :(

yarskegg3y ago

I think this might be better for c/c++ though admittedly a bit more cryptic:

(x>>1) + (y>>1) + (0x01 & x & y)

j / k navigate · click thread line to collapse

195 comments

106 comments · 24 top-level

blacklight3y ago· 14 in thread

I'd put the blame on languages that don't allow exceptions, and whose return value in case of errors belong to the same domain as the solution.

So what does everyone do? Return a positive int if the index is found, or -1 otherwise.

In other words, we artificially extend the domain of the solution just to include the error. We force into the signed integer domain something that was always supposed to be unsigned.

This is the most common cause for most of the integer overflows problems out there.

Sharlin3y ago

C has size_t. Use it.

a13692099933y ago

flupe3y ago

The problem is not solved by using unsigned ints though, because it stems from integer overflow. I'm afraid your implementations are, alas, also incorrect.

dataflow3y ago

Confused, how does using unsigned integers not solve this particular problem? Doesn't the article itself show solutions with unsigned integers?

2 more replies

hoosieree3y ago

Or you can return an unsigned int which is the highest valid index+1.

    ['a','b','c'].indexof('b') == 1 // found - return index
    ['a','b','c'].indexof('w') == 3 // not found - return size of array

pharmakom3y ago

There should be a pointer type that is not an int and whose size depends on the host platform.

Casts from int to pointer should be explicit.

We are enamoured with programmer convenience at the expense of the safety of our systems. It’s unprofessional and we should all aim to fix it.

Sharlin3y ago

1 more reply

orangepurple3y ago

Seeing the new erroneous assumption being made in this post reminded me of Linus Torvalds' rant about C++ http://harmful.cat-v.org/software/c++/linus

pclmulqdq3y ago

For most practical purposes, an int64 index can include a universe of negative return codes with no loss of functionality.

The problems here are about using integers that are too narrow and not properly doing arithmetic to prevent overflow from impacting the result.

dataflow3y ago

> For most practical purposes, an int64 index can include a universe of negative return codes with no loss of functionality.

Isn't this article a counterexample to that? Where using signed instead of unsigned actually does result in a loss of functionality?

1 more reply

leni5363y ago

The naive (x+y)/2 returns the wrong number for x=UINT_MAX and y=UINT_MAX, for a trivial counter example.

lamp9873y ago

"Because, whenever you're talking indices, you should ALWAYS use unsigned int."

sounds like a lot of your code is in fact broken...

dataflow3y ago

I think it'd be nice if you give some examples of how using unsigned integers for indices breaks code in cases where signed integers don't, because otherwise your comment is very unilluminating.

1 more reply

fulafel3y ago

C & C++ allow exceptions on signed integer overflow.

tromp3y ago· 11 in thread

The bug in question is trying to compute an average as

    avg = (x + y) / 2

I wonder what is the simplest fix if the ordering between x and y is not known (e.g. in applications when x and y are not range bounds) and the language has no right-shift operation...

froh3y ago

M = L + (R - L)/2

looks fairly simple to me. note this works of any ordering of R and L if the data type is signed.

mqus3y ago

But doesn't this have the same overflow issue, e.g. if R is a large positive number and L is a large negative one?

3 more replies

pclmulqdq3y ago

Even if the data type is unsigned it suffices as long as the R - L term is signed (so the division by 2 is an arithmetic shift, not a logical shift).

dukoid3y ago

Guess it depends on how you define "simplest"?

x / 2 + y / 2 + ((x & 1) + (y & 1)) / 2

Jensson3y ago

    x / 2 + y / 2 + (x & y & 1)

Edit: This is the same you wrote, but it gets the wrong number for negative values, for negative ints the rounding will go up and not down.

nerdponx3y ago

leni5363y ago

Very much relevant talk about std::midpoint in C++:

https://youtu.be/sBtAGxBh-XI

kragen3y ago

https://math.ucr.edu/home/baez/torsors.html

I see dataflow and maxiepoo were already talking about this: https://news.ycombinator.com/item?id=33493149

JadeNB3y ago

… some field of characteristic ≠ 2, of course.

1 more reply

im3w1l3y ago

The simplest fix is obviously to use 64 bit ints and call it a day.

nwallin3y ago

  (x^y)/2 + (x&y)

dataflow3y ago· 8 in thread

europeanguy3y ago

Interesting. Another example is datetimes. You can't add datetimes. You can add a datetime and a time delta, and the difference of two datetimes is a timedelta.

I guess in maths this is called a generating Lie algebra (maybe someone can comment on this?)

maxiepoo3y ago

I think the concept you are looking for is a ["torsor"](https://en.wikipedia.org/wiki/Principal_homogeneous_space).

Basically,

1. You have a 0 time delta, and you can add and subtract them satisfying some natural equations. (time deltas form a group)

2. You can add time deltas to a datetime to get a new datetime, and this satisfies some natural equations relating to adding time deltas to each other (time deltas act on datetimes).

3. You can subtract two datetimes to get a time delta satisfying some more natural equations (the action is free and transitive).

2 more replies

enriquto3y ago

> I guess in maths this is called a generating Lie algebra

This is often called an affine structure.

1 more reply

zeroonetwothree3y ago

dataflow3y ago

morelisp3y ago

This should also be obvious after a bit of thought to anyone who has worked with timestamps, and is also well-known in e.g. animation where midpoint is just a special case of p=0.5.

pmayrgundter3y ago

Not sure about midpoint being well-defined on anything we can subtract.. Z and R are infinite.. there are a lot of values in there that don't compute.

To vary your point here, the axioms for twos complement and IEEE floating-point aren't well known or observed.

ChadNauseam3y ago

There are countably infinite turing machines and there is one for every element in Z. But there are uncountably infinite real numbers, so we’re out of luck for almost all of them.

delusional3y ago· 7 in thread

The implementation shown works perfectly for arrays on the order 2^30. Calling them broken is like saying strlen is broken for strings that aren't null terminated.

karpierz3y ago

Mergesort and binary search have a contract which defines:

- Which inputs are valid.

- For a valid input, what the constraints on the return value are.

You'd have a point if the implementations had as an input constraint: "array must be less than 2^30". But they didn't.

Otherwise, nothing is broken unless it never returns the right answer. Take:

fn add(x: u32, y:u32) -> u32 { return 1; }

This implementation works perfectly for numbers that add to 1. It just loses generality outside of that.

fn add(x: u32, y: u32) -> u32 { return (x + y) - (x >> 10) - (y >> 10); }

dataflow3y ago

gnull3y ago

> most binary search implementations don't claim they only work under some particular inputs

1 more reply

queuebert3y ago

> Calling binary search and mergesort implementations "broken" does the author no service with his argument.

Very much on brand for the FAANG r/iamverysmart crowd though.

_dain_3y ago

1 more reply

Blackthorn3y ago

This blog post predates r/iamverysmart. There was a way of talking and discourse in 2006 that this is very much as example of. One has to take things from the time they were written.

1 more reply

_dain_3y ago

it literally threw an exception in production on a valid input so yes it is fair to say it was broken

fnordpiglet3y ago· 5 in thread

0x4454423y ago

fnordpiglet3y ago

My favorite is when interviewers expect you to know sportsball stuff like tournament elimination rules when interviewing programmers who clearly don’t care about sportsball

1 more reply

quag3y ago

N-1 games?

junon3y ago

I hate when interviewers rely on niche recall-only interview questions...

latency-guy23y ago

Eh, I don't think integer overflow is a recall-only type question

This type of issue is pretty common to encounter and I make at least a few fixes a year specifically addressing integer overflow across many companies

bugfix-663y ago· 5 in thread

Here is the approach taken in Go's sort.Search()

Do the sum using signed int.

Then cast to unsigned int before the division (i.e., use a non-arithmetic shift low).

Then cast back to signed int.

  func Search(n int, f func(int) bool) int {
      // Define f(-1) == false and f(n) == true.
      // Invariant: f(i-1) == false, f(j) == true.
      i, j := 0, n
      for i < j {
          h := int(uint(i+j) >> 1) // avoid overflow when computing h
          // i ≤ h < j
          if !f(h) {
              i = h + 1 // preserves f(i-1) == false
          } else {
              j = h // preserves f(j) == true
          }
      }
      // i == j, f(i-1) == false, and f(j) (= f(i)) == true  =>  answer is i.
      return i
  }

If you care about stuff like this you may enjoy the puzzle "Upside-Down Arithmetic Shift":

https://bugfix-66.com/76b563beb6f4e61801fce4e835be862fb3dbbe...

morelisp3y ago

LoganDark3y ago

Java lacks a distinct uint type, but (since Java 8) allows you to perform unsigned operations on a regular int, effectively treating it as a uint.

It doesn't help that almost nobody knows this, though.

1 more reply

wizeman3y ago

This wouldn't work for C/C++ because in these languages signed integer overflow is undefined behavior.

bugfix-663y ago

That is correct. A serious mistake in C.

Go was designed by (among others) the father of Unix Ken Thompson, with an understanding of the mistakes of C and C++.

Another example is that Go requires explicit integer casts (disallowing implicit integer casts) to avoid what is now understood to be an enormous source of confusion and bugs in C.

You can understand Go as an improved C, designed for a world where parallel computing (e.g., dozens of CPU cores) is commonplace.

1 more reply

morelisp3y ago

You could write the same approach in C as `(size_t)i+(size_t)j` without UB. The real reason it doesn't work in C is because a memory region can be large enough to still overflow in that case.

2 more replies

david_allison3y ago· 5 in thread

16 years later, it's still incorrect on Wikipedia

https://en.wikipedia.org/wiki/Binary_search_algorithm#Proced...

enriquto3y ago

But this is pseudocode. For all you know, it could be implemented in a language whose integers are arbitrary precision, in which case it is perfectly correct and appropriate.

gp3y ago

> language whose integers are arbitrary precision

I’m not sure what this could mean. Could you please share some examples?

9 more replies

froh3y ago

see "implementation issues" in the same article, with

  M = L + (R - L)/2

david_allison3y ago

I'm aware (plus the fact that the algorithm is correct in Python). It's very unlikely that this is an argument I can win.

I'm taking a pragmatic perspective: like it or not, people are going to skim the article and copy & paste the pseudocode.

Given that the pseudocode is buggy in the vast majority of programming languages and the user isn't informed about this in the pseudocode, it's going to lead to unnecessary bugs.

1 more reply

elcomet3y ago

They do discuss it though at the end of the article

https://en.wikipedia.org/wiki/Binary_search_algorithm#Implem...

And as other mentionned, this is pseudo code and not implementation. But if you think it's incorrect, feel free to correct it.

altaltalt3y ago· 5 in thread

Can't it simply be written like this?

    mid = low/2 + high/2

Godel_unicode3y ago

Division is not associative:

https://www.khanacademy.org/math/arithmetic-home/multiply-di...

mimon3y ago

While that is true it is not relevant here, since this example does not involve associativity.

What is relevent here is that integer division is not distributive over addition.

dataflow3y ago

Nope, try low = 1, high = 1 and you get mid = 0.

benmmurphy3y ago

i think you can fix it with: (low >> 1) + (high >> 1) + (low & 1 & high)

for unsigned numbers. not sure if it works for signed numbers.

curling_grad3y ago

For low=3, high=5 case, this gives mid=3.

lkuty3y ago· 4 in thread

"It is not sufficient merely to prove a program correct; you have to test it too."

Well in fact it is exactly the contrary.

Jtsummers3y ago

I took it as a reference to Knuth: "Beware of bugs in the above code; I have only proved it correct, not tried it."

https://staff.fnwi.uva.nl/p.vanemdeboas/knuthnote.pdf [PDF] page 7 of the PDF, 5 of the classroom note.

User233y ago

a13692099933y ago

No, what you've observed is the (IIRC the terminology) converse, namely:

It is not sufficient merely to test a program; you have to prove it correct too.

In addition, it is not sufficient merely to prove a program correct; you have to test it too.

In summary, you have to both prove a program correct, and test it; skipping either will result in buggy garbage.

joshuamorton3y ago

dunhuang_nomad3y ago· 4 in thread

Does anyone know why the bitshift method works?

dataflow3y ago

The ">>>" operator works, the ">>" operator doesn't. The reason the former works is that it basically performs unsigned division by a power of 2; the latter does it signed. There's no floating-point.

erikpukinskis3y ago

What do the five >s and the , mean in this comment?

1 more reply

odo12423y ago

Example with 8-bit integers (from wikipedia):

Bits, Unsigned value, Signed value

0000 0000, 0, 0

0000 0001, 1, 1

0000 0010, 2, 2

0111 1110, 126, 126

0111 1111, 127, 127

1000 0000, 128, −128

When the logical bit shift is conducted on -128, -128 is treated as an unsigned integer. Its sign bit gets shifted such that the integer becomes 0100 0000, aka 64.

dunhuang_nomad3y ago

Oh I see, this is very helpful. Thank you.

vintermann3y ago· 4 in thread

On mobile, this site is broken too. Text doesn't wrap and scrolling seems to be disabled.

rgovostes3y ago

It's a post from before the iPhone came out, try reading the WAP version of the blog on your Cingular connection.

remram3y ago

The blog is still active though. Somehow they fixed their layouts but kept old posts on the old layout?

andai3y ago

Yeah I had to use reader mode.

tiagod3y ago

I really dislike when devs disable mobile scrolling without knowing for sure their content is wrapping properly.

EdSchouten3y ago· 2 in thread

1. If you only have single byte elements, you'd better use counting sort.

2. There always tend to be parts of the virtual address space that are reserved. On x86-64, most userspace processes can only access 2^47 bytes of space.

junon3y ago

> input spans more than half of the virtual address space

Not only that, but in practice most general purpose operating systems are designed with higher-half kernels[0].

[0] https://wiki.osdev.org/Higher_Half_Kernel

valleyer3y ago

32-bit Mac OS X was not (it had a 4/4 scheme).

Though even then I'm not sure you could reliably allocate two gigs of contiguous virtual space without running into some immovable OS-provided thing.

GuB-423y ago· 2 in thread

Such a function, even if it seems trivial, has some educative value as it opens an opportunity to explain the problem in the documentation.

fay593y ago

GuB-423y ago

legosexmagic3y ago· 1 in thread

the right solution is to parametrize the search region as (offset, length) instead of (start, end). then the midpoint is just offset+length/2.

you can also remove that unpredictable branch in the loop if you want.

  whatever_t *bisect(whatever_t *offset, size_t length, whatever_t x) {
    while(size_t midpoint = length / 2) {
      bool side = x < offset[midpoint];
      midpoint &= side - 1;
      length >>= side;
      offset += midpoint;
      length -= midpoint;
    }
    return offset;
  }

abecedarius3y ago

(offset, length) was how I coded it in the 90s, too, precisely because it made correctness clearer. "Nearly all" broken, hmph.

IncRnd3y ago· 1 in thread

There are still edge cases here - various posters here have mentioned them.

dataflow3y ago

> There are still edge cases here - various posters here have mentioned them.

Are you sure? What's an example of an array.length that would trigger a remaining edge case here? (Keep in mind array.length is 32-bit in Java.)

1 more reply

kazinator3y ago· 1 in thread

This article is poorly/incompletely reasoned.

Suppose your high, low and mid indexes are as wide as a pointer on your machine: 32 or 64 bits. Unsigned.

Suppose you're binary searching or merge sorting a structure that fits entirely into memory.

The only way (low + high)/2 will overflow is if the object being subdivided fills the entire address space, and is an array of individual bytes. Or else is a sparsely populated, virtual structure.

If the space contains distinct objects from [0] to [high-1], and they are more than a byte wide, this is a non-issue. If the objects are more than two bytes wide, you can use signed integers.

Also, you're never going to manipulate objects that fill the whole address space. On 32 bits, some applications came close. On 64 bits, people are using the top 16 bits of a pointer for a tag.

kragen3y ago

> Suppose your high, low and mid indexes are as wide as a pointer on your machine: 32 or 64 bits. Unsigned.

1 more reply

jansan3y ago· 1 in thread

chowells3y ago

Unless you're binary searching something other than a data structure. Fascinatingly, binary search works just fine in optimization problems where the function to optimize is monotonic.

butlerm3y ago· 1 in thread

Anyone dealing with arrays containing a billion elements or more really ought to be using 64 bit arithmetic to avoid problems like this. Certainly better to do this the right way though.

PartiallyTyped3y ago

Is there any reason not to use 64bit arithmetic anyway?

runeblaze3y ago· 1 in thread

Oh boy, in 2022 you could not afford writing a broken binary search in any serious coding interview. Back before 2006 apparently PhD students in CMU could not.

feoren3y ago

dang3y ago

Google Research Blog: Nearly All Binary Searches and Mergesorts Are Broken - https://news.ycombinator.com/item?id=16890739 - April 2018 (1 comment)

Nearly All Binary Searches and Mergesorts Are Broken (2006) - https://news.ycombinator.com/item?id=14906429 - Aug 2017 (86 comments)

Nearly All Binary Searches and Mergesorts Are Broken (2006) - https://news.ycombinator.com/item?id=12147703 - July 2016 (35 comments)

Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=9857392 - July 2015 (43 comments)

Read All About It: Nearly All Binary Searches and Mergesorts Are Broken - https://news.ycombinator.com/item?id=9113001 - Feb 2015 (2 comments)

Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=7594625 - April 2014 (2 comments)

Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=6799336 - Nov 2013 (46 comments)

Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=1130463 - Feb 2010 (49 comments)

Google Research Blog: Nearly All Binary Searches and Mergesorts are Broken [2006] - https://news.ycombinator.com/item?id=621557 - May 2009 (9 comments)

1 more reply

User233y ago

I briefly tried using binary search as a weeder problem and quickly abandoned it when no one got it right.

queuebert3y ago

This is a great example of how good algorithms are software plus hardware. The idea that a pure mathematical idea can be naively implemented on any hardware has never truly materialized.

kfajdsl3y ago

My data structures professor took off points for that in an assignment once :(

yarskegg3y ago

I think this might be better for c/c++ though admittedly a bit more cryptic:

(x>>1) + (y>>1) + (0x01 & x & y)

j / k navigate · click thread line to collapse