In particular, (x + y) / 2 is the wrong implementation of midpoint in general, because it would fail to even compile on objects you can't add together. But midpoint is well-defined on anything you can subtract (i.e. anything you can define a consistent distance function for)—and it doesn't require addition to be well-defined between those objects!
One obvious (in C/C++, and not-so-obvious in Java) counterexample here is pointers/iterators. You can subtract them, but not add them. And, in fact, if you implement midpoint in a manner that generalizes to those and respects the intrinsic constraints of the problem, you end up with the same x + (y - x) / 2 implementation, which doesn't have this bug.
I guess in maths this is called a generating Lie algebra (maybe someone can comment on this?)
Basically,
1. You have a 0 time delta, and you can add and subtract them satisfying some natural equations. (time deltas form a group)
2. You can add time deltas to a datetime to get a new datetime, and this satisfies some natural equations relating to adding time deltas to each other (time deltas act on datetimes).
3. You can subtract two datetimes to get a time delta satisfying some more natural equations (the action is free and transitive).
This is often called an affine structure.
But note that that sentence was just about calculating midpoints, not about the larger binary search algorithm. And in any case, I was just trying to convey layman intuition, not write a mathematically precise theorem.
To vary your point here, the axioms for twos complement and IEEE floating-point aren't well known or observed.
avg = (x + y) / 2
which fails both for signed ints (when adding positive x and y overflows maxint) and for unsigned ints (when x + y wraps around 0). Note that this can only be considered a bug for array indices x,y when these are 32 bit variables and the array can conceivably grow to more than 2 billion elements.I wonder what is the simplest fix if the ordering between x and y is not known (e.g. in applications when x and y are not range bounds) and the language has no right-shift operation...
looks fairly simple to me. note this works of any ordering of R and L if the data type is signed.
x / 2 + y / 2 + ((x & 1) + (y & 1)) / 2
x / 2 + y / 2 + (x & y & 1)
Edit: This is the same you wrote, but it gets the wrong number for negative values, for negative ints the rounding will go up and not down.The difference of two elements of the torsor of some group G is an honest-to-God group element of G, though, and so you have an honest-to-God identity element. You may or may not have an honest-to-God division or halving operator (which computes e given (e + e)) but in cases where G is the additive group of some field you do.
However, in this case our array indices are drawn from something like ℤ/2³²ℤ, and we might be trying to halve odd numbers, so none of this is justifiable! We want something different from our halving operator.
https://math.ucr.edu/home/baez/torsors.html
I see dataflow and maxiepoo were already talking about this: https://news.ycombinator.com/item?id=33493149
… some field of characteristic ≠ 2, of course.
(x^y)/2 + (x&y)Google Research Blog: Nearly All Binary Searches and Mergesorts Are Broken - https://news.ycombinator.com/item?id=16890739 - April 2018 (1 comment)
Nearly All Binary Searches and Mergesorts Are Broken (2006) - https://news.ycombinator.com/item?id=14906429 - Aug 2017 (86 comments)
Nearly All Binary Searches and Mergesorts Are Broken (2006) - https://news.ycombinator.com/item?id=12147703 - July 2016 (35 comments)
Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=9857392 - July 2015 (43 comments)
Read All About It: Nearly All Binary Searches and Mergesorts Are Broken - https://news.ycombinator.com/item?id=9113001 - Feb 2015 (2 comments)
Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=7594625 - April 2014 (2 comments)
Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=6799336 - Nov 2013 (46 comments)
Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=1130463 - Feb 2010 (49 comments)
Google Research Blog: Nearly All Binary Searches and Mergesorts are Broken [2006] - https://news.ycombinator.com/item?id=621557 - May 2009 (9 comments)
This type of issue is pretty common to encounter and I make at least a few fixes a year specifically addressing integer overflow across many companies
you can also remove that unpredictable branch in the loop if you want.
whatever_t *bisect(whatever_t *offset, size_t length, whatever_t x) {
while(size_t midpoint = length / 2) {
bool side = x < offset[midpoint];
midpoint &= side - 1;
length >>= side;
offset += midpoint;
length -= midpoint;
}
return offset;
}Do the sum using signed int.
Then cast to unsigned int before the division (i.e., use a non-arithmetic shift low).
Then cast back to signed int.
func Search(n int, f func(int) bool) int {
// Define f(-1) == false and f(n) == true.
// Invariant: f(i-1) == false, f(j) == true.
i, j := 0, n
for i < j {
h := int(uint(i+j) >> 1) // avoid overflow when computing h
// i ≤ h < j
if !f(h) {
i = h + 1 // preserves f(i-1) == false
} else {
j = h // preserves f(j) == true
}
}
// i == j, f(i-1) == false, and f(j) (= f(i)) == true => answer is i.
return i
}
If you care about stuff like this you may enjoy the puzzle "Upside-Down Arithmetic Shift":https://bugfix-66.com/76b563beb6f4e61801fce4e835be862fb3dbbe...
It doesn't help that almost nobody knows this, though.
Go was designed by (among others) the father of Unix Ken Thompson, with an understanding of the mistakes of C and C++.
Another example is that Go requires explicit integer casts (disallowing implicit integer casts) to avoid what is now understood to be an enormous source of confusion and bugs in C.
You can understand Go as an improved C, designed for a world where parallel computing (e.g., dozens of CPU cores) is commonplace.
The proper method is to type promote first - not just to unsigned but to a wider variable type - 32 to 64 bits or from 64 to 128 bits. Unsigned simply gives a single extra bit, while erasing negative semantics. Promoting to twice the size works for either addition or multiplication. The benefits are correctness and the ability to be understood at a glance.
Are you sure? What's an example of an array.length that would trigger a remaining edge case here? (Keep in mind array.length is 32-bit in Java.)
The implementation shown works perfectly for arrays on the order 2^30. Calling them broken is like saying strlen is broken for strings that aren't null terminated.
- Which inputs are valid.
- For a valid input, what the constraints on the return value are.
You'd have a point if the implementations had as an input constraint: "array must be less than 2^30". But they didn't.
Otherwise, nothing is broken unless it never returns the right answer. Take:
fn add(x: u32, y:u32) -> u32 { return 1; }
This implementation works perfectly for numbers that add to 1. It just loses generality outside of that.
fn add(x: u32, y: u32) -> u32 { return (x + y) - (x >> 10) - (y >> 10); }
This implementation works for x < 2^10 and y < 2^10. Arguably this implementation is much worse than the previous one because it fails unexpectedly. At least the previous implementation would be much more obviously broken.
But these are both broken because they don't fulfill the (implicit) contract for add. You can't just say "well, it's implied that my add function only takes inputs that add to 1" unless you actually write that somewhere and make it clear.
More generally, I feel like the thought process of "it's not broken if it works fine for inputs that occur 99% of the time" is an artifact of how little attention we pay to correctness, not something that is intrinsically true. If your function breaks for inputs that are clearly within its domain without any kind of warning... it's broken, as much as we might not want to admit it. We're just so used to this happening near edge cases that we don't think about it that way, but it's true.
They do implicitly. It's just common sense. When you read a recipe in a cookbook, it usually doesn't mention that you're expected to be standing on your legs, not on your arms. Reader is expected to derive these things themselves.
A lot of generic algorithm implementations will start acting weird if your input size has the order of INT_MAX. Instances this big will take days or weeks or process on commodity CPUs, so if you're doing something like that you would normally use a specialized library that takes these specifics into account.
Very much on brand for the FAANG r/iamverysmart crowd though.
Well in fact it is exactly the contrary.
https://staff.fnwi.uva.nl/p.vanemdeboas/knuthnote.pdf [PDF] page 7 of the PDF, 5 of the classroom note.
If the system has a well written formal specification then your model can be built from that without error if done diligently. One real world example is the first Algol 60 compiler, which was built to a formal specification. On the other hand if there is no useful spec or no spec at all then you end up needing to experiment, ie test, and get your model as close as you can.
It is not sufficient merely to test a program; you have to prove it correct too.
In addition, it is not sufficient merely to prove a program correct; you have to test it too.
In summary, you have to both prove a program correct, and test it; skipping either will result in buggy garbage.
1. If you only have single byte elements, you'd better use counting sort.
2. There always tend to be parts of the virtual address space that are reserved. On x86-64, most userspace processes can only access 2^47 bytes of space.
Not only that, but in practice most general purpose operating systems are designed with higher-half kernels[0].
Though even then I'm not sure you could reliably allocate two gigs of contiguous virtual space without running into some immovable OS-provided thing.
Such a function, even if it seems trivial, has some educative value as it opens an opportunity to explain the problem in the documentation.
If you look at GLSL, it has many function that do obvious things, like exp2(x) that does the same thing as pow(2,x), and I don't think anyone has any issue with that. It even has a specific "fma" operation (fma(a,b,c) = a*b+c, precisely), that solves a similar kind of problem as the overflowing average.
I briefly tried using binary search as a weeder problem and quickly abandoned it when no one got it right.
Suppose your high, low and mid indexes are as wide as a pointer on your machine: 32 or 64 bits. Unsigned.
Suppose you're binary searching or merge sorting a structure that fits entirely into memory.
The only way (low + high)/2 will overflow is if the object being subdivided fills the entire address space, and is an array of individual bytes. Or else is a sparsely populated, virtual structure.
If the space contains distinct objects from [0] to [high-1], and they are more than a byte wide, this is a non-issue. If the objects are more than two bytes wide, you can use signed integers.
Also, you're never going to manipulate objects that fill the whole address space. On 32 bits, some applications came close. On 64 bits, people are using the top 16 bits of a pointer for a tag.
Yeah, if you suppose that, you can correctly conclude that you only run into overflow if the object is a byte array that fills more than half the address space (though not the entire address space as you say). And that's why this problem remained unnoticed from 01958 or whenever someone first published a correct-on-my-machine binary search until 02006.
But suppose they aren't. Suppose, for example, that you're in Java, where there's no such thing as an unsigned type, and where ints are 32 bits even on a 64-bit machine. Suddenly the move to 64-bit machines around 02006 demonstrates that you have this problem on any array with more than 2³⁰ elements. It's easy to have 2³⁰ elements on a 64-bit machine! Even if they aren't bytes.
Is it that low and high are both floating point, so you're not constrained by int precision and so you don't get an overflow error. The article makes it sound like sign switching is the issue, but this is just a general overflow problem, right?
Example with 8-bit integers (from wikipedia):
Bits, Unsigned value, Signed value
0000 0000, 0, 0
0000 0001, 1, 1
0000 0010, 2, 2
0111 1110, 126, 126
0111 1111, 127, 127
1000 0000, 128, −128
When the logical bit shift is conducted on -128, -128 is treated as an unsigned integer. Its sign bit gets shifted such that the integer becomes 0100 0000, aka 64.
https://en.wikipedia.org/wiki/Binary_search_algorithm#Proced...
I’m not sure what this could mean. Could you please share some examples?
M = L + (R - L)/2I'm taking a pragmatic perspective: like it or not, people are going to skim the article and copy & paste the pseudocode.
Given that the pseudocode is buggy in the vast majority of programming languages and the user isn't informed about this in the pseudocode, it's going to lead to unnecessary bugs.
https://en.wikipedia.org/wiki/Binary_search_algorithm#Implem...
And as other mentionned, this is pseudo code and not implementation. But if you think it's incorrect, feel free to correct it.
Yes, we are a long way from flipping switches to input machine code, but there are still hardware considerations for correctness and performance, e.g. the entire industry of deep learning running somewhat weird implementations of linear algebra to be fast on GPUs.
mid = low/2 + high/2https://www.khanacademy.org/math/arithmetic-home/multiply-di...
What is relevent here is that integer division is not distributive over addition.
for unsigned numbers. not sure if it works for signed numbers.
(x>>1) + (y>>1) + (0x01 & x & y)
I've coded binary searches and sorts tons of times in C++, and yet none was succeptible to this bug. Why? Because, whenever you're talking indices, you should ALWAYS use unsigned int. Since an array can't have negative indices, if you use unsigned ints the problem is solved by design. And, if the element is not found, you throw an exception.
Instead, in C you don't have exceptions, and you have to figure out creative ways for returning errors. errno-like statics work badly with concurrency. And doing something like int search(..., int* err), and setting err inside of your functions, feels cumbersome.
So what does everyone do? Return a positive int if the index is found, or -1 otherwise.
In other words, we artificially extend the domain of the solution just to include the error. We force into the signed integer domain something that was always supposed to be unsigned.
This is the most common cause for most of the integer overflows problems out there.
C has size_t. Use it.
(And because C doesn't mandate correct handling of benign undefined behavior, you still have a problem if you `return ptr-orig_ptr` as a size_t offset (rather than returning the final ptr directly), because pointer subtraction is specified as producing ptrdiff_t (rather than size_t), which can 'overflow' for large arrays, despite that it's immediatedly converted back to a correct value of size_t.)
['a','b','c'].indexof('b') == 1 // found - return index
['a','b','c'].indexof('w') == 3 // not found - return size of arrayCasts from int to pointer should be explicit.
We are enamoured with programmer convenience at the expense of the safety of our systems. It’s unprofessional and we should all aim to fix it.
The problems here are about using integers that are too narrow and not properly doing arithmetic to prevent overflow from impacting the result.
Isn't this article a counterexample to that? Where using signed instead of unsigned actually does result in a loss of functionality?