Cryptic genetic variation in software: hunting a buffered 41-year-old bug (opens in new tab)

(cryptogenomicon.wordpress.com)

175 pointsmineo11y ago47 comments

47 comments

35 comments · 9 top-level

matthewcanty11y ago· 13 in thread

That was a really interesting read, and very well written. I wonder if anyone can clear this up though...

I find the terminology of open and closed intervals contradictory to their meaning. Does anyone know why they are described like this?

`Closed` makes me think shut or not-including - however it includes its endpoints. `Open` makes me think inclusive - yet does not include its endpoints.

jdpage11y ago

From a real analysis standpoint... the definition of a 'closed set' in an N-dimensional metric space (of which Euclidean space, i.e. normal space, is an example) is as follows: a set C is 'closed' if and only if, given any sequence of elements (x_n) converging on x, such that (x_n) is a subset of C, it follows that x is also in C.

Under this definition, 'closed' makes sense in the larger context, since in mathematics, if an operation is 'closed' on a set, that means that applying the operation to elements of the set always yields another element of the same set.

Open is a bit more awkward. A set O is 'open' if and only if the complement of O (i.e. the set of all points not in O) is closed. So open is kind of the opposite of closed, hence the convention. Of course, it isn't really the opposite of closed, since there are sets which are neither closed nor open.

tt;dr (too technical; didn't read) in maths, 'closed' usually means 'I can do stuff inside this set without falling out of it'. In the case of intervals, the 'stuff' in question is taking a limit of a convergent sequence.

GFK_of_xmaspast11y ago

You can also have sets that are simultaneously closed and open.

1 more reply

ntlve11y ago

Out of curiosity, do you have any examples or references to something with examples of sets that are neither open nor closed?

4 more replies

saurik11y ago

The analogy I've always used is that the closed interval has what amounts to a lid or a cap on it, while the open interval does not. Another (sort of related) way of looking at it is that the closed interval has a maximum (or minimum) value, whereas the open interval, despite only missing a single value, suddenly has this feeling of continuation, because without that final value it now asymptotes to the end and you will always be able to find a larger (or smaller) value than any previous value you examine.

masklinn11y ago

The so-called "french notation" is interesting and — I think — clearer: a closed interval is [a, b] and an open interval is ]a, b[.

1 more reply

StavrosK11y ago

Even though this is also the way I think of it, this underscored to me the fact that you can make up plausible explanations for anything; even for words that are opposites and are trying to explain the same thing.

matthewcanty11y ago

Thanks saurik, I think I understand that way of thinking.

With a closed interval it eventually return a bound instead of getting smaller/larger - continuing.

db48x11y ago

It's a good question. A close reading of the Wikipedia page <https://en.wikipedia.org/wiki/Interval_%28mathematics%29#Ter... tells me that an open interval has no definite maximum or minimum.

This no doubt sounds strange, but it surely comes down to the slightly odd behavior of real numbers. In an open interval like (0, 1), there is no single real number which is less than one but is greater than all other real numbers less than one. (Proof by Cantor's diagonal: write out a list of all such numbers and you can always construct another which wasn't on the list, even when the list is infinite.) In contrast, the closed interval [0, 1] has a definite maximum element, the number 1.

This seems like a good example of poor UX in mathematics; the terms are too similar and can only be learned by memorization, the notations likewise. Good luck changing it though; Wikipedia informs me that there are no less than two ISO standards codifying it.

There is a perfectly clear notation in the Wikipedia page, however: (a,b) = {x∈ℝ|a<x<b}; [a,b] = {x∈ℝ|a≤x≤b}

picomancer11y ago

> real numbers. In an open interval like (0, 1), there is no single real number which is less than one but is greater than all other real numbers less than one.

The rational numbers are like this too.

All the rational numbers between 0 and 1 also increase arbitrarily close to 1 with no largest element. You can prove this by simple contradiction: For every rational number x < 1, there is another rational number y = (x+1)/2 with x < y < 1. So the set of rational numbers less than 1 also has no largest element.

Actually real numbers can be defined (axiomatized) in terms of the least upper bound property -- every set of real numbers that's bounded above has a least upper bound. So you could actually have a set S of rational numbers that gets arbitrarily close to something like sqrt(2) from below. S then has no rational least upper bound -- for every rational number x greater than or equal to every element of S, there is another rational number y greater than or equal to every element of S but smaller than x.

Note that a set of real numbers need not contain its upper bound -- as you noted, a bounded open interval doesn't contain its upper bound.

Someone11y ago

I don't see how this is poor UX. If we accept your statement that the terms "open" and "closed" are too similar, what do you suggest?

Also, any notation will have to be learned. Concise ones a bit more, but there is nothing one can do about that.

And that perfectly clear notation requires you to learn quite a bit by memorization, such as the meaning of those {} brackets, the |, and the symbols ∈ and ℝ.

Finally, you don't need Cantor's diagonalization for a proof. It is easy(1) to show that, for any real x<1, x+(1-x)/2 is real, less than one, and strictly greater than x.

Alternatively, assuming 0<x<1, write x in decimal, and increase the first digit less than nine by one to get a larger real less than one. There is such a digit because 0.9999999… is not less than one.

(1) depending on how deep you want to descend into the foundations of mathematics.

vishnugupta11y ago

The way I wrap my head around this is "open" == "no defined beginning or end point" and "closed" == "precisely defined beginning/end point".

To take your thinking a bit further the interval is "shut" precisely because it's closed exactly at that point. And the interval is thrown "open" by not including the point.

BTW; Rudin's classic text (1) covers the fascinating topic of neighborhoods which builds on the concept of intervals. It took me a while to completely understand the concept but after that understanding limits was relatively easier; even otherwise "neighborhood" as a concept is interesting in itself.

[1] http://www.math.boun.edu.tr/instructors/ozturk/eskiders/guz1...

matthewcanty11y ago

My word, I'd have to do some studying to get through that!

Thanks for everyone's input. I think I have a clear understanding of the open/closed paradigm now.

anon411y ago

Think of Tom&Jerry. Open is when Tom is running on a carpet towards an open door, but the carpet is moving under him and he never reaches the door. Closed is when Tom slams into a closed door and gets flattened.

kator11y ago· 4 in thread

I always cringe when I see or have to write:

    for ( ; ; )

or:

    while ( true )

etc.

You just know you're setting yourself up for undocumented bugs later.

I've been known to build "escape" vars into these like:

    int attempts = 1000;
    for(;attempts > 0;attempts--) {
        ** DO SOMETHING **
    }

    if (attempts <= 0) {
        ** BAIL ***
    }

Where 1000 or whatever number is a reasonable estimate of the function's need to loop x 10 or etc.

munificent11y ago

Really? I use "while (true)" pretty frequently in my code. In my mind, all loops are composed of these parts:

1. Some initialization (optional). 2. Some stuff you do each iteration (optional). 3. Check the exit condition and exit the loop (optional). 4. Some stuff you do each iteration (optional).

"while" loops nicely handle the case where 2 is empty. "do while" loops handle 4 being empty. "for" loops handle both 2 and 4 being non-empty but only really accommodate 4 being a single expression.

For any other case where 2 is non-empty and 4 is more than a single expression, I just use a "while (true)" with an "if (...) break;" in the body.

nknighthb11y ago

Burying the only way out in the loop body is more error-prone, especially for those who come after you. Putting an escape hatch in the loop setup makes it less likely that an operationally-problematic[0] infinite loop scenario will be triggered. This is particularly important when your code doesn't have a runtime environment that can easily kill runaway requests.

[0] Or utterly catastrophic. Think embedded system with limited debugging facilities. This kind of bug could manifest as the system completely locking up, and take much, much longer to track down than a "infinite loop detected!" message in an error log.

1 more reply

na8511y ago

I'm not sure why any competent programmer would use for(;;) or while(true) without explicitly including a break; statement.

segmondy11y ago

I wonder what you will do when you see tail call optimization functions in prolog or lisp.

imurray11y ago· 4 in thread

It would be safest for most rand() functions to omit both zero and one, unless a user was really sure they wanted otherwise. If we were generating real numbers, we'd never see precisely zero or one. The fact that we do is an artifact of limited precision. These boundary cases cause problems in common computations like u.log(u) or (1-u).log(1-u).

bainsfather11y ago

Interesting suggestion, but I'm not sure I agree with you.

You could use that argument about any of the random numbers that your rng returns. "Hey, 0.023 is infinitesimally unlikely to occur, so let's exclude that as well".

Generally when I use a real-valued rng, the numbers generated are meant to be 'representative' of what I should get from the distribution. And when I get e.g. 0.023, it kind-of means "0.023 and/or numbers near to 0.023". If I excluded 0.023 from the possible results of the rng, then I would have a 'hole' in my distribution 'in the region of' 0.023.

Maybe what you want is to go from the possible rnds:

0.000, 0.001, 0.002, 0.003, ... , 0.999

to:

0.0005, 0.0015, 0.0025, ... , 0.9995

whereas with your suggestion, you are under-representing the boundary numbers near to 0 and/or near to 1. Generally not a problem I guess, but what you are doing is, imo, 'wrong in theory', even though your high-precision floats will probably cover it up ok in almost all cases.

I agree with you about 0 and 1 causing problems when they are fed into other functions - e.g. generating Gaussian rngs by using InverseCumulativeGaussian(0.0) - a bug I wasted some time hunting down in my company's rng library. My view was that the developer who wrote the library did not understand the maths of what he was doing, rather than that the 0to1 random number generator was at fault.

ZoFreX11y ago

It's pretty common in practice to want [0,1) (that is, 0 <= x < 1) from your random number generator. Never generating 0 would be a problem when the range is small and discrete - generating random letters for example.

captaincrowbar11y ago

If you're generating integers, or selecting from a discrete set with uniform probabilities, there's no reason to involve floating point at any point in the process.

john_b11y ago

Generating random numbers in (0,1) is mathematically different than generating random numbers, even from the same type of distribution, in [0,1), (0,1], or [0,1]. Equating all of these, as the article points out, rarely causes problems in practice, but it is nonetheless a conceptual error.

If you'll pardon me pulling out my soapbox for a moment, this kind of bug would be less common if the "programmers don't need math" attitude was not a successful meme in our culture. The notion of "between zero and one" may seem straghtforward and obvious, but is actually ambiguous. Perhaps, given the number of distinct floating point values that can be represented in the interval (0,1), it would hardly seem to matter what interpretation of "between zero and one" is used. But in mathematics you will never see such terminology used without a qualifying statement that clarifies* the meaning.

*E.g. "between 0 and 1, inclusive", "on the closed interval bounded by 0 and 1", "[0,1]", etc.

saurik11y ago· 3 in thread

(Also posted 20 hours ago, also no comments.) (It occurs to me that if there are ever comments on this post, my comment will sound really confusing: let it be clear that this article is #1 currently, was posted 40 minutes ago, and there are no comments yet. ;P)

https://news.ycombinator.com/item?id=8453042

kovrik11y ago

Well, this article is so thorough, detailed and well-written, that I don't know what to say other than "Wow, that was cool and very interesting!".

More stories like that?

vkolencik11y ago

I especially like the clever analogy with genetics. Douglas Hofstadter said something along the lines of analogy being more powerful when the two concepts it connects are otherwise very distant from each other.

db48x11y ago

Agreed; technical articles such as this can't come up often enough for my tastes.

I've never used ranlib; I hope the documentation can be cleaned up at least. On the other hand, anyone who has read A Deepness in the Sky will surely feel that it is unlikely in the long term.

PhantomGremlin11y ago· 1 in thread

At least one, possibly two other bugs lurking in the implementation.

1) Algorithm FT says:

    1. Generate u. Store the first bit of u
    as a sign s (s=0 if u<1/2, s=1 if u>=1/2).

and yet the C code implements

    if ( u <= 0.5 ) s = 0.0;
    else s = 1.0;

2) I can't be sure of the following w/o access to doc. But i4_uni() says

    a uniform distribution over (1, 2147483562)

which, offhand, is suspicious. A distribution over positive integers would probably want to use all available values in a 32-bit signed int, so it would most likely end at 2^31 -1 which is 2147483647, and not the value given.

abecedarius11y ago

Good point about the C code. OTOH, 2147483562 is 1 less than a prime, which is unlikely to be an accident.

menaf11y ago· 1 in thread

Menaf

Houshalter11y ago

What is the purpose of this?

mbq11y ago

Thumbs up for not using rand, but the assuming that MT is a golden bullet is not exactly scientific; one should just test few RNGs, it may come in that the code exploits some ultra-hidden hole in MT or that some much faster RNG works equally well. Also reproducibility of a stochastic code means that the code results lead to the same conclusions regardless of the seed, not that you get bit to bit identical output for the same seed. In case one assumes (only) the latter, it may end up in seed cherry picking, not-optimizing code because it would "break reproducibility" or not investigating the natural deviation of the results.

jdnier11y ago

A great exposition, and worth it just for the introduction to "schrödinbug". http://en.wikipedia.org/wiki/Heisenbug#Related_terms

rmsaksida11y ago

Really interesting read.

I wonder about the resolution: was snorm() reimplemented correctly, or was an RNG with the 'wrong' interval supplied?

j / k navigate · click thread line to collapse

47 comments

35 comments · 9 top-level

matthewcanty11y ago· 13 in thread

That was a really interesting read, and very well written. I wonder if anyone can clear this up though...

I find the terminology of open and closed intervals contradictory to their meaning. Does anyone know why they are described like this?

`Closed` makes me think shut or not-including - however it includes its endpoints. `Open` makes me think inclusive - yet does not include its endpoints.

jdpage11y ago

GFK_of_xmaspast11y ago

You can also have sets that are simultaneously closed and open.

1 more reply

ntlve11y ago

Out of curiosity, do you have any examples or references to something with examples of sets that are neither open nor closed?

4 more replies

saurik11y ago

masklinn11y ago

The so-called "french notation" is interesting and — I think — clearer: a closed interval is [a, b] and an open interval is ]a, b[.

1 more reply

StavrosK11y ago

matthewcanty11y ago

Thanks saurik, I think I understand that way of thinking.

With a closed interval it eventually return a bound instead of getting smaller/larger - continuing.

db48x11y ago

It's a good question. A close reading of the Wikipedia page <https://en.wikipedia.org/wiki/Interval_%28mathematics%29#Ter... tells me that an open interval has no definite maximum or minimum.

There is a perfectly clear notation in the Wikipedia page, however: (a,b) = {x∈ℝ|a<x<b}; [a,b] = {x∈ℝ|a≤x≤b}

picomancer11y ago

> real numbers. In an open interval like (0, 1), there is no single real number which is less than one but is greater than all other real numbers less than one.

The rational numbers are like this too.

Note that a set of real numbers need not contain its upper bound -- as you noted, a bounded open interval doesn't contain its upper bound.

Someone11y ago

I don't see how this is poor UX. If we accept your statement that the terms "open" and "closed" are too similar, what do you suggest?

Also, any notation will have to be learned. Concise ones a bit more, but there is nothing one can do about that.

And that perfectly clear notation requires you to learn quite a bit by memorization, such as the meaning of those {} brackets, the |, and the symbols ∈ and ℝ.

Finally, you don't need Cantor's diagonalization for a proof. It is easy(1) to show that, for any real x<1, x+(1-x)/2 is real, less than one, and strictly greater than x.

(1) depending on how deep you want to descend into the foundations of mathematics.

vishnugupta11y ago

The way I wrap my head around this is "open" == "no defined beginning or end point" and "closed" == "precisely defined beginning/end point".

To take your thinking a bit further the interval is "shut" precisely because it's closed exactly at that point. And the interval is thrown "open" by not including the point.

[1] http://www.math.boun.edu.tr/instructors/ozturk/eskiders/guz1...

matthewcanty11y ago

My word, I'd have to do some studying to get through that!

Thanks for everyone's input. I think I have a clear understanding of the open/closed paradigm now.

anon411y ago

kator11y ago· 4 in thread

I always cringe when I see or have to write:

    for ( ; ; )

or:

    while ( true )

etc.

You just know you're setting yourself up for undocumented bugs later.

I've been known to build "escape" vars into these like:

    int attempts = 1000;
    for(;attempts > 0;attempts--) {
        ** DO SOMETHING **
    }

    if (attempts <= 0) {
        ** BAIL ***
    }

Where 1000 or whatever number is a reasonable estimate of the function's need to loop x 10 or etc.

munificent11y ago

Really? I use "while (true)" pretty frequently in my code. In my mind, all loops are composed of these parts:

1. Some initialization (optional). 2. Some stuff you do each iteration (optional). 3. Check the exit condition and exit the loop (optional). 4. Some stuff you do each iteration (optional).

"while" loops nicely handle the case where 2 is empty. "do while" loops handle 4 being empty. "for" loops handle both 2 and 4 being non-empty but only really accommodate 4 being a single expression.

For any other case where 2 is non-empty and 4 is more than a single expression, I just use a "while (true)" with an "if (...) break;" in the body.

nknighthb11y ago

1 more reply

na8511y ago

I'm not sure why any competent programmer would use for(;;) or while(true) without explicitly including a break; statement.

segmondy11y ago

I wonder what you will do when you see tail call optimization functions in prolog or lisp.

imurray11y ago· 4 in thread

bainsfather11y ago

Interesting suggestion, but I'm not sure I agree with you.

You could use that argument about any of the random numbers that your rng returns. "Hey, 0.023 is infinitesimally unlikely to occur, so let's exclude that as well".

Maybe what you want is to go from the possible rnds:

0.000, 0.001, 0.002, 0.003, ... , 0.999

to:

0.0005, 0.0015, 0.0025, ... , 0.9995

ZoFreX11y ago

captaincrowbar11y ago

If you're generating integers, or selecting from a discrete set with uniform probabilities, there's no reason to involve floating point at any point in the process.

john_b11y ago

*E.g. "between 0 and 1, inclusive", "on the closed interval bounded by 0 and 1", "[0,1]", etc.

saurik11y ago· 3 in thread

https://news.ycombinator.com/item?id=8453042

kovrik11y ago

Well, this article is so thorough, detailed and well-written, that I don't know what to say other than "Wow, that was cool and very interesting!".