No way to parse integers in C (2022) (opens in new tab)

(blog.habets.se)

85 pointskonmok1mo ago111 comments

111 comments

69 comments · 18 top-level

orthoxerox1mo ago· 15 in thread

I wasn't in this class myself, but one prof at my alma mater started his "Programming 201" class with the simplest assignment: write a C program that accepts two integers from the user and prints their sum. It actually was the only assignment for the rest of the semester, since he has a test suite that would humiliate the students gently at first, but would ultimately pipe a billion nines into stdin as the first argument.

dlcarrier1mo ago

It's a little awkward, because you'd need to parse the strings in reverse, but if all you need to do is sum, you can do it one digit at a time, while at any given moment only handling only one character from each input string, a carry byte, and one output character.

ronsor1mo ago

You don't need to parse the strings in reverse. That's for printing integers, not parsing. Roughly:

    int stdin_atoi() {
      int i = 0;
      while (1) {
        int c = getchar();
        if (c >= '0' && c <= '9') {
          i = i * 10 + (c - '0');
        } else { break; }
      }
      return i;
    }

2 more replies

pbalau1mo ago

How do you know where the first string ends and the second starts? Did you miss the "stdin" part?

This is not

    ./program first_number second_number

1 more reply

camkego1mo ago

This would be kind of a fun challenge. If you are handling random numbers, well you are limited by disk or memory size. But if the numbers are compressible ala LZ77 or Gzip, then there are ways to use the value’s compression trees to sum the numbers from the least significant digits using the LZ77 style compressed value tree representation. If you go that route, and the numbers are compressible (not random) then the question is whether the compressed input and output trees fit in memory or disk.

jeffrallen1mo ago

Would be fun to write a program that arranges to send the input into dc(1) and just outsource the whole problem to Ken or Rob or whoever wrote it. :)

Henchman211mo ago

It would be fun, but were I the teacher I'd commend you for your ingenuity, and then ask you to return to your desk to complete the assignment.

BobbyTables21mo ago

That’s golden!

Would make an excellent “interview question from Hell”!

msie1mo ago

Perfect is the enemy of good.

lanstin1mo ago

That's true for product development, but it's not true for mathy libraries. Perfect is achievable. For a released software that humans will decide to use or not, rapid iteration is great. But also: https://randomascii.wordpress.com/2014/01/27/theres-only-fou...

Precision and exactitude and formally proven correct software can exist in some problem domains, and it's kind of silly to not achieve that when it's achievable.

chowells1mo ago

But in this case, C is not "good". It is more like "abysmal". "Good" is just producing a correct result or error, with no ambiguity which case applied and no UB. "Perfect" is arguing over the most usable and elegant API for it.

clark_dent1mo ago

Could you humor a coding noob--how do you deal with utterly insane inputs like that?

matthewkayin1mo ago

You first ask if you really need to.

1 more reply

wwalexander1mo ago

Arbitrary precision arithmetic (GMP, BigInteger, etc). Numbers can take arbitrary amounts of memory, instead of just a single machine word.

Ekaros1mo ago

At some point you can just refuse. Too many digits. Well time to quit with error.

doubled1121mo ago

Crash and report an error.

1 more reply

voidUpdate1mo ago· 11 in thread

Cant you just:

  for(int i = 0; i < len(characters); i++)
  {
    if(characters[i]-48 <= 9 && characters[i]-48 >= 0)
    {
      ret = ret * 10 + characters[i] - 48;
    }
    else
    {
      return ERROR;
    }
  }
  return ret;

Adjust until it actually works, but you get the picture.

knome1mo ago

this wouldn't catch overflow or underflow errors, nor does it allow non-base-10 numbers, nor does it handle negative numbers. and writing your own parser is a failure case by op's logic. they are complaining about the builtin parsing functions.

the author admits you can parse signed integers in their second example, but for unsigned, they don't like seem to like that unsigned parsing will accept negative numbers and then automatically wrap them to their unsigned equivalents, nor do they like that C number parsing often bails with best effort on non-numeric trailing data rather than flagging it an error, nor do they like that ULONG_MAX is used as a sentinel value by sscanf.

I'm not sure what they mean by "output raw" vs "output"

    $ cat t.c
    
    #include <stdlib.h>
    #include <math.h>
    #include <stdio.h>
    
    int main(int argc, char \* argv){
    
      char * enda = NULL;
      unsigned long long a = strtoull("-18446744073709551614", &enda, 10);
      printf("in = -18446744073709551614, out = %llu\n", a);
      
      char * endb = NULL;
      unsigned long long b = strtoull("-18446744073709551615", &endb, 10);
      printf("in = -18446744073709551615, out = %llu\n", b);
      
      return 0;
    }
    $ gcc t.c
    $ ./a.out 
    in = -18446744073709551614, out = 2
    in = -18446744073709551615, out = 1
    $

I get their "output raw" value. I don't know what their "output" value is coming from.

I don't see anywhere they describe what they are representing in the raw vs not columns.

thomashabets21mo ago

> they don't like seem to like that unsigned parsing will accept negative numbers and then automatically wrap them to their unsigned equivalents, nor do they like that C number parsing often bails with best effort on non-numeric trailing data rather than flagging it an error, nor do they like that ULONG_MAX is used as a sentinel value by sscanf.

That's right. I don't like asking it to parse the number contained inside a string, and getting a different number as a result.

That's just simply not the right answer.

> I'm not sure what they mean by "output raw" vs "output"

I can see how that's very unclear. Changed now to "Readable".

card_zero1mo ago

I think "output" is just supposed to be a human-readable version of "output raw". So the line in the table where "output raw" is 2 but "output" is 1 looks like a mistake. It's repeated in the table for sscanf().

1 more reply

dlcarrier1mo ago

Here's a readability tip for working with ASCII numbers: Treat adding and subtracting the ASCIIness as you would multiplying and dividing by a unit in physics. You can add '0' to convert a numeral to ASCII and subtract '0' to convert it back, and you can do direct comparisons between ASCII numerals.

    if(characters[i] <= '9' && characters[i] >= '0')
    {
      ret = ret * 10 + characters[i] - '0';
    }

voidUpdate1mo ago

I was trying to remember how to do that, I forgot you can subtract '0', and was thinking that - 0 obviously wouldn't work

Sharlin1mo ago

And how does this avoid returning nonsense if the number is too large? (Wrapping if the accumulator is unsigned, straight to UB land if signed.) Not reporting overflows as errors is one of the major problems demonstrated by TFA.

voidUpdate1mo ago

you could check if ret > ret * 10 + characters[i]-48, if so it has wrapped around and you return an error

2 more replies

fhdkweig1mo ago

What if the number you want to return just happens to be the value of ERROR? You need an error flag that can't be represented as an int, but then C wouldn't let you return it from a function that only returns "int". It is why some languages throw exceptions and why databases have the special "null" value.

voidUpdate1mo ago

I don't use C enough to know what the convention is for throwing an error when the function can return a number anyway. You'd have to ask someone else

1 more reply

jerf1mo ago

And why some very, very special languages have an effectively-global variable called "errno" that you have to check after the call manually, and worry about whether maybe it was populated from some previous error. Nothing says "production-quality language that an entire civilization's code base should be based on" like "sometimes (but only sometimes!) functions return additional information through global values".

1 more reply

bitwize1mo ago

You cannot "just" anything in C without hitting a minefield of UB. It is, probably, more economical to convert your entire project to Rust than it is to do the pufferfish spine removal procedure of auditing the code base for UB and replacing the problem areas. With generative AI, the size of project for which this remains true may be as large as "the entire Linux kernel".

stephc_int131mo ago· 11 in thread

As a C programmer, I find this kind of bad faith article very irritating.

Yes, the standard library is bad. This is by far the worst part of the C legacy. But it is not that hard to write your own.

String functions like this are not difficult at all, and you can use better naming and semantics, write faster code etc.

C is not the C standard library, ffs.

konmokOP1mo ago

I don't think it's in bad faith.

The distinction between a language and its standard library gets blurry even in theory, and in practice they're nearly inseparable. If a language's standard library has four ways of doing almost the same thing, and they're all fundamentally broken, that's a problem.

stephc_int131mo ago

If you read the other articles by the same author on his blog, you'll see that he has some strong and weird opinions about C and UB.

Complete BS in my opinion.

alexfoo1mo ago

Exactly. A wrapper that handles all of the edge cases properly and gives proper reporting just gets added to your own library of functions and the devs get used to using it. Much like the code for abstract data types like lists/hashmaps/etc which neither C nor the standard libraries provide.

Bonus points for having bespoke linting rules to point out the use of known “bad” functions.

In one old project we went through and replaced all instances of sprintf() with snprintf() or equivalent. Once we were happy that we’d got every occurrence we could then add lint rules to flag up any new use of sprintf() so that devs didn’t introduce new possible problems into the code.

(Obviously you can still introduce plenty of problems with snprintf() but we learned to give that more scrutiny.)

17186274401mo ago

> like lists/hashmaps/etc which neither C nor the standard libraries provide

There is a hashmap implementation though: https://man7.org/linux/man-pages/man3/hsearch.3.html

2 more replies

thomashabets21mo ago

While snprintf() is better than sprintf(), I find that it's easy for people to not check if the return value is bigger than the provided size. Sure, it prevents a buffer overflow, but there could still be a string truncation problem.

Similar to how strlcpy() is not a slam dunk fix to the strcpy() problem.

1 more reply

wang_li1mo ago

The thing I find irritating is all the folks who say C is broken because it’s not a write once run anywhere language like JavaScript or python. Part of the deal has always been that the programmer needs to understand the target platform and the target compiler’s behavior.

DowsingSpoon1mo ago

Write once run anywhere? But C already is a "write once run anywhere" language! Though, you usually have to recompile first :)

The criticisms related to UB are not about understanding the target platform and the target compiler's behavior. Undefined Behavior is not the same thing as Implementation-defined Behavior, and lots of folks (including me) would be satisfied with reclassifying chunks of UB as the latter.

The behavior of the target platform isn't really the issue. C23 mandates two's complement for signed integers. Most hardware wraps on overflow, but that literally doesn't matter. The standard says a program exhibiting signed overflow is undefined, period.

In practice, UB rules mean the compiler is free to remove checks for signed overflow/underflow, checks for null pointers, etc. This can and does happen. Man, just a few weeks ago, I just had to deal with a crash in a C program that turned out to be due to the compiler removing a null check. That was a painful one.

1 more reply

thomashabets21mo ago

The point of this post, though, is even something as simple as "give me this string as an integer" doesn't have an answer that doesn't come with "are you OK with this best effort parse under these edge cases? Oh and we use this number as error, so you can't parse that".

Like… edge cases? It's parsing a number! We're not talking about I/O on hard vs soft intr NFS mounts, here. There's a right answer.

strlen(), on valid null terminated strings, doesn't come with caveats like "oh we can't measure strings of length 99".

But sure, C is turing complete. It is possible to solve any problem a turing machine can solve.

> understand the target platform and the target compiler’s behavior.

This is neither. This is purely the language.

1 more reply

mswphd1mo ago

isn't the whole point of C that it's portable assembly though? needing to understand the target platform/compiler's behavior to write correct code seems to cut against that claim quite a bit.

1 more reply

msie1mo ago

The people downvoting you are probably not C programmers and love to hate C.

card_zero1mo ago

I guess trying to write in Rust makes them irritable.

bsenftner1mo ago· 5 in thread

One of the first homework assignments when I learned C back in '83 was after a long lecture on how the string functions are fundamentally broken, and the class introduction to writing C was fixing all of them.

psvv1mo ago

My memory growing up is that making your own C library was basically an inevitable rite of passage for any aspiring programmer.

lanstin1mo ago

And then your own custom allocator that would be fitted for your algorithms and vastly faster than malloc.

prerok1mo ago

Yeah, it's a shame we never got something like boost for C. Every company I ever worked for had its own common C library solving these problems.

ndesaulniers1mo ago

It's a shame we never got a package manager for C (or C++).

EDIT: perhaps I should have been clearer; by not having one early on, we now have multiple competing package managers, with no clear winner. Responses prove that point.

2 more replies

bsenftner1mo ago

I worked at a shop where we used Boost in a C++ code base that the only use of C++ was the harness to use Boost. After that, it was all C, object-styled C, as that code base started before C++ compilers were not a template overlay on C.

contubernio1mo ago· 2 in thread

One of the great virtues of C is that this sort of thing is not part of the language ...

thomashabets21mo ago

Only literally. 7.24.1 in the C programming language spec has these poor parsers.

rbanffy1mo ago

Is their misbehavior part of the spec as well? If not, we can always add the correct behavior to the spec and let anyone who implemented a broken version deal with fixing every program compiled using it.

1 more reply

CodesInChaos1mo ago· 2 in thread

Another case many integer parsing functions get wrong is that they interpret a leading 0 as an octal indicator.

That should be opt-in via a flag, if it needs to be supported at all. Unix file permissions are the only deliberate use of octal I've ever seen.

kevin_thibedeau1mo ago

It used to be much more common. In the 70s there was a lot of collective hesitance to use hex with its strange letter digits. Octal was the compact representation of choice.

adrianmonk1mo ago

Also, some very old computers had 36-bit words. Word sizes on modern computers are virtually always powers of 2, but it hasn't always been that way.

And octal is more convenient for output via 7-segment LEDs and for input via numeric keypads.

ramon1561mo ago· 1 in thread

Why not look at how other languages attack this? e.g. how does "42".parse() work in rust?

Edit: https://doc.rust-lang.org/src/core/num/mod.rs.html#1537

interesting! It boils down to this

pub const fn from_ascii_radix(src: &[u8], radix: u32) -> Result<u32, ParseIntError> {

    use self::IntErrorKind::*;

    use self::ParseIntError as PIE;

    // guard: radix must be 2..=36

    if 2 > radix || radix > 36 {

        from_ascii_radix_panic(radix);

    }

    if src.is_empty() {

        return Err(PIE { kind: Empty });

    }

    // Strip leading '+' or '-', detect sign

    // (a bare '+' or '-' with nothing after it is an error)

    // accumulate digits, checking for overflow

    Ok(result)

}

marcosdumay1mo ago

It's not an overwhelming hard problem. There are some issues with radix signaling, exponent notation, decimal points being allowed or not, and group separators that make parsing numbers incredibly irritating. So you usually don't want to do it yourself.

But it's not hard at all. It's not even as full of small issues that you can't handle the load, like dates. It's just annoying as hell.

The problem is exclusive to C and C++. It's created by the several rounds of standardization of broken behavior.

zokier1mo ago· 1 in thread

I thought it was pretty well known that everything related to strings in C stdlib (including all str... functions) is bad. You just need to bring in your own string library.

bhk1mo ago

Not just the string-related functions. If you want robust error checking, re-entrant code, and bounds checking performed in library functions (instead of performing bespoke validations all across your code base), you have some work to do. Yes, some improvements have been tacked on over the years, but many problems ("current locale", for one) remain endemic.

In my experience, the worst part of the C standard library is not its existence, but the fact that so many developers insist on slavishly using it directly, instead of safer wrappers.

norir1mo ago· 1 in thread

This is not a hard thing to do without using a library. The code below is easily adapted to the unsigned case and/or arbitrary base rather than 10.

    #include <stdio.h>
    int main(int argc, char **argv) {
        if (argc != 2) {
            fprintf(stderr, "usage: require one numeric argument");
        }
        char *nump = argv[1];
        unsigned neg = 0;
        unsigned long long ures = 0;
        if (*nump == '-') {
            neg = 1;
            nump = nump + 1;
        }
        if (!*nump) {
            fprintf(stderr, "require non empty string\n");
            return 1;
        }
        char b;
        while (b = *nump++) {
            if (b >= '0' && b <= '9') {
                unsigned long long nres = (ures * 10) + (b - '0'); 
                if (nres < ures) {
                    fprintf(stderr, "overflow in '%s'\n", argv[1]);
                    return 1;
                }   
                ures = nres;
            } else {
                if (b >= ' ') {
                    fprintf(stderr, "invalid char '%c' in '%s'\n", b, argv[1]); 
                } else {
                    fprintf(stderr, "invalid byte '%d' in '%s'\n", b, argv[1]);
                }
                return 1;  
            }
        }
        long long res = (long long) ures;
        if (neg) {
            if (ures <= 0x8000000000000000ULL) {
                res = -res;
            } else {
                fprintf(stderr, "underflow in '%s'\n", argv[1]);
                return 1;
            }
        } else if (ures > 0x7FFFFFFFFFFFFFFFULL) {
            fprintf(stderr, "overflow in '%s'\n", argv[1]);
            return 1;
        }
        fprintf(stdout, "result: %lld\n", res);
        return 0;
    }

wCxV8HzziQBb1mo ago

The bound on ures <= 0x80[...] should be either ures < 0x80[...] or ures <= 0x7F[...]. Otherwise, parsing negative `0x8000000000000000` will run code to negate the signed integer INT64_MIN (-0x80[...]) to 0x80[...], which doesn't fit in an integer (INT*_MAX is 0x80[...]).

    $ clang parseint.c -fsanitize=undefined -O0 -g -o parseint
    $ ./parseint -9223372036854775808
    parseint.c:38:23: runtime error: negation of -9223372036854775808 cannot be represented in type 'long long'; cast to an unsigned type to negate this value to itself
    result: -9223372036854775808

edit: this is just to show that getting undefined behavior right is hard!

jervant1mo ago· 1 in thread

https://man.openbsd.org/strtonum

bmandale1mo ago

Interestingly fails as well, in two ways. First:

> The string may begin with an arbitrary amount of whitespace (as determined by isspace(3))

Second is that it only applies to signed long long, not unsigned.

lacewing1mo ago· 1 in thread

There's no one correct way to parse integers. Do you want to support 0x prefixes? Is a leading zero an indicator or octal, a zero-padded decimal, or a syntax error? Are you willing to accept a leading "+"? Are leading whitespaces OK? Trailing ones? Is 0x0c a whitespace? What about all the weird Unicode ones? Do you allow exponential notation (1e1)? Etc, etc.

In every language, the standard library makes some assumptions about this. In JavaScript, an empty string parses to zero.

The standard C library, which dates back to the stone age, does the simplest thing you can do without range checking, because, well, that's kinda the C paradigm. If you want parsing that handles edge cases in a specific way, you do it yourself. It's just digits.

mike_hock1mo ago

> There's no one correct way to parse integers.

No, but there are a myriad of incorrect ways and the C library's way is one of them.

It's perfectly fine to make reasonable choices for all those options and then implement them correctly.

alkonaut1mo ago

How could an api for number parsing ever be designed to return 0 for invalid input, for a function where 0 is also a common (perhaps the most common) return value for a valid input?

This wouldn't even pass a cursory sanity check of the api from a beginner developer, how did it end up in a standard library at all? Was it a mistake and then it was just too late to remove it?

Any function that can either succeed or fail, which is basically every parsing function, must typically indicate success or failure. You can terminate the program or you can return an object that itself indicates failure (such as -1 when finding a positive index) but if ALL values of the return type CAN be valid then the success state must be a separate return value.

What's the purpose of the function atol() if it doesn't have that? Is it "It's still useful for trusted input we know is a string representation of a long" (E.g. for bounded number roundtrip)? That seems awfully limited. But perhaps such a scenario was perhaps more common in 1960?

alexfoo1mo ago

I remember an old project that ran into something like this. I think we just used atoi() or similar and the error check was a string comparison between the original input and a sprintf() of the converted value.

Ugly (and not performant if in a hot path) but it works.

eithed1mo ago

Can't you regex that given string contains just numbers and then use any of the provided methods? Then check if the returning value is a number to cater for edge cases

Ok, having a method to do that for you would be nice, but the post reads like it's an issue that std library doesn't provide you with a method behaving as you exactly want

mike_hock1mo ago

The problem is that float parsing is highly non-trivial if you want it to be correct for all edge cases.

For integers, you're faster (in both development time and runtime) to write your own parser than to try and assemble the pieces in this pile of shit into a half-working one.

C++17 from_chars excluded. Incidentally, 2022 seems about right for the year that ONE open source implementation finally actually implemented the float part of that. Or was it more like 2024?

fastaguy881mo ago

And yet, thousands and thousands of 'C' programs parse integers every hour successfully.

Perhaps the right title should be "No way to parse pathological edge cases in 'C'"

And then see how other languages do.

derefr1mo ago

> It is not OK to stop at the first sign of trouble, and return whatever maybe is right. “123timmy” is not a number, nor is the empty string.

None of the C functions referenced (atol, strtol, sscanf) are number-parsing functions per se. Rather, they're numeric-lexeme scanning+extraction functions.

These functions are all designed to avoid making any assumptions about the syntax of the larger document the numeric lexeme might be embedded in. You might, after all, be using a syntax where numbers can come with units on the end. Or you might be reading numbers as comma-separated values.

And, as a key point the author might be missing: C, in being co-designed with UNIX, offers primitives tuned for the context of:

- writing UNIX CLI tools that work with unbounded streams of input (i.e. piped output from other UNIX CLI tools),

- where, crucially, the stream is just text, and so carries no TLV-esque framing protocol to tell you the definitive length of a thing;

- and nor (especially in early memory-constrained systems) are you able to perform allocations of heap memory in order to employ an unbounded growable buffer for retaining the current lexeme until you do reach the end of it (which, if you could, would let you use a scanner state-machine that doubles as a parser/validator, returning either a parsed value or an error)

- but instead, to deal with the 1. unbounded input, 2. of textual encoding, 3. in constant memory, you must eagerly scan the input stream (i.e. synchronously reduce over each received byte, or at most each fixed-length N-byte chunk using a static or stack-allocated fixed-length buffer, discarding the original string bytes once reduced-over) to produce lexically-decoded (but not parsed/validated) lexemes; and then do this again, on a higher level, feeding your stream of lexemes into a fixed-sized sum-typed ring-buffer (i.e. an array-of-union-typed-lexeme-struct-type-entries), where you can then invoke a function that attempts to scan over + consume them (but unlike the original stream-parsing function, doesn't consume the buffer unless successful, and so isn't functioning as a scanner per se, but rather as an LR parser.)

If you're not writing UNIX CLI tools, direct use of the C-stdlib numeric-lexeme scan functions is operating on the wrong abstraction layer. What you want, if you have pre-framed strings that are "either valid numbers or parse errors", is to implement an actual parsing function... that can then invoke these numeric-lexer functions to do the majority of its work.

And if you're writing C, and yet you're not in UNIX-pipeline unbounded-text-stream land, but rather are parsing well-defined bounded-length "documents" (like, say, C source files)... then you probably want to use a real lexer-generator (like flex) to feed a parser-generator (like yacc/bison). Where:

- you'd validate the token in context, in the parsing phaase;

- and your lexing rules would make certain classes of input invalid at lexing time. (E.g. you can write your lexeme matching rules such that multi-digit numbers with leading zeroes, or floating-point values with no digits before/after the decimal place, simply aren't "numbers" from your lexer's perspective.)

...which means that, once again, you can "get away with" invokeing the regular C numeric-lexeme scanner functions; i.e. `yylval = atoi(yytext);` in bison terms. (And you'd want to, since doing so saves memory vs. keeping the numbers around as strings.)

chadgpt31mo ago

... say users of only language with no way to parse integers.

j / k navigate · click thread line to collapse

111 comments

69 comments · 18 top-level

orthoxerox1mo ago· 15 in thread

dlcarrier1mo ago

ronsor1mo ago

You don't need to parse the strings in reverse. That's for printing integers, not parsing. Roughly:

    int stdin_atoi() {
      int i = 0;
      while (1) {
        int c = getchar();
        if (c >= '0' && c <= '9') {
          i = i * 10 + (c - '0');
        } else { break; }
      }
      return i;
    }

2 more replies

pbalau1mo ago

How do you know where the first string ends and the second starts? Did you miss the "stdin" part?

This is not

    ./program first_number second_number

1 more reply

camkego1mo ago

jeffrallen1mo ago

Would be fun to write a program that arranges to send the input into dc(1) and just outsource the whole problem to Ken or Rob or whoever wrote it. :)

Henchman211mo ago

It would be fun, but were I the teacher I'd commend you for your ingenuity, and then ask you to return to your desk to complete the assignment.

BobbyTables21mo ago

That’s golden!

Would make an excellent “interview question from Hell”!

msie1mo ago

Perfect is the enemy of good.

lanstin1mo ago

Precision and exactitude and formally proven correct software can exist in some problem domains, and it's kind of silly to not achieve that when it's achievable.

chowells1mo ago

clark_dent1mo ago

Could you humor a coding noob--how do you deal with utterly insane inputs like that?

matthewkayin1mo ago

You first ask if you really need to.

1 more reply

wwalexander1mo ago

Arbitrary precision arithmetic (GMP, BigInteger, etc). Numbers can take arbitrary amounts of memory, instead of just a single machine word.

Ekaros1mo ago

At some point you can just refuse. Too many digits. Well time to quit with error.

doubled1121mo ago

Crash and report an error.

1 more reply

voidUpdate1mo ago· 11 in thread

Cant you just:

  for(int i = 0; i < len(characters); i++)
  {
    if(characters[i]-48 <= 9 && characters[i]-48 >= 0)
    {
      ret = ret * 10 + characters[i] - 48;
    }
    else
    {
      return ERROR;
    }
  }
  return ret;

Adjust until it actually works, but you get the picture.

knome1mo ago

I'm not sure what they mean by "output raw" vs "output"

    $ cat t.c
    
    #include <stdlib.h>
    #include <math.h>
    #include <stdio.h>
    
    int main(int argc, char \* argv){
    
      char * enda = NULL;
      unsigned long long a = strtoull("-18446744073709551614", &enda, 10);
      printf("in = -18446744073709551614, out = %llu\n", a);
      
      char * endb = NULL;
      unsigned long long b = strtoull("-18446744073709551615", &endb, 10);
      printf("in = -18446744073709551615, out = %llu\n", b);
      
      return 0;
    }
    $ gcc t.c
    $ ./a.out 
    in = -18446744073709551614, out = 2
    in = -18446744073709551615, out = 1
    $

I get their "output raw" value. I don't know what their "output" value is coming from.

I don't see anywhere they describe what they are representing in the raw vs not columns.

thomashabets21mo ago

That's right. I don't like asking it to parse the number contained inside a string, and getting a different number as a result.

That's just simply not the right answer.

> I'm not sure what they mean by "output raw" vs "output"

I can see how that's very unclear. Changed now to "Readable".

card_zero1mo ago

1 more reply

dlcarrier1mo ago

    if(characters[i] <= '9' && characters[i] >= '0')
    {
      ret = ret * 10 + characters[i] - '0';
    }

voidUpdate1mo ago

I was trying to remember how to do that, I forgot you can subtract '0', and was thinking that - 0 obviously wouldn't work

Sharlin1mo ago

voidUpdate1mo ago

you could check if ret > ret * 10 + characters[i]-48, if so it has wrapped around and you return an error

2 more replies

fhdkweig1mo ago

voidUpdate1mo ago

I don't use C enough to know what the convention is for throwing an error when the function can return a number anyway. You'd have to ask someone else

1 more reply

jerf1mo ago

1 more reply

bitwize1mo ago

stephc_int131mo ago· 11 in thread

As a C programmer, I find this kind of bad faith article very irritating.

Yes, the standard library is bad. This is by far the worst part of the C legacy. But it is not that hard to write your own.

String functions like this are not difficult at all, and you can use better naming and semantics, write faster code etc.

C is not the C standard library, ffs.

konmokOP1mo ago

I don't think it's in bad faith.

stephc_int131mo ago

If you read the other articles by the same author on his blog, you'll see that he has some strong and weird opinions about C and UB.

Complete BS in my opinion.

alexfoo1mo ago

Bonus points for having bespoke linting rules to point out the use of known “bad” functions.

(Obviously you can still introduce plenty of problems with snprintf() but we learned to give that more scrutiny.)

17186274401mo ago

> like lists/hashmaps/etc which neither C nor the standard libraries provide

There is a hashmap implementation though: https://man7.org/linux/man-pages/man3/hsearch.3.html

2 more replies

thomashabets21mo ago

Similar to how strlcpy() is not a slam dunk fix to the strcpy() problem.

1 more reply

wang_li1mo ago

DowsingSpoon1mo ago

Write once run anywhere? But C already is a "write once run anywhere" language! Though, you usually have to recompile first :)

1 more reply

thomashabets21mo ago

Like… edge cases? It's parsing a number! We're not talking about I/O on hard vs soft intr NFS mounts, here. There's a right answer.

strlen(), on valid null terminated strings, doesn't come with caveats like "oh we can't measure strings of length 99".

But sure, C is turing complete. It is possible to solve any problem a turing machine can solve.

> understand the target platform and the target compiler’s behavior.

This is neither. This is purely the language.

1 more reply

mswphd1mo ago

isn't the whole point of C that it's portable assembly though? needing to understand the target platform/compiler's behavior to write correct code seems to cut against that claim quite a bit.

1 more reply

msie1mo ago

The people downvoting you are probably not C programmers and love to hate C.

card_zero1mo ago

I guess trying to write in Rust makes them irritable.

bsenftner1mo ago· 5 in thread

psvv1mo ago

My memory growing up is that making your own C library was basically an inevitable rite of passage for any aspiring programmer.

lanstin1mo ago

And then your own custom allocator that would be fitted for your algorithms and vastly faster than malloc.

prerok1mo ago

Yeah, it's a shame we never got something like boost for C. Every company I ever worked for had its own common C library solving these problems.

ndesaulniers1mo ago

It's a shame we never got a package manager for C (or C++).

EDIT: perhaps I should have been clearer; by not having one early on, we now have multiple competing package managers, with no clear winner. Responses prove that point.

2 more replies

bsenftner1mo ago

contubernio1mo ago· 2 in thread

One of the great virtues of C is that this sort of thing is not part of the language ...

thomashabets21mo ago

Only literally. 7.24.1 in the C programming language spec has these poor parsers.

rbanffy1mo ago

1 more reply

CodesInChaos1mo ago· 2 in thread

Another case many integer parsing functions get wrong is that they interpret a leading 0 as an octal indicator.

That should be opt-in via a flag, if it needs to be supported at all. Unix file permissions are the only deliberate use of octal I've ever seen.

kevin_thibedeau1mo ago

It used to be much more common. In the 70s there was a lot of collective hesitance to use hex with its strange letter digits. Octal was the compact representation of choice.

adrianmonk1mo ago

Also, some very old computers had 36-bit words. Word sizes on modern computers are virtually always powers of 2, but it hasn't always been that way.

And octal is more convenient for output via 7-segment LEDs and for input via numeric keypads.

ramon1561mo ago· 1 in thread

Why not look at how other languages attack this? e.g. how does "42".parse() work in rust?

Edit: https://doc.rust-lang.org/src/core/num/mod.rs.html#1537

interesting! It boils down to this

pub const fn from_ascii_radix(src: &[u8], radix: u32) -> Result<u32, ParseIntError> {

    use self::IntErrorKind::*;

    use self::ParseIntError as PIE;

    // guard: radix must be 2..=36

    if 2 > radix || radix > 36 {

        from_ascii_radix_panic(radix);

    }

    if src.is_empty() {

        return Err(PIE { kind: Empty });

    }

    // Strip leading '+' or '-', detect sign

    // (a bare '+' or '-' with nothing after it is an error)

    // accumulate digits, checking for overflow

    Ok(result)

}

marcosdumay1mo ago

But it's not hard at all. It's not even as full of small issues that you can't handle the load, like dates. It's just annoying as hell.

The problem is exclusive to C and C++. It's created by the several rounds of standardization of broken behavior.

zokier1mo ago· 1 in thread

I thought it was pretty well known that everything related to strings in C stdlib (including all str... functions) is bad. You just need to bring in your own string library.

bhk1mo ago

In my experience, the worst part of the C standard library is not its existence, but the fact that so many developers insist on slavishly using it directly, instead of safer wrappers.

norir1mo ago· 1 in thread

This is not a hard thing to do without using a library. The code below is easily adapted to the unsigned case and/or arbitrary base rather than 10.

    #include <stdio.h>
    int main(int argc, char **argv) {
        if (argc != 2) {
            fprintf(stderr, "usage: require one numeric argument");
        }
        char *nump = argv[1];
        unsigned neg = 0;
        unsigned long long ures = 0;
        if (*nump == '-') {
            neg = 1;
            nump = nump + 1;
        }
        if (!*nump) {
            fprintf(stderr, "require non empty string\n");
            return 1;
        }
        char b;
        while (b = *nump++) {
            if (b >= '0' && b <= '9') {
                unsigned long long nres = (ures * 10) + (b - '0'); 
                if (nres < ures) {
                    fprintf(stderr, "overflow in '%s'\n", argv[1]);
                    return 1;
                }   
                ures = nres;
            } else {
                if (b >= ' ') {
                    fprintf(stderr, "invalid char '%c' in '%s'\n", b, argv[1]); 
                } else {
                    fprintf(stderr, "invalid byte '%d' in '%s'\n", b, argv[1]);
                }
                return 1;  
            }
        }
        long long res = (long long) ures;
        if (neg) {
            if (ures <= 0x8000000000000000ULL) {
                res = -res;
            } else {
                fprintf(stderr, "underflow in '%s'\n", argv[1]);
                return 1;
            }
        } else if (ures > 0x7FFFFFFFFFFFFFFFULL) {
            fprintf(stderr, "overflow in '%s'\n", argv[1]);
            return 1;
        }
        fprintf(stdout, "result: %lld\n", res);
        return 0;
    }

wCxV8HzziQBb1mo ago

    $ clang parseint.c -fsanitize=undefined -O0 -g -o parseint
    $ ./parseint -9223372036854775808
    parseint.c:38:23: runtime error: negation of -9223372036854775808 cannot be represented in type 'long long'; cast to an unsigned type to negate this value to itself
    result: -9223372036854775808

edit: this is just to show that getting undefined behavior right is hard!

jervant1mo ago· 1 in thread

https://man.openbsd.org/strtonum

bmandale1mo ago

Interestingly fails as well, in two ways. First:

> The string may begin with an arbitrary amount of whitespace (as determined by isspace(3))

Second is that it only applies to signed long long, not unsigned.

lacewing1mo ago· 1 in thread

In every language, the standard library makes some assumptions about this. In JavaScript, an empty string parses to zero.

mike_hock1mo ago

> There's no one correct way to parse integers.

No, but there are a myriad of incorrect ways and the C library's way is one of them.

It's perfectly fine to make reasonable choices for all those options and then implement them correctly.

alkonaut1mo ago

How could an api for number parsing ever be designed to return 0 for invalid input, for a function where 0 is also a common (perhaps the most common) return value for a valid input?

This wouldn't even pass a cursory sanity check of the api from a beginner developer, how did it end up in a standard library at all? Was it a mistake and then it was just too late to remove it?

alexfoo1mo ago

Ugly (and not performant if in a hot path) but it works.

eithed1mo ago

Can't you regex that given string contains just numbers and then use any of the provided methods? Then check if the returning value is a number to cater for edge cases

Ok, having a method to do that for you would be nice, but the post reads like it's an issue that std library doesn't provide you with a method behaving as you exactly want

mike_hock1mo ago

The problem is that float parsing is highly non-trivial if you want it to be correct for all edge cases.

For integers, you're faster (in both development time and runtime) to write your own parser than to try and assemble the pieces in this pile of shit into a half-working one.

C++17 from_chars excluded. Incidentally, 2022 seems about right for the year that ONE open source implementation finally actually implemented the float part of that. Or was it more like 2024?

fastaguy881mo ago

And yet, thousands and thousands of 'C' programs parse integers every hour successfully.

Perhaps the right title should be "No way to parse pathological edge cases in 'C'"

And then see how other languages do.

derefr1mo ago

> It is not OK to stop at the first sign of trouble, and return whatever maybe is right. “123timmy” is not a number, nor is the empty string.

None of the C functions referenced (atol, strtol, sscanf) are number-parsing functions per se. Rather, they're numeric-lexeme scanning+extraction functions.

And, as a key point the author might be missing: C, in being co-designed with UNIX, offers primitives tuned for the context of:

- writing UNIX CLI tools that work with unbounded streams of input (i.e. piped output from other UNIX CLI tools),

- where, crucially, the stream is just text, and so carries no TLV-esque framing protocol to tell you the definitive length of a thing;

- you'd validate the token in context, in the parsing phaase;

chadgpt31mo ago

... say users of only language with no way to parse integers.

j / k navigate · click thread line to collapse