https://groups.google.com/forum/message/raw?msg=comp.lang.c/... [2001]
https://groups.google.com/forum/message/raw?msg=comp.lang.c/... [2011 repost]
strspn(s, bag) calculates the length of the prefix of string s which consists only of the characters in string bag. strcspn(s, bag) calculates the length of the prefix of s consisting of characters not in bag.
The bag is like a one-character regex class; so that is to say strspn(s, "abcd") is like calculating the length of the token at the front of input s matching the regex [abcd]* , and in the case of strcspn, that becomes [^abcd]* .
If you program in C please just write those four obvious lines yourself.
Those are not necessarily obvious lines, there are several pitfalls to avoid, and for that reason strtok() is much longer than four lines. When it comes to the standard library functions strtok() has well-defined behaviour that is easy to reason with and near-magically approaches the string-splitting convenience close to scripting languages.
In contrast, an example of truly sickening part of stdlib is converting strings to number. The atoi()/atol() family doesn't check for errors at all so you want to use strtol(). But the way error checking works in strtol() is so complex that the man page has a specific example of how to do it correctly. All sane programmers quickly write a clean wrapper around strtol() to encode the complexity once. Now, strtok() is nothing like that.
In its simplicity, strtok() is quite versatile. A few strtok() calls can easily parse lines like:
keyword=value1, value2, value3
that you might find in configuration files. And I mean truly in just a few lines which you might expect in Python but with C string handling? No.> https://github.com/esmil/musl/blob/master/src/string/strtok....
It's a bit longer than 4 lines because strtok does things you should not want. If you insist on parsing that configuration line with strtok, go ahead and write that brittle code. It breaks as soon as you want empty strings (try "keyword=value1, , value3" with strtok) or escape sequences or other transformations, or as soon as you want to do something as basic as parsing from a stream instead of a string that is completely in memory.
So to clarify, of course you are never done with parsing in 4 lines. But even if it wasn't as braindead to overwrite the input string, the functionality strtok provides would not be worth more than 4 lines.
I worked on a project a few years ago that read its custom-format config file in line by line, chopped everything off each line following the first '#' character (to support comments), and then trimmed the whitespace. This sounds like a reasonable and elegant approach until you consider that now none of your user controlled fields (via a GUI in our case) can contain the '#' character. This effected customers, but nobody ever fixed it.
With the tools and languages out there now, there's just no excuse for this crap.
keyword=value1, value2, value3
The challenge with parsing isn’t parsing correct inputs; it’s generating useful error messages and recovering on incorrect inputs such as keyword=,,value1, value2, value3,,,,
or even =keyword=,,value1, value2, value3,,,,
strtok isn’t the best tool for doing that.(Yes, those could be valid inputs, but if they are, chances are they should be parsed differently)
The plain truth is that string handling in C is a huge pain in the ass no matter how you look at it. Splitting, concatenating, regex-ing... All of that is a huge pain in C. If you need to write a high-performance parser then it might be worth it but if you're just parsing a fancy command line format and performances don't matter it's just incredibly frustrating and error prone.
Rust fares better here because its str type is not NUL-terminated but actually keeps track of the length separately which makes it significantly more flexible and versatile. Of course you could do that in C but you'll be incompatible with any code dealing with native C-strings.
And of course you make one mistake and you have a buffer overflow vulnerability...
So yeah, if you program in C please use strtok_r if applicable, otherwise considering offloading the parsing to an other part of your application written in a language better suited for that and hand over binary data to the C library. If everything else fails then consider handwriting your parser and may god be with you. Oh and if your grammar is complex enough to warrant it, there's always lex/yacc.
It is. And it’s not even only Cs fault. 80% of it is bad API Design. Strings could be accepted as a struct consisting of a pointer and a length, aka string_view. And there could be some manipulation functions around it. That would make those APIs a lot more flexible (one no longer needs to care whether things are null terminated and there would be less pointless copies).
For these reasons my estimate in the meantime is that the average C program which uses stdlib functions is less efficient than an implementation in another language, even though the authors would claim otherwise (its C, it must be fast).
Not entirely, see https://github.com/antirez/sds
Basically, you have a header storing length, etc, but still null terminate, so library functions like strlen are none the wiser.
libc has somehow managed to hit the sweet spot and have APIs that are both inconvenient to use properly, and perform poorly.
If you pass strtok_r a const string it can and will bus fault in some systems. This happens when it tries to write a /0 to the input string. Being an old crusty firmware guy I'm not sailing on good cargo cult ship HMS Immutability, but generating side effects in your input data stream is terrible.
There is no way to back up/undo when using strtok_r. When your parsing involves a decision tree that kinda sucks.
Other issues with strtok() aside, this seems like a silly reason to discount a standard library function. If you don't want your input munged you can strdup() it. It's rare to find a C program that's so specialized that the performance hit of a strdup() would be unacceptable in a case where strtok() could otherwise have been used.
Token tok;
start_token(&tok);
for (;;) {
int c = look_next_char();
if (('A' <= c && c <= 'Z') ||
('a' <= c && c <= 'z')) { /* or whatever test */
consume_char();
add_to_token(tok, c);
} else {
break;
}
}
end_token(tok);
Done. There's no point in going through a weird API. strcpy(str,"abc,def,ghi");
token = strtok(str,",");
printf("%s \n",token);
Even if the author knows how many tokens are returned I would prefer a check for NULL here since a good fraction might not read further than this bad example.It is perfectly OK for example code to be unsafe. You do not wear a parachute when you learn to fly using a simulator. You realize that things will become more serious and complicated in the future, but you have to start with something simple and unsafe, no big deal. Otherwise you will never see the consequences of unsafe code in simple cases.
Regular expressions don't show up outside the spec, sure; but if you're writing the code (for implicit state machine), you need to know exactly where you are in the regular language that defines the tokens to write good code. Writing a regex matcher in code like this is like writing code in assembly - mentally, you're mapping to a different set of concepts all the time.
Speak for yourself. There’s a POSIX standard for regex that is more than 30 years old & a GNU implementation that comes with gcc. C++ has regex in the standard library.
Typical pattern:
start = p;
while (isspace(*p) && p < eof) // [ ]*
++p;
if (p == eof) return EOF;
if (is_ident_start(*p)) { // [a-z]
++p;
while (is_ident(*p)) // [a-z0-9]*
++p;
set_token(p, p - start);
return IDENT;
} else if (is_number(*p)) { // [0-9]
++p;
while (is_number(*p)) // [0-9]*
++p;
set_token(p, p - start);
return NUMBER;
} // etc.
Corresponds to: IDENT ::= [a-z][a-z0-9]* ;
NUMBER ::= [0-9][0-9]* ;
SPACE ::= [ ]* ;
TOKEN ::= SPACE (IDENT | NUMBER) ;
Inline those nonterminals, and guess what - regular expression!A little example of some Rexx code with some string parsing is in
Well, there is already a thread-safe variant [0]: > The strtok() function uses a static buffer while parsing, so it's not thread safe. Use strtok_r() if this matters to you.
A function can be not thread-safe and still safe to use in single-threaded programs.
The point is that strtok is not a good choice even for single-threaded code.
1. https://www.gnu.org/software/libc/manual/html_node/Finding-T...
I wonder why strtok() does not use an output parameter similar to scanf() — and return the number of tokens. Something like:
int strtok(char *str, char *delim, char **tokens);
Granted, it would involve dynamic memory allocation and the implementation that immediately comes to mind would be less efficient than the current implementation, but surely it’s worth eliminating the kind of bugs the current strtok() can introduce?Does anyone here have the historical prospective?
str = (char *) malloc(sizeof(char) * (strlen(TESTSTRING)+1));
strcpy(str,TESTSTRING);
str = strdup(TESTSTRING)?