String tokenization in C (opens in new tab)

(onebyezero.blogspot.com)

155 pointsthrowaway24197y ago114 comments

114 comments

The actions of strtok can easily be coded using strspn and strcspn.

https://groups.google.com/forum/message/raw?msg=comp.lang.c/... [2001]

https://groups.google.com/forum/message/raw?msg=comp.lang.c/... [2011 repost]

strspn(s, bag) calculates the length of the prefix of string s which consists only of the characters in string bag. strcspn(s, bag) calculates the length of the prefix of s consisting of characters not in bag.

The bag is like a one-character regex class; so that is to say strspn(s, "abcd") is like calculating the length of the token at the front of input s matching the regex [abcd]* , and in the case of strcspn, that becomes [^abcd]* .

saagarjha7y ago

And it’s nicer, since you can pass in a const char * and use it in concurrent code.

jstimpfle7y ago

strtok is one of the silliest parts of the standard library. (And there are many bad ones). It's broken. It's not thread safe (yes there is strtok_r). It's needlessly hard to use. And it writes zeros to the input array. The latter means it's unfit for most use cases, including non-trivial tokenization where you want e.g. to split "a+1" into three tokens.

If you program in C please just write those four obvious lines yourself.

yason7y ago

If you program in C please just write those four obvious lines yourself.

Those are not necessarily obvious lines, there are several pitfalls to avoid, and for that reason strtok() is much longer than four lines. When it comes to the standard library functions strtok() has well-defined behaviour that is easy to reason with and near-magically approaches the string-splitting convenience close to scripting languages.

In contrast, an example of truly sickening part of stdlib is converting strings to number. The atoi()/atol() family doesn't check for errors at all so you want to use strtol(). But the way error checking works in strtol() is so complex that the man page has a specific example of how to do it correctly. All sane programmers quickly write a clean wrapper around strtol() to encode the complexity once. Now, strtok() is nothing like that.

In its simplicity, strtok() is quite versatile. A few strtok() calls can easily parse lines like:

    keyword=value1, value2, value3

that you might find in configuration files. And I mean truly in just a few lines which you might expect in Python but with C string handling? No.

jstimpfle7y ago

Here is the musl implementation.

> https://github.com/esmil/musl/blob/master/src/string/strtok....

It's a bit longer than 4 lines because strtok does things you should not want. If you insist on parsing that configuration line with strtok, go ahead and write that brittle code. It breaks as soon as you want empty strings (try "keyword=value1, , value3" with strtok) or escape sequences or other transformations, or as soon as you want to do something as basic as parsing from a stream instead of a string that is completely in memory.

So to clarify, of course you are never done with parsing in 4 lines. But even if it wasn't as braindead to overwrite the input string, the functionality strtok provides would not be worth more than 4 lines.