So until you put in locale handling, alternate line endings, option handling, and error handling, I don't see that your post is at all convincing.
Quite the opposite.
So I look forward to a Haskell version that supports everything the wc has so we can get a fair comparison.
That's precisely what the second part would be about.
And, if I succeed, IMO, that's where Haskell would really shine (because composability and local reasoning), and where I would be able to claim to achieve something — the stuff in the post we're discussing is indeed trivial (I didn't want to say that in the post itself though as I think it'll look like I'm belittling the guy who did the original post), while modularizing this is _fun_.
I look forward to your results.
It's easy to see that it's not the result of 10+ years of low-level optimizations to eek out the most performance.
Your test code probably hits the MB_CUR_MAX>1 path at line 361. (Check your locale setting!)
The main loop is:
402 if (!in_shift && is_basic (*p))
403 {
404 /* Handle most ASCII characters quickly, without calling
405 mbrtowc(). */
406 n = 1;
407 wide_char = *p;
408 wide = false;
409 }
...
443 switch (wide_char)
444 {
445 case '\n':
446 lines++;
447 FALLTHROUGH;
448 case '\r':
449 case '\f':
450 if (linepos > linelength)
451 linelength = linepos;
452 linepos = 0;
453 goto mb_word_separator;
454 case '\t':
455 linepos += 8 - (linepos % 8);
456 goto mb_word_separator;
457 case ' ':
458 linepos++;
459 FALLTHROUGH;
460 case '\v':
461 mb_word_separator:
462 words += in_word;
463 in_word = false;
464 break;
465 default:
466 if (wide && iswprint (wide_char))
....
480 else if (!wide && isprint (to_uchar (*p)))
481 {
482 linepos++;
483 if (isspace (to_uchar (*p)))
484 goto mb_word_separator;
485 in_word = true;
486 }
487 break;
488 }
This is a much more complicated implementation than your code. Among other things, note how it uses isprint/iswprint on each character, and how these are locale dependent.Even in when character = byte, the main loop uses the same logic:
555 default:
556 if (isprint (to_uchar (p[-1])))
557 {
558 linepos++;
559 if (isspace (to_uchar (p[-1]))
560 || isnbspace (to_uchar (p[-1])))
561 goto word_separator;
562 in_word = true;
563 }
564 break;
565 }
Your benchmark only uses the characters: "\n\r ',-./0123456789ABCDEFGHIJKLMNOPQRSTUVYZabcdefghijklmnopqrstuvwxyz"
which means it comes nowhere near being a good test which verifies the two programs do the same thing.The following should be a more difficult test set to reproduce wc's output. I create the test set with Python:
with open("testset.dat", "wb") as f:
for b1 in range(256):
for b2 in range(256):
for b3 in range(256):
_ = f.write(bytes((b1, b2, b3)))
then print the output using two different locales: % env LANG=en_US.UTF-8 wc testset.dat
196608 1523713 50331648 testset.dat
% env LANG=C wc testset.dat
196608 1152001 50331648 testset.dat