undefined | Better HN

0 pointsColinWright6y ago0 comments

Reproducing what your code does in the most simple, naive C program possible, I can beat the existing wc utility, taking only around 40% of the time that wc takes.

So until you put in locale handling, alternate line endings, option handling, and error handling, I don't see that your post is at all convincing.

Quite the opposite.

So I look forward to a Haskell version that supports everything the wc has so we can get a fair comparison.

0 comments

0xd34df00d6y ago

> So until you put in locale handling, alternate line endings, option handling, and error handling, I don't see that your post is at all convincing.

That's precisely what the second part would be about.

And, if I succeed, IMO, that's where Haskell would really shine (because composability and local reasoning), and where I would be able to claim to achieve something — the stuff in the post we're discussing is indeed trivial (I didn't want to say that in the post itself though as I think it'll look like I'm belittling the guy who did the original post), while modularizing this is _fun_.

ColinWrightOP6y ago

FWIW, my version, now compiled with -O3, is 8 times faster than wc. And I haven't even tried to optimise it.

I look forward to your results.

eesmith6y ago

Here's the source code for GNU wc - https://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;... .

It's easy to see that it's not the result of 10+ years of low-level optimizations to eek out the most performance.

Your test code probably hits the MB_CUR_MAX>1 path at line 361. (Check your locale setting!)

The main loop is:

   402   if (!in_shift && is_basic (*p))
   403     {
   404       /* Handle most ASCII characters quickly, without calling
   405          mbrtowc().  */
   406       n = 1;
   407       wide_char = *p;
   408       wide = false;
   409     }
    ...
   443   switch (wide_char)
   444     {
   445     case '\n':
   446       lines++;
   447       FALLTHROUGH;
   448     case '\r':
   449     case '\f':
   450       if (linepos > linelength)
   451         linelength = linepos;
   452       linepos = 0;
   453       goto mb_word_separator;
   454     case '\t':
   455       linepos += 8 - (linepos % 8);
   456       goto mb_word_separator;
   457     case ' ':
   458       linepos++;
   459       FALLTHROUGH;
   460     case '\v':
   461     mb_word_separator:
   462       words += in_word;
   463       in_word = false;
   464       break;
   465     default:
   466       if (wide && iswprint (wide_char))
                  ....
   480       else if (!wide && isprint (to_uchar (*p)))
   481         {
   482           linepos++;
   483           if (isspace (to_uchar (*p)))
   484             goto mb_word_separator;
   485           in_word = true;
   486         }
   487       break;
   488     }

This is a much more complicated implementation than your code. Among other things, note how it uses isprint/iswprint on each character, and how these are locale dependent.

Even in when character = byte, the main loop uses the same logic:

   555     default:
   556       if (isprint (to_uchar (p[-1])))
   557         {
   558           linepos++;
   559           if (isspace (to_uchar (p[-1]))
   560               || isnbspace (to_uchar (p[-1])))
   561             goto word_separator;
   562           in_word = true;
   563         }
   564       break;
   565     }

Your benchmark only uses the characters:

    "\n\r ',-./0123456789ABCDEFGHIJKLMNOPQRSTUVYZabcdefghijklmnopqrstuvwxyz"

which means it comes nowhere near being a good test which verifies the two programs do the same thing.

The following should be a more difficult test set to reproduce wc's output. I create the test set with Python:

    with open("testset.dat", "wb") as f:
      for b1 in range(256):
        for b2 in range(256):
          for b3 in range(256):
            _ = f.write(bytes((b1, b2, b3)))

then print the output using two different locales:

    % env LANG=en_US.UTF-8 wc testset.dat
      196608 1523713 50331648 testset.dat
    % env LANG=C wc testset.dat
      196608 1152001 50331648 testset.dat

0xd34df00d6y ago

This is a great point about handling printable vs non-printable characters that I originally missed when I read wc code. Thank you for pointing this out!

1 more reply

j / k navigate · click thread line to collapse

0 comments

0xd34df00d6y ago

> So until you put in locale handling, alternate line endings, option handling, and error handling, I don't see that your post is at all convincing.

That's precisely what the second part would be about.

ColinWrightOP6y ago

FWIW, my version, now compiled with -O3, is 8 times faster than wc. And I haven't even tried to optimise it.

I look forward to your results.

eesmith6y ago

Here's the source code for GNU wc - https://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;... .

It's easy to see that it's not the result of 10+ years of low-level optimizations to eek out the most performance.

Your test code probably hits the MB_CUR_MAX>1 path at line 361. (Check your locale setting!)

The main loop is:

   402   if (!in_shift && is_basic (*p))
   403     {
   404       /* Handle most ASCII characters quickly, without calling
   405          mbrtowc().  */
   406       n = 1;
   407       wide_char = *p;
   408       wide = false;
   409     }
    ...
   443   switch (wide_char)
   444     {
   445     case '\n':
   446       lines++;
   447       FALLTHROUGH;
   448     case '\r':
   449     case '\f':
   450       if (linepos > linelength)
   451         linelength = linepos;
   452       linepos = 0;
   453       goto mb_word_separator;
   454     case '\t':
   455       linepos += 8 - (linepos % 8);
   456       goto mb_word_separator;
   457     case ' ':
   458       linepos++;
   459       FALLTHROUGH;
   460     case '\v':
   461     mb_word_separator:
   462       words += in_word;
   463       in_word = false;
   464       break;
   465     default:
   466       if (wide && iswprint (wide_char))
                  ....
   480       else if (!wide && isprint (to_uchar (*p)))
   481         {
   482           linepos++;
   483           if (isspace (to_uchar (*p)))
   484             goto mb_word_separator;
   485           in_word = true;
   486         }
   487       break;
   488     }

This is a much more complicated implementation than your code. Among other things, note how it uses isprint/iswprint on each character, and how these are locale dependent.

Even in when character = byte, the main loop uses the same logic:

   555     default:
   556       if (isprint (to_uchar (p[-1])))
   557         {
   558           linepos++;
   559           if (isspace (to_uchar (p[-1]))
   560               || isnbspace (to_uchar (p[-1])))
   561             goto word_separator;
   562           in_word = true;
   563         }
   564       break;
   565     }

Your benchmark only uses the characters:

    "\n\r ',-./0123456789ABCDEFGHIJKLMNOPQRSTUVYZabcdefghijklmnopqrstuvwxyz"

which means it comes nowhere near being a good test which verifies the two programs do the same thing.

The following should be a more difficult test set to reproduce wc's output. I create the test set with Python:

    with open("testset.dat", "wb") as f:
      for b1 in range(256):
        for b2 in range(256):
          for b3 in range(256):
            _ = f.write(bytes((b1, b2, b3)))

then print the output using two different locales:

    % env LANG=en_US.UTF-8 wc testset.dat
      196608 1523713 50331648 testset.dat
    % env LANG=C wc testset.dat
      196608 1152001 50331648 testset.dat

0xd34df00d6y ago

This is a great point about handling printable vs non-printable characters that I originally missed when I read wc code. Thank you for pointing this out!

1 more reply

j / k navigate · click thread line to collapse