\d less efficient than [0-9] (opens in new tab)

(stackoverflow.com)

252 pointsmserdarsanli13y ago75 comments

75 comments

45 comments · 15 top-level

Aurel1us13y ago· 9 in thread

Short answer: \d includes all the Unicode characters from http://www.fileformat.info/info/unicode/category/Nd/list.htm

ars13y ago

Is that actually a good thing? If I'm using \d to validate numbers (for example to check before string to int conversion, or IP address, phone number, or any other use), other unicode digits are not helpful to me.

It's great to support unicode, but I don't think the \d should have been extended this way. Add a \ud or something.

Tuna-Fish13y ago

Given that the category is specifically "decimal digit", I think it's good, so long as the number parsing code accepts them all too.

1 more reply

rmc13y ago

Yes it's a good thing. There are other places in the world that don't just use ascii. If you want European style numbers just use [0-9]

bellbind13y ago

If you use a preg engine you can add the /a modifier which excludes unicode chars from matches.

chebucto13y ago

Maybe specify the subset of unicode you're expecting in the headers, and have the compiler do the nitty gritty?

wging13y ago

...at least in C# regexes.

ars13y ago

Anyone know if this happens in other languages?

8 more replies

hkmurakami13y ago

oh wow I had no idea that "full width digits" can actually be handled properly. (U+FF10 ~ U+FF19)

coldtea13y ago

Or improperly. If you expect \d to be a shorthand for 0-9, your string can also contain junk.

rwmj13y ago· 7 in thread

I was a bit surprised that Perl does not seem to be matching Unicode digits. Anyone know why?

    $ echo '0' | perl -pe 'print "yes: " if m/\d/'
    yes: 0
    $ echo '੧' | perl -pe 'print "yes: " if m/\d/'
    ੧

xonea13y ago

You have to tell perl to expect utf8 from stdin (switch -C).

  $ echo '੧' | perl -C -pe 'print "yes: " if m/\d/'
  and
  $ perl -e 'use utf8; print "yes\n" if "੧" =~ m/\d/;'

both work :)

pooriaazimi13y ago

`man perlunicode` is chockfull of utf8-related stuff (and it's looong): http://perldoc.perl.org/5.14.0/perlunicode.html

damncabbage13y ago

Ditto PHP:

  php > var_export(preg_match("/\d/", "1"));
  1
  php > var_export(preg_match("/\d/", "۳"));
  0

jpiasetz13y ago

Add /u

    php > var_export(preg_match("/\d/u", "۳"));

bellbind13y ago

It does. See http://ideone.com/Q1lf1M

ars13y ago

Try:

    utf8::upgrade($string)

And/or:

    use feature 'unicode_strings'

netfeed13y ago

The documentation says \d should match if you use /u on the regex

0x013y ago· 4 in thread

I wonder what kind of security vulnerabilities could be looming in validators not expecting non-ascii 0-9 digits and using this regex?

duaneb13y ago

I'm betting quite a few. People should use library number parsers, even if they reject all non-[0-9].

hdragomir13y ago

That's exactly what I was thinking!

trebor13y ago

As of PHP 5.3, PHP-powered software is safe. Using

    is_numeric('١٣٦٨') // -> false
    preg_match('/\d/', '١٣٦٨') // -> no match / false
    filter_var('١٣٦٨', FILTER_VALIDATE_INT) // -> false

Which I'm thankful for. I should hope that most people understand base-10 and ascii numbers. I don't want to have to worry about properly validating/handling unicode characters with number parsing.

0x013y ago

In PHP 5.4.15, I get:

  var_dump(preg_match('/\d/u', '١٣٦٨')) -> 1
  var_dump(preg_match('/\d/', '١٣٦٨'))  -> 0

laumars13y ago· 3 in thread

Regex is a really powerful tool, but sometimes I wonder just how well people actually understand it as the vast majority of people (myself included) seem to be self taught in the syntax - only learning the bits they need as and when they need it.

The problem is, regular expressions is packed full of counter intuitive idiosyncrasies which make perfect sense once they're explained, but are far from obvious. Take this for example:

    s/(^\s+|\s+$)//g

is slower than running two separate regex, like so:

    s/^\s+//;
    s/\s+$//;

So it does make me wonder the number of bugs that have been introduced to software by bad regex.

xonea13y ago

The speed difference is bigger than I would have expected - about one order of magnitude in perl with a simple test script : http://ideone.com/Yso23W

rhizome13y ago

Use the s/, Luke

    s/^\s+?(.*)\s+?$/$1/g

conroe6413y ago

That wouldn't work. First, it will only grab at only one whitespace character at the beginning and at the end. Second, if there was whitespace at the beginning or the end but not both, it won't match at all. "^\s* (.* ?)\s* $/$1/g" would work.

1 more reply

joosters13y ago· 2 in thread

The quoted benchmarks all complete in fractions of a second. Not a good sign. They may be reliable results, performed accurately, but why risk it?

IMO you should be running something for much longer, to protect against random short spurious events. e.g. a task reschedule, interrupts, etc could add significant variances. It wouldn't hurt to add a few more zeros to the loop and wait a minute for the results.

stephencanon13y ago

Those events are all on the order of micro-seconds, far shorter than the benchmark duration. A tenth of a second is an eternity on a modern CPU.

scott_s13y ago

One of those events, yes. But it's possible for the system to be experiencing a bursty workload unrelated to your benchmark, and many of those events may happen. There's also the problems of startup effects, both at the high level (the VM, which in this case is .Net), the medium level (major and minor page faults) and the low level (caches).

My rule of thumb is that benchmarks which are supposed to be bound by the processor and memory should last at least 60 seconds.

1 more reply

hdragomir13y ago· 2 in thread

in C#

hdragomir13y ago

Here are some results in Javascript: http://jsperf.com/digit-regex

jeltz13y ago

It does not match it in ruby or PostgreSQL (which uses a modified version of the tcl regexp implementation).

fleitz13y ago· 1 in thread

The test code creates a new regex every time, would be interesting to see how it works with a compiled and reused regex.

joosters13y ago

It compiles it once and then matches it against 10000 strings.

1 more reply

belper13y ago· 1 in thread

Interesting to see these missing, which are 1 and 1, respectively: [一, 壹]

anonymous13y ago

I think that's because they are treated as words, rather than digits. The same way that \d won't match "einz".

dbbolton13y ago· 1 in thread

What is the need for those "mathematical monospace/bold/sans" characters? Should that be a font issue?

claudius13y ago

Fonts are about different representations of the same symbols. Monospace/bold/sans are different symbols in mathematics.

Erwin13y ago

Python's methods on unicode strings also apply this logic. E.g.:

     >>>  u'١٣٦٨'.isdigit()
     True

    >>> int(u'١٣٦٨')
    1368

I suppose this could be potentially abused if you are storing and displayeing what is supposed to used as a number as unicode text, but later convert it to a number. E.g. an online shop where you are asked whether you want to pay '5꯸' for some item which looks like 5 plus some weird square, but is really int(u'5꯸') => 58 -- http://www.fileformat.info/info/unicode/char/abf8/index.htm

foobar__13y ago

The fact that character ranges like [a-z] can depend on the value of LC_COLLATE is also something not many people are aware of.

  $ echo "ä" | LC_COLLATE=C grep '[a-z]'
  $ echo "ä" | LC_COLLATE=en_US.UTF-8 grep '[a-z]'
  ä

For common values of LC_COLLATE, the range [a-z] does not exclude accented characters and umlauts.

jnotarstefano13y ago

There seems to be a tiny bit of difference in Ruby too. This code:

    require 'benchmark'

    def random_string(length)
      result = (1..length).map { (65+rand(26)).chr }.join
      result[rand(length)] = rand(10).to_s if rand > 0.5
      result
    end

    Benchmark.bmbm do |b|
      b.report("\\d") do 
        (1..1000).count { random_string(1000).match(/\d/) }     
      end

      b.report("[0-9]") do 
        (1..1000).count { random_string(1000).match(/[0-9]/) }
      end

      b.report("[0123456789]") do 
        (1..1000).count { random_string(1000).match(/[0123456789]/) }
      end
    end

gives:

    ~/Code/ruby% ruby regex.rb
    Rehearsal ------------------------------------------------
    \d             0.690000   0.000000   0.690000 (  0.712500)
    [0-9]          0.690000   0.000000   0.690000 (  0.703990)
    [0123456789]   0.680000   0.010000   0.690000 (  0.705759)
    --------------------------------------- total: 2.070000sec
    
                       user     system      total        real
    \d             0.710000   0.000000   0.710000 (  0.791722)
    [0-9]          0.700000   0.000000   0.700000 (  0.708210)
    [0123456789]   0.690000   0.010000   0.700000 (  0.713355)

justanotherbody13y ago

Noted here that modifying \d to only include [0-9] yields \d more efficient http://stackoverflow.com/a/16622773/1943429

ams611013y ago

I tend to use ranges (e.g. [0-9]) as they seem to me to be more standard than the token for "any digit" (often \d, but in elisp (Emacs) it's [:digit:])

conchulio13y ago

Maybe the order in which the Regexes are evaluated is also important due to caching etc. Has anyone tested if results are different when changing the order?

j / k navigate · click thread line to collapse

75 comments

45 comments · 15 top-level

Aurel1us13y ago· 9 in thread

Short answer: \d includes all the Unicode characters from http://www.fileformat.info/info/unicode/category/Nd/list.htm

ars13y ago

It's great to support unicode, but I don't think the \d should have been extended this way. Add a \ud or something.

Tuna-Fish13y ago

Given that the category is specifically "decimal digit", I think it's good, so long as the number parsing code accepts them all too.

1 more reply

rmc13y ago

Yes it's a good thing. There are other places in the world that don't just use ascii. If you want European style numbers just use [0-9]

bellbind13y ago

If you use a preg engine you can add the /a modifier which excludes unicode chars from matches.

chebucto13y ago

Maybe specify the subset of unicode you're expecting in the headers, and have the compiler do the nitty gritty?

wging13y ago

...at least in C# regexes.

ars13y ago

Anyone know if this happens in other languages?

8 more replies

hkmurakami13y ago

oh wow I had no idea that "full width digits" can actually be handled properly. (U+FF10 ~ U+FF19)

coldtea13y ago

Or improperly. If you expect \d to be a shorthand for 0-9, your string can also contain junk.

rwmj13y ago· 7 in thread

I was a bit surprised that Perl does not seem to be matching Unicode digits. Anyone know why?

    $ echo '0' | perl -pe 'print "yes: " if m/\d/'
    yes: 0
    $ echo '੧' | perl -pe 'print "yes: " if m/\d/'
    ੧

xonea13y ago

You have to tell perl to expect utf8 from stdin (switch -C).

  $ echo '੧' | perl -C -pe 'print "yes: " if m/\d/'
  and
  $ perl -e 'use utf8; print "yes\n" if "੧" =~ m/\d/;'

both work :)

pooriaazimi13y ago

`man perlunicode` is chockfull of utf8-related stuff (and it's looong): http://perldoc.perl.org/5.14.0/perlunicode.html

damncabbage13y ago

Ditto PHP:

  php > var_export(preg_match("/\d/", "1"));
  1
  php > var_export(preg_match("/\d/", "۳"));
  0

jpiasetz13y ago

Add /u

    php > var_export(preg_match("/\d/u", "۳"));

bellbind13y ago

It does. See http://ideone.com/Q1lf1M

ars13y ago

Try:

    utf8::upgrade($string)

And/or:

    use feature 'unicode_strings'

netfeed13y ago

The documentation says \d should match if you use /u on the regex

0x013y ago· 4 in thread

I wonder what kind of security vulnerabilities could be looming in validators not expecting non-ascii 0-9 digits and using this regex?

duaneb13y ago

I'm betting quite a few. People should use library number parsers, even if they reject all non-[0-9].

hdragomir13y ago

That's exactly what I was thinking!

trebor13y ago

As of PHP 5.3, PHP-powered software is safe. Using

    is_numeric('١٣٦٨') // -> false
    preg_match('/\d/', '١٣٦٨') // -> no match / false
    filter_var('١٣٦٨', FILTER_VALIDATE_INT) // -> false

Which I'm thankful for. I should hope that most people understand base-10 and ascii numbers. I don't want to have to worry about properly validating/handling unicode characters with number parsing.

0x013y ago

In PHP 5.4.15, I get:

  var_dump(preg_match('/\d/u', '١٣٦٨')) -> 1
  var_dump(preg_match('/\d/', '١٣٦٨'))  -> 0

laumars13y ago· 3 in thread

The problem is, regular expressions is packed full of counter intuitive idiosyncrasies which make perfect sense once they're explained, but are far from obvious. Take this for example:

    s/(^\s+|\s+$)//g

is slower than running two separate regex, like so:

    s/^\s+//;
    s/\s+$//;

So it does make me wonder the number of bugs that have been introduced to software by bad regex.

xonea13y ago

The speed difference is bigger than I would have expected - about one order of magnitude in perl with a simple test script : http://ideone.com/Yso23W

rhizome13y ago

Use the s/, Luke

    s/^\s+?(.*)\s+?$/$1/g

conroe6413y ago

1 more reply

joosters13y ago· 2 in thread

The quoted benchmarks all complete in fractions of a second. Not a good sign. They may be reliable results, performed accurately, but why risk it?

stephencanon13y ago

Those events are all on the order of micro-seconds, far shorter than the benchmark duration. A tenth of a second is an eternity on a modern CPU.

scott_s13y ago

My rule of thumb is that benchmarks which are supposed to be bound by the processor and memory should last at least 60 seconds.

1 more reply

hdragomir13y ago· 2 in thread

in C#

hdragomir13y ago

Here are some results in Javascript: http://jsperf.com/digit-regex

jeltz13y ago

It does not match it in ruby or PostgreSQL (which uses a modified version of the tcl regexp implementation).

fleitz13y ago· 1 in thread

The test code creates a new regex every time, would be interesting to see how it works with a compiled and reused regex.

joosters13y ago

It compiles it once and then matches it against 10000 strings.

1 more reply

belper13y ago· 1 in thread

Interesting to see these missing, which are 1 and 1, respectively: [一, 壹]

anonymous13y ago

I think that's because they are treated as words, rather than digits. The same way that \d won't match "einz".

dbbolton13y ago· 1 in thread

What is the need for those "mathematical monospace/bold/sans" characters? Should that be a font issue?

claudius13y ago

Fonts are about different representations of the same symbols. Monospace/bold/sans are different symbols in mathematics.

Erwin13y ago

Python's methods on unicode strings also apply this logic. E.g.:

     >>>  u'١٣٦٨'.isdigit()
     True

    >>> int(u'١٣٦٨')
    1368

foobar__13y ago

The fact that character ranges like [a-z] can depend on the value of LC_COLLATE is also something not many people are aware of.

  $ echo "ä" | LC_COLLATE=C grep '[a-z]'
  $ echo "ä" | LC_COLLATE=en_US.UTF-8 grep '[a-z]'
  ä

For common values of LC_COLLATE, the range [a-z] does not exclude accented characters and umlauts.

jnotarstefano13y ago

There seems to be a tiny bit of difference in Ruby too. This code:

    require 'benchmark'

    def random_string(length)
      result = (1..length).map { (65+rand(26)).chr }.join
      result[rand(length)] = rand(10).to_s if rand > 0.5
      result
    end

    Benchmark.bmbm do |b|
      b.report("\\d") do 
        (1..1000).count { random_string(1000).match(/\d/) }     
      end

      b.report("[0-9]") do 
        (1..1000).count { random_string(1000).match(/[0-9]/) }
      end

      b.report("[0123456789]") do 
        (1..1000).count { random_string(1000).match(/[0123456789]/) }
      end
    end

gives:

    ~/Code/ruby% ruby regex.rb
    Rehearsal ------------------------------------------------
    \d             0.690000   0.000000   0.690000 (  0.712500)
    [0-9]          0.690000   0.000000   0.690000 (  0.703990)
    [0123456789]   0.680000   0.010000   0.690000 (  0.705759)
    --------------------------------------- total: 2.070000sec
    
                       user     system      total        real
    \d             0.710000   0.000000   0.710000 (  0.791722)
    [0-9]          0.700000   0.000000   0.700000 (  0.708210)
    [0123456789]   0.690000   0.010000   0.700000 (  0.713355)

justanotherbody13y ago

Noted here that modifying \d to only include [0-9] yields \d more efficient http://stackoverflow.com/a/16622773/1943429

ams611013y ago

I tend to use ranges (e.g. [0-9]) as they seem to me to be more standard than the token for "any digit" (often \d, but in elisp (Emacs) it's [:digit:])

conchulio13y ago

Maybe the order in which the Regexes are evaluated is also important due to caching etc. Has anyone tested if results are different when changing the order?

j / k navigate · click thread line to collapse