It's great to support unicode, but I don't think the \d should have been extended this way. Add a \ud or something.
$ echo '0' | perl -pe 'print "yes: " if m/\d/'
yes: 0
$ echo '੧' | perl -pe 'print "yes: " if m/\d/'
੧ $ echo '੧' | perl -C -pe 'print "yes: " if m/\d/'
and
$ perl -e 'use utf8; print "yes\n" if "੧" =~ m/\d/;'
both work :) php > var_export(preg_match("/\d/", "1"));
1
php > var_export(preg_match("/\d/", "۳"));
0 php > var_export(preg_match("/\d/u", "۳")); utf8::upgrade($string)
And/or: use feature 'unicode_strings' is_numeric('١٣٦٨') // -> false
preg_match('/\d/', '١٣٦٨') // -> no match / false
filter_var('١٣٦٨', FILTER_VALIDATE_INT) // -> false
Which I'm thankful for. I should hope that most people understand base-10 and ascii numbers. I don't want to have to worry about properly validating/handling unicode characters with number parsing. var_dump(preg_match('/\d/u', '١٣٦٨')) -> 1
var_dump(preg_match('/\d/', '١٣٦٨')) -> 0The problem is, regular expressions is packed full of counter intuitive idiosyncrasies which make perfect sense once they're explained, but are far from obvious. Take this for example:
s/(^\s+|\s+$)//g
is slower than running two separate regex, like so: s/^\s+//;
s/\s+$//;
So it does make me wonder the number of bugs that have been introduced to software by bad regex. s/^\s+?(.*)\s+?$/$1/gIMO you should be running something for much longer, to protect against random short spurious events. e.g. a task reschedule, interrupts, etc could add significant variances. It wouldn't hurt to add a few more zeros to the loop and wait a minute for the results.
My rule of thumb is that benchmarks which are supposed to be bound by the processor and memory should last at least 60 seconds.
>>> u'١٣٦٨'.isdigit()
True
>>> int(u'١٣٦٨')
1368
I suppose this could be potentially abused if you are storing and displayeing what is supposed to used as a number as unicode text, but later convert it to a number. E.g. an online shop where you are asked whether you want to pay '5꯸' for some item which looks like 5 plus some weird square, but is really int(u'5꯸') => 58 -- http://www.fileformat.info/info/unicode/char/abf8/index.htm $ echo "ä" | LC_COLLATE=C grep '[a-z]'
$ echo "ä" | LC_COLLATE=en_US.UTF-8 grep '[a-z]'
ä
For common values of LC_COLLATE, the range [a-z] does not exclude accented characters and umlauts. require 'benchmark'
def random_string(length)
result = (1..length).map { (65+rand(26)).chr }.join
result[rand(length)] = rand(10).to_s if rand > 0.5
result
end
Benchmark.bmbm do |b|
b.report("\\d") do
(1..1000).count { random_string(1000).match(/\d/) }
end
b.report("[0-9]") do
(1..1000).count { random_string(1000).match(/[0-9]/) }
end
b.report("[0123456789]") do
(1..1000).count { random_string(1000).match(/[0123456789]/) }
end
end
gives: ~/Code/ruby% ruby regex.rb
Rehearsal ------------------------------------------------
\d 0.690000 0.000000 0.690000 ( 0.712500)
[0-9] 0.690000 0.000000 0.690000 ( 0.703990)
[0123456789] 0.680000 0.010000 0.690000 ( 0.705759)
--------------------------------------- total: 2.070000sec
user system total real
\d 0.710000 0.000000 0.710000 ( 0.791722)
[0-9] 0.700000 0.000000 0.700000 ( 0.708210)
[0123456789] 0.690000 0.010000 0.700000 ( 0.713355)