But as someone who actually knows [some flavours of] regex fairly well, what I would really like, is a reference that covers all the subtle differences between the various regex engines, along with community-managed documentation (perhaps wiki pages) of which applications & API versions use which flavour of regex.
For example, the other day I wanted to run a find on my NAS. I needed to use a regex, but the Busybox version of find doesn't support the iregex option, so all expressions are case-sensitive. With some googling, I was able to find out that the default regex type is Emacs, but I wasn't able to find either a good reference for exactly what Emacs regex does and doesn't support, nor any information about how to set the "i" flag. In the end I had to manually convert every character into a class (like [aA] for "a") which was tedious, but quicker than trying to find a better solution or resorting to grep.
A related, annoyingly common pattern is that the documentation for `find` states that `--regex` specifies a regex, but it does not state which flavour of regex. The documentation for certain versions of `find`, which support alternative engines, note that the default is Emacs. From this I was able to infer (perhaps wrongly) that the Busybox `find` uses Emacs-flavoured regex, but ultimate I still had to resort to some trial-and-error. This problem is all too common in API documentation.
Python flavor would probably be different than PCRE, which is probably different than JS flavor.
Even worse is that it might be too late to standardize all the regex flavors because there is already so much written in different regex flavors that it just costs too much for them to become obsolete in the future.
This is really demotivating.
1) Learn PCRE regex. 2) Try regex golf or cross words to learn PCRE regex. 3) Take the quiz on regex101.
Once you're done with all 3:
Learn the minor/major differences in the other languages. There aren't many. For example this named capture group:
(?<somename>someregex)
Would look like this in a different language:
(?P<somename>someregex)
There's some differences about what language can and cannot do like recursion because someone thought it was a great idea to make javascript awful at regex, but that's besides the point. Regex is totally worth learning.
Clear your afternoon, and just learn it. Seriously, it takes a couple of hours at best and then - BOOM - you're done for the rest of your life.
Emacs regexps are unfortunately their own weird beast - they handle parentheses differently than other regexp engines, because Emacs assumes that you'll be running regexps on Lisp code a lot and want to easily match parentheses. The best documentation on that syntax is (confusingly) in the Elisp reference manual: https://www.gnu.org/software/emacs/manual/html_node/elisp/Sy....
\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b
Can easily be converted like: (->> "\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\\b"
pcre-to-elisp xr)
To: (seq word-boundary
(one-or-more
(any "0-9A-Z" "%+._-"))
"@"
(one-or-more
(any "0-9A-Z" ".-"))
not-newline
(repeat 2 4
(any "A-Z"))
word-boundary)1. Stick to using the lowest common denominator like you did for case insensitivity.
2. If that becomes too cumbersome, then consider whether regex is the right tool for the job. Maybe you can use e.g Python/your favorite language with a known regex standard.
3. If there are no other tools and you're stuck with whatever flavor of regex one particular thing supports, only then invest time in learning the details. There is probably a book out there with the details even if there's no webpage.
Then pray you never get to step 3 :)
It's not so bad going between JS, Ruby and Elixir regex (possibly due to my use of a smaller set of features), but VIM regex disappoint me time after time.
It's how I learned regex years ago, and I still use it today to test/build more complex patterns.
2 points:
1. it fiddled with my back button which is a bit annoying
2. a better email sample is
^[^@]+@[^@]+\.[^@]+$
which removes the 2 ampersands problem.So sundar.pichai@google is technically a valid address (whether .google has any MX records is another matter)
Regex shouldn't really be used for email addresses anyway because the only reliable way to authenticate an email address is to literally send an email to that address.
i.e. johndoe@com will never exist
I think I know what's wrong with your back button. I will fix it.
And for the regex. will try it out and see if I can add it.
grabs popcorn
Relies on ?(DEFINE): http://p3rl.org/perlre#(DEFINE)
> To sum up: RegEx's are misnamed. I think it's a shame, but it won't change. Compatible 'RegEx' engines are not allowed to reject non-regular languages. They therefore cannot be implemented correctly with only Finte State Machines. The powerful concepts around computational classes do not apply. Use of RegEx's does not ensure O(n) execution time. The advantages of RegEx's are terse syntax and the implied domain of character recognition. To me, this is a slow moving train wreck, impossible to look away, but with horrible consequences unfolding
Your sanity won't be left intact tho.
[1] http://www.drregex.com/2018/11/how-to-match-b-c-where-abc-be...
One thing that confounded me often was positive and negative look-arounds. I always got the expressions mixed up, until I just put the expressions into a table like this...
look-behind | look-ahead
------------------------------------
positive (?<=a)b | a(?=b)
------------------------------------
negative (?<!a)b | a(?!b)
It's not hard, but for whatever reason my brain had trouble remembering the usage because every time I looked it up, each of those expressions was nested in a paragraph of explanation, and I could not see the simple intuitive pattern.Putting it into a simple visualization helps a lot.
Now, if I can find a similar mnemonic for backreferences !?
To handle a lookbehind, you really only need to occasionally 'AND' together some states (not an operation you would normally do in a standard NFA whether Glushkov or Thompson). To handle lookaheads... well, it gets ugly.
So depending on the language or flavor you're working in, running away isn't really necessary.
"(this is inside a bracket (and this is nested or (double nested)))
P.S. I know token parsing is better for these things but still I just want to learn the other thing too.In practice, most regexp implemenations you see are more powerful then regular expressions. For instance, .net has a balancing groups feature [0] for exactly this usecase.
$str = "(this is inside a bracket (and this is nested or (double nested)))";
do {
preg_match_all('~\(((?:[^\(\)]++|(?R))*)\)~', $str, $matches);
echo $str = $matches[1][0] ?? '', "\n";
} while($str);
Outputs this [1]: > this is inside a bracket (and this is nested or (double nested))
> and this is nested or (double nested)
> double nested
You're right that there is more processing involved (e.g. while loop) but I still don't understand this part '~\(((?:[^\(\)]++|(?R))*)\)~'
[1] https://rextester.com/MEH86820When green devs are having trouble with regular expressions (and don't have a formal computer science background), I like to give them a crash course in DFAs.
The username reference doesn't match 16 characters as claimed
looks good to me
You could do:
my $var='foo foo bar and more bar foo!!!';
if($var=~/(foo|bar)/g){ # does the variable contain foo or bar?
print "foo! $1 removing foo..\n";
# remove our value..
$var=~s/$1//g;
}Because if I specify x{0,3}, i have 2 paths - around x and thru x + at most 2 more times
¯\_(ツ)_/¯
Unlike most regex helpers, in this one you would start with the text you want to filter/parse and then it would suggest you possible extractions.
Do you know any alternatives?
Something subtle, but I quite loved the email regex is, IMHO, close to perfect: \S+@\S+\.\S+
Because the "perfect" one is just absurd, and no one realizes it's going to be so fucking absurd until they start getting support cases and then go read something like this: https://stackoverflow.com/a/201378/931209
> If you want to get fancy and pedantic, implement a complete state engine. A regular expression can only act as a rudimentary filter. The problem with regular expressions is that telling someone that their perfectly valid e-mail address is invalid (a false positive) because your regular expression can't handle it is just rude and impolite from the user's perspective.
The `ai` ccTLD ran their own mail server at the root, so an address like `a@ai` was a valid email address.
They serve a website at the tld root: http://ai./
in the cheatsheet is false. (https://regexr.com/4tc48)
`.` can match any character except linebreaks (including whitespace)
one suggestion would be to mention clearly which tool/language is being used, regex has no unified standard.. based on "Cheatsheet adapted" message at the bottom, I think it is for JavaScript. I wrote a book on js regexp last year, and I have post for cheatsheet too [3]
[3] https://learnbyexample.github.io/cheatsheet/javascript/javas...
This tool is a cheat sheet that also explains the commonly used expressions so that you understand it.
- There is a visual representation of the regular expression (thanks to regexpr)
- The application shows matching strings which you can play around
- Expressions can be edited and these are instantly validated
If the only thing that is embedded in that frame was taken entirely from a different project, that project should at least be mentioned in the frame.
I found that you can see your own regex with railroad diagram by going to one of the prepopulated examples and editing it. However, it wasn't clear to me that's the intended use of the tool. It's either a little side-effect, or not super-discoverable.