Python code to solve xkcd 1313 by Peter Norvig (opens in new tab)

(nbviewer.ipython.org)

447 pointsweslly12y ago72 comments

72 comments

55 comments · 20 top-level

z-e-r-o12y ago· 4 in thread

Can someone explain what does this line mean and why does he use it as heuristic?

  key=lambda c: 3*len(matches(c, uncovered)) - len(c)

bzbarsky12y ago

It means to use as the key for max 3(number of things this pattern matches) - (number of characters in pattern). The basic idea is that patterns that match more things are good, while patterns that are long are bad; if you select the pattern with the maximal score for the above expression it's matching more things than others or is shorter than others, or both.
Why the "3
" bit, good question. Would be interesting to see what happens with other relative weights of number of matches and length.

kabouseng12y ago

Norvig did say why he chose 3: " I may have chosen a bad tradeoff. (I arbitrarily decided that matching a winner is 3 times more important than spending a character (because a disjunction seems to take about 3 characters on average).)"

natch12y ago

IMHO, sadly, that part is serving as a shibboleth.

The clue is how he explains all the simple parts of the script with comments, but leaves this part unexplained, making some of us feel left out of the cool club. Implying: if you don't grasp it immediately you must not be up to it.

Overall it's a great post, but as to the cryptic parts of it, it's disappointing to see this kind of thing on the part of someone we all look up to. It would be more generous of spirit if he were to have written this in a readable, self explanatory way, like the rest of the code... as python should be written.

Sigh. Norvig has a posse, but I thought it had to be said.

    >>> import this
   ...
   "Readability matters"
   ...
   "Sparse is better than dense"
   ...
   "If the implementation is hard to explain, it's a bad idea"

magicalist12y ago

First, it is explained in the text above it.

Second, seriously?

A line of code that isn't as clear as it could be in an ipython notebook he probably dashed off in an hour is somehow a reflection on his generosity of spirit? I think you need to recheck your expectations.

1 more reply

fwenzel12y ago· 4 in thread

I am not sure why Norvig omits president Obama. That said, "[mtg]a" does match him, so at least Munroe tries.

lazerwalker12y ago

From the article: "I started by finding a page that lists winners and losers of US presidential elections through 2004."

sgustard12y ago

Given that, an interesting variation of the problem is: what's the easiest way to transform an expression to incorporate new data? After the next election, one can toss away the result and re-generate a new one from scratch. But is there an easier way to absorb additional terms, both inclusions and exclusions?

RyJones12y ago

His source page doesn't list Obama, for whatever reason.

http://www.anesi.com/presname.htm

stbullard12y ago

It's just out of date. It says there have been 53 presidential elections since 1789; there have been 56.

sushirain12y ago· 4 in thread

What would be a use for finding a minimal discriminating regex? Perhaps understanding the difference between boys' and girls' names?

bjourne12y ago

Performance. For example if you are syntax highlighting a programming language with hundreds of keywords, then using a regexp like "(kwd1|kwd2|...|kwdn)" is not very efficient. An optimized regexp can do the same matching much faster.

jules12y ago

You would be much better off just generating the DFA of the straightforward regex and minimizing that DFA. This is simpler and faster to generate and faster to execute. Furthermore, in a parser for a programming language you do not want to match some things in one list but not in this other list, you want to match some things in this list and nothing else.

Xephyrous12y ago

I know this isn't what you're asking, but I imagine in this case it's because that's one of the challenges of regex golf. Matching regex, as short as possible.

yen22312y ago

It's a subset of the set cover problem, which is a well-known NP-hard problem.

blt12y ago· 3 in thread

I love Norvig's Python posts. He really gets the spirit of the language and has fun with it.

mkesper12y ago

Yes, I love expressions as this one:

  winners, losers = (winners - losers), (losers - winners)

echion12y ago

Was that in an earlier version? I can't find that in the article now.

1 more reply

BWStearns12y ago

Most times when I read Norvig's Python I get the same feeling as when a tough riddle answer is given, except with Norvig you don't wonder whether anyone actually arrives at the answer the first time they hear it, you know he did.

tlarkworthy12y ago· 3 in thread

Exercise for the reader, write a regex to distinguish random noise from English

EDIT: possibly down-voted because someone though it was sarcastic???

I was actually thinking of this problem before the XKCD comic, for detecting hashes on hardrives efficiently...

jxf12y ago

Almost any regex containing a common English phrase of a few words, like /of course/, would give you a very high accuracy rate on a large enough dataset and a very low false positive rate. English has low entropy per bit.

tlarkworthy12y ago

That strategy would return results for any phrase not containing "of course". Which would be rather a lot of results in any document. I am envisioning a regex that plucks hashes from emails, for example. Some parts English, some parts cryptographic "noise". Pick the noise from the English

(note the inspiration was a nefarious regex scanner for finding bitcoin hashes. I have no intention of building such a thing, but the idea of a random detector regex intrigued me. Is it possible?)

My idea was more like calculate the statistics of character n-grams in english (all of which exist in random noise too), but count the most unlikely occurrences until you hit some probabilistic threshold that indicates some well thought out decision boundary.

2 more replies

Groxx12y ago

For detecting hashes? As in, SHA, or {key: value}?

If the former, you're possibly better just looking for high-entropy chunks of data of the right size. Of course, that'll match all kinds of encrypted data, but there may not be much you can do there.

gwern12y ago· 3 in thread

It's too bad he didn't try to tackle the optimal regexp problem and settled for approximations - it may be a NP-hard problem, but all the example solutions are short enough that the instances might be still tractable. Would've been nice to know for sure.

abecedarius12y ago

I coded a brute-force one before: https://github.com/darius/sketchbook/blob/master/regex/find_...

Though it cares about the size of the AST rather than the concrete syntax. I can't try running it now, I'm on an iPad.

jules12y ago

Another xkcd applies here: http://xkcd.com/356/

chao_xu12y ago

I asked the problem for regular expression(the theoretical kind) for a finite set on cstheory a while ago.

http://cstheory.stackexchange.com/questions/16860/minimizing...

Apparently this problem is still open.

ddebernardy12y ago· 2 in thread

This was posted a few days ago on Code Golf:

http://codegolf.stackexchange.com/questions/17718/meta-regex...

That link includes a perl 10-liner to do the same.

jamesaguilar12y ago

I dunno if calling a pre-implemented solution is quite the same as writing one yourself.

ben33612y ago

Since that XKCD was from today, pretty sure the question is from today too :)

haberman12y ago· 2 in thread

I thought it was going to be meta-meta-regex golf, and couldn't imagine how that would be possible. But meta-regex golf is an interesting exercise, and is far more tractable. :)

siddboots12y ago

"Write the shortest regex that only matches programs that successfully generate the shortest regex given lists of inclusions and exclusions."

Laremere12y ago

Such a regex cannot exist:

1. Take an existing solution to finding the shortest regex given a list of inclusions an exclusions.

2. Take another unrelated arbitrary program.

3. Combine the two, when the arbitrary program terminates, use the existing solution to solve the problem.

4. Such a regex would have to solve the halting problem to know if the arbitrary program terminates and the existing solution solves the problem.

5. Since the halting problem cannot be solved, no such regex can exist.

2 more replies

firegrind12y ago· 2 in thread

When I read 'subtitles', i wondered about the .srt files of the movies.

StavrosK12y ago

Same here, and was blown away.

Groxx12y ago

"subtitles" ? I'm not seeing that anywhere here nor on norvig.com. link?

1 more reply

joyofpi12y ago· 2 in thread

I think it fails for: findregex(set(['abc']), set(['abcd']))

xentronium12y ago

Pretty sure that's impossible within given constraints (only disjunction positive regexes allowed).

jameshart12y ago

trivially fixable by including start-of-string and end-of-string as tokens in the initial string breakdown, so instead of analyzing {abc, ab, bc, a, b, c} as candidate regexes, you start out analyzing {^abc, abc$, ^ab, abc, bc$, ^a, ab, bc, c$, ^, a, b, c, $}; would rapidly home in on c$ as an optimal solution.

Thus would fail against a 'must fail' target of abcabc, but then you fix that by extending the maximum allowable regex fragment length from 4 to 5 and it'll find ^abc$. More generally, you extend the maximum allowable regex count to the longest 'must match string' plus 2, and it'll always succeed, even if it has to create a regex consisting of ^word1$|^word2$|^word3$...

donniezazen12y ago· 2 in thread

Is Python Peter Norvig's preferred language (along with Lisp, I suppose)?

dekhn12y ago

http://norvig.com/python-lisp.html

practical tradeoffs, but yes, his preferred language is Python.

pekk12y ago

It is just a language well-suited for teaching.

j2kun12y ago· 2 in thread

What tool does Norvig use to create this json file? Does iPython have this as a feature (somehow allowing formatted text)?

jofer12y ago

By "this json file", do you mean the ipython notebook itself? (It's saved as JSON behind-the-scenes. That's what you'll get if you click "download notebook".)

If so, yes, that's what a saved ipython notebook is. See: http://ipython.org/notebook.html for an overview of ipython notebooks.

You can export it to html, latex, etc using "ipython nbconvert --to <format_you_want>". See: http://ipython.org/ipython-doc/dev/interactive/notebook.html...

Basically, the website you're seeing (nbviewer) hosts ipython notebooks (json files) and converts them to static html for viewing.

dded12y ago

What's the best way to install ipython on a Mac? Mine's from Macports, and I have ipython notebook, but when I try to use nbconvert I'm told I'm missing pandoc. A 'port install pandoc' wants to install yet another copy of gcc (I have two--including one from Macports for another ipython dependency), ghc, and about three dozen packages prefixed with 'hs'.

Is there any nice bundle of everything you need for ipython, for the Mac, that's both easy to install and uninstall?

1 more reply

throwaway_yy2Di12y ago· 1 in thread

I don't know why Randall's regex incorrectly (?) matches "Fremont", but it's worth noting Wikipedia's primary spelling has an accent aigu "Frémont":

https://en.wikipedia.org/wiki/John_C._Frémont

fwenzel12y ago

Frémont still matches "[rn]t", so the problem persists regardless of this spelling.

thewarrior12y ago· 1 in thread

Could this be used as an alternative to a bloom filter ?

gwern12y ago

If you know everything ahead of time, you could probably just use something like a https://en.wikipedia.org/wiki/Perfect_hash_function

temuze12y ago

This is a great article. It's pretty fun to play around with this heuristic:

  lambda c: 3*len(matches(c, uncovered)) - len(c)

Here's a trivial way to explore it: say we generalize the heuristic to H(a, b).

  H(a,b) = lambda c: a*len(matches(c, uncovered)) - b*len(c)

The original heuristic is considered H(3,1) by this definition. Then we can play around with a and b to see if we'd get smaller results.

  def findregex_lambda(winners, losers, a, b):
      "Find a regex that matches all winners but no losers (sets of strings)."
      # Make a pool of candidate components, then pick from them to cover winners.
      # On each iteration, add the best component to 'cover'; finally disjoin them together.
      pool = candidate_components(winners, losers)
      cover = []
      while winners:
          best = max(pool, key=lambda c: a*len(matches(c, winners)) - b*len(c))
          cover.append(best)
          pool.remove(best)
          winners = winners - matches(best, winners)
      return '|'.join(cover)

  >>> findregex_lambda(starwars, startrek, 3, 1)
  ' T|E.P| N'
  >>> findregex_lambda(starwars, startrek, 3, 2)
  ' T|B| N| M'

Or, to automate this:

  def best_H_heuristic(winners, losers):
      d = {(a,b) : len(findregex_lambda(winners, losers, a,b)) for a in range(0,4) for b in range(0,4)}
      return min(d, key=d.get)

  >>> best_H_heuristic(starwars, startrek):
  (3,1)

Looks like H(3,1) is pretty good for this case. What about the nfl teams?

  >>> best_H_heuristic(nfl_in, nfl_out)
  (3, 2)
  >>> findregex_lambda(nfl_in, nfl_out, 3, 1)
  'pa|g..s|4|fs|sa|se|lt|os'
  >>> findregex_lambda(nfl_in, nfl_out, 3, 2)
  'pa|ch|4|e.g|sa|se|lt|os'

Not the best heuristic there. H(3,1) wins or ties for the boys/girls set, left/right set and drugs/cities set, which just goes to show you that picking a heuristic off a gut guess isn't such a bad approach.

You could also explore heuristics of different forms:

  M(a,b,d,e) = lambda c: a*len(matches(c, uncovered))^b - d*len(c)^e

Or trying completely different formats:

  L(a,b) = lambda c: a*log(len(matches(c, uncovered))) - b*len(c)

j2kun12y ago

One thing not mentioned in this article:

1. The greedy algorithm has an O(log(n)) approximation ratio, meaning it produces a regex guaranteed to use a number of terms within a multiplicative O(log(n)) factor of the optimal regex.

2. Unless P != NP, set cover cannot be approximated better than the greedy algorithm. In other words, the only general solutions you'll find (unless you're using some special insight about how regular expressions cover sets of strings) will be no better than a constant factor improvement in produced regex size than the greedy algorithm.

That being said, regexes (esp disjunctions of small regexes) are not arbitrary sets. So this problem is a subset of set cover, and certainly may have efficient exact solutions.

a3_nm12y ago

Interestingly, finding a minimal-size regexp satisfying a set of positive and negative examples (words that should match, and should not match) is NP-hard. Here is a nice discussion: http://cstheory.blogoverflow.com/2011/08/on-learning-regular...

josephlord12y ago

If you just want to play regex golf this site appeared before Christmas and there was quite a discussion [1] although there are a few more levels now: http://regex.alf.nu/

I'm still not happy with my 214 on Alphabetical including one false match (I was 202 or something with everything correctly matched).

[1] http://news.ycombinator.com/item?id=6941231

shdon12y ago

With the given set,

  /M | [TN]|B/

is suboptimal, but could be

  / [TMN]|B/

But that (and the article) leaves out the subtitle for Star Trek 1: "The Motion Picture". For that, Randall's original expression works.

LambdaAlmighty12y ago

Judging by the amount of fawning here, this guy must be a HN celebrity. Interesting post nevertheless!

I can only hope, one day, I'd be writing and publishing joyful little hacks like this, to such general applause, instead of eking out a living. I have to say I'm a bit envious here!

Well done to the dude. An inspirational post, in many ways.

j / k navigate · click thread line to collapse

72 comments

55 comments · 20 top-level

z-e-r-o12y ago· 4 in thread

Can someone explain what does this line mean and why does he use it as heuristic?

  key=lambda c: 3*len(matches(c, uncovered)) - len(c)

bzbarsky12y ago

kabouseng12y ago

natch12y ago

IMHO, sadly, that part is serving as a shibboleth.

Sigh. Norvig has a posse, but I thought it had to be said.

    >>> import this
   ...
   "Readability matters"
   ...
   "Sparse is better than dense"
   ...
   "If the implementation is hard to explain, it's a bad idea"

magicalist12y ago

First, it is explained in the text above it.

Second, seriously?

1 more reply

fwenzel12y ago· 4 in thread

I am not sure why Norvig omits president Obama. That said, "[mtg]a" does match him, so at least Munroe tries.

lazerwalker12y ago

From the article: "I started by finding a page that lists winners and losers of US presidential elections through 2004."

sgustard12y ago

RyJones12y ago

His source page doesn't list Obama, for whatever reason.

http://www.anesi.com/presname.htm

stbullard12y ago

It's just out of date. It says there have been 53 presidential elections since 1789; there have been 56.

sushirain12y ago· 4 in thread

What would be a use for finding a minimal discriminating regex? Perhaps understanding the difference between boys' and girls' names?

bjourne12y ago

jules12y ago

Xephyrous12y ago

I know this isn't what you're asking, but I imagine in this case it's because that's one of the challenges of regex golf. Matching regex, as short as possible.

yen22312y ago

It's a subset of the set cover problem, which is a well-known NP-hard problem.

blt12y ago· 3 in thread

I love Norvig's Python posts. He really gets the spirit of the language and has fun with it.

mkesper12y ago

Yes, I love expressions as this one:

  winners, losers = (winners - losers), (losers - winners)

echion12y ago

Was that in an earlier version? I can't find that in the article now.

1 more reply

BWStearns12y ago

tlarkworthy12y ago· 3 in thread

Exercise for the reader, write a regex to distinguish random noise from English

EDIT: possibly down-voted because someone though it was sarcastic???

I was actually thinking of this problem before the XKCD comic, for detecting hashes on hardrives efficiently...

jxf12y ago

tlarkworthy12y ago

(note the inspiration was a nefarious regex scanner for finding bitcoin hashes. I have no intention of building such a thing, but the idea of a random detector regex intrigued me. Is it possible?)

2 more replies

Groxx12y ago

For detecting hashes? As in, SHA, or {key: value}?

If the former, you're possibly better just looking for high-entropy chunks of data of the right size. Of course, that'll match all kinds of encrypted data, but there may not be much you can do there.

gwern12y ago· 3 in thread

abecedarius12y ago

I coded a brute-force one before: https://github.com/darius/sketchbook/blob/master/regex/find_...

Though it cares about the size of the AST rather than the concrete syntax. I can't try running it now, I'm on an iPad.

jules12y ago

Another xkcd applies here: http://xkcd.com/356/

chao_xu12y ago

I asked the problem for regular expression(the theoretical kind) for a finite set on cstheory a while ago.

http://cstheory.stackexchange.com/questions/16860/minimizing...

Apparently this problem is still open.

ddebernardy12y ago· 2 in thread

This was posted a few days ago on Code Golf:

http://codegolf.stackexchange.com/questions/17718/meta-regex...

That link includes a perl 10-liner to do the same.

jamesaguilar12y ago

I dunno if calling a pre-implemented solution is quite the same as writing one yourself.

ben33612y ago

Since that XKCD was from today, pretty sure the question is from today too :)

haberman12y ago· 2 in thread

I thought it was going to be meta-meta-regex golf, and couldn't imagine how that would be possible. But meta-regex golf is an interesting exercise, and is far more tractable. :)

siddboots12y ago

"Write the shortest regex that only matches programs that successfully generate the shortest regex given lists of inclusions and exclusions."

Laremere12y ago

Such a regex cannot exist:

1. Take an existing solution to finding the shortest regex given a list of inclusions an exclusions.

2. Take another unrelated arbitrary program.

3. Combine the two, when the arbitrary program terminates, use the existing solution to solve the problem.

4. Such a regex would have to solve the halting problem to know if the arbitrary program terminates and the existing solution solves the problem.

5. Since the halting problem cannot be solved, no such regex can exist.

2 more replies

firegrind12y ago· 2 in thread

When I read 'subtitles', i wondered about the .srt files of the movies.

StavrosK12y ago

Same here, and was blown away.

Groxx12y ago

"subtitles" ? I'm not seeing that anywhere here nor on norvig.com. link?

1 more reply

joyofpi12y ago· 2 in thread

I think it fails for: findregex(set(['abc']), set(['abcd']))

xentronium12y ago

Pretty sure that's impossible within given constraints (only disjunction positive regexes allowed).

jameshart12y ago

donniezazen12y ago· 2 in thread

Is Python Peter Norvig's preferred language (along with Lisp, I suppose)?

dekhn12y ago

http://norvig.com/python-lisp.html

practical tradeoffs, but yes, his preferred language is Python.

pekk12y ago

It is just a language well-suited for teaching.

j2kun12y ago· 2 in thread

What tool does Norvig use to create this json file? Does iPython have this as a feature (somehow allowing formatted text)?

jofer12y ago

By "this json file", do you mean the ipython notebook itself? (It's saved as JSON behind-the-scenes. That's what you'll get if you click "download notebook".)

If so, yes, that's what a saved ipython notebook is. See: http://ipython.org/notebook.html for an overview of ipython notebooks.

You can export it to html, latex, etc using "ipython nbconvert --to <format_you_want>". See: http://ipython.org/ipython-doc/dev/interactive/notebook.html...

Basically, the website you're seeing (nbviewer) hosts ipython notebooks (json files) and converts them to static html for viewing.

dded12y ago

Is there any nice bundle of everything you need for ipython, for the Mac, that's both easy to install and uninstall?

1 more reply

throwaway_yy2Di12y ago· 1 in thread

I don't know why Randall's regex incorrectly (?) matches "Fremont", but it's worth noting Wikipedia's primary spelling has an accent aigu "Frémont":

https://en.wikipedia.org/wiki/John_C._Frémont

fwenzel12y ago

Frémont still matches "[rn]t", so the problem persists regardless of this spelling.

thewarrior12y ago· 1 in thread

Could this be used as an alternative to a bloom filter ?

gwern12y ago

If you know everything ahead of time, you could probably just use something like a https://en.wikipedia.org/wiki/Perfect_hash_function

temuze12y ago

This is a great article. It's pretty fun to play around with this heuristic:

  lambda c: 3*len(matches(c, uncovered)) - len(c)

Here's a trivial way to explore it: say we generalize the heuristic to H(a, b).

  H(a,b) = lambda c: a*len(matches(c, uncovered)) - b*len(c)

The original heuristic is considered H(3,1) by this definition. Then we can play around with a and b to see if we'd get smaller results.

  def findregex_lambda(winners, losers, a, b):
      "Find a regex that matches all winners but no losers (sets of strings)."
      # Make a pool of candidate components, then pick from them to cover winners.
      # On each iteration, add the best component to 'cover'; finally disjoin them together.
      pool = candidate_components(winners, losers)
      cover = []
      while winners:
          best = max(pool, key=lambda c: a*len(matches(c, winners)) - b*len(c))
          cover.append(best)
          pool.remove(best)
          winners = winners - matches(best, winners)
      return '|'.join(cover)

  >>> findregex_lambda(starwars, startrek, 3, 1)
  ' T|E.P| N'
  >>> findregex_lambda(starwars, startrek, 3, 2)
  ' T|B| N| M'

Or, to automate this:

  def best_H_heuristic(winners, losers):
      d = {(a,b) : len(findregex_lambda(winners, losers, a,b)) for a in range(0,4) for b in range(0,4)}
      return min(d, key=d.get)

  >>> best_H_heuristic(starwars, startrek):
  (3,1)

Looks like H(3,1) is pretty good for this case. What about the nfl teams?

  >>> best_H_heuristic(nfl_in, nfl_out)
  (3, 2)
  >>> findregex_lambda(nfl_in, nfl_out, 3, 1)
  'pa|g..s|4|fs|sa|se|lt|os'
  >>> findregex_lambda(nfl_in, nfl_out, 3, 2)
  'pa|ch|4|e.g|sa|se|lt|os'

You could also explore heuristics of different forms:

  M(a,b,d,e) = lambda c: a*len(matches(c, uncovered))^b - d*len(c)^e

Or trying completely different formats:

  L(a,b) = lambda c: a*log(len(matches(c, uncovered))) - b*len(c)

j2kun12y ago

One thing not mentioned in this article:

1. The greedy algorithm has an O(log(n)) approximation ratio, meaning it produces a regex guaranteed to use a number of terms within a multiplicative O(log(n)) factor of the optimal regex.

That being said, regexes (esp disjunctions of small regexes) are not arbitrary sets. So this problem is a subset of set cover, and certainly may have efficient exact solutions.

a3_nm12y ago

josephlord12y ago

If you just want to play regex golf this site appeared before Christmas and there was quite a discussion [1] although there are a few more levels now: http://regex.alf.nu/

I'm still not happy with my 214 on Alphabetical including one false match (I was 202 or something with everything correctly matched).

[1] http://news.ycombinator.com/item?id=6941231

shdon12y ago

With the given set,

  /M | [TN]|B/

is suboptimal, but could be

  / [TMN]|B/

But that (and the article) leaves out the subtitle for Star Trek 1: "The Motion Picture". For that, Randall's original expression works.

LambdaAlmighty12y ago

Judging by the amount of fawning here, this guy must be a HN celebrity. Interesting post nevertheless!

I can only hope, one day, I'd be writing and publishing joyful little hacks like this, to such general applause, instead of eking out a living. I have to say I'm a bit envious here!

Well done to the dude. An inspirational post, in many ways.

j / k navigate · click thread line to collapse