Regex Golf, XKCD And Peter Norvig

Written by Mike James

Sunday, 12 January 2014

A recent xkcd cartoon has started some deep academic thinking. When AI expert Peter Norvig gets involved you know the algorithms are going to fly.

Xkcd is our favourite comic and regular expression are definitely our favourite way to spend the time waiting for the next installment - so what could be better than an xkcd cartoon on Regex Golf.

Code Golf is a reasonably well known sport of trying to code an algorithm in the shortest possible code. Regex Golf is similar, but in general the aim is to create a regular expression that accepts the strings in one list and rejects the strings in a second list.

This should have been enough, but clearly we are programmers and we like recursion and a regex is a string after all and a regex can process a string:

Regex Golf

Credit: xkcd

Yes my friend, it's regexes all the way down!

Mind boggling enough, but the hover over text, a key feature of many xkcd funnies, is where the problem really starts to take off and gain a life of its own.

The text gives a regular expression that matches the last names of the elected US presidents, but not the losers.

pnorvig

This started Peter Norvig, the well-known computer scientist, director of research at Google and wearer of brightly colored shirts, thinking about the problem. Is it possible to write a program that would create a regular expression to solve the xkcd problem?

The problem is one level of meta, i.e. it is a program that creates a regular expression, but it is a difficult one that needs AI-like search techniques.

The basic algorithm is to first derive a pool of regexs for short character sequences in the data that match a least one winner but no losers. Then create a bigger regex by or-ing a selection of components together. Vary the selection until you get a regex that matches all of the winners but none of the losers.

This is equivalent to the set cover problem and hence NP hard.

To try to find the smallest regex that does the job the simplest approximate way is to use a greedy selection that picks component regexs that maximally separate the two lists first.

From here the algorithm can be improved by adding a set of rules for generating the initial pool of regexs.

To find out more read the complete description, including Python code, at Peter Norvig's blog post which ends with the challenge:

"I hope you found this interesting, and perhaps you can find ways to improve my algorithm, or more interesting lists to apply it to. I found it was fun to play with, and I hope this page gives you an idea of how to address problems like this."

Or, of course, you could take the problem to one stage more of meta e.g meta-meta-regexgolf - a program that writes programs that writes regexs....

If you would like to play more Regex Golf, with or without the help of a program, visit: http://regex.alf.nu/

regexgolf1

More Information

xkcd 1313: Regex Golf

Regex Golf

Verbal Expressions Are Easier Than Regular Ones

Regular Expression Crossword Site

Can You Do The Regular Expression Crossword?

No 16-Clue Sudoku!

.NET Regular Expressions in depth

Master JavaScript Regular Expressions

The Chaos Within Sudoku - A Richter Scale

Picture-Hanging Puzzles

Rubik's cube - the order of God's Number

The maximum overhang algorithm

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Comments

or email your comment to: comments@i-programmer.info

Knuth's Xmas Lecture 2024 - Strong And Weak
22/12/2024

Could the festive season be complete without Donald Knuth putting on his flamboyant xmas top and talking to us about something that most of us know nothing about? Of course not.

+ Full Story

Clocks For The New Year 2024
31/12/2024

What could be a better time to indulge my interest in clocks? Digital clocks, that is, but with a twist on how they present the time. So in the countdown to the rollover lets clock...

+ Full Story

More News

Last Updated ( Sunday, 19 May 2019 )

More Information

Related Articles

Comments