Can Regular Expressions Be Safely Reused Across Languages?
Written by Nikos Vaggalis   
Monday, 02 September 2019
Article Index
Can Regular Expressions Be Safely Reused Across Languages?
Performance

That is, can I reuse a regular expression crafted in JavaScript verbatim in Python? In doing so, will I get the same results and performance? Enter your languages of interest in place of JavaScript and Python, the question remains the same. 

It is a not well kept secret that programmers are huge fans of copying and pasting code snippets, regular expressions included, that are freely available across the web. But unlike copying and pasting code within the boundaries of the same programming language, does copying a regular expression that was crafted in one language into another work as assumed, or would it introduce errors, both semantically and in performance?

"Why Aren’t Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions", a paper presented at the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’19), attempts to shed light on the question: are regular expressions truly portable?

But first things first. Do programmers engage in copying and re-using regular expressions to begin with? Really, DO they?

To find out, the researchers from Virginia Tech, James C. Davis, Louis G. Michael IV, Christy A. Coghlan, Francisco Servant and Dongyoon Lee, surveyed 159 professional developers on the job in order to understand their perceptions and practices around regular expressions.The findings leave no room for misinterpretation: 94 percent of these developers copy and reuse regex constructs taken from Stack Overflow and other forums and 47% think that they are indeed portable across language barriers.

These findings firmly confirmed the researchers' beliefs that this is a real issue that has to be further investigated and get to the bottom of it. So the next stage was to measure the extent of that reuse:

  • How commonly are regexes re-used from other software?

and

  • How commonly are regexes re-used from Internet sources?

To answer them, they built a regex corpus consisting of 537,806 regexes extracted from 193,524 libraries/modules written in
JavaScript, Java, PHP, Python, Ruby, Go, Perl and Rust, residing in the languages' primary module registries. When a regex appeared in multiple modules in the same registry, they marked those modules as containing an intra-language duplicate, while when such a regex appeared in at least one module in different registries, they marked those modules as containing an inter-language duplicate.

The findings of this experiment showed that thousands of modules (20%) shared the same regexes, both within and across languages and that 5% of all corpus modules (about 10,000) primarily written in JavaScript, use regexes from Stack Overflow and RegExLib.

Based on their polyglot regex corpus, they then explored the issues of portability, starting out with the semantic portability defined as the case when two languages exhibit different features (or behaviors) for the same regex syntax.

To do that they ran a large set of randomly generated inputs against  a large set of complex regular expressions in each language that supported them.This resulted in the so-called "Witness points" which were used as the basis of comparison among all the languages grouped by every possible pair.The comparison's outcome was plotted into a chart categorized by Witness type in order to highlight the differences. These categories were: 

(1) Match witness: Languages disagree on whether there is a match 

(2) Substring witness: Languages agree that there is a match but disagree about the matching substring 

(3) Capture witness: Languages agree on the match and the matching substring, but disagree about the division of the substring into any capture groups of the regex

 

regex-image1

From this chart you can find out the incompatibilities across language pairs.Going through it, for example, reveals that JavaScript and Java have a 4% deviation on the Match witness scale and 1% on the Substring witness scale. Another take is that PHP and Python seem to disagree on pretty much everything; 5% deviation on Match, 4% on Substring and 7% on Capture.

The research now turns to the causes of this behavior. Apparently: 

  • Some languages support a feature that others do not

  • Languages use the same syntax for different features

  • Languages use the same syntax for the same features but exhibit different behavior

The new findings were summarized in the following table:

regex-image2

From this chart we can derive that for example in JavaScript the anchor notation /\Ab\Z/ is interpreted literally as AbZ, despite that developers who use this notation probably intend anchors, or that /^a/ in most languages means match at the beginning of input but in Ruby means match at the beginning of the line.In summary 15% of regexes exhibit documented and undocumented semantic differences.

But the findings don't stop here;the researchers even identified bugs in the regex engines of V8-JavaScript, Python, and Rust, to which official bug reports were filed.

Yet another conclusion is that these unusual behaviors could not be explained by peeking into each language’s regex documentation, as such "testing, not reading the manual, is the only way for developers to learn these behaviors".



Last Updated ( Tuesday, 03 September 2019 )