Checking satisfiability of regular expressions - SonarSource/sonar-java GitHub Wiki

Problem statement

Most modern programming languages support extended regular expressions (regexes) with additional features. Some of these features make it possible to write regular expressions that can never match an input string.

Regular expressions extensions

There is no widely adopted standard when it comes to regexes. However, the most adopted features that might render a regex unsatisfiable are:

  • Lookarounds are a construct that enables the programmer to use assertions. The regular expression of the lookaround has to match for the assertion to be true, but it does not consume characters. It can either apply to characters after it, in which case it is called a lookahead, or to previous characters, in which case it is called a lookbehind. Lookaround can be positive or negative. If negative, then their pattern should not match for the assertion to be true. The usual syntax is (?=...)/(?<...) for positive lookahead/lookbehind and (?!...)/(?<!...) for negative lookahead/lookbehind
  • Possessive quantifiers provide a way to repeat some pattern. In contrast to the lazy/greedy quantifiers, the regex engine will not backtrack once the pattern has been matched. This feature adresses practical concerns: it enables to speed up matching and prevent catastrophic backtracking. The most common syntax is to append a + to regular quantifiers.
  • Anchors are simply a shorthand for some common assertions. Usually, ^ and $ assert that the match is at the start (resp. end) of the string. Word boundaries \b only match if the current character is the end or beginning of a word.

Although other extensions might not cause impossible matching by themselves, we also want to handle them since they interact with the above features. One particularly problematic feature is backreferences.

A bit of theory

Classical regular expressions describe the class of languages called regular. This is nice because many properties of such languages are decidable, emptiness included.

Thankfully, most extensions do not empower regular expressions, but instead provide conciseness. For example, lookaround can concisely describe a language to match that would require a much more involved regular expression if they were not available.

The notable exception are backreferences. Regular expressions extended with backreferences describe a broader set of languages. Checking emptiness of languages in this set is not decidable.

Possible solutions

(#1) Running NFAs in parallel

This is what is currently used in sonar-analyzer-commons. We construct the equivalent NFA for the lookaround/possessive quantifier subexpression and the expression that follows it. We then check if the intersection/difference of those two languages is empty by running the two automata together symbolically, and backtracking when they cannot match the same character. This is a brute force approach, and requires testing all possible path of one automaton if on negative lookarounds and possessive quantifiers.

(#2) Solving string constraints

Here, regexes are modeled as constraints over an arbitrary string and ask an SMT solver to find if there exists such a string. The theory used in this case is called the theory of Unicode Strings in the SMTLIB format. Most solvers understand this format. For example, we model the regex a(?=.b)[^c]* as:

(declare-fun s1 () String)
(declare-fun s2 () String)
(declare-fun s3 () String)
(declare-fun s4 () String)
(declare-fun s5 () String)
(declare-fun s6 () String)
(assert 
  (and (str.in_re s5 re.all)
       (str.in_re s3 (str.to_re "a"))
       (str.in_re s1 (re.++ re.allchar (str.to_re "b")))
       (str.prefixof s1 s2)
       (str.in_re s4 (re.* (re.inter re.allchar (re.comp (str.to_re "c")))))
       (str.in_re s6 re.all)
       (= (str.++ s4 s6) s2)))
(check-sat)

In practice, we can use java-smt, a library that provides a unified interface to many smt solvers. However, not all solvers can work with the theory of strings. Only Z3 and CVC4/5 can, and they do not run on the JVM. It means that if this solution was used, it would be necessary to package the correct binaries for one of these solvers.

(#3) Checking AFA emptiness

With this approach, we first convert the regex into an alternating finite automata (AFA). AFAs have 2 types of transitions: universal and existential. In comparison, NFAs only have existential transitions. Universal transitions enable efficient modeling of lookahead and possibly lookbehind. Possessive quantifiers and anchors can also be modeled.

To implement and test...

Evaluation

The evaluation was done using a set of ~7700 regexes extracted from Github repositories. I believe that this dataset is representative of the kind of inputs sonar analyzers have to work with.

Some notable facts about this data: ~96% of the regexes do not contain any construct that can cause emptiness, and as such can be trivially verified as always satisfiable. In the remaining ~300 regexes, only one has actually been found to be impossible to match.

Summary

Feature support

Running NFAs in parallel Solving string constraints
Positive lookahead Yes Yes
Negative lookahead Yes Yes
Positive lookbehind No Yes
Negative lookbehind No Yes
Nested lookarounds No Possible
Anchors Yes Yes
Possessive quantifiers Yes Yes
Combinations of the above No Yes
Bounded loops Approx. Yes

Performance

Running NFAs in parallel Solving string constraints
Average analysis time 22.6us 226ms
Median analysis time 5us 23.5ms
Worse analysis time 345us 3.63s
FPs (estimated) 1 0
FNs 3 3

Completeness

(#1) is the most complete solution. It can easily model all types of lookarounds, all anchors and possessive quantifiers. It also seems possible and somewhat easy to support backreferences, or a least an approximation. The most problematic situation are assertions defined inside a repetition, which apply to characters outside of the repetition subpattern. It requires unrolling the loop, and is not a fully accurate solution.

(#2) is in contrast the least complete approach. The main issue is that it can only take one construct into account at a time. It also lacks support for lookbehinds

(#3) can support all features. I suspect it would be difficult to work with backreferences.

In practice, (#2) approximates a lot. However, since the overwhelming majority of the dataset is satisfiable,

Performance

Although the SMT solver is faster in theory, most regexes in the benchmark are not very complex (at least in terms of the constructs they use, some of them can be quite challenging to read). Therefore, the more naive approach has the best performance. And by a several orders of magnitude!

Benchmark results

Even if there were more complex cases, it would not really be possible to see the SMT solver wins in term of performance, since

Conclusion: since most inputs are mostly simple, the brute force approach has way better performance.

Lessons learned

About the problem

It seems that a majority of regexes in the wild are not often using advanced features, and are usually pretty simple. Therefore, I think simple approaches should have better performance.

About using an SMT solver in Java

Using string constraints to model the problem feels very natural. Java-smt API is nice to work with and well documented.

Tiny changes in the input can result in huge run time differences. Take for example the input a(?!b*c). In partial matching mode, it corresponds to the following constraints:

(declare-fun s2 () String)
(declare-fun s5 () String)
(declare-fun s0 () String)
(declare-fun s3 () String)
(declare-fun s4 () String)
(assert (let ((a!1 (forall ((s1 String))
             (! (let ((a!1 (str.in_re s1
                                      (re.++ (re.* (str.to_re "b"))
                                             (str.to_re "c")))))
                  (=> a!1 (not (str.prefixof s1 s2))))
                :weight 0))))
  (and (str.in_re s4 re.all)
       (str.in_re s3 (str.to_re "a"))
       (str.in_re s0 (str.to_re ""))
       a!1
       (str.in_re s5 re.all)
       (= s5 s2))))

It is found to be satisfiable by Z3 in <60ms. However, the slight variation a+(?!b*c)

(declare-fun s2 () String)
(declare-fun s5 () String)
(declare-fun s0 () String)
(declare-fun s3 () String)
(declare-fun s4 () String)
(assert (let ((a!1 (str.in_re s3 (re.++ (str.to_re "a") (re.* (str.to_re "a")))))
      (a!2 (forall ((s1 String))
             (! (let ((a!3 (str.in_re s1
                                      (re.++ (re.* (str.to_re "b"))
                                             (str.to_re "c")))))
                  (=> a!3 (not (str.prefixof s1 s2))))
                :weight 0))))
  (and (str.in_re s4 re.all)
       a!1
       (str.in_re s0 (str.to_re ""))
       a!2
       (str.in_re s5 re.all)
       (= s5 s2))))

has to be timed out since no solution is found in 20s. I have no good idea why, since the change only impacts the unique constraint on variable s3. I am guessing it has something to do with the universal quantification (necessary to model negative lookarounds). Maybe the theory of strings has some problems, or I'm just missing something. It would be nice to see if this problem appears when using CVC5 as a backend (when support lands in java-smt). The fact that the solver has to be treated as a black box and that its performance is so uncertain makes it difficult to justify its usage for other purposes.

References and useful links