Three detailed solutions to Leetcode #10: regular expression matching

Posted on September 27, 2024

As I prepare for technical interviews, I’ve been grinding Leetcode to practice my data structures and algorithms. (And if you’re looking to hire in 2025, hit me up.) I want to share detailed explanations of three different, fairly short solutions to Problem 10, “regular expression matching.” I especially find it interesting when a problem admits multiple, workable solutions. Part of the beauty of computer science and software engineering is exactly that there are so many different, interesting ways to solve problems.

The problem: not really regex!

We’re asked to define a function match_pattern(s: str, p: str) -> bool that determines whether the given string s matches the given regular expression p, subject to some constraints.

The . pattern matches any character, and the * is a modifier meaning to match zero or more times the preceding character. There is an additional restriction that the input p be “well-formed,” meaning that there isn’t a * as the very first character, nor that there are two *s in a row.

Moreover, for our match_pattern to output True, the pattern must cover the entire input string.

This might sound different from the regex you know and love:

This vastly simplifies the problem down from full regex matching!

In the absense of any *, this is just a simple string-matching problem. We could just traverse s and p simultaneously, checking along the way that each character of p matches the corresponding character of s, letting any . in p match any character in s.

In other words, when e.g. s = "abc" and p = "a.c", we can walk both strings together and compare character by character. When comparing 'b' with '.', we say yes. We would arrive at the end of both inputs at the same time, and along the way said yes everywhere, so we decide that this string matches this pattern.

To illustrate the challenge involved in handling the *, let’s walk through an example in detail. Say s = "aab" and p = "a*b". Set up two pointers: let i be an index into s and j be an index into p, both starting at zero.

That decision is the crux of the challenge: we can’t know for sure which choice to make.

The idea is this: we don’t know which choice to make, so try both ways!

A backtracking search will, in the worst case, explore the entire decision tree of the problem, which we can visualize like this. Rather than show the indices, I’ll show s and p changing as we move through them. States in red are those where we return false, and the green state is where we return True as we successfully matched the string with the pattern.

We can code this backtracking search as a recursive algorithm.

def match_pattern(s, p):
    if s == '' and p == '': return True
    if p == '': return False # pattern empty, but more string left
    if len(p) >= 2 and p[1] == '*':
        # right branch: skip star
        if match_pattern(s, p[2:]): return True
        # but if that fails, then we have to use the star
        return (
            len(s) # for that, we need at least one char in s
            and match_char(s[0], p[0]) # it must match the head of p
            and match_pattern(s[1:], p) # and p has to match the rest of s
        )
    else: # we're not handling a star, so we need:
        return (
            len(s) # there to be at least one character in s
            and match_char(s[0], p[0]) # that it match the head of p
            and match_pattern(s[1:], p[1:])
            # that the rest of the string match the rest of the pattern
        )

The match_char function is simply an equality check on characters that accounts for the pattern character '.' being equal to any string character.

To assess the time complexity of this solution, let’s consider an input with lots of stars: match_pattern("aaaaab", "a*a*a*a*c")

Of course, the output should be False due to the last characters being mismatched, but to discover this, the backtracking search has to explore the entire tree. So how big does the tree become for this input?

When there’s a star, we make two recursive calls, each on an input that’s smaller only by one. Overall, we get an exponential time complexity.

However, you might notice that a sequence of a*a*... is equivalent to just a single a*. Therefore, we can ‘optimize’ the regular expression before interpreting it. By doing this and prioritizing the ‘use star’ branch of the backtracking search we get a greedy formulation of the search that avoids exponential blowup in most cases. This formulation is performant enough to pass the time requirements on Leetcode.

Avoid repeating work: dynamic programming

The subproblems to solve in the recursive algorithm above are overlapping.

To see why, let’s draw out some of the recursive calls for match_pattern("aaab", "a*a*a*c")

See? There are two different paths that lead us to solving the subproblem match_pattern("aab", "a*a*c"):

  1. Left (‘use star’) then right (‘skip star’)
  2. Right (‘skip star’) then left (‘use star’)

The essence of dynamic programming is to solve each unique subproblem exactly once, and there are two broad approaches to accomplishing that.

Strategy 1: top-down with memoization

This technique is quite simple: take the recursive algorithm and attach a cache of solutions to it. This cache will be a hashtable, initially empty. Each key will be a tuple of inputs to match_pattern. At the very beginning of match_pattern, we consult the cache to see if we’ve already solved the problem, returning the cached result in that case. Else, we actually do the computation, and store its result in the cache before returning.

def match_pattern_dp(s, p, cache):
    if s == '' and p == '': return True
    if p == '': return False

    # answer already known:
    if (s, p) in cache: return cache[(s, p)]

    if len(p) >= 2 and p[1] == '*':
        outcome = (
            # skip star:
            match_pattern_dp(s, p[2:], cache)
            # or use star:
            or len(s)
            and match_char(s[0], p[0])
            and match_pattern_dp(s[1:], p, cache)
        )

    else:
        outcome = (
            len(s)
            and match_char(s[0], p[0])
            and match_pattern_dp(s[1:], p[1:], cache)
        )

    cache[(s, p)] = outcome
    return outcome

With the cache in place, whenever we run into a subproblem we’ve seen before, we can just look up its answer.

Just by looking this program, however, it’s not so obvious what the time complexity is. Instead, it’s more illuminating to look at the tree again:

With the addition of the cache, it doesn’t matter anymore by what path we arrive at a particular subproblem. We can visualize this by fusing the duplicated middle nodes.

Imagining that we continue fusing duplicated middle nodes all the way, the structure that this top-down memoizing algorithm ends up exploring takes the shape of a rectangle.

To determine the time complexity of our algorithm, it suffices to count the nodes in the rectangle. One side’s length is bounded by the length of s – let’s call that \(n\) – and the other side’s length is bounded by the length of p – let’s call that \(m\). We get an \(O(n\times m)\) complexity overall then.

Equipped with this better understanding of this problem’s underlying structure, we can exploit this structure to design an even better implementation.

Strategy 2: bottom-up table construction

Function calls are slow. Rather than express the exploration of the problem’s rectangular state space as a recursive function + a hashtable, we can instead directly build the rectangle as a matrix, using nested loops. CPUs hate function calls, but love loops. Of course, this won’t improve the asymptotic complexity of the algorithm – it will still be \(O(n \times m)\) – but the constant factors, which matter in Real Life, will be better.

The game plan is to write a function that will construct a 2D array T, such that T[i][j] holds the answer to the problem “does the pattern formed by taking the last i characters of p match the string obtained by taking the last j characters of s.”

Sheesh. That’s a complicated definition. I picked it because it really is a bottom-up version of what we did in the previous section.

To simplify the implementation slightly, we’ll preprocess the input pattern p to ‘tokenize’ it. This process will turn a pattern like ba*a*bb*c into ['b', 'a*', 'a*', 'b', 'b*', 'c']. This avoids an uncomfortable situation where we take e.g. the last two characters of that pattern, giving the illegal pattern *c.

def preprocess(p):
    out = []
    i = 0
    while i < len(p):
        if i+1 < len(p) and p[i+1] == '*':
            out.append(p[i:i+2])
            i += 2
        else:
            out.append(p[i])
            i += 1
    return out

def match_pattern_tabular(s, p):
    p = preprocess(p)
    T = [[True] + [False] * len(s)]
    for i in range(1, len(p) + 1):
        p_c = p[-i]
        T[i][0] = p_c.endswith('*') and T[i-1][0]
        for j in range(1, len(s) + 1):
            c = s[-j]
            T[i][j] = (
                match_char(p_c, c) and T[i-1][j-1]
                if not p_c.endswith('*') else (
                    # use star
                    match_char(p_c[0], c) and T[i][j-1]
                    # skip star
                    or T[i-1][j]
                )
            )
    return T[len(p)][len(s)]

The nested loops in this approach make obvious the \(O(n \times m)\) time complexity.

We could further improve the memory usage of this approach by observing that only the previous row is needed to compute the next row. I leave that as an exercise to the reader.

The general solution: interpret an NFA

I was first intrigued by this Leetcode problem because it had me remember my theory of computation class. We spent quite some time in that course talking about different kinds of automata, which are (idealized) models of computation. I learned in that course that regular expressions correspond to nondeterministic finite automata (NFAs), and that a regex can be converted into an NFA according to an algorithm called Thompson’s construction. Yes, that’s Ken Thompson, co-creator of Unix!

But what’s an NFA?

Let’s start with an example. Here’s an NFA that corresponds to the regex a*b*c:

This diagram is a representation of an abstract machine into which we input a string, that either accepts or rejects the string.

Fundamentally, an NFA is a state machine. The circles in the diagram are the different states the machine can be in. The letters inside the states are just some arbitrary names for those states. The arrows are the transitions of the machine, and the label of an arrow indicates what letter we need to see at the head of the string to advance to the corresponding state on the rest of the string. However, the label \(\epsilon\) is special: this is a transition that we can make “for free” without needing to consume the head letter from the string.

The machine accepts a given string if after processing the whole string, the machine is in an accepting state, indicated by the double circle. The machine rejects the string under two situations.

  1. The machine’s current state cannot accept the head letter of the string.
  2. The string becomes empty while the machine is not in an accepting state.

An NFA is nondeterministic. What that means is that the transition to take to move to the next state isn’t (always) uniquely determined. For example, in the first state on the left (or the second one), we could follow the \(a\) (or \(b\)) transition or the \(\epsilon\) transition whenever the head of the string is an a (or b, respectively). More generally, NFAs might have multiple transitions with the same label.

This somewhat complicates the idea of “accepting” a string. How does the NFA “know” which choice to make, when more than one transition is possible? It doesn’t know! This is a mathematical object, after all. In reality, the acceptance criterion for an NFA is more precisely that there exist a sequence of transitions that consumes the entire input string and ends on an accepting state.

But let’s not get ahead of ourselves. Before we see how to discover whether such a sequence of transitions exists, we need to decide how to represent an NFA in code, and how to translate a regex into an NFA.

Constructing an NFA from a regex

Fortunately, the regex in this problem are not really regex! Therefore, we don’t need to implement the entire Thompson construction. Just a part of it will do.

First, let’s choose a way of representing an NFA. We’ll identify the states of the NFA with numbers, and use an array to represent the NFA itself. Indexing into that array at \(i\) will give us a description of the transitions that are possible in state \(i\). We’ll describe a set of transitions as a list of key-value pairs, with the key being the letter we need to see and the value being the state index to move to in that case. To represent \(\epsilon\)-moves, we’ll use the empty string.

We can represent the NFA from before, for the regex a*b*c, as the following.

abc = [
    [('a', 0), ('', 1)],
    [('b', 1), ('', 2)],
    [('c', 3)],
    [],
]

Due to the simplified nature of the regex in this problem, we’ll always have a unique accepting state, and it will always be the last state in the list.

def regex_to_nfa(p):
    nfa = []
    i = 0
    while i < len(p):
        new_state_index = len(nfa)
        if i + 1 < len(p) and p[i+1] == '*':
            new_state = [ ('', new_state_index + 1), (p[i], new_state_index) ]
            i += 2
        else:
            new_state = [ (p[i], new_state_index + 1) ]
            i += 1
        nfa.append(new_state)
    nfa.append([]) # the accepting state
    return nfa

Equipped with a Python model of NFAs and a procedure for constructing an NFA from a regex, we’re ready to tackle the nondeterminism of the NFA to discover how to actually run one of these machines.

Running an NFA

The computers that we use are deterministic, so to run an NFA, we have to simulate its nondeterminism. One way to do this is to keep track of a set of states that the NFA is in simultaneously. To see how this works, let’s walk through running the NFA of the regex a*b*c on the string aac.

You also might have noticed from the description above that “all the states reachable from there by using only \(\epsilon\)-moves” shows up a few times. This concept has a fancy math-name: it’s called the \(\epsilon\)-closure of a state. It’s easily expressed as a depth-first search through the NFA, following only \(\epsilon\)-moves. The NFAs in our setting never have \(\epsilon\)-moves that jump “backwards”, so we don’t need to worry about marking nodes as visited.

def epsilon_closure(nfa, i):
    closure = set([i]) # the state itself is part of the closure
    for (letter, j) in nfa[i]:
        if letter == '':
            closure.update(epsilon_closure(nfa, j))
    return closure

Finally, we’re ready to simulate an NFA.

def run_nfa(nfa, s):
    active_states = epsilon_closure(nfa, 0)

    for c in s:
        # each iteration of the outermost loop
        # calculates a new set of active states.
        next_active_states = set()
        for i in active_states:
            for (letter, j) in nfa[i]:
                if match_char(c, letter):
                    next_active_states.update(epsilon_closure(nfa, j))
        active_states = next_active_states

    # accept or reject the string according to whether
    # the accepting state is among the active states, now that
    # the entire string has been traversed.
    return len(nfa)-1 in active_states

Now let’s evaluate the time complexity of run_nfa. The two outermost loops require that it be at least \(O(n \times m)\), but it might be worse than that. We need to account for the call to epsilon_closure that happens in the innermost loop!

Recall that calculating an \(\epsilon\)-closure requires making a depth-first search through the NFA. In a general graph, a depth-first search has a time complexity of \(O(V + E)\); in our setting the vertices are the states of the NFA and the edges are the transitions. The number of transitions per state is bounded by a constant, namely \(2\) – just look at regex_to_nfa. Therefore, to calculate the \(\epsilon\)-closure of one state takes \(O(m)\) time.

It suffices at this point to determine how many \(\epsilon\)-closures we need to calculate. Notice that the call to epsilon_closure is guarded by the condition if match_char(c, letter). This condition can be true at most once per entire loop! That’s because there are at most \(2\) transitions per state in our NFAs, and at most \(1\) that’s not an \(\epsilon\)-move.

Therefore, thanks again to the restricted nature of the regex in this problem, the simulation of the NFAs that result from these regex still takes \(O(n \times m)\) time.

In a general NFA, where the number of transitions per state labelled with a given letter is bounded by the total number of states in the NFA (instead of a constant), the time complexity would be worse.

Conclusion

Leetcode problem 10 is a fascinating one because it admits three different, workable solutions. The backtracking, exponential-time algorithm is performant enough to pass the automated checker so long as we make a simple optimization of collapsing a*a*... down to a*. The two dynamic programming solutions admit a much better time complexity of \(O(n\times m)\).

  1. The top-down implementation uses a hashtable to store previously seen solutions of subproblems.
  2. The bottom-up implementation directly exploits the resulting rectangular shape of the problem’s state space.

Finally, the most theoretically rigorous and (in principle) general solution to the problem comes from the study of the theory of computation: we translate the regular expression into a nondeterministic finite automaton and interpret it by tracking a set of active states to simulate the nondeterminism. Of course, thanks to the simplified nature of the regex in the problem, we were able to cut some corners to simplify the implementation as well.