James Gregson's Website

Jul 22, 2023

Regular Expression Matching

This came up years ago for me as an interview question that I flubbed badly. The task was to write a simplified regular expression matching algorithm that would test if a given string matched the provided pattern. The set of valid patterns was considerably simplified from full regular expressions to be only matching single characters or the Kleene Star * which matched zero or more instances of the previous character.

I quickly put together an implementation that worked well greedily but stumbled on backtracking, which is needed to match patterns like "aa*a". This should match strings "aa", "aaa", aaaa", and so on. However, to match the final character it is necessary to not consume the final "a" with the Kleene star since this would exhaust the input before reaching the final term. I was not able to figure out how to do this and did not get the job.

Now I do know how to do this. This post documents how not just for the simple case above but also a much more complete set of patterns.

Preamble

In the end, the bit I was missing was backtracking. As the pattern is matched, the matching algorithm needs to track a state indicating both where it is in the input and where it is in the pattern. In greedy matching, only a single state is maintained and the match fails whenever the next term fails to match or the input is exhausted before reachind the end of the pattern. The latter is what happens in the example at the start, the "a*"" term gobbles up all the input "a"s leaving the final a in the pattern unmatched.

For a backtracking implmentation, the algorithm creates & tracks a set of states, each corresponding to one of several discrete choices that can be made during the algorithm. These states are advanced as input is parsed and the match fails when all of the states fail to match.

In the simple expression grammar above, the discrete choices are to either match or skip matching an additional "a" when (recursively) handling the Kleene Star.

So a backtracking algorithm needs to be able to keep track of multiple states, each indicating the input position and pattern position, advance each state through the input and cull those states whenever they fail to match. If any state reaches the end of the pattern, the string matches.

Pattern Representation

The matching algorithm I'll describe here is richer than the introductory example. It supports matching any specified character, a binary or '|' operator that matches either the left or right operand, a binary and '&' operator that matches both the left & right operand in order and the Kleene Star '*'. The first is a terminal node of the expression tree, while the '&', '|' and '*' operators are internal nodes.

In addition, a nop node '~' is another terminal that automatically matches but consumes no input. It is helpful for defining other operations, e.g. "a?" matches zero or one instances of character "a" and can be defined as "~|a". It also ensures that every non-terminal node has both left and right children.

Beyond this, there are number of other node types that are pretty easy to figure out once the above are handled.

To represent this, an array-based tree is used with each element containing the following structure:

struct node_t {
    constexpr static int INVALID = -1;
    int    parent = INVALID;
    int    left   = INVALID;
    int    right  = INVALID;
    type_t type   = type_t::NOP;
    int    data   = -1;
};

Parent, left and right are indices of the corresponding nodes in the array and are used to walk the tree when matching. node_t::INVALID is used to represent the parent of the root node. The data field is used by the character node to define what character to match in the input while the type indicates the type of node as defined below:

enum type_t { NOP, CHAR, ANY, END, AND, OR, STAR, PLUS };

State Representation

This is enough to define tree but it is still necessary to track the matching algorithm state. To perform the matching, the algorithm walks the tree down to the left most terminal. If the terminals matches, it walks back up to the terminal's parent. Depending on the node type, it then either evaluates the right child or walks up to it's parent.

The algorithm state representation used is the currently visited node and the node that it arrived from as well as the current position in the input. This is enough to know both where in the tree we are and where we came from, which is enough to know what needs to be evaluated next.

In this algorithm, these states are tracked via a matching stack:

// stack entries are {input_pos, current_node, previous_node}
using match_stack_t = std::stack<std::tuple<const char*,int,int>>;

Matching Algorithm via Tree Traversal

By knowing where we are in the input and where we are in the tree and how we got there, we can implement the matching recursively. However, it is not desireable to actually use recursion because we could end up exhausting the call stack on long inputs. Instead we use a stack as above (and potentially exhaust memory instead).

At each iteration, we will pull a state from the stack. First we will check if we've exhausted the input string. If so the current state is culled. Then we will check if we've successfully processed the root node after recursing back up the tree. If so and input was consumed from the string, we've successfully matched the expression. Otherwise the state is culled.

result_t match( const char *input ) const {
    match_stack_t stack;
    const int n = strlen(input)+1;
    // tree is built in post-fix order so root is last node
    stack.push({input,int(_tree.size())-1,node_t::INVALID});
    while( !stack.empty() ){
        auto [str,node,last] = stack.top();
        stack.pop();

        if( str-input >= n ){
            // exhausted input, consider remaining states
            continue;
        } else if( node == node_t::INVALID ){
            // finished recursing back to root
            if( str != input ){
                // consumed input, successfully matched
                return {true,str};
            }
            // did not consume input, match failed, try
            // remaining states
            continue;
        }

        switch( _tree[node].type ){
        case type_t::NOP:
            walk_nop( stack, str, node, last );
            break;
        case type_t::CHAR:
            walk_char( stack, str, node, last );
            break;

        ... other node types ...

        case type_t::AND:
            walk_and( stack, str, node, last );
            break;
        case type_t::OR:
            walk_or( stack, str, node, last );
            break;
        case type_t::STAR:
            walk_star( stack, str, node, last );
            break;

        ... other node types ...
        }
    }
    // exhausted regular expression or input
    return {false,input};
}

Following this, we traverse the tree according to a set of rules for each node. The rules depend on the node type. The one common theme is that on a successful match at a subtree a new state is pushed with the input position and starting at the subtree's parent arriving from the subtree. However, when the subtree does not match, it does not push a new state and the state is culled.

For example, if we are at an '&' node and came from the '&' node's parent, we need to evaluate the left subtree. If we arrived from the left subtree, we need to evaluate the right subtree. Finally if we arrived from the right subtree, we need to go back up to the '&' node's parent.

void walk_and( match_stack_t& stack, const char* str, int node, int last ) const {
    if( last == node_t::INVALID || last == _tree[node].parent ){
        stack.push({str,_tree[node].left,node});
    } else if( last == _tree[node].left ){
        stack.push({str,_tree[node].right,node});
    } else if( last == _tree[node].right ){
        stack.push({str,_tree[node].parent,node});
    } else {
        assert( false && "Should never happen. Tree is borked!");
    }
}

If, however, we are at an '|' node we first check the left subtree. If it matches, we recurse back up to the '|' node's parent and skip checking the right operand. This is done by pushing both left and right nodes onto the stack at the current position, if the left traveral fails, the algorithm will pop the right state and proceed from there. If we sucessfully arrive back at the '|' node from either of the subtrees, the match succeeded and we head back up to the parent node:

void walk_or( match_stack_t& stack, const char* str, int node, int last ) const {
    if( last == node_t::INVALID || last == _tree[node].parent ){
        stack.push({str,_tree[node].right,node});
        stack.push({str,_tree[node].left, node});
    } else if( last == _tree[node].left || last == _tree[node].right ){
        stack.push({str,_tree[node].parent,node});
    } else {
        assert( false && "Should never happen. Tree is borked!");
    }
}

The Kleene Star is similar to the '|' case except whenever we arrive at the '*' node we push both its parent and its left subtree. The order these are pushed determines whethe the '*' is greedy (tries to consume as much input as possible) or non-greedy (tries to consume as little as possible). This is the greedy version since it will retry the left subtree before recursing back to the parent.

void walk_star( match_stack_t& stack, const char* str, int node, int last ) const {
    if( last == node_t::INVALID || last == _tree[node].parent ){
        stack.push({str,_tree[node].parent, node});
        stack.push({str,_tree[node].left,   node});
    } else if( last == _tree[node].left ){
        stack.push({str,_tree[node].parent,node});
        stack.push({str,_tree[node].left,  node});
    } else {
        assert( false && "Should never happen. Tree is borked!");
    }
}

Finally to pull it all together, there need to be the correponding implmentations for the '~' and character terminals. In the case of the nop, the string position remains the same and we move back to the parent node. The character node is almost the same, except we only recurse back if the next input character matches. In that case, we advance the string and recurse back to the parent node. Otherwise, the match for this state does not continue. These are:

void walk_nop( match_stack_t& stack, const char* str, int node, int last ) const {
    stack.push({str,last,node});
}

void walk_char( match_stack_t& stack, const char* str, int node, int last ) const {
    if( *str == _tree[node].data ){
        stack.push({str+1,last,node});
    }
}

Aside from some boilerplate to actually build the tree, the definition of the result type and a class to wrap it all together, that's really all there is to it.

The full example, including additional node types, is below

Full Example Source Code

#include <string>
#include <cassert>
#include <iostream>
#include <stack>
#include <tuple>
#include <vector>

class regex_t {
public:
    struct result_t {
        bool        result;
        const char* remainder;
        operator bool() const { return result; }
        operator const char*() const { return remainder; }
    };

    result_t match( const char *input ) const {
        match_stack_t stack;
        const int n = strlen(input)+1;
        // tree is built in post-fix order so root is last node
        stack.push({input,int(_tree.size())-1,node_t::INVALID});
        while( !stack.empty() ){
            auto [str,node,last] = stack.top();
            stack.pop();

            if( str-input >= n ){
                // exhausted input, consider remaining states
                continue;
            } else if( node == node_t::INVALID ){
                // finished recursing back to root
                if( str != input ){
                    // consumed input, successfully matched
                    return {true,str};
                }
                // did not consume input, match failed, try
                // remaining states
                continue;
            }

            switch( _tree[node].type ){
            case type_t::NOP:
                walk_nop( stack, str, node, last );
                break;
            case type_t::CHAR:
                walk_char( stack, str, node, last );
                break;
            case type_t::ANY:
                walk_any( stack, str, node, last );
                break;
            case type_t::END:
                walk_end( stack, str, node, last );
                break;
            case type_t::AND:
                walk_and( stack, str, node, last );
                break;
            case type_t::OR:
                walk_or( stack, str, node, last );
                break;
            case type_t::STAR:
                walk_star( stack, str, node, last );
                break;
            case type_t::PLUS:
                walk_plus( stack, str, node, last );
                break;
            }
        }
        // exhausted regular expression or input
        return {false,input};
    }

    int nop(){
        return add_terminal( {.type=type_t::NOP} );
    }

    int character( const char match ){
        assert( match > 0 );
        return add_terminal( {.type = type_t::CHAR, .data = match} );
    }

    int any(){
        return add_terminal( {.type = type_t::ANY } );
    }

    int end(){
        return add_terminal( {.type = type_t::END} );
    }

    int both( const int left, const int right ){
        return add_tree( {.type = type_t::AND}, left, right );
    }

    int either( const int left, const int right ){
        return add_tree( {.type = type_t::OR}, left, right );
    }

    int star( const int left ){
        return add_tree( {.type = type_t::STAR}, left, nop() );
    }

    int plus( const int left ){
        return add_tree( {.type = type_t::PLUS}, left, nop() );
    }

    int maybe( const int left ){
        return either( left, nop() );
    }

private:
    using match_stack_t = std::stack<std::tuple<const char*,int,int>>;
    enum type_t { NOP, CHAR, ANY, END, AND, OR, STAR, PLUS };

    struct node_t {
        constexpr static int INVALID = -1;
        int    parent = INVALID;
        int    left   = INVALID;
        int    right  = INVALID;
        type_t type   = type_t::NOP;
        int    data   = -1;
    };

    void walk_nop( match_stack_t& stack, const char* str, int node, int last ) const {
        stack.push({str,last,node});
    }

    void walk_char( match_stack_t& stack, const char* str, int node, int last ) const {
        if( *str == _tree[node].data ){
            stack.push({str+1,last,node});
        }
    }

    void walk_any( match_stack_t& stack, const char* str, int node, int last ) const {
        if( *str != '\0' ){
            stack.push({str+1,last,node});
        }
    }

    void walk_end( match_stack_t& stack, const char* str, int node, int last ) const {
        if( *str == '\0' ){
            stack.push({str,last,node});
        }
    }

    void walk_and( match_stack_t& stack, const char* str, int node, int last ) const {
        if( last == node_t::INVALID || last == _tree[node].parent ){
            stack.push({str,_tree[node].left,node});
        } else if( last == _tree[node].left ){
            stack.push({str,_tree[node].right,node});
        } else if( last == _tree[node].right ){
            stack.push({str,_tree[node].parent,node});
        } else {
            assert( false && "Should never happen. Tree is borked!");
        }
    }

    void walk_or( match_stack_t& stack, const char* str, int node, int last ) const {
        if( last == node_t::INVALID || last == _tree[node].parent ){
            stack.push({str,_tree[node].right,node});
            stack.push({str,_tree[node].left, node});
        } else if( last == _tree[node].left || last == _tree[node].right ){
            stack.push({str,_tree[node].parent,node});
        } else {
            assert( false && "Should never happen. Tree is borked!");
        }
    }

    void walk_star( match_stack_t& stack, const char* str, int node, int last ) const {
        if( last == node_t::INVALID || last == _tree[node].parent ){
            stack.push({str,_tree[node].parent, node});
            stack.push({str,_tree[node].left,   node});
        } else if( last == _tree[node].left ){
            stack.push({str,_tree[node].parent,node});
            stack.push({str,_tree[node].left,  node});
        } else {
            assert( false && "Should never happen. Tree is borked!");
        }
    }

    void walk_plus( match_stack_t& stack, const char* str, int node, int last ) const {
        if( last == node_t::INVALID || last == _tree[node].parent ){
            stack.push({str,_tree[node].left, node});
        } else if( last == _tree[node].left ){
            stack.push({str,_tree[node].parent,node});
            stack.push({str,_tree[node].left,  node});
        } else {
            assert( false && "Should never happen. Tree is borked!");
        }
    }

    // builder stuff
    int add_terminal( const node_t& n ){
        _tree.emplace_back( n );
        return _tree.size()-1;
    }

    int add_tree( const node_t& n, int left, int right ){
        assert( left >= 0 && left < _tree.size() && right >= 0 && right < _tree.size() );
        int parent = add_terminal( n );
        _tree[parent].left  = left;
        _tree[parent].right = right;
        _tree[ left].parent = parent;
        _tree[right].parent = parent;
        return parent;
    }

    std::vector<node_t> _tree;
};

std::ostream& operator<<( std::ostream& os, const regex_t::result_t& res ){
    os << (res.result ? "match succeeded" : "match_failed") << ", remainder: \"" << res.remainder << "\"" << std::endl;
    return os;
}

int main( int argc, char **argv ){
    regex_t re;
    re.both(
        re.character('a'),
        re.both(
            re.plus(
                re.character('a')
            ),
            re.both(
                re.character('a'),
                re.end()
            )
        )
    );

    std::cout << re.match("aaa") << std::endl;

    return 0;
}