7 minute read

“Regular Expression Matching is where string manipulation meets automata theory. It requires translating a sequence of patterns into a resilient state machine.”

TL;DR

Regular Expression Matching supports ‘.’ (any character) and ‘*’ (zero or more of the preceding element). The star creates branching: skip the char-star pair entirely, or consume a matching character and keep the pattern active. Bottom-up DP computes dp[i][j] for whether s[:i] matches p[:j], collapsing exponential backtracking into O(NM) time. This is the algorithmic foundation of tokenizers in NLP pipelines and builds on the wildcard matching DP pattern.

1. Introduction: The Power of Pattern

From terminal commands (grep "error.*") to tokenizers in Large Language Models, Regular Expressions (RegEx) are the backbone of text processing. While most engineers use them every day, few understand what’s happening under the hood.

At its core, RegEx matching is a Search Problem. Unlike simple string search (finding a substring), RegEx introduces Wildcards (.) and Quantifiers (*). This makes the search space branch: when you see a*, you could choose to match zero ‘a’s, one ‘a’, or many.

  • Should I consume this character?
  • Should I skip this pattern?
  • Should I backtrack?

We solve the RegEx matching problem using Dynamic Programming (DP), exploring how to manage overlapping subproblems and complex state transitions without falling into the “redos” (Regular Expression Denial of Service) trap of exponential time complexity.

1.1 Why This Problem Matters

  • Compiler Design: Understanding lexical analysis and token generation.
  • Security: Analyzing how catastrophic backtracking can crash a server.
  • Automata Theory: Bridging the gap between NFA (Non-deterministic Finite Automaton) and practical code.

2. The Problem Statement

Implement regular expression matching with support for . and *.

  • . Matches any single character.
  • * Matches zero or more of the preceding element.

The matching should cover the entire input string.


We explore State-Driven Intelligence:

  • DSA: We use DP to track the valid states of a matching engine.
  • ML System Design: Advanced NLP pipelines use finite state machines for tokenization and entity recognition.
  • Speech Tech: Conversational AI systems transition through dialog states based on user intent.
  • Agents: Ethical AI agents use rule-based state machines to enforce safety guardrails.

4. Approach 1: Top-Down Recursion with Memoization

The most intuitive way to think about * is as a branching decision.

4.1 The Recurrence Relation

Let dp(i, j) be true if s[i:] matches p[j:].

  1. Check the first character: first_match = (i < len(s) and p[j] in {s[i], '.'}).
  2. Handle the * lookahead: If p[j+1] == '*':
    • Option A: Ignore the * construct (match zero): dp(i, j + 2).
    • Option B: Use the * construct (if first match is true): first_match and dp(i + 1, j).
  3. No * quantifier: Just move both pointers forward: first_match and dp(i + 1, j + 1).

4.2 Implementation

class Solution:
    def isMatch(self, s: str, p: str) -> bool:
        memo = {}

    def dp(i, j):
        if (i, j) in memo:
            return memo[(i, j)]

            if j == len(p):
                return i == len(s)

                first_match = i < len(s) and p[j] in {s[i], "."}

                if j + 1 < len(p) and p[j+1] == "*":
                    res = dp(i, j + 2) or (first_match and dp(i + 1, j))
                else:
                    res = first_match and dp(i + 1, j + 1)

                    memo[(i, j)] = res
                    return res

                    return dp(0, 0)

5. Approach 2: Bottom-Up Dynamic Programming

A 2D table approach is often preferred for its iterative efficiency and O(1) memory access speed.

5.1 The Table Design

T[i][j] represents if s[:i] matches p[:j].

5.2 Implementation

class Solution:
    def isMatch(self, s: str, p: str) -> bool:
        n, m = len(s), len(p)
        dp = [[False] * (m + 1) for _ in range(n + 1)]
        dp[0][0] = True

        for j in range(2, m + 1):
            if p[j-1] == "*":
                dp[0][j] = dp[0][j-2]

                for i in range(1, n + 1):
                    for j in range(1, m + 1):
                        if p[j-1] == s[i-1] or p[j-1] == ".":
                            dp[i][j] = dp[i-1][j-1]
                        elif p[j-1] == "*":
                            match_zero = dp[i][j-2]
                            preceding_char = p[j-2]
                            match_one_plus = (s[i-1] == preceding_char or preceding_char == ".") and dp[i-1][j]
                            dp[i][j] = match_zero or match_one_plus

                            return dp[n][m]

6. Implementation Deep Dive: Handling the Quantifier

The logic dp[i][j] = dp[i][j-2] or (first_match and dp[i-1][j]) is the heart of the algorithm.

  • dp[i][j-2] means skipping the * and its preceding character entirely.
  • dp[i-1][j] means we used the * to match a character, and we’re looking to match the rest of the string with the same pattern.

This is equivalent to a Non-deterministic Finite Automaton (NFA). When the machine sees a *, it effectively splits. If any path reaches the final state, the string matches.


7. Comparative Performance Analysis

Metric Recursion (No Memo) DP (Memo/Table) NFA (Automata)
Time O(2^{N+M}) O(NM) O(NM)
Space O(N+M) O(NM) O(M)

8. Real-world Applications: NLP and Parsing

RegEx is used extensively in NLP Pipelines:

  • Tokenization: Handling punctuations, URLs, and numeric formats.
  • Rule-based Entity Extraction: In high-precision domains like legal or medical tech, RegEx state machines extract specific dates and IDs more reliably than raw neural nets.

9. Interview Strategy: The “Empty Input” Case

  1. Test the Extremes: Ask “What if s is empty and p is a*?” (The result should be true).
  2. Explain the Table: Define what dp[0][j] means, it handles “optional” patterns at the start.
  3. Trace a wildcard: Show how .* can swallow any character in the DP table.

10. Common Pitfalls

  1. Off-by-one errors: The DP table is (n+1) x (m+1). Be careful with 1-based indexing for the table vs 0-based for the string.
  2. The j-2 risk: When checking p[j-1] == "*", ensure j >= 2.
  3. Forgetting ‘match zero’: Always remember that a* can match an empty string.

11. Key Takeaways

  1. DP is about Decision Trees: Every * is a branch; DP collapses the branches.
  2. State Machines are patterns: A RegEx is a set of compressed instructions for a matching machine.
  3. Correctness > Speed: Edge cases like empty strings and wildcards are where most bugs live in string matching.

FAQ

What makes regular expression matching harder than wildcard matching?

In regex, the star modifies the preceding character rather than matching any sequence independently. So “a*” matches zero or more ‘a’ characters specifically. The DP transition must handle both skipping the char-star pair (j-2) and consuming a matching character while keeping the pattern active (i-1, j).

How does the DP table handle the star operator?

When p[j-1] is ‘*’, dp[i][j] equals dp[i][j-2] (skip star and its preceding char) OR the result of checking whether the preceding char matches s[i-1] AND dp[i-1][j] holds. The first case handles zero occurrences, the second handles one or more.

What is the base case for empty string matching a pattern with stars?

An empty string can match patterns like “abc” because each star can match zero occurrences. Initialize dp[0][j] = dp[0][j-2] whenever p[j-1] is ‘’, which propagates True through consecutive star patterns from the start.

What is the connection between regex matching and automata theory?

A regex pattern corresponds to a Non-deterministic Finite Automaton where the star creates epsilon transitions and self-loops. The DP approach effectively simulates the NFA by tracking all possible states simultaneously, achieving polynomial O(NM) time instead of exponential backtracking.


Originally published at: arunbaby.com/dsa/0058-regular-expression-matching