11 min read

Optimizing Levenshstein for Fuzzy Name Matching

Myers bit-parallel gave a 10x speedup over naive Levenshtein. Here's how a handful of cheap pruning tricks pushed it another 5x.

algorithmsperformancecsharpdotnet

The Problem

I needed to detect misspellings of names against a known list of candidates. Given a query string (the value with a potential typo) and a set of candidates, I needed to pick the candidate that was most similar to the query — when there wasn't an exact match.

For example, given the query "Jonathon" and the candidates:

Connor, Joe, Mary, Martin, Johnathon, Xavier

The right answer is "Johnathon". It's off by a single inserted h. The metric that captures this kind of similarity is Levenshtein distance: the minimum number of single-character insertions, deletions, or substitutions required to turn one string into the other. JonathonJohnathon is a distance of 1.

My real use case wasn't human names — it was place names, which come with their own set of problems (numbers, abbreviations, dropped tokens, and so on). That's a different blog post. Here I'll focus on Levenshtein, which is the most important metric for typos and the most computationally expensive part of the algorithm. I'll use names for simplicity.

Levenshtein was sitting on a hot path. When I profiled my application, over 90% of the CPU time was spent inside the distance function. So if I wanted the application to be fast, I had to make Levenshtein fast.

A Detour: Why Levenshtein is Expensive

To understand why making this fast is non-trivial, you need a feel for what the standard algorithm actually does.

The classical formulation builds a 2D grid. For two strings of length m and n, the grid is (m+1) × (n+1), and each cell asks the question:

"What's the minimum number of edits to turn the first i characters of A into the first j characters of B?"

Each cell looks at three of its neighbors (the cell above for "delete", the cell to the left for "insert", and the diagonal cell for "match or substitute") and takes the minimum. The answer ends up in the bottom-right corner.

That's m * n cells, each doing a constant amount of work. For two 20-character strings that's 400 cells. Still cheap. But push it:

  • 1 query against 1000 candidates → 400,000 cells
  • 1000 queries → 400 million cells

The Standard Speedups

A few well-known algorithms attack this from different angles:

  • Wagner-Fischer is the classical O(m*n) dynamic programming formulation I just described. It's the baseline.
  • Ukkonen's algorithm observes that if you only care about distances up to some threshold k, you can skip the cells outside a diagonal band of width 2k+1 — anything outside that band already has a distance greater than k, so it doesn't matter. This drops complexity to O(k * min(m, n)).
  • Myers bit-parallel is the fastest of the three on modern hardware, and it gets there with a completely different trick.

Wagner-Fischer vs Ukkonen (k = 2)

Full DP — every cell

εJOHNATHONεJONATHON012345678912345678
Cells filled:0 / 72

Ukkonen — diagonal band only

εJOHNATHONεJONATHON01212
Cells filled:0 / 36
Idle / unvisited
Computing
Filled

Why Myers Bit-Parallel is So Fast

Here's the key observation Myers exploited: in any column of the DP table, adjacent cells differ by exactly -1, 0, or +1. That's a strong constraint. It means a whole column of m cells can be encoded with just two bit vectors:

  • VP — bits that are set where the cell differs from its neighbor by +1
  • VN — bits that are set where the cell differs by -1

Now an entire DP column fits in a single 64-bit register (for patterns up to 64 characters). Advancing the column from one candidate character to the next becomes about ten bitwise operations on those registers.

That's where the "parallel" in the name comes from. There are no SIMD intrinsics here. The CPU is operating on 64 logical cells at once because they were packed into one ulong. People sometimes call this SWAR — SIMD within a register.

Now, to be frank with you. I don't understand the math behind every operation here. I understand the concept of SWAR and why it's faster, but the real algorithm is very complicated to understand. There are many lectures and papers on Myers Bit-Parallel, they are all very long and dense. The main kernel is pretty short so I'll just show it here. Don't beat yourself up if you don't understand it (I don't), just appreciate that there are minds bright enough to come with an idea like this (Gene Myers, 1999):

private int BoundedKernel(in MyersPattern64 pattern, string candidate, int maxDist)
{
    int m = pattern.Length;
    int n = candidate.Length;

    ulong maskAll = pattern.MaskAll;
    ulong lastBitMask = pattern.LastBitMask;
    ulong VP = maskAll;
    ulong VN = 0UL;
    int score = m;

    ref ulong patternRef = ref ArrayHelpers.GetArrayDataReference(pattern.Masks!);
    ref byte mapRef = ref ArrayHelpers.GetArrayDataReference(_map);
    ref char candRef = ref MemoryMarshal.GetReference(candidate.AsSpan());

    for (int i = 0; i < n; i++)
    {
        char c = Unsafe.Add(ref candRef, i);
        byte idx = Unsafe.Add(ref mapRef, (byte)c);
        ulong PM = Unsafe.Add(ref patternRef, idx);

        ulong X = PM | VN;
        ulong D0 = (((X & VP) + VP) ^ VP) | X;
        ulong HN = VP & D0;
        ulong HP = VN | (~(VP | D0) & maskAll);

        X = (HP << 1) | 1UL;
        VN = X & D0;
        VP = (HN << 1) | (~(X | D0) & maskAll);

        if ((HP & lastBitMask) != 0)
        {
            score++;
        }
        else if ((HN & lastBitMask) != 0)
        {
            score--;
        }

        // Even if every remaining char decreases the score by 1, can
        // we still finish at <= maxDist? If not, the answer is "too far".
        int remaining = n - i - 1;
        if (score - remaining > maxDist)
        {
            return int.MaxValue;
        }
    }

    return score <= maxDist ? score : int.MaxValue;
}

Standard DP vs Myers Bit-Parallel

Standard — one cell per step

εJOHNATHONεJONATHON012345678912345678
Cells filled:0 / 72

Myers — one column per step

εJOHNATHONεJONATHON012345678912345678
Columns:0 / 9
VP
VN
Idle / 0-bit
Just computed / 1-bit
Filled

The First 10x: Port to C#

I ported a known-good Myers bit-parallel implementation to C# and immediately saw a 6x speedup over the Levenshtein routine I was using before. Great, but not enough. Levenshtein was still a large portion of CPU time. I needed it cheaper still.

Where the Speed Came From

Past the initial port, four observations took me from 10x to roughly 50x. None of them are clever in isolation. The win is that they compose, and they're all bitwise-cheap relative to running the kernel itself.

1. Cache Everything That Only Depends on the Pattern

I was always comparing one query against many candidates. That asymmetry is the first thing to exploit: any work that depends only on the query string can be done once and reused across every candidate.

In Myers bit-parallel, the query string produces a lookup table — for each possible character, a bitmask of the positions in the query string where that character appears. That table is the most expensive setup work in the algorithm, and it's also the part that's purely a function of the query.

So I split the API into two pieces. Prepare(pattern) returns a reusable handle:

public MyersPattern64 Prepare(string pattern)
{
    // ...
    ulong[] masks = ArrayPool<ulong>.Shared.Rent(256);
    Array.Clear(masks, 0, 256);

    ulong charMask = 0UL;
    for (int i = 0; i < m; i++)
    {
        byte idx = _map[pattern[i]];
        masks[idx] |= 1UL << i;
        charMask |= 1UL << (idx & 63);
    }

    ulong maskAll = m == 64 ? ulong.MaxValue : (1UL << m) - 1UL;
    ulong lastBitMask = 1UL << (m - 1);

    return new MyersPattern64(masks, m, lastBitMask, maskAll, charMask, /* ... */);
}

The handle holds everything Myers bit-parallel needs to compare a query string to a candidate string. Pass that handle into Distance(...) repeatedly:

using MyersPattern64 pat = engine.Prepare(query);
foreach (string candidate in candidates)
{
    int dist = engine.Distance(in pat, candidate, maxDist);
    // ...
}

2. Bail Early When the Answer Can't Be Below maxDist

In a fuzzy-matching pipeline, you almost always have a maximum distance you're willing to tolerate. A typo correction system doesn't care whether "Jonathon" and "Mary" are at distance 5 or 7; it just needs to know they're too far apart and move on.

There are two natural places to plug this in.

Outside the kernel. The length difference between two strings is a hard lower bound on their edit distance — you can't turn an 8-character string into a 3-character string in fewer than 5 deletions. So:

int lenDiff = Math.Abs(m - n);
if (lenDiff > maxDist) return int.MaxValue;

Inside the kernel. This one's more subtle, and it's where Myers bit-parallel really shines.

After the algorithm processes each candidate character, it updates a running score (the edit distance for the prefix processed so far). A nice property of the algorithm is that this score changes by at most ±1 per iteration: every iteration the last bit either pushes the score up by 1, down by 1, or leaves it alone.

That means the best case from any point in the loop is "every remaining character drops the score by 1." If even that best case lands above maxDist, the final answer can only be worse, so we can bail:

int remaining = n - i - 1;
if (score - remaining > maxDist)
{
    return int.MaxValue;
}

3. Prune With a Single 64-bit Character Mask

As fast as Myers Bit-Parallel is, it is still expensive and any optimization that can cheaply prune very bad candidates without entering the kernel is a big win.

Here's the observation. Consider:

query     = "Orlando"   // distinct chars: A, D, L, N, O, R
candidate = "Miami"     // distinct chars: A, I, M
maxDist   = 2

The length difference is 1, so the length-based prune doesn't help. But look at the alphabets. Five letters in Orlando (D, L, N, O, R) don't appear in Miami at all. To turn Miami into Orlando, you'd need at least one edit per missing letter — five edits minimum, way over the threshold. We don't need the DP table to know this is hopeless.

I encoded each string's distinct characters as a single 64-bit mask, with one bit per character class. Building this mask is a linear scan over the candidate (the pattern's mask is already cached from step 1), and the prune itself is two CPU instructions:

// ulong with 2 bits set to 1 which indicate the presence of a and b
ulong candMask = BuildCharMaskCore("aaaabbb");
int overlap = BitOperations.PopCount(candMask & pattern.CharMask);
if (pattern.UniqueCharCount - overlap > maxDist)
{
    return int.MaxValue;
}

BitOperations.PopCount is the number of bits set to 1. It's a single CPU instruction and very fast

pattern.UniqueCharCount is popcount(queryMask) — the number of distinct characters in the query. overlap is the number of distinct characters that appear in both strings. The difference is the number of query characters that are missing from the candidate, and each one costs at least one edit.

ASCII has more than 64 characters, so this trick only works directly when your character map has 64 or fewer distinct values. In my use case it did. In the repo, the lowest 6 bits of the mapped value are used to pick the bit position, so distinct mapped values that share their low 6 bits will collide. That makes the filter conservative — it can be less selective, but it never wrongly rejects a valid match.

4. Required Character Masks (Practically Free)

Sometimes the structure of the query tells you that specific characters must appear in any valid match. The classic case is upstream normalization: imagine your matching pipeline expands name suffixes — Jr becomes Junior, so a query of "James Smith Jr" is rewritten to "James Smith Junior" before fuzzy comparison. Once that rewrite happens, every legitimate candidate has to contain the letters of Junior: J, U, N, I, O, R. You don't want to accidentally allow James Smith Senior to match.

The naive way to enforce that is a substring search per candidate. That's not free, especially when you're checking multiple required tokens.

But I'd already built candMask for the previous prune. So "does the candidate contain all required characters?" reduces to one bitwise AND and one compare:

ulong requiredCharMask = engine.BuildCharMask("Junior");
// ...

if ((candMask & requiredCharMask) != requiredCharMask)
    return int.MaxValue;

You still need to check the actual substring is in the match after it passes levenshtein. It's cheaper though to use cheap pruning on all candidates and expensive pruning on ones that pass than to use expensive pruning on all candidates

The Results

I benchmarked four implementations against my Myers bit-parallel build:

  • NaiveLevenshtein — the textbook full DP table, with and without a maxDist early exit
  • WagnerFischer — the same DP, written more carefully (single rolling row, no allocations in steady state)
  • Ukkonen — the diagonal-band optimization
  • Myers_PreparedOnce — my implementation, with the pattern prepared once and reused across all 1000 candidates

Each benchmark runs 1 query against 1000 candidates with MaxDist = 3. The bounded Myers kernel is the baseline (Ratio = 1.00); every other row is that many times slower:

MethodMeanRatio
Myers_PreparedOnce_WithMaxDist5.415 us1.00
Myers_PreparedOnce_NoMaxDist21.652 us4.00
WagnerFischerReference_WithMaxDist42.496 us7.85
UkkonenReference_WithMaxDist51.068 us9.43
NaiveLevenshteinReference_WithMaxDist63.057 us11.65
NaiveLevenshteinReference_NoMaxDist202.914 us37.48

When This Helps

This stack of optimizations is well-suited to a specific shape of workload:

  • One query, many candidates. The whole "prepare once, reuse" structure assumes you're amortizing setup work across multiple comparisons. If you're doing one-off comparisons, just call the convenience overload and move on — you won't see most of these gains.
  • A meaningful maxDist. The biggest wins come from pruning, and pruning needs a threshold. If you genuinely need the exact distance for every pair regardless of how far they are, the in-kernel and char-mask bounds don't help.
  • A constrained alphabet (256 characters or less). The 64-bit character mask trick assumes your character map has 64 or fewer distinct values. The algorithm itself only works with up to 256 distinct values, unicode is not supported.
  • You actually care about microseconds. If Levenshtein is 1% of your CPU, none of this matters. If it's 90% — like it was for me — every layer of pruning is worth it.

The full source — including the prepared-pattern API, the bounded kernel, and all the pruning steps — lives at biegehydra/MyersBitParallelDotnet. PRs welcome.

Was this useful?