May 2, 202611 min read

Optimizing Levenshstein for Fuzzy Name Matching

Myers bit-parallel gave a 10x speedup over naive Levenshtein. Here's how a handful of cheap pruning tricks pushed it another 5x.

algorithmsperformancecsharpdotnet

The Problem

I needed to detect misspellings of names against a known list of candidates. Given a query string (the value with a potential typo) and a set of candidates, I needed to pick the candidate that was most similar to the query — when there wasn't an exact match.

For example, given the query "Jonathon" and the candidates:

Connor, Joe, Mary, Martin, Johnathon, Xavier

The right answer is "Johnathon". It's off by a single inserted h. The metric that captures this kind of similarity is Levenshtein distance: the minimum number of single-character insertions, deletions, or substitutions required to turn one string into the other. Jonathon → Johnathon is a distance of 1.

My real use case wasn't human names — it was place names, which come with their own set of problems (numbers, abbreviations, dropped tokens, and so on). That's a different blog post. Here I'll focus on Levenshtein, which is the most important metric for typos and the most computationally expensive part of the algorithm. I'll use names for simplicity.

Levenshtein was sitting on a hot path. When I profiled my application, over 90% of the CPU time was spent inside the distance function. So if I wanted the application to be fast, I had to make Levenshtein fast.

A Detour: Why Levenshtein is Expensive

To understand why making this fast is non-trivial, you need a feel for what the standard algorithm actually does.

The classical formulation builds a 2D grid. For two strings of length m and n, the grid is (m+1) × (n+1), and each cell asks the question:

"What's the minimum number of edits to turn the first i characters of A into the first j characters of B?"

Each cell looks at three of its neighbors (the cell above for "delete", the cell to the left for "insert", and the diagonal cell for "match or substitute") and takes the minimum. The answer ends up in the bottom-right corner.

That's m * n cells, each doing a constant amount of work. For two 20-character strings that's 400 cells. Still cheap. But push it:

1 query against 1000 candidates → 400,000 cells
1000 queries → 400 million cells

The Standard Speedups

A few well-known algorithms attack this from different angles:

Wagner-Fischer is the classical O(m*n) dynamic programming formulation I just described. It's the baseline.
Ukkonen's algorithm observes that if you only care about distances up to some threshold k, you can skip the cells outside a diagonal band of width 2k+1 — anything outside that band already has a distance greater than k, so it doesn't matter. This drops complexity to O(k * min(m, n)).
Myers bit-parallel is the fastest of the three on modern hardware, and it gets there with a completely different trick.

Hit Run to watch Wagner-Fischer fill its full DP table while Ukkonen, with k = 2, only fills a narrow band around the diagonal. Both arrive at the same answer — Ukkonen just does a fraction of the work:

Wagner-Fischer vs Ukkonen (k = 2)

Full DP — every cell

Cells filled:0 / 72

Ukkonen — diagonal band only

Cells filled:0 / 36

Idle / unvisited

Computing

Filled

Why Myers Bit-Parallel is So Fast

Here's the key observation Myers exploited: in any column of the DP table, adjacent cells differ by exactly -1, 0, or +1. That's a strong constraint. It means a whole column of m cells can be encoded with just two bit vectors:

VP — bits that are set where the cell differs from its neighbor by +1
VN — bits that are set where the cell differs by -1

Now an entire DP column fits in a single 64-bit register (for patterns up to 64 characters). Advancing the column from one candidate character to the next becomes about ten bitwise operations on those registers — the kind of work the CPU does in a single cycle.

That's where the "parallel" in the name comes from. There are no SIMD intrinsics here. The CPU is operating on 64 logical cells at once because they were packed into the same word. People sometimes call this SWAR — SIMD within a register.

Standard DP vs Myers Bit-Parallel

Standard — one cell per step

Cells filled:0 / 72

Myers — one column per step

Columns:0 / 9

Idle / 0-bit

Just computed / 1-bit

Filled

The First 10x: Port to C#

I ported a known-good Myers bit-parallel implementation to C# and immediately saw a 10x speedup over the Levenshtein routine I was using before. Great, but not enough. Levenshtein was still 75% of CPU. I needed it cheaper still.

The full implementation is up at biegehydra/MyersBitParallelDotnet — I'll only show the relevant snippets here.

Where the Speed Came From

Past the initial port, four observations took me from 10x to roughly 50x. None of them are clever in isolation. The win is that they compose, and they're all bitwise-cheap relative to running the kernel itself.

1. Cache Everything That Only Depends on the Pattern

I was always comparing one query against many candidates. That asymmetry is the first thing to exploit: any work that depends only on the pattern can be done once and reused across every candidate.

In Myers bit-parallel, the pattern produces a lookup table — for each possible character, a bitmask of the positions in the pattern where that character appears. That table is the most expensive setup work in the algorithm, and it's also the part that's purely a function of the pattern.

So I split the API into two pieces. Prepare(pattern) returns a reusable handle:

public MyersPattern64 Prepare(string pattern)
{
    // ...
    ulong[] masks = ArrayPool<ulong>.Shared.Rent(256);
    Array.Clear(masks, 0, 256);

    ulong charMask = 0UL;
    for (int i = 0; i < m; i++)
    {
        byte idx = _map[pattern[i]];
        masks[idx] |= 1UL << i;
        charMask |= 1UL << (idx & 63);
    }

    ulong maskAll = m == 64 ? ulong.MaxValue : (1UL << m) - 1UL;
    ulong lastBitMask = 1UL << (m - 1);

    return new MyersPattern64(masks, m, lastBitMask, maskAll, charMask, /* ... */);
}

The handle holds the per-character bitmask table, the maskAll and lastBitMask constants, and a couple of fields I'll explain in a minute. Pass that handle into Distance(...) repeatedly:

using MyersPattern64 pat = engine.Prepare(query);
foreach (string candidate in candidates)
{
    int dist = engine.Distance(in pat, candidate, maxDist);
    // ...
}

For 1000 candidates, that's the 256-entry mask table built once instead of 1000 times. By itself this is a modest improvement. The real value is that it lets the next optimizations be free.

2. Bail Early When the Answer Can't Be Below `maxDist`

In a fuzzy-matching pipeline, you almost always have a maximum distance you're willing to tolerate. A typo correction system doesn't care whether "Jonathon" and "Mary" are at distance 5 or 7; it just needs to know they're too far apart and move on.

There are two natural places to plug this in.

Outside the kernel. The length difference between two strings is a hard lower bound on their edit distance — you can't turn an 8-character string into a 3-character string in fewer than 5 deletions. So:

int lenDiff = Math.Abs(m - n);
if (lenDiff > maxDist) return int.MaxValue;

Inside the kernel. This one's more subtle, and it's where Myers bit-parallel really shines.

After the algorithm processes each candidate character, it updates a running score (the edit distance for the prefix processed so far). A nice property of the algorithm is that this score changes by at most ±1 per iteration: every iteration the last bit either pushes the score up by 1, down by 1, or leaves it alone.

That means the best case from any point in the loop is "every remaining character drops the score by 1." If even that best case lands above maxDist, the final answer can only be worse, so we can bail:

int remaining = n - i - 1;
if (score - remaining > maxDist)
{
    return int.MaxValue;
}

3. Prune With a Single 64-bit Character Mask

Even with the in-kernel bound, Myers still has to start running before it can decide a candidate is hopeless. On a hot path that adds up. So I wanted a step that could reject obviously bad candidates without entering the kernel at all.

Here's the observation. Consider:

query     = "Orlando"   // distinct chars: A, D, L, N, O, R
candidate = "Miami"     // distinct chars: A, I, M
maxDist   = 2

The length difference is 1, so the length-based prune doesn't help. But look at the alphabets. Five letters in Orlando (D, L, N, O, R) don't appear in Miami at all. To turn Miami into Orlando, you'd need at least one edit per missing letter — five edits minimum, way over the threshold. We don't need the DP table to know this is hopeless.

I encoded each string's distinct characters as a single 64-bit mask, with one bit per character class. Building this mask is a linear scan over the candidate (the pattern's mask is already cached from step 1), and the prune itself is two CPU instructions:

ulong candMask = BuildCharMaskCore(candidate);
int overlap = BitOperations.PopCount(candMask & pattern.CharMask);
if (pattern.UniqueCharCount - overlap > maxDist)
{
    return int.MaxValue;
}

pattern.UniqueCharCount is popcount(queryMask) — the number of distinct characters in the query. overlap is the number of distinct characters that appear in both strings. The difference is the number of query characters that are missing from the candidate, and each one costs at least one edit.

ASCII has more than 64 characters, so this trick only works directly when your character map has 64 or fewer distinct values. In my use case it did. In the repo, the lowest 6 bits of the mapped value are used to pick the bit position, so distinct mapped values that share their low 6 bits will collide. That makes the filter conservative — it can be less selective, but it never wrongly rejects a valid match.

This was the single biggest win after the in-kernel bound. It's cheap enough to apply unconditionally before the kernel, and on real workloads it eliminates the majority of candidates without running a single iteration of Myers.

4. Required Character Masks (Practically Free)

Sometimes the structure of the query tells you that specific characters must appear in any valid match. The classic case is upstream normalization: imagine your matching pipeline expands name suffixes — Jr becomes Junior, so a query of "James Smith Jr" is rewritten to "James Smith Junior" before fuzzy comparison. Once that rewrite happens, every legitimate candidate has to contain the letters of Junior: J, U, N, I, O, R. You don't want to accidentally allow James Smith Senior to match.

The naive way to enforce that is a substring search per candidate. That's not free, especially when you're checking multiple required tokens.

But I'd already built candMask for the previous prune. So "does the candidate contain all required characters?" reduces to one bitwise AND and one compare:

ulong requiredCharMask = engine.BuildCharMask("Junior");
// ...

if ((candMask & requiredCharMask) != requiredCharMask)
    return int.MaxValue;

If every bit set in requiredCharMask is also set in candMask, the AND equals requiredCharMask. Otherwise, the candidate is missing at least one required character and we bail. One extra instruction on top of work we were already doing.

You still need to check the actual substring is in the match after it passes levenshtein. It's cheaper though to use cheap pruning on all candidates and expensive pruning on ones that pass than to use expensive pruning on all candidates

5. Kill the Allocations

The 256-entry per-character mask table inside Prepare is ulong[] — 2 KB. Allocating that on every call would put real pressure on the GC, especially since Prepare is called once per query in the hot path.

Instead I rent it from ArrayPool<ulong>.Shared and return it when the pattern handle is disposed:

ulong[] masks = ArrayPool<ulong>.Shared.Rent(256);
// ...

public void Dispose()
{
    if (Masks != null)
    {
        ArrayPool<ulong>.Shared.Return(Masks, clearArray: false);
    }
}

Combined with [StructLayout] tricks and aggressive use of ref and Unsafe.Add in the kernel itself, the steady-state allocation profile of the whole thing is zero bytes per comparison.

The Results

I benchmarked four implementations against my Myers bit-parallel build:

NaiveLevenshtein — the textbook full DP table, with and without a maxDist early exit
WagnerFischer — the same DP, written more carefully (single rolling row, no allocations in steady state)
Ukkonen — the diagonal-band optimization
Myers_PreparedOnce — my implementation, with the pattern prepared once and reused across all 1000 candidates

Each benchmark runs 1 query against 1000 candidates with MaxDist = 3. The bounded Myers kernel is the baseline (Ratio = 1.00); every other row is that many times slower:

Method	MaxDist	CandidateCount	Mean	Ratio
Myers_PreparedOnce_WithMaxDist	3	1000	5.415 us	1.00
Myers_PreparedOnce_NoMaxDist	3	1000	21.652 us	4.00
WagnerFischerReference_WithMaxDist	3	1000	42.496 us	7.85
UkkonenReference_WithMaxDist	3	1000	51.068 us	9.43
NaiveLevenshteinReference_WithMaxDist	3	1000	63.057 us	11.65
NaiveLevenshteinReference_NoMaxDist	3	1000	202.914 us	37.48

A few things jump out.

The bounded Myers kernel demolishes everything else. At maxDist = 3 it runs 1000 comparisons in 5.4 microseconds — about 5 nanoseconds per comparison. Versus naive Levenshtein at 203 us, that's a 37x speedup. At tighter thresholds (maxDist = 1 or 2) the gap widens further, because the in-kernel bound and the character-mask prune kill candidates much earlier.

The pruning pays off proportionally to how strict maxDist is. Comparing the same Myers kernel to itself, the bounded version is 4x faster than the unbounded one at maxDist = 3. That gap is entirely from the in-kernel score - remaining > maxDist check and the character-mask pre-filter — no algorithmic difference, just earlier exits.

Even the bounded traditional algorithms can't keep up. Wagner-Fischer with maxDist, the fastest of the textbook bounded variants here, comes in at 42 us — almost 8x slower than the bounded Myers.

The headline number depends on which baseline you compare against. Versus naive Levenshtein the speedup is 37x at maxDist = 3 and grows past 50x at tighter thresholds. Versus the best traditional bounded algorithm (Wagner-Fischer here), it's a flat 8x — and that's after spending exactly zero CPU instructions on SIMD intrinsics or unsafe pointer math beyond what's already in the kernel.

When This Helps

This stack of optimizations is well-suited to a specific shape of workload:

One query, many candidates. The whole "prepare once, reuse" structure assumes you're amortizing setup work across multiple comparisons. If you're doing one-off comparisons, just call the convenience overload and move on — you won't see most of these gains.
A meaningful maxDist. The biggest wins come from pruning, and pruning needs a threshold. If you genuinely need the exact distance for every pair regardless of how far they are, the in-kernel and char-mask bounds don't help.
A constrained alphabet. The 64-bit character mask trick assumes your character map has 64 or fewer distinct values. If you're working with full Unicode, you'd need a wider representation (or accept more collisions in the filter).
You actually care about microseconds. If Levenshtein is 1% of your CPU, none of this matters. If it's 90% — like it was for me — every layer of pruning is worth it.

The full source — including the prepared-pattern API, the bounded kernel, and all the pruning steps — lives at biegehydra/MyersBitParallelDotnet. PRs welcome.

Was this useful?

BACK TO ALL POSTS