How Postgres CTIDs Gave Me a 30x Speedup on Processing 200 Million Rows
Indexes don't solve everything at scale. How I went from 3-day process to a few hours by exploiting PostgreSQL's physical row addresses for cache-friendly sequential reads.
The Problem
Processing data at scale efficiently is hard. Algorithms and strategies that run very quickly at a few thousand or even a few million rows quickly crumble when tables grow to hundreds of millions or billions of rows. The bottleneck almost always shifts from CPU to I/O, and solutions that feel correct on paper fall apart when the disk can't keep up.
I recently had this painful realization when trying to process a table with 200 million records. The table was roughly 40 GB — about 200 bytes per row. My goal was straightforward: process each row, store the result, and be able to pause and resume at any time. I was iterating on the business logic that processed the rows, so I needed the full pipeline to run fast enough that I could review results and make adjustments without waiting days between runs.
Let's call this table work_item. An important detail: the work_item table was completely static — no external inserts, updates, or deletes were happening on it during processing. The work I needed to perform on each row was not computationally heavy either. This was fundamentally an I/O-bound problem.
I'm writing this in C# with Npgsql (the PostgreSQL driver for .NET), but the core concepts — random vs. sequential I/O, index structure, CTID range scans — apply regardless of language.
Environment: PostgreSQL 16.3 on Ubuntu 24.04, C# / .NET 8, Npgsql
Failed Approach 1: Partial Index + In-Place Updates
My first approach was the simplest and most naive. Add a process_result column to the work_item table, fetch rows in batches where process_result IS NULL, process them, then bulk-update the results back via a temp table join.
I also created a partial index so the "find unprocessed rows" query would be fast:
CREATE INDEX idx_work_item_unprocessed
ON work_item ((1))
WHERE process_result IS NULL;The C# side looked like this:
const int batchSize = 150_000;
WorkItem[] rowsToProcess = await context.WorkItems
.Where(w => w.ProcessResult == null)
.Take(batchSize)
.AsNoTracking()
.ToArrayAsync();
WorkItemResult[] results = await ProcessWorkItems(rowsToProcess);
// Bulk copy results to a temp table, then UPDATE via join
await using var transaction = await conn.BeginTransactionAsync();
await using (var cmd = new NpgsqlCommand(@"
CREATE TEMP TABLE _result_batch (
work_item_id UUID NOT NULL,
process_result TEXT NOT NULL
) ON COMMIT DROP", conn, transaction))
{
await cmd.ExecuteNonQueryAsync();
}
using (var writer = conn.BeginBinaryImport(
"COPY _result_batch (work_item_id, process_result) FROM STDIN (FORMAT BINARY)"))
{
foreach (var r in results)
{
writer.StartRow();
writer.Write(r.WorkItemId, NpgsqlDbType.Uuid);
writer.Write(r.Result, NpgsqlDbType.Text);
}
writer.Complete();
}
await using (var updateCmd = new NpgsqlCommand(@"
UPDATE work_item wi
SET process_result = b.process_result
FROM _result_batch b
WHERE wi.id = b.work_item_id", conn, transaction))
{
await updateCmd.ExecuteNonQueryAsync();
}
await transaction.CommitAsync();"We have an index, it should be fast" — right? Wrong. This approach fell apart for two reasons, and they both had the same root cause.
The fetch was slow. Each batch of 150k rows took 20–30 seconds to load. The index told Postgres which rows to grab, but retrieving them still required jumping to scattered pages all over the disk.
The update was worse. Each batch update took 1–2 minutes. Postgres had to write new tuple versions, update the partial index (which now had to remove entries as process_result became non-null), and update the table's primary index.
At this rate, processing all 200 million rows would take 2–3 days. That might be acceptable for a one-time ETL job, but I was iterating on business logic and needed to review results quickly. I needed something drastically faster.
A Detour: How Indexes Actually Work in Postgres
To understand why this was slow, you need to understand what a Postgres index actually stores and how it translates to disk reads.
What an Index Holds
At the simplest level, a B-tree index entry contains:
(index key values, TID)The index key values are whatever columns you indexed on, and the TID (Tuple Identifier) is a pointer to the physical location of that row on disk.
You can optionally include additional columns in the index with the INCLUDE clause:
CREATE INDEX idx_example ON table (col1) INCLUDE (col2, col3);These included columns don't affect the index's search structure or sort order. They exist so that queries needing only those columns can be served by an index-only scan — Postgres can return results directly from the index without reading any heap pages. This is a significant optimization, but it only helps when all the columns your query needs are in the index.
So more precisely, an index entry is: (key columns, included columns, TID).
What is a TID?
TID stands for Tuple Identifier. It tells Postgres where a row is physically located on disk. A TID is composed of two parts:
(block_number, tuple_offset)The block number is the page; Postgres stores data in 8 KB pages by default. The tuple offset tells Postgres where within that page the row lives.
A TID is 6 bytes: 4 bytes for the block number (supporting up to 32 TB per table) and 2 bytes for the offset.
One critical property: TIDs are not stable. As you update rows, delete rows, or run VACUUM FULL, TIDs can change. This is why Postgres documentation warns against using CTIDs as row identifiers. Remember this — it matters later.
Why the Fetch/Update Were Slow: Random I/O
Even though the partial index quickly identified which rows had process_result IS NULL, it still had to perform a heap fetch for each matching row. And here's the problem: those rows were scattered across the entire table.
Think about what happens when you process the first batch of 150k rows and update them. The next batch pulls from whatever rows still have NULL. These rows aren't contiguous — they're spread across potentially millions of 8 KB pages. Every fetch becomes a random page read, and random I/O on spinning disks is roughly 100x slower than sequential I/O. Even on SSDs, the difference is significant — random reads involve more system calls, more page cache misses, and far less opportunity for the OS to prefetch.
The update side had the same problem. Postgres's MVCC model means an UPDATE doesn't modify a row in place — it writes a new tuple version, marks the old one as dead, and updates every index pointing to that row. For 150k rows scattered across the heap, that's 150k random writes, plus index maintenance.
Index Scan vs CTID Scan Visualization
Hit Run below to see both access patterns side by side. Watch how random reads jump all over the grid while sequential reads sweep through contiguously and finish far quicker:
Reading 1,000 rowsReading 1k rows
Index Scan
CTID Range ScanCTID Scan
For a real batch of 150k rows, the index scan would jump across tens of thousands of pages scattered throughout the 40 GB table, while the CTID range scan would read just a few hundred contiguous pages. Both grids here are the same size for simplicity. In practice, the difference in pages touched by the two strategies is significant.
Failed Approach 2: Memory-Mapped IDs + Separate Results Table
At this point I hadn't yet fully grasped that random I/O was the core problem. I thought the fetch was slow because of the index, and the update was slow because of tuple rewrites and index maintenance. So I made two changes:
- Load all work item IDs into a local memory-mapped file and track which ones had been processed, eliminating the need for the partial index entirely.
- Store results in a separate table — insert-only, no column on
work_item— so I could avoid UPDATE overhead entirely.
const int batchSize = 150_000;
Guid[] allIds = LoadIdsFromBinaryFile();
HashSet<Guid> processedIds = LoadProcessedIdsFromBinaryFile();
foreach (Guid[] batch in allIds
.Where(id => !processedIds.Contains(id))
.Chunk(batchSize))
{
WorkItem[] rowsToProcess = await context.WorkItems
.Where(w => batch.Contains(w.Id))
.AsNoTracking()
.ToArrayAsync();
WorkItemResult[] results = await ProcessWorkItems(rowsToProcess);
// Bulk COPY into permanent results table (insert-only)
using var writer = conn.BeginBinaryImport(
"COPY work_item_result (work_item_id, result) FROM STDIN (FORMAT BINARY)");
foreach (var r in results)
{
writer.StartRow();
writer.Write(r.WorkItemId, NpgsqlDbType.Uuid);
writer.Write(r.Result, NpgsqlDbType.Text);
}
writer.Complete();
foreach (var r in results)
processedIds.Add(r.WorkItemId);
PersistProcessedIds(processedIds);
}The good news: writes were much faster. Inserting into a results table took a couple seconds instead of a couple minutes. No tuple rewrites, no index churn on work_item, just appending rows to a fresh table. The binary COPY protocol made this very efficient.
The bad news: the fetch was even slower. Loading a batch by a WHERE id IN (...) clause with 150k UUIDs took over a minute. This was worse than the partial index approach.
Why? The IDs I was using were generated with XxHash128 — a non-cryptographic hash. These IDs have zero locality. When Postgres resolves WHERE id IN (guid1, guid2, ..., guid150000), it looks each one up in the primary key index and performs a random heap fetch for each. With 150k random UUIDs, those fetches are scattered across the entire 40 GB table. Every single row is a cache miss.
Hindsight: If I had used UUID v7 (which embeds a timestamp and sorts chronologically) when creating the work items, the IDs would have had natural locality and this approach might have been fine. But I was working with what I had.
The Breakthrough: CTID
So writes were solved. But I still needed to figure out how to read rows efficiently. The fundamental problem was clear now: random I/O is the bottleneck, and no indexing strategy will help when the rows you need are scattered across millions of pages.
I needed a way to guarantee that each batch reads rows from the same or adjacent pages. I needed physical locality.
That's when I found CTID in the Postgres documentation:
"ctid — The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid should not be used as a long-term row identifier."
Did you know that you can SELECT ctid from any table? I didn't. This was the key to everything.
SELECT ctid, id, data FROM work_item WHERE ctid >= '(0,1)'::tid AND ctid <= '(1000,0)'::tid ORDER BY ctid;CTID gives you direct access to the physical layout. A range query on CTID translates directly to a sequential page scan of a contiguous block of disk. No index lookups, no random I/O — just "read pages 0 through 1000 in order." This is exactly how the OS prefetcher and disk controller want to read data.
And remember the critical caveat about CTIDs not being stable? It doesn't matter here. My work_item table was static — no updates, no deletes, no VACUUM FULL. The CTIDs would never change. And since I was storing results in a separate table (thanks to approach 2), I wasn't modifying work_item at all.
The Final Solution
The approach: load all CTIDs up front, sort them, then process sequential ranges. Each batch reads a contiguous slice of the physical table.
First, a struct to represent CTIDs in C#:
[StructLayout(LayoutKind.Sequential, Pack = 1)]
public readonly record struct Ctid(uint Page, ushort Offset) : IComparable<Ctid>
{
public int CompareTo(Ctid other)
{
int cmp = Page.CompareTo(other.Page);
return cmp != 0 ? cmp : Offset.CompareTo(other.Offset);
}
public override string ToString() => $"({Page},{Offset})";
}Loading all CTIDs from the database and persisting them to a binary file:
const int CtidBinarySize = 6; // 4 bytes (uint Page) + 2 bytes (ushort Offset)
static async Task<Ctid[]> FetchAllCtidsFromDb(NpgsqlConnection conn)
{
await using var cmd = new NpgsqlCommand(
"SELECT ctid FROM work_item ORDER BY ctid", conn);
var ctids = new List<Ctid>();
await using var reader = await cmd.ExecuteReaderAsync();
while (await reader.ReadAsync())
{
var tid = reader.GetFieldValue<NpgsqlTid>(0);
ctids.Add(new Ctid(tid.BlockNumber, tid.OffsetNumber));
}
return ctids.ToArray();
}
static void SaveAllCtids(Ctid[] sortedCtids)
{
using var stream = new FileStream(AllCtidsFilePath, FileMode.Create,
FileAccess.Write, FileShare.None, bufferSize: 4 * 1024 * 1024);
stream.Write(MemoryMarshal.AsBytes(sortedCtids.AsSpan()));
}
static Ctid[] LoadAllCtids()
{
if (!File.Exists(AllCtidsFilePath))
return [];
int count = (int)(new FileInfo(AllCtidsFilePath).Length / CtidBinarySize);
var ctids = new Ctid[count];
using var stream = new FileStream(AllCtidsFilePath, FileMode.Open,
FileAccess.Read, FileShare.Read, bufferSize: 4 * 1024 * 1024,
FileOptions.SequentialScan);
stream.ReadExactly(MemoryMarshal.AsBytes(ctids.AsSpan()));
return ctids;
}Calculating the next batch range using binary search:
static (Ctid Start, Ctid End)? GetNextCtidRange(Ctid currentCtid, Ctid[] sortedCtids, int batchSize)
{
int startIndex = Array.BinarySearch(sortedCtids, currentCtid);
if (startIndex >= 0)
startIndex++;
else
startIndex = ~startIndex;
if (startIndex >= sortedCtids.Length)
return null;
int count = Math.Min(batchSize, sortedCtids.Length - startIndex);
return (sortedCtids[startIndex], sortedCtids[startIndex + count - 1]);
}The range query itself:
static async Task<WorkItem[]> FetchBatchByCtidRange(
NpgsqlConnection conn, Ctid startCtid, Ctid endCtid)
{
await using var cmd = new NpgsqlCommand($@"
SELECT id, data
FROM work_item
WHERE ctid >= '({startCtid.Page},{startCtid.Offset})'::tid
AND ctid <= '({endCtid.Page},{endCtid.Offset})'::tid
ORDER BY ctid", conn);
var results = new List<WorkItem>();
await using var reader = await cmd.ExecuteReaderAsync();
while (await reader.ReadAsync())
{
results.Add(new WorkItem
{
Id = reader.GetGuid(0),
Data = reader.GetString(1),
});
}
return results.ToArray();
}And tying it all together — the main processing loop:
const int batchSize = 150_000;
Ctid[] allCtids = LoadAllCtids();
if (allCtids.Length == 0)
{
allCtids = await FetchAllCtidsFromDb(conn);
SaveAllCtids(allCtids);
}
Ctid currentCtid = LoadCurrentCtid(); // resumes from last checkpoint
while (GetNextCtidRange(currentCtid, allCtids, batchSize) is (Ctid startCtid, Ctid endCtid))
{
WorkItem[] rowsToProcess = await FetchBatchByCtidRange(conn, startCtid, endCtid);
WorkItemResult[] results = await ProcessWorkItems(rowsToProcess);
// Bulk COPY results into permanent results table
using var writer = conn.BeginBinaryImport(
"COPY work_item_result (work_item_id, result) FROM STDIN (FORMAT BINARY)");
foreach (var r in results)
{
writer.StartRow();
writer.Write(r.WorkItemId, NpgsqlDbType.Uuid);
writer.Write(r.Result, NpgsqlDbType.Text);
}
writer.Complete();
currentCtid = endCtid;
SaveCurrentCtid(currentCtid);
}And just like that, all my problems were solved.
The Results
The fetch query dropped from 20–30 seconds (approach 1) and 60+ seconds (approach 2) down to 1–3 seconds. The write was already fast from approach 2 at a couple seconds per batch. The full pipeline went from a projected 2–3 day runtime to a few hours.
That's roughly a 30x speedup on the read side, purely from better I/O access patterns.
Memory Savings
Beyond raw speed, CTIDs also saved significant memory both on disk and in RAM:
| Metric | GUID-based (Approach 2) | CTID-based (Final) |
|---|---|---|
| ID size | 16 bytes | 6 bytes |
| 200M IDs in memory | ~3.2 GB | ~1.2 GB |
| Binary file on disk | ~3.2 GB | ~1.2 GB |
| Progress tracking | HashSet of processed GUIDs | Single 6-byte CTID checkpoint |
With GUIDs, I needed a HashSet<Guid> holding millions of entries to track which IDs had been processed. With CTIDs, I just save a single 6-byte value — the current position — to a file. Resume means loading that one value and continuing from where I left off.
Bonus: Trivial Parallelization
Another benefit of CTID-based processing is that it scales trivially to multiple workers. Since the CTIDs are pre-sorted and you're just carving out non-overlapping ranges, you can partition the work with zero coordination:
// Split the sorted CTIDs into N chunks for N workers
int chunkSize = allCtids.Length / workerCount;
var tasks = Enumerable.Range(0, workerCount).Select(i =>
{
int start = i * chunkSize;
int end = (i == workerCount - 1) ? allCtids.Length : start + chunkSize;
Ctid[] workerCtids = allCtids[start..end];
return Task.Run(() => ProcessCtidRange(workerCtids));
});
await Task.WhenAll(tasks);Each worker gets its own contiguous slice of the table. No locks, no contention, no overlapping reads. Each worker writes to the same results table (or a partitioned one if you prefer), and Postgres handles concurrent inserts fine. The work is embarrassingly parallel.
When to Use This Technique
This approach works well under specific conditions:
- The table is static (or at least stable during processing). If rows are actively being deleted, updated, or vacuumed, CTIDs will shift and your ranges will become unreliable.
- The table is large enough that random I/O is the bottleneck. If your table fits comfortably in
shared_buffersor the OS page cache, you won't see much difference because all reads are effectively in memory anyway. - You need to process all or most rows. If you're only touching a small fraction of the table, an index scan might be faster since it avoids reading irrelevant pages. CTID range scans read everything in the range, including dead tuples that haven't been vacuumed.
- You can tolerate a one-time upfront cost of loading all CTIDs. For 200M rows, the initial
SELECT ctid FROM work_item ORDER BY ctidtakes a few minutes — but you only do it once and persist the results.
Key Takeaways
-
Indexes solve search, not I/O. An index tells Postgres which rows match. It says nothing about where those rows physically are. When matching rows are scattered, every fetch is a random page read, and that's where the real cost is.
-
Physical locality matters more than you think. The performance difference between sequential and random I/O is not 2x or 5x — it can easily be 10x to 100x, depending on storage hardware, page cache hit rates, and OS prefetching behavior.
-
CTIDs are an underused tool. They're well-documented but rarely discussed in application-level code. For batch processing static tables, they give you something no index can: direct control over your I/O access pattern.
-
Separate your read and write paths. Moving results into a separate insert-only table eliminated UPDATE overhead entirely. Combined with binary
COPY, writes became negligible compared to reads. -
Always measure where your time is going. I spent time optimizing the wrong thing (index strategy, update patterns) before realizing the real bottleneck was random disk reads. A quick
EXPLAIN (ANALYZE, BUFFERS)on the fetch query would have shown the heap fetch cost immediately. -
Design your IDs with access patterns in mind. If I had used UUID v7 (which sorts chronologically) instead of a random hash for the work item IDs, the original approach might have been good enough. The physical row order would have roughly matched the ID order, giving me natural locality. Something to keep in mind when designing your next schema.
Was this useful?