How Postgres CTIDs Gave Me a 30x Speedup on Processing 200 Million Rows
Indexes don't solve everything at scale. A story of how I cut 3-days worth of execution time down to a few hours by exploiting PostgreSQL's physical row addresses for cache-friendly sequential reads.
The Problem
Processing data at scale efficiently is hard. Algorithms and strategies that run very quickly at a few thousand or even a few million rows quickly crumble when tables grow to hundreds of millions or billions of rows. The bottleneck almost always shifts from CPU to I/O, and solutions that feel correct on paper fall apart when the disk can't keep up.
I recently had this painful realization when trying to process a table with 200 million records. The table was roughly 40 GB — about 200 bytes per row. The requirements were deceptively simple:
- Process every row in the table and store the result
- Process in batches to avoid memory issues and see results faster
- Pause and resume at any point without losing progress
- Iterate rapidly — the full pipeline had to be fast enough to review results and update the business logic without waiting days between runs
Let's call this table work_item. An important detail: the work_item table was completely static — no external inserts, updates, or deletes were happening on it during processing. The work I needed to perform on each row was not computationally heavy either.
I'm writing this in C# with Npgsql (the PostgreSQL driver for .NET), but the core concepts — random vs. sequential I/O, index structure, CTID range scans — apply regardless of language.
Environment: PostgreSQL 16.3 on Ubuntu 24.04, C# / .NET 8, Npgsql
Failed Approach 1: Partial Index + In-Place Updates
My first approach was the simplest and most naive:
- Add a
process_resultcolumn to thework_itemtable - Fetch rows in batches where
process_result IS NULL - Process the batch
- Bulk-update the results back via a temp table join.
I also created a partial index so the "find unprocessed rows" query would be fast:
CREATE INDEX idx_work_item_unprocessed
ON work_item ((1))
WHERE process_result IS NULL;The C# side to process a batch looked like this:
const int batchSize = 150_000;
// Translates to:
// SELECT id, data FROM work_item WHERE process_result IS NULL LIMIT 150000;
WorkItem[] rowsToProcess = await context.WorkItems
.Where(w => w.ProcessResult == null)
.Take(batchSize)
.AsNoTracking()
.ToArrayAsync();
WorkItemResult[] results = await ProcessWorkItems(rowsToProcess);
// Bulk copy results to a temp table, then UPDATE via join
await using var transaction = await conn.BeginTransactionAsync();
await using (var cmd = new NpgsqlCommand(@"
CREATE TEMP TABLE _result_batch (
work_item_id UUID NOT NULL,
process_result TEXT NOT NULL
) ON COMMIT DROP", conn, transaction))
{
await cmd.ExecuteNonQueryAsync();
}
using (var writer = conn.BeginBinaryImport(
"COPY _result_batch (work_item_id, process_result) FROM STDIN (FORMAT BINARY)"))
{
foreach (WorkItemResult r in results)
{
writer.StartRow();
writer.Write(r.WorkItemId, NpgsqlDbType.Uuid);
writer.Write(r.Result, NpgsqlDbType.Text);
}
writer.Complete();
}
await using (var updateCmd = new NpgsqlCommand(@"
UPDATE work_item wi
SET process_result = b.process_result
FROM _result_batch b
WHERE wi.id = b.work_item_id", conn, transaction))
{
await updateCmd.ExecuteNonQueryAsync();
}
await transaction.CommitAsync();"We have an index and the update uses a primary key join, it should be fast" — right? Wrong. This approach fell apart for two reasons, and they both had the same root cause.
The fetch was slow. Each batch of 150k rows took 20–30 seconds to load. More on why later.
The update was worse. Each batch update took 1–2 minutes. Postgres had to write new tuple versions, update the partial index (which now had to remove entries as process_result became non-null), and update the table's primary index.
At this rate, processing all 200 million rows would take 2–3 days. That might be acceptable for a one-time ETL job, but I was iterating on business logic and needed to review results quickly. This was non acceptable.
A Detour: How Indexes Actually Work in Postgres
To understand why this was slow, you need to understand what a Postgres index actually stores and how it translates to disk reads.
What an Index Holds
At the simplest level, a B-tree index entry contains:
(index key values, TID)The index key values are whatever columns you indexed on, and the TID (Tuple Identifier) is a pointer to the physical location of that row on disk.
You can optionally include additional columns in the index with the INCLUDE clause:
CREATE INDEX idx_example ON table (col1) INCLUDE (col2, col3);
-- Useful for a query like so
SELECT col2, col3 FROM table WHERE col1 = 'value';These included columns don't affect the index's search structure or sort order. They exist so that queries that use those columns in the select portion of the query can be served by an index-only scan. Index-only scan means postgres can return results directly from the index without reading any heap pages; this is a significant optimization but irrelevant here since the data is too large to put in the index.
So more precisely, an index entry is: (key columns, included columns, TID).
What is a TID?
TID stands for Tuple Identifier. It tells Postgres where a row is physically located on disk. A TID is composed of two parts:
(block_number, tuple_offset)The block number is the page; Postgres stores data in 8 KB pages by default. The tuple offset tells Postgres where within that page the row lives.
A TID is 6 bytes: 4 bytes for the block number (supporting up to 32 TB per table) and 2 bytes for the offset.
One critical property: TIDs are not stable. As you update rows, delete rows, or run VACUUM FULL, TIDs can change. This is why Postgres documentation warns against using CTIDs as row identifiers. Remember this, it matters later.
Why the Fetch/Update Were Slow: Random I/O
Even though the partial index quickly identified which 150k rows had process_result IS NULL, it still had to perform a heap fetch for each matching row. And here's the problem: those rows were not contiguous, they were scattered across the tens of thousands of 8 KB pages on the disk.
Hit Run to watch how an index scan works. Each match in the B-tree resolves to a TID, which triggers a random jump to a different heap page. 150k rows matched = 150k random page reads:
Index Scan: traverse → match → heap fetch
B-Tree IndexB-Tree Idx
Heap Pages (disk)Heap
Random I/O is very slow: 10x-30x slower on SSDs and 100x-300x slower on spinning disks. They involve more system calls, more page cache misses, and far less opportunity for the OS to prefetch.
The update side had the same problem. Postgres's MVCC model means an UPDATE doesn't modify a row in place — it writes a new tuple version, marks the old one as dead, and updates every index pointing to that row. For 150k rows scattered across the heap, that's 150k random writes, plus index maintenance.
Failed Approach 2: Memory-Mapped IDs + Separate Results Table
At this point I hadn't yet fully grasped that random I/O was the core problem. I thought the fetch was slow because of the index, and the update was slow because of tuple rewrites and index maintenance. So I made two changes:
- Load all work item IDs into a local memory-mapped file and track which ones had been processed, eliminating the need for the partial index entirely.
- Store results in a separate table — insert-only, no column on
work_item— so I could avoid UPDATE overhead entirely.
const int batchSize = 150_000;
Guid[] allIds = LoadIdsFromBinaryFile();
HashSet<Guid> processedIds = LoadProcessedIdsFromBinaryFile();
foreach (Guid[] batch in allIds
.Where(id => !processedIds.Contains(id))
.Chunk(batchSize))
{
// Translate to:
// SELECT id, data FROM work_item WHERE id IN (guid1, guid2, ...)
WorkItem[] rowsToProcess = await context.WorkItems
.Where(w => batch.Contains(w.Id))
.AsNoTracking()
.ToArrayAsync();
WorkItemResult[] results = await ProcessWorkItems(rowsToProcess);
// Bulk COPY into permanent results table (insert-only)
using var writer = conn.BeginBinaryImport(
"COPY work_item_result (work_item_id, result) FROM STDIN (FORMAT BINARY)");
foreach (WorkItemResult r in results)
{
writer.StartRow();
writer.Write(r.WorkItemId, NpgsqlDbType.Uuid);
writer.Write(r.Result, NpgsqlDbType.Text);
}
writer.Complete();
foreach (WorkItemResult r in results)
processedIds.Add(r.WorkItemId);
PersistProcessedIds(processedIds);
}The good news: writes were much faster. Inserting into a results table took a couple seconds instead of a couple minutes. No tuple rewrites, no index churn on work_item, just appending rows to a fresh table. The binary COPY protocol made this very efficient.
The bad news: the fetch was even slower. Loading a batch by a WHERE id IN (...) clause with 150k UUIDs took over a minute. This was worse than the partial index approach.
Why? Same problem as before — random page reads. The IDs I was using were generated with XxHash128, so they had zero locality. With 150k random UUIDs, the lookups scattered across the entire 40 GB table and every row was a cache miss.
Hindsight: If I had used UUID v7 (which embeds a timestamp and sorts chronologically) when creating the work items, the IDs would have had natural locality and this approach might have been fine. But I was working with what I had.
The Breakthrough: CTID
So writes were solved. But I still needed to figure out how to read rows efficiently. The fundamental problem was clear now: random I/O is the bottleneck, and no indexing strategy will help when the rows you need are scattered across millions of pages.
I needed a way to guarantee that each batch reads rows from the same or adjacent pages (physical locality).
That's when I found CTID in the Postgres documentation:
"ctid — The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid should not be used as a long-term row identifier."
Did you know that you can SELECT ctid from any table? I didn't. This was the unlock.
SELECT ctid, id, data FROM work_item
WHERE ctid >= '(0,1)'::tid AND ctid <= '(1000,0)'::tid ORDER BY ctid;CTID gives you direct access to the physical location of a row. A range query on CTID translates directly to a sequential page scan of a contiguous block of disk. No index lookups, no random I/O — just "read pages 0 through 1000 in order." This is exactly how the OS prefetcher and disk controller want to read data.
And remember the critical caveat about CTIDs not being stable? It doesn't matter here. My work_item table was static — no updates, no deletes, no VACUUM FULL. The CTIDs would never change. And since I was storing results in a separate table (thanks to approach 2), I wasn't modifying work_item at all.
The Final Solution
The approach: load all CTIDs up front, sort them, then process sequential ranges. Each batch reads a contiguous slice of the physical table. This is significantly faster than the index scan approach.
Index Scan vs CTID Scan Visualization
Hit Run to see both access patterns side by side. Watch how random reads jump all over the grid while sequential reads sweep through contiguously and finish far quicker:
Reading 1,000 rowsReading 1k rows
Index Scan
CTID Range ScanCTID Scan
For a real batch of 150k rows, the index scan would jump across tens of thousands of pages scattered throughout the 40 GB table, while the CTID range scan would read just a few hundred contiguous pages. Both grids here are the same size for simplicity. In practice, the difference in pages touched by the two strategies is significant.
The Code
A few small types and helpers feed into the main loop. Expand to see the full implementation details.
A struct to represent CTIDs in C#
[StructLayout(LayoutKind.Sequential, Pack = 1)]
public readonly record struct Ctid(uint Page, ushort Offset) : IComparable<Ctid>
{
public int CompareTo(Ctid other)
{
int cmp = Page.CompareTo(other.Page);
return cmp != 0 ? cmp : Offset.CompareTo(other.Offset);
}
public override string ToString() => $"({Page},{Offset})";
}Loading all CTIDs from the database and persisting them to a binary file
ORDER BY ctid is faster than ordering on any other column because Postgres doesn't need to sort anything. It just scans the table in the order the rows are physically stored on disk.
const int CtidBinarySize = 6; // 4 bytes (uint Page) + 2 bytes (ushort Offset)
static async Task<Ctid[]> FetchAllCtidsFromDb(NpgsqlConnection conn)
{
await using var cmd = new NpgsqlCommand(
"SELECT ctid FROM work_item ORDER BY ctid", conn);
var ctids = new List<Ctid>();
await using var reader = await cmd.ExecuteReaderAsync();
while (await reader.ReadAsync())
{
var tid = reader.GetFieldValue<NpgsqlTid>(0);
ctids.Add(new Ctid(tid.BlockNumber, tid.OffsetNumber));
}
return ctids.ToArray();
}
static void SaveAllCtids(Ctid[] sortedCtids)
{
using var stream = new FileStream(AllCtidsFilePath, FileMode.Create,
FileAccess.Write, FileShare.None, bufferSize: 4 * 1024 * 1024);
stream.Write(MemoryMarshal.AsBytes(sortedCtids.AsSpan()));
}
static Ctid[] LoadAllCtids()
{
if (!File.Exists(AllCtidsFilePath))
return [];
int count = (int)(new FileInfo(AllCtidsFilePath).Length / CtidBinarySize);
Ctid[] ctids = new Ctid[count];
using var stream = new FileStream(AllCtidsFilePath, FileMode.Open,
FileAccess.Read, FileShare.Read, bufferSize: 4 * 1024 * 1024,
FileOptions.SequentialScan);
stream.ReadExactly(MemoryMarshal.AsBytes(ctids.AsSpan()));
return ctids;
}Calculating the next batch range using binary search
static (Ctid Start, Ctid End)? GetNextCtidRange(Ctid currentCtid, Ctid[] sortedCtids, int batchSize)
{
int startIndex = Array.BinarySearch(sortedCtids, currentCtid);
if (startIndex >= 0)
startIndex++;
else
startIndex = ~startIndex;
if (startIndex >= sortedCtids.Length)
return null;
int count = Math.Min(batchSize, sortedCtids.Length - startIndex);
return (sortedCtids[startIndex], sortedCtids[startIndex + count - 1]);
}The range query itself
static async Task<WorkItem[]> FetchBatchByCtidRange(
NpgsqlConnection conn, Ctid startCtid, Ctid endCtid)
{
await using var cmd = new NpgsqlCommand($@"
SELECT id, data
FROM work_item
WHERE ctid >= '({startCtid.Page},{startCtid.Offset})'::tid
AND ctid <= '({endCtid.Page},{endCtid.Offset})'::tid
ORDER BY ctid", conn);
var results = new List<WorkItem>();
await using var reader = await cmd.ExecuteReaderAsync();
while (await reader.ReadAsync())
{
results.Add(new WorkItem
{
Id = reader.GetGuid(0),
Data = reader.GetString(1),
});
}
return results.ToArray();
}And tying it all together — the main processing loop:
const int batchSize = 150_000;
Ctid[] allCtids = LoadAllCtids();
if (allCtids.Length == 0)
{
allCtids = await FetchAllCtidsFromDb(conn);
SaveAllCtids(allCtids);
}
// resumes from last checkpoint
Ctid currentCtid = LoadCurrentCtid();
while (GetNextCtidRange(currentCtid, allCtids, batchSize) is (Ctid startCtid, Ctid endCtid))
{
WorkItem[] rowsToProcess = await FetchBatchByCtidRange(conn, startCtid, endCtid);
WorkItemResult[] results = await ProcessWorkItems(rowsToProcess);
// Bulk COPY results into permanent results table
using var writer = conn.BeginBinaryImport(
"COPY work_item_result (work_item_id, result) FROM STDIN (FORMAT BINARY)");
foreach (WorkItemResult r in results)
{
writer.StartRow();
writer.Write(r.WorkItemId, NpgsqlDbType.Uuid);
writer.Write(r.Result, NpgsqlDbType.Text);
}
writer.Complete();
currentCtid = endCtid;
SaveCurrentCtid(currentCtid);
}And just like that, all my problems were solved.
The Results
The fetch query dropped from 20–30 seconds (approach 1) and 60+ seconds (approach 2) down to 1–3 seconds. The write was already fast from approach 2 at a couple seconds per batch. The full pipeline went from a projected 2–3 day runtime to a few hours.
That's roughly a 30x speedup on the read side, purely from better I/O access patterns.
Memory Savings
Beyond raw speed, CTIDs also saved significant memory both on disk and in RAM:
| Metric | GUID-based (Approach 2) | CTID-based (Final) |
|---|---|---|
| ID size | 16 bytes | 6 bytes |
| 200M IDs in memory | ~3.2 GB | ~1.2 GB |
| Binary file on disk | ~3.2 GB | ~1.2 GB |
| Progress tracking | HashSet of processed GUIDs | Single 6-byte CTID checkpoint |
With GUIDs, I needed a HashSet<Guid> holding millions of entries to track which IDs had been processed. With CTIDs, I just save a single 6-byte value — the current position — to a file. Resume means loading that one value and continuing from where I left off.
Bonus: Trivial Parallelization
Another benefit of CTID-based processing is that it scales trivially to multiple workers. Since the CTIDs are pre-sorted and you're just carving out non-overlapping ranges, you can partition the work with zero coordination:
// Split the sorted CTIDs into N chunks for N workers
int chunkSize = allCtids.Length / workerCount;
Task[] tasks = Enumerable.Range(0, workerCount).Select(i =>
{
int start = i * chunkSize;
int end = (i == workerCount - 1) ? allCtids.Length : start + chunkSize;
Ctid[] workerCtids = allCtids[start..end];
return Task.Run(() => ProcessCtidRange(workerCtids));
}).ToArray();
await Task.WhenAll(tasks);Each worker gets its own contiguous slice of the table. No locks, no contention, no overlapping reads. Each worker writes to the same results table, and Postgres handles concurrent inserts fine. The work is trivial to parallelize.
Other Sql Engines
If you're coming from MySQL, SQL Server, or another engine, this whole trick might sound a little strange. "Why not just scan the primary key in order?" In many systems, that's exactly what you would do.
The reason this was necessary in Postgres is that Postgres treats all indexes as secondary structures, including primary key indexes. The table data itself lives separately in the heap. The primary key index is just a B-tree that maps the PK to the heap using TIDs So a query like:
SELECT id, data
FROM work_item
WHERE id > @lastSeenId
ORDER BY id
LIMIT 150000;looks sequential logically, but it is not sequential physically. Postgres walks the primary key index in sorted order, then follows each index entry back to a heap page somewhere else on disk. If the heap is not physically ordered by id, those reads are still random. That's the entire problem CTID sidesteps.
Many other engines make a different tradeoff: they store the table in the order of a clustered index. In that world, the leaf pages of the primary key B-tree are the table. Walking the primary key in order means you're already walking the physical rows in order too.
For example:
- MySQL InnoDB clusters the table by the primary key. A range scan on the primary key is already a physically ordered read
- SQL Server can define a clustered index, and when a table has one, the row data is stored in that index order
Clustered Index: leaf pages ARE the data
When to Use This Technique
This approach works well under specific conditions:
- The table is static (or at least stable during processing). If rows are actively being deleted, updated, or vacuumed, CTIDs will shift and your ranges will become unreliable.
- The table is large enough that random I/O is the bottleneck. If your table fits comfortably in
shared_buffersor the OS page cache, you won't see much difference because all reads are effectively in memory anyway. - You need to process all or most rows. If you're only touching a small fraction of the table, an index scan might be faster since it avoids reading irrelevant pages.
- You can tolerate a one-time upfront cost of loading all CTIDs
Key Takeaways
-
Indexes solve search, not I/O. An index tells Postgres which rows match. It doesn't guarantee fast retrieval of the data in the matched rows.
-
Physical locality matters more than you think. The performance difference between sequential and random I/O is not 2x or 5x — it can easily be 10x to 100x.
-
CTIDs are an underused tool. They give you something no index can: direct control over your I/O access pattern.
-
Separate your read and write paths. Moving results into a separate insert-only table eliminated UPDATE overhead entirely. Combined with binary
COPY, writes were negligible compared to reads. -
Always measure where your time is going. I spent time optimizing the wrong thing (index strategy, update patterns) before realizing the real bottleneck was random disk reads. A quick
EXPLAIN (ANALYZE, BUFFERS)on the fetch query would have shown the heap fetch cost immediately. -
Design your IDs with access patterns in mind. If I had used UUID v7 (which sorts chronologically) instead of a random hash for the work item IDs, the original approach might have been good enough.
Was this useful?