How Lazy Container Loading Works
Understanding Modal's image loader: content addressing, FUSE, and 250 lines of Python
A few weeks ago I read Modal’s post on truly serverless GPUs. They describe getting GPU container starts from many minutes down to a few seconds, even on multi-gigabyte images. A few tricks make that possible. I wanted to build the one that handles the image itself, and watch it work.
The core trick is surprisingly small, and it’s the same idea behind stargz, SOCI, Nydus, and Modal’s image loader: don’t download the image. Mount a tiny index, and fetch file contents lazily, the first time something actually reads them.
So I did what I usually do and built a toy version.
The idea: a process reads almost nothing
A normal container runtime downloads and unpacks every byte of an image before the container starts. But think about what a process actually touches at startup: a Python interpreter, a handful of libraries, your code. A few hundred megabytes of a multi-gigabyte image, often less. The rest of those bytes are downloaded, decompressed, written to disk, and never read.
So the plan is three steps:
- Package the image so the file contents live somewhere fetchable, and the image itself becomes just an index: the directory tree plus, per file, where to find its bytes. No bytes in the index.
- Mount the index through a filesystem. Every file appears, with the right size and metadata, but nothing is downloaded.
- Fetch on read. When a process actually
read()s a file, fetch only the parts it touches, and cache them. Files nobody opens are never fetched.
The filesystem part is FUSE, “filesystem in userspace.” You mount a directory, and from then on every filesystem syscall the kernel sees on it (stat, readdir, open, read) gets forwarded to your code. That’s the trick the whole idea is based on.
Packaging: content addressing
Here’s the first design decision, and it’s the one that does the most work. When I split files into chunks to store them, I don’t name the chunks chunk_0, chunk_1. I name each chunk by the SHA-256 hash of its own contents.
while chunk := f.read(chunk_size): # 64 KiB chunks
h = hashlib.sha256(chunk).hexdigest()
chunk_hashes.append(h) # the file is now an ordered list of hashes
if h not in seen: # store each unique chunk exactly once
seen.add(h)
(store / h).write_bytes(chunk)
This is content addressing: the address is the hash of the content. This has two consequences for free, and both matter at scale:
- Deduplication. Identical chunks hash to the same name, so they’re stored once, even across completely different images. The base layer shared by a thousand images is stored once. Let’s forget about collisions for a moment.
- Immutability. A hash can’t change. So any cache of a chunk, anywhere, is valid forever. There is no cache invalidation logic, because there is no such thing as a stale chunk.
The output is two things: a store/ full of chunk blobs named by hash, and an image.json index that holds the “filesystem” metadata. On my little test tree the split looks like this:
files / dirs : 5 / 2
logical size : 4.50 MiB
chunk refs / unique : 74 / 70 # 4 chunks were duplicates
store size on disk : 4.25 MiB (dedup 1.06x)
index size : 6.7 KiB <- this is all that mounts
A 4.5 MiB image, and only 6.7 KiB will bemounted. At real scale that’s the difference between a multi-gigabyte image and a few-megabyte index. Two of my test files were byte-identical copies, and you can see them share blobs: the dedup ratio is above 1, and the two files point at the exact same hashes in the index.
Mounting: every file appears, nothing downloads
Now the FUSE side. To make the filesystem browsable, I only need to answer metadata questions, and metadata is all the index holds. Two operations take care of it:
def getattr(self, path, fh=None):
node = self._node(path) # look the path up in the index
# ... return st_mode, st_size, st_mtime, ... straight from the node
def readdir(self, path, fh):
node = self._node(path)
return [".", "..", *node.children] # directory listing, from the index
Neither one touches the chunk store. They answer entirely from the in-memory index. That’s the whole reason a giant image can mount in milliseconds: the kernel is only asking about files, not for them.
Mount it and browse:
[mount] index loaded: 7 nodes, 6.7 KiB, 0 chunks fetched
$ ls -lR mnt
mnt:
total 0
-rw-r--r-- 1 jordi jordi 4194304 May 31 16:16 big_file.bin
-rw-r--r-- 1 jordi jordi 262144 May 31 16:16 file_a.bin
-rw-r--r-- 1 jordi jordi 262144 May 31 16:16 file_b.bin
-rw-r--r-- 1 jordi jordi 10 May 31 16:14 note.md
drwxr-xr-x 2 jordi jordi 0 May 31 16:15 subfolder
mnt/subfolder:
total 0
-rw-r--r-- 1 jordi jordi 20 May 31 16:15 note.md
Every file is there with its real size. big_file.bin reports 4 MiB. And the most important thing, zero chunks have been fetched. We’re browsing a 4 MiB file that doesn’t exist on this machine yet.
The first attempt didn’t quite work, and the reason is a nice little lesson about FUSE. Files showed up empty, every lookup missed. The bug was in how I built the index keys: I’d let the root path collapse and stored nodes under keys like big_file.bin instead of /big_file.bin. FUSE always hands you absolute paths starting with /, so the kernel asked for /big_file.bin and the index only knew big_file.bin. One leading slash, every file invisible. The fix was one character.
Lazy reads: fetch only what you touch
This is the part the whole thing exists for. A read asks for a byte range, an offset and a size. That range maps to a range of chunks, and I fetch only those.
def read(self, path, size, offset, fh):
node = self._node(path)
end = min(offset + size, node.size) # never read past EOF
out = bytearray()
pos = offset
while pos < end:
idx = pos // self.chunk_size # which chunk this byte is in
chunk = self._get_chunk(node.chunks[idx])
cstart = pos - idx * self.chunk_size # where in the chunk we start
cend = min(len(chunk), cstart + (end - pos))
out.extend(chunk[cstart:cend])
pos += cend - cstart
return bytes(out)
And _get_chunk is where the fetch and the cache live. It’s the only code that touches the store:
def _get_chunk(self, h):
if h in self.chunk_cache: # already fetched once -> free
return self.chunk_cache[h]
data = (self.store / h).read_bytes() # the "remote" fetch
self.chunk_cache[h] = data
return data
Because chunks are immutable, the cache never needs invalidating. A chunk is fetched at most once, ever.
I wrote this and ran the obvious tests. cat a small file, read an aligned block out of the middle of the big one. They passed. Then I tried a 4 KB read:
AssertionError: actual amount read 65536 greater than expected 4096
The first version copied each chunk whole and advanced by the whole chunk. It passed the first tests only because those reads happened to be chunk-aligned. The moment a read ended in the middle of a chunk, I returned 64 KB when the kernel asked for 4 KB, and FUSE rejected it. Worse, an unaligned start would have silently returned shifted bytes, the kind of bug that doesn’t crash, it just corrupts. The fix is the two lines that slice the chunk down to exactly the requested sub-range (cstart/cend above). A range copy tested only on aligned inputs passes every test and ships, then breaks the first time reality hands it an odd offset.
With that fixed, here’s the payoff. big_file.bin is 4 MiB, 64 chunks. I read 64 KiB out of the middle of it:
$ dd if=mnt/big_file.bin of=/dev/null bs=64K skip=20 count=1
[fetch] 8c8ab169ff39 (65536 B) [cumulative: 64 KiB]
--> chunks fetched: 1 of 64
One chunk. Not the file. The other 63 chunks, the other 4 MiB, still don’t exist locally.1 And reading the same range again fetches nothing, it’s a cache hit. That’s the entire thesis in one command: a process reads a little, so we fetch a little.
Let’s make it fast
The lazy-fetch core above is genuinely a few hours project. For simplicity, we are reading blob files from local storage but what happens when the fetch isn’t a local file read but a network round-trip to a blob store, and you want a sequential scan of a big file to go fast anyway. That’s where the real engineering is. Here are the three levers, in the order I learned they mattered.
To measure them I made the fetch latency explicit: a cold-cache sequential scan of an 8 MiB file (128 chunks of 64 KiB), with a simulated 3 ms “remote” latency on every chunk fetch (a simple sleep in the fetch code). If you fetch one chunk at a time and wait, the floor is unavoidable: 128 × 3 ms = 0.384 s. The whole game is getting under that by overlapping fetches.
1. Concurrency comes first, or nothing else matters. The kernel reads ahead: when it sees a sequential scan, it asks for chunks before the app needs them. But if your FUSE server is single-threaded, those prefetch reads just queue up behind the current one, and you pay the fetch latency serially anyway. At the default window, flipping the server from single-threaded to concurrent is the difference between:
| window | threads | scan time |
|---|---|---|
| 128 K | no | 0.417 s |
| 128 K | yes | 0.228 s |
About 1.8× for one flag. And it’s the prerequisite for everything else: tuning the window does nothing if a single thread serializes the fetches anyway.
2. The read-ahead setting is not the setting you think it is. FUSE has a max_readahead option, so I set it high and measured. Nothing changed. It turns out that option is only a ceiling. The window the kernel actually uses is per-mount in sysfs, /sys/class/bdi/<dev>/read_ahead_kb, and it defaults to 128 KiB no matter what you passed to the mount:
cat /sys/class/bdi/$dev/read_ahead_kb # 128 <- the default that bites you
echo 1024 | sudo tee /sys/class/bdi/$dev/read_ahead_kb # now the window is actually 1 MiB
I only caught it by reading the effective value back from sysfs instead of trusting the flag I’d set. Once it’s actually set, the window matters a lot (threaded, cold cache, same 8 MiB scan):
effective read_ahead_kb | scan time |
|---|---|
| 128 | 0.228 s |
| 1024 | 0.064 s |
| 4096 | 0.058 s |
A 1 MiB window keeps about 16 chunks in flight, which hides almost all the latency. The scan is about 3.6× faster than at the 128 KiB default, and we’re well under the 0.384 s serial floor now. Pushing it to 4 MiB barely helps, but note 4 MiB is already half this 8 MiB file: the window saturates quickly simply because the file is small. The interesting regime is a real multi-gigabyte file, where this same curve has room to run, and where bigger stops being better: too large and you prefetch megabytes nobody wanted and evict useful cache.2 Verify the setting that’s live, not the one you asked for.
3. The third lever I couldn’t reach, and that’s the interesting part. There’s a separate setting from the window: the request size, how much data the kernel asks for per round-trip to my server. I tried to raise it and the largest single read I ever saw stayed pinned at 128 KiB, no matter what I passed:
| requested max read | largest single FUSE read |
|---|---|
| default | 131072 B |
| 256 K | 131072 B |
| 1 M | 131072 B |
The reason isn’t the code. The Python FUSE binding I’m using (fusepy) links libfuse version 2, and libfuse 2.x hard-caps a single read at 32 pages = 128 KiB. The capability that lifts that cap (max_pages) only exists in libfuse 3 with a recent kernel; the option I was setting (max_read) can only ever lower the cap, never raise it. From this binding, the lever simply doesn’t exist.
That last one is the punchline of the whole exercise. Two of the three throughput levers I could demonstrate. The third I could only prove was out of reach, and why it was out of reach. It’s a concrete, specific reason production systems like Modal don’t write their image loaders in Python with fusepy. They write them in Rust on modern libfuse, because that’s the only way to reach both the window and the request-size knobs. The toy doesn’t just show you how lazy loading works; it shows you exactly where the toy ends and the real engineering begins.
Where the toy ends
Like any toy, the interesting part is what it leaves out, and that’s where the actual product is:
- A real cache fabric. The cache is a Python dict. The real thing is tiered: local page cache, then SSD, then a zonal cache, then regional, then a CDN, then the blob store, with all the ops to run it. This tier is the other half of the cold-start promise: lazy loading makes the mount instant, but it’s the fabric that makes those first reads land in milliseconds instead of a cross-continent round-trip to the blob store.
- Runtime integration. Mine mounts a directory. The real thing plugs into a container runtime (a containerd snapshotter, a gVisor sandbox).
- Compression and encryption of chunks. Mine stores them raw, for clarity.
- Writes. This image is read-only. Mutable files on top of immutable, content-addressed storage, done consistently and globally, is a genuinely deep distributed-systems problem. That’s the Modal Volumes story, and it’s much harder than the image-loading one.
One decision does the work
What surprised me most while building this is how much of the cleverness was paid for in a single decision. Naming chunks by hash isn’t an optimization on top of the design, it is the design. It’s the same thing I felt looking at how Shazam works: each layer falls out of the one before. Content addressing buys dedup and immutability in a single decision. Immutability buys a cache that never needs invalidating. The split between index and store buys an instant mount. Lazy reads buy “download a little because you read a little.” FUSE is the small, learnable nucleus that makes all of it observable on a laptop.
And then the moat is everything bolted around that nucleus: the caching fabric, the runtime integration, the tuning, and the writes. The filesystem is the part you can build in an afternoon.3 The rest is the company.
Footnotes
-
A lone, cold read like this doesn’t trip the kernel’s sequential read-ahead, so it fetches exactly the one chunk asked for; that prefetch machinery only wakes up for a streaming scan, which is the whole next section. ↩
-
Modal once shipped a 1 GB read-ahead and saw “disastrous latency issues in production”; they settled on 32 MB. Fast, lazy container loading in Modal. ↩
-
Code for the toy: a
build_image.pythat packages a directory, and alazyfs.pyFUSE server, ~250 lines total. Code here. ↩