64-bit Misalignment
Why misaligned data structures can be faster?
I’ve been following CMU’s 15-445/645 Database Systems course lately, trying to fill some gaps in my low-level systems knowledge. When the lectures covered memory alignment, it all made perfect sense. Cache lines, multiple memory operations, hardware penalties. Textbook stuff.
But here’s the thing: I don’t trust theory until I’ve seen it break something myself.
So I decided to write a benchmark. Simple plan: create two structs (one with natural alignment, one tightly packed) and demonstrate the performance penalty everyone talks about. I’d watch the misaligned version choke, nod knowingly, and move on to the next lecture.
Spoiler alert: The misaligned version won.
What is 64-bit Memory Alignment?
Before we get to the weird results, let me explain what I thought I understood. Modern processors prefer (or require) that data be aligned to 64-bit (8-byte) boundaries.
Why does alignment matter? When data is properly aligned, the CPU can read or write it in a single memory operation. When misaligned, several things can happen depending on the architecture1:
-
Multiple memory operations: Reading a misaligned
uint64_tmight require two separate memory reads instead of one. For example, if an 8-byte value starts at address 1, it might span two different memory access boundaries, requiring the CPU to:- Read bytes 0-7 (getting bytes 1-7 of our value)
- Read bytes 8-15 (getting byte 0 of our value)
- Combine the pieces with bit-shifting and masking
-
Non-atomic operations: Even more critically, misaligned writes are not atomic. A properly aligned 8-byte write happens as a single atomic operation, but a misaligned write might be implemented as:
- Read-modify-write of the first memory chunk
- Read-modify-write of the second memory chunk
This means another thread could observe a partially-written value, making misaligned access dangerous in concurrent code without additional synchronization.
Compilers automatically add padding to structs to maintain these alignment requirements and avoid these issues. Here’s a simple example:
// NORMAL struct - compiler adds padding for alignment
struct NormalStruct {
uint8_t a; // 1 byte
uint64_t b; // 8 bytes (aligned at offset 8)
uint16_t c; // 2 bytes
// Total: 24 bytes (with padding)
};
// PACKED struct - no padding, fields stored consecutively
struct __attribute__((packed)) PackedStruct {
uint8_t a; // 1 byte
uint64_t b; // 8 bytes (misaligned at offset 1)
uint16_t c; // 2 bytes
// Total: 11 bytes (no padding)
};
The memory layout looks like this:
NormalStruct (24 bytes):
Byte: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Field: [a][- padding (7 bytes) -][b b b b b b b b][c c][pad]
PackedStruct (11 bytes):
Byte: 0 1 2 3 4 5 6 7 8 9 10
Field: [a][b b b b b b b b][c c]
The packed version saves 13 bytes per struct (54% reduction), but the uint64_t b field is now misaligned, starting at offset 1 instead of 8.
The Plan
The plan was straightforward. I’d benchmark these two structs performing various operations and watch the misaligned version suffer. I tested on the two different architectures I have at home:
- Apple M1 Pro (ARM64)
- AMD Ryzen 7 8745HS (x86-64)
I created four benchmark tests:
- Sequential read of the
uint64_t bfield - Random read of the
uint64_t bfield - Sequential write to the
uint64_t bfield - Read-modify-write operations
Each test ran on arrays of 100,000 elements with 1,000 iterations, repeated 10 times for statistical reliability.
First Results: Wait, What?
Here are the initial results on x86-64 (AMD Ryzen 7 8745HS):
Test Aligned (ms) Misaligned (ms) Difference
-----------------------------------------------------------------
Sequential Read 28.38 26.10 -8.0%
Random Read 56.90 69.96 +22.9%
Sequential Write 20.27 14.45 -28.8%
Read-Modify-Write 21.53 20.71 -3.8%
And on ARM64 (Apple M1 Pro):
Test Aligned (ms) Misaligned (ms) Difference
-----------------------------------------------------------------
Sequential Read 97.22 97.21 -0.0%
Random Read 211.00 210.11 -0.4%
Sequential Write 30.47 22.22 -27.0%
Read-Modify-Write 39.42 40.80 +3.5%
I stared at these numbers for a long time. The misaligned packed struct was performing almost the same (if not better) than the aligned version. This made no sense. Where were my alignment penalties?
Nothing Made Sense
I double-checked everything:
- Checked if some compiler flags were enabled and avoiding the alignment penalties
- Warmup iterations
- Statistical analysis
- Memory initialization
- Reviewed the theory and had a long conversation with Claude
The code was correct. The results were real. But they didn’t match my expectations at all.
Then Cache Hit Me
I calculated the actual memory footprint of my test arrays:
100,000 elements:
NormalStruct: 100,000 × 24 bytes = 2,343 KBPackedStruct: 100,000 × 11 bytes = 1,074 KB
And then I checked the CPU cache sizes:
AMD Ryzen 7 8745HS (per core):
- L1 Data Cache: 32 KB
- L2 Cache: 1 MB (1,024 KB)
Apple M1 Pro (per core):
- L1 Data Cache: 64 KB
- L2 Cache: 4 MB
The packed struct’s smaller memory footprint meant better cache utilization. This advantage was outweighing any misalignment penalties!
Proving It
I needed to test this properly. I created a new benchmark that tested different array sizes to see where the performance crossover happens:
void test_array_size(size_t count, size_t iterations) {
vector<NormalStruct> normal_data(count);
vector<PackedStruct> packed_data(count);
// ... initialization and warmup ...
double normal_time = benchmark_sequential_read(normal_data, iterations);
double packed_time = benchmark_sequential_read(packed_data, iterations);
// Calculate and print results
}
I tested array sizes from 100 elements (fits comfortably in L1) up to 200,000 elements (way beyond L2), adjusting iterations to keep total runtime reasonable.
The Pattern
Here are the results on x86-64 (AMD Ryzen 7 8745HS):
Array Size Normal (KB) Packed (KB) Difference
--------------------------------------------------------
100 2 1 +2.8%
1,000 23 10 +1.5%
5,000 117 53 -5.4%
10,000 234 107 -5.9%
30,000 703 322 -6.6%
50,000 1,171 537 -8.8%
100,000 2,343 1,074 -8.7%
200,000 4,687 2,148 -8.2%
The pattern is crystal clear:
-
Tiny arrays (100-1,000 elements, both in L1): Aligned wins
- This is the “true” alignment penalty—minimal in modern CPUs
-
Growing arrays (5,000-30,000): Packed advantage grows
- Better cache utilization starts to dominate
-
The critical point (~50,000 elements):
- NormalStruct (1,171 KB) exceeds L2
- PackedStruct (537 KB) still fits in L2
- Packed is 8.8% faster despite misalignment!
-
Beyond L2: Packed maintains ~8% advantage
- Both exceed cache, but packed uses less memory bandwidth
What About the 64-bit Alignment Penalty?
After all this, I still needed to see the actual alignment penalty without cache effects getting in the way. So I re-ran the benchmarks with just 100 elements, small enough to fit entirely within L1 cache. 2
x86-64 (AMD Ryzen 7 8745HS)
Test Aligned (ms) Misaligned (ms) Difference
-----------------------------------------------------------------
Sequential Read 2.37 2.45 +3.3%
Random Read 3.33 3.70 +11.3%
Sequential Write 1.52 1.93 +26.7%
Read-Modify-Write 2.07 2.11 +2.0%
ARM64 (Apple M1 Pro)
Test Aligned (ms) Misaligned (ms) Difference
-----------------------------------------------------------------
Sequential Read 9.80 9.76 -0.4%
Random Read 20.82 20.81 -0.1%
Sequential Write 1.59 1.66 +5.0%
Read-Modify-Write 1.94 3.54 +82.7%
These results are exactly what we expected to see. With a small array of 100 elements that fits within L1 cache, we finally observe the alignment penalties that theory predicts. The misaligned packed struct shows consistent slowdowns ranging from 2% to 82%, demonstrating the real computational cost of accessing data that doesn’t sit on natural memory boundaries.
The small dataset isolates the true alignment penalty from cache effects. This is the benchmark that confirms what everyone says about alignment—when cache isn’t a factor, misaligned access is indeed slower.
What I Learned
I started this to prove I understood something “everyone knows”. Instead, I got to see how different effects interact in practice.
The alignment penalties are real, I saw them clearly with small datasets that fit in L1 cache. But in my benchmarks with larger arrays, something else happened. The packed struct’s 54% smaller memory footprint meant better cache utilization, and that advantage outweighed the misalignment cost. The “slower” approach became 8-9% faster.
This wasn’t about the textbooks being wrong. It was about understanding that performance is about trade-offs, not rules. Alignment matters. Cache efficiency matters. Which one dominates depends on your specific workload, data size, and access patterns.
Here’s what watching these effects interact taught me:
- Initial hypotheses can be wrong: I expected one thing, measured another
- Small benchmarks and large benchmarks can show opposite results
- “Common knowledge” is often incomplete without context
- Counterintuitive results are invitations to dig deeper
The real lesson isn’t about memory alignment. It’s about staying curious when your data surprises you. It’s about measuring instead of assuming. It’s about being willing to investigate when something doesn’t match what you thought you knew.
I wanted to prove I understood how memory worked. Instead, I got to see multiple effects competing with each other. And that taught me more than being right ever could have.
Update [2025-11-18]: After sharing the blogpost in Software Internals Discord, I received good feedback about reviewing performance counters to confirm my findings.
I proceeded then to add a wrapper using perf_event_open to measure cache misses, and observed that the cache misses were significantly reduced when the data was packed due to the smaller size. This confirmed the hypothesis that packed struct was compensating misalignment penalties due to the extra performance trade-off from better cache usage.
Comparing cache misses between normal and packed structs for different array sizes, we can observe the following results:
Array Size Normal (KB) Packed (KB) Difference L1 Miss(N) L1 Miss(P)
--------------------------------------------------------------------------------------
100 2 1 +2.8% 0.00% 0.00%
1,000 23 10 +1.5% 0.10% 0.00%
5,000 117 53 -5.4% 18.77% 8.16%
10,000 234 107 -5.9% 18.77% 8.16%
30,000 703 322 -6.6% 18.79% 8.16%
50,000 1,171 537 -8.8% 18.74% 8.16%
100,000 2,343 1,074 -8.7% 18.75% 8.15%
200,000 4,687 2,148 -8.2% 18.73% 8.15%
So running the 64-bit performance test for 100,000, the original test I ran, it’s clear that cache misses on the aligned data are significantly higher making the misaligned structure more efficient in almost all cases.
Test Aligned (ms) Misaligned (ms) Difference Misses (A) Misses (M)
----------------------------------------------------------------------------------------------
Sequential Read 28.38 26.10 -8.0% 18.77% 8.17%
Random Read 56.90 69.96 +22.9% 36.99% 40.87%
Sequential Write 20.27 14.45 -28.8% 37.69% 15.53%
Read-Modify-Write 21.53 20.71 -3.8% 15.80% 7.77%
In the discord channel, Phil Eaton also shared an interesting discussion about the very same topic that happened 4 months ago in: https://lobste.rs/s/plrsmw/data_alignment_for_speed_myth_reality
Footnotes
-
There is also another misalignment to consider when talking about memory alignment. Probably more important than the one being discussed here. The cache line misalignment is a performance issue that occurs when a data structure crosses a cache line boundary (usually 64 bytes). The implications of this misalignment are similar to the ones discussed here but at another level. ↩
-
Code available here: https://github.com/jrdi/playground/tree/main/memory-alignment ↩