64-bit Misalignment

Why misaligned data structures can be faster?

I’ve been following CMU’s 15-445/645 Database Systems course lately, trying to fill some gaps in my low-level systems knowledge. When the lectures covered memory alignment, it all made perfect sense. Cache lines, multiple memory operations, hardware penalties. Textbook stuff.

But here’s the thing: I don’t trust theory until I’ve seen it break something myself.

So I decided to write a benchmark. Simple plan: create two structs (one with natural alignment, one tightly packed) and demonstrate the performance penalty everyone talks about. I’d watch the misaligned version choke, nod knowingly, and move on to the next lecture.

Spoiler alert: The misaligned version won.

What is 64-bit Memory Alignment?

Before we get to the weird results, let me explain what I thought I understood. Modern processors prefer (or require) that data be aligned to 64-bit (8-byte) boundaries.

Why does alignment matter? When data is properly aligned, the CPU can read or write it in a single memory operation. When misaligned, several things can happen depending on the architecture1:

  1. Multiple memory operations: Reading a misaligned uint64_t might require two separate memory reads instead of one. For example, if an 8-byte value starts at address 1, it might span two different memory access boundaries, requiring the CPU to:

    • Read bytes 0-7 (getting bytes 1-7 of our value)
    • Read bytes 8-15 (getting byte 0 of our value)
    • Combine the pieces with bit-shifting and masking
  2. Non-atomic operations: Even more critically, misaligned writes are not atomic. A properly aligned 8-byte write happens as a single atomic operation, but a misaligned write might be implemented as:

    • Read-modify-write of the first memory chunk
    • Read-modify-write of the second memory chunk

This means another thread could observe a partially-written value, making misaligned access dangerous in concurrent code without additional synchronization.

Compilers automatically add padding to structs to maintain these alignment requirements and avoid these issues. Here’s a simple example:

// NORMAL struct - compiler adds padding for alignment
struct NormalStruct {
    uint8_t  a;      // 1 byte
    uint64_t b;      // 8 bytes (aligned at offset 8)
    uint16_t c;      // 2 bytes
    // Total: 24 bytes (with padding)
};

// PACKED struct - no padding, fields stored consecutively
struct __attribute__((packed)) PackedStruct {
    uint8_t  a;      // 1 byte
    uint64_t b;      // 8 bytes (misaligned at offset 1)
    uint16_t c;      // 2 bytes
    // Total: 11 bytes (no padding)
};

The memory layout looks like this:

NormalStruct (24 bytes):
  Byte:  0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19
  Field: [a][- padding (7 bytes) -][b  b  b  b  b  b  b  b][c  c][pad]

PackedStruct (11 bytes):
  Byte:  0  1  2  3  4  5  6  7  8  9  10
  Field: [a][b  b  b  b  b  b  b  b][c  c]

The packed version saves 13 bytes per struct (54% reduction), but the uint64_t b field is now misaligned, starting at offset 1 instead of 8.

The Plan

The plan was straightforward. I’d benchmark these two structs performing various operations and watch the misaligned version suffer. I tested on the two different architectures I have at home:

  • Apple M1 Pro (ARM64)
  • AMD Ryzen 7 8745HS (x86-64)

I created four benchmark tests:

  1. Sequential read of the uint64_t b field
  2. Random read of the uint64_t b field
  3. Sequential write to the uint64_t b field
  4. Read-modify-write operations

Each test ran on arrays of 100,000 elements with 1,000 iterations, repeated 10 times for statistical reliability.

First Results: Wait, What?

Here are the initial results on x86-64 (AMD Ryzen 7 8745HS):

Test                   Aligned (ms)  Misaligned (ms)  Difference
-----------------------------------------------------------------
Sequential Read              28.38            26.10      -8.0%
Random Read                  56.90            69.96     +22.9%
Sequential Write             20.27            14.45     -28.8%
Read-Modify-Write            21.53            20.71      -3.8%

And on ARM64 (Apple M1 Pro):

Test                   Aligned (ms)  Misaligned (ms)  Difference
-----------------------------------------------------------------
Sequential Read              97.22            97.21      -0.0%
Random Read                 211.00           210.11      -0.4%
Sequential Write             30.47            22.22     -27.0%
Read-Modify-Write            39.42            40.80      +3.5%

I stared at these numbers for a long time. The misaligned packed struct was performing almost the same (if not better) than the aligned version. This made no sense. Where were my alignment penalties?

Nothing Made Sense

I double-checked everything:

  • Checked if some compiler flags were enabled and avoiding the alignment penalties
  • Warmup iterations
  • Statistical analysis
  • Memory initialization
  • Reviewed the theory and had a long conversation with Claude

The code was correct. The results were real. But they didn’t match my expectations at all.

Then Cache Hit Me

I calculated the actual memory footprint of my test arrays:

100,000 elements:

  • NormalStruct: 100,000 × 24 bytes = 2,343 KB
  • PackedStruct: 100,000 × 11 bytes = 1,074 KB

And then I checked the CPU cache sizes:

AMD Ryzen 7 8745HS (per core):

  • L1 Data Cache: 32 KB
  • L2 Cache: 1 MB (1,024 KB)

Apple M1 Pro (per core):

  • L1 Data Cache: 64 KB
  • L2 Cache: 4 MB

The packed struct’s smaller memory footprint meant better cache utilization. This advantage was outweighing any misalignment penalties!

Proving It

I needed to test this properly. I created a new benchmark that tested different array sizes to see where the performance crossover happens:

void test_array_size(size_t count, size_t iterations) {
    vector<NormalStruct> normal_data(count);
    vector<PackedStruct> packed_data(count);

    // ... initialization and warmup ...

    double normal_time = benchmark_sequential_read(normal_data, iterations);
    double packed_time = benchmark_sequential_read(packed_data, iterations);

    // Calculate and print results
}

I tested array sizes from 100 elements (fits comfortably in L1) up to 200,000 elements (way beyond L2), adjusting iterations to keep total runtime reasonable.

The Pattern

Here are the results on x86-64 (AMD Ryzen 7 8745HS):

Array Size    Normal (KB)    Packed (KB)     Difference
--------------------------------------------------------
       100           2              1            +2.8%
     1,000          23             10            +1.5%
     5,000         117             53            -5.4%
    10,000         234            107            -5.9%
    30,000         703            322            -6.6%
    50,000       1,171            537            -8.8%
   100,000       2,343          1,074            -8.7%
   200,000       4,687          2,148            -8.2%

The pattern is crystal clear:

  1. Tiny arrays (100-1,000 elements, both in L1): Aligned wins

    • This is the “true” alignment penalty—minimal in modern CPUs
  2. Growing arrays (5,000-30,000): Packed advantage grows

    • Better cache utilization starts to dominate
  3. The critical point (~50,000 elements):

    • NormalStruct (1,171 KB) exceeds L2
    • PackedStruct (537 KB) still fits in L2
    • Packed is 8.8% faster despite misalignment!
  4. Beyond L2: Packed maintains ~8% advantage

    • Both exceed cache, but packed uses less memory bandwidth

What About the 64-bit Alignment Penalty?

After all this, I still needed to see the actual alignment penalty without cache effects getting in the way. So I re-ran the benchmarks with just 100 elements, small enough to fit entirely within L1 cache. 2

x86-64 (AMD Ryzen 7 8745HS)

Test                   Aligned (ms)  Misaligned (ms)  Difference
-----------------------------------------------------------------
Sequential Read                2.37             2.45      +3.3%
Random Read                    3.33             3.70     +11.3%
Sequential Write               1.52             1.93     +26.7%
Read-Modify-Write              2.07             2.11      +2.0%

ARM64 (Apple M1 Pro)

Test                   Aligned (ms)  Misaligned (ms)  Difference
-----------------------------------------------------------------
Sequential Read                9.80             9.76      -0.4%
Random Read                   20.82            20.81      -0.1%
Sequential Write               1.59             1.66      +5.0%
Read-Modify-Write              1.94             3.54     +82.7%

These results are exactly what we expected to see. With a small array of 100 elements that fits within L1 cache, we finally observe the alignment penalties that theory predicts. The misaligned packed struct shows consistent slowdowns ranging from 2% to 82%, demonstrating the real computational cost of accessing data that doesn’t sit on natural memory boundaries.

The small dataset isolates the true alignment penalty from cache effects. This is the benchmark that confirms what everyone says about alignment—when cache isn’t a factor, misaligned access is indeed slower.

What I Learned

I started this to prove I understood something “everyone knows”. Instead, I got to see how different effects interact in practice.

The alignment penalties are real, I saw them clearly with small datasets that fit in L1 cache. But in my benchmarks with larger arrays, something else happened. The packed struct’s 54% smaller memory footprint meant better cache utilization, and that advantage outweighed the misalignment cost. The “slower” approach became 8-9% faster.

This wasn’t about the textbooks being wrong. It was about understanding that performance is about trade-offs, not rules. Alignment matters. Cache efficiency matters. Which one dominates depends on your specific workload, data size, and access patterns.

Here’s what watching these effects interact taught me:

  • Initial hypotheses can be wrong: I expected one thing, measured another
  • Small benchmarks and large benchmarks can show opposite results
  • “Common knowledge” is often incomplete without context
  • Counterintuitive results are invitations to dig deeper

The real lesson isn’t about memory alignment. It’s about staying curious when your data surprises you. It’s about measuring instead of assuming. It’s about being willing to investigate when something doesn’t match what you thought you knew.

I wanted to prove I understood how memory worked. Instead, I got to see multiple effects competing with each other. And that taught me more than being right ever could have.


Update [2025-11-18]: After sharing the blogpost in Software Internals Discord, I received good feedback about reviewing performance counters to confirm my findings.

I proceeded then to add a wrapper using perf_event_open to measure cache misses, and observed that the cache misses were significantly reduced when the data was packed due to the smaller size. This confirmed the hypothesis that packed struct was compensating misalignment penalties due to the extra performance trade-off from better cache usage.

Comparing cache misses between normal and packed structs for different array sizes, we can observe the following results:

Array Size    Normal (KB)    Packed (KB)     Difference     L1 Miss(N)     L1 Miss(P)
--------------------------------------------------------------------------------------
       100           2              1            +2.8%        0.00%           0.00%
     1,000          23             10            +1.5%        0.10%           0.00%
     5,000         117             53            -5.4%       18.77%           8.16%
    10,000         234            107            -5.9%       18.77%           8.16%
    30,000         703            322            -6.6%       18.79%           8.16%
    50,000       1,171            537            -8.8%       18.74%           8.16%
   100,000       2,343          1,074            -8.7%       18.75%           8.15%
   200,000       4,687          2,148            -8.2%       18.73%           8.15%

So running the 64-bit performance test for 100,000, the original test I ran, it’s clear that cache misses on the aligned data are significantly higher making the misaligned structure more efficient in almost all cases.

Test                   Aligned (ms)  Misaligned (ms)  Difference    Misses (A)     Misses (M)
----------------------------------------------------------------------------------------------
Sequential Read              28.38            26.10      -8.0%       18.77%          8.17%
Random Read                  56.90            69.96     +22.9%       36.99%         40.87%
Sequential Write             20.27            14.45     -28.8%       37.69%         15.53%
Read-Modify-Write            21.53            20.71      -3.8%       15.80%          7.77%

In the discord channel, Phil Eaton also shared an interesting discussion about the very same topic that happened 4 months ago in: https://lobste.rs/s/plrsmw/data_alignment_for_speed_myth_reality

Footnotes

  1. There is also another misalignment to consider when talking about memory alignment. Probably more important than the one being discussed here. The cache line misalignment is a performance issue that occurs when a data structure crosses a cache line boundary (usually 64 bytes). The implications of this misalignment are similar to the ones discussed here but at another level.

  2. Code available here: https://github.com/jrdi/playground/tree/main/memory-alignment

@jrdi
Subscribe to the newsletter to keep you posted about future articles