Skip to main content

OS I/O Performance: Memory vs Disk Operations Deep Dive

Created: April 24, 2026 Larry Qu 8 min read

Introduction

In modern system architecture, understanding the massive performance differential between different layers of storage is fundamental for building scalable software. A CPU can execute billions of instructions per second, but if those instructions require waiting for disk I/O, the CPU remains idle for what essentially amounts to an eternity in computing time.

This deep dive explores the mechanics of OS I/O operations, the memory hierarchy, and modern patterns for mitigating I/O bottlenecks in backend applications.

The Memory Hierarchy and Access Times

The fundamental law of hardware performance is that proximity to the CPU dictates access latency. Operations that stay within registers are near-instantaneous, while operations requiring physical disk spinning or network round-trips take orders of magnitude longer.

The Latency Numbers Every Programmer Should Know

Hardware Layer Typical Access Time Relative Scale Example
L1 Cache 0.5 ns 0.5 seconds
L2 Cache 7 ns 7 seconds
Main Memory 100 ns 1.5 minutes
NVMe SSD I/O 10-100 µs 1-3 days
HDD Disk I/O 1-10 ms 1-12 months
Network (US) 40-60 ms 4-6 years

As the table illustrates, accessing an NVMe SSD is roughly 1,000 times slower than hitting main memory. Accessing older spinning magnetic disks is 10,000 to 100,000 times slower. This massive discrepancy is why I/O architectures (like event loops, batching, and async I/O) are critical.

Synchronous vs Asynchronous I/O

When a process initiates an I/O operation (like reading a file), it makes a system call to the OS kernel. How the OS and the thread handle this wait determines the architecture’s efficiency.

1. Blocking (Synchronous) I/O

In blocking I/O, the calling thread is suspended (put to sleep) by the OS until the disk controller finishes fetching the data and the kernel copies it into user-space memory.

// Example: Blocking I/O in C
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>

int main() {
    char buffer[4096];
    int fd = open("large_file.dat", O_RDONLY);
    
    // The thread STOPS here until data is read
    ssize_t bytes_read = read(fd, buffer, sizeof(buffer));
    
    if (bytes_read > 0) {
        printf("Successfully read %zd bytes\n", bytes_read);
    }
    
    close(fd);
    return 0;
}

Pros: Simple programming model (procedural). Cons: Thread sits idle. If you have 10,000 connections, you need 10,000 threads, which consumes massive memory and incurs heavy context-switching overhead.

2. Non-Blocking I/O

The system call returns immediately. If data isn’t ready, it returns an error (like EAGAIN or EWOULDBLOCK).

// Example: Non-Blocking I/O setup in C
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>

void set_nonblocking(int fd) {
    int flags = fcntl(fd, F_GETFL, 0);
    // Set the O_NONBLOCK flag
    fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}

void read_nonblocking(int fd) {
    char buffer[4096];
    ssize_t n = read(fd, buffer, sizeof(buffer));
    
    if (n < 0) {
        if (errno == EAGAIN || errno == EWOULDBLOCK) {
            printf("Data not ready yet. I can do other work!\n");
        }
    } else {
        printf("Read %zd bytes immediately\n", n);
    }
}

3. I/O Multiplexing (epoll / kqueue)

Instead of constantly polling non-blocking sockets, modern systems use an event loop that asks the OS “tell me which of these 10,000 file descriptors is ready for reading.” This is the engine behind Node.js, Nginx, and Redis.

# Example: Epoll pattern abstracted in Python (asyncio)
import asyncio
import aiofiles

async def read_file_async(filename):
    # Event loop continues processing other tasks while waiting for disk
    async with aiofiles.open(filename, mode='r') as f:
        contents = await f.read()
        print(f"Read {len(contents)} characters.")

async def main():
    # We can kick off multiple I/O bounds tasks concurrently
    tasks = [
        read_file_async('file1.txt'),
        read_file_async('file2.txt')
    ]
    await asyncio.gather(*tasks)

asyncio.run(main())

Bridging the Gap: Caching and Page Cache

To mitigate disk latency, Operating Systems heavily rely on memory virtualization and the Page Cache.

When you read a file, Linux doesn’t just read the requested bytes. It reads an entire 4KB block (or more) and stores it in RAM (the Page Cache). Subsequent reads hit RAM, not the disk.

# You can see this cache in action using the 'free' command
$ free -m
              total        used        free      shared  buff/cache   available
Mem:          15948        4231        2100         123        9616       11293
Swap:          2048           0        2048
# => The 'buff/cache' column shows how much RAM is being used to cache Disk I/O

When a process writes data, the OS writes it to the Page Cache immediately, marks the page as “dirty”, and reports success to the application. The OS lazily flushes these dirty pages to physical disk in the background (using thread processes like kworker or pdflush).

Zero Copy and Memory Mapping (mmap)

For applications moving massive amounts of data (like Kafka, Databases, or video streaming servers), even the overhead of copying data from Kernel Space (where the read happened) to User Space (where the app lives) is too slow.

Standard I/O Path

  1. Read from Disk to Kernel Buffer
  2. Copy from Kernel Buffer to User Application Buffer
  3. Copy from User Buffer to Target Kernel Buffer (e.g., a Network Socket)
  4. Send from Kernel Buffer to NIC (Network Card)

Zero-Copy (sendfile)

The OS copies data directly from the File Kernel Buffer to the Socket Kernel Buffer. It skips the User Space entirely.

// Example: Zero-Copy abstraction in Go
package main

import (
"fmt"
"io"
"os"
)

func main() {
src, err := os.Open("massive_video.mp4")
if err != nil {
panic(err)
}
defer src.Close()

dst, err := os.Create("backup.mp4")
if err != nil {
panic(err)
}
defer dst.Close()

// Under the hood, io.Copy in Go on Linux might use the 'sendfile' system call,
// executing a zero-copy transfer directly in kernel space.
bytesCopied, err := io.Copy(dst, src)
if err != nil {
panic(err)
}

fmt.Printf("Zero-copied %d bytes successfully.\n", bytesCopied)
}

Memory Mapping (mmap)

mmap maps a file directly into the application’s RAM address space. When you modify variables in memory, the OS automatically syncs those changes to the disk file.

// Example: Memory mapping a file in C
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>

int main() {
    int fd = open("database.dat", O_RDWR);
    struct stat sb;
    fstat(fd, &sb);

    // Map file directly to memory
    char *mapped = mmap(NULL, sb.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    
    // Now you can read/write the disk file exactly like an array!
    mapped[0] = 'H';
    mapped[1] = 'i';
    
    // Sync changes back to disk explicitly
    msync(mapped, sb.st_size, MS_SYNC);
    
    munmap(mapped, sb.st_size);
    close(fd);
    return 0;
}

Advanced I/O Architectures (io_uring)

Historically, Linux developers relied on epoll(7) (a notification model) combined with non-blocking I/O to achieve high concurrent networking. However, standard operations like reading from files were still fundamentally synchronous at the OS-level. Even O_NONBLOCK wasn’t completely decoupled for disk operations on some filesystems.

Enter io_uring—the most revolutionary change to Linux I/O in twenty years. Supported natively from Linux 5.1+, io_uring leverages a shared-memory ring buffer architecture between the user-space application and the kernel.

The Dual-Ring System

Instead of executing sequential system calls (read, write, send, recv), the application submits operations into a Submission Queue (SQ). The Kernel continuously polls the SQ, executes the I/O operations asynchronously, and pushes the final results into a Completion Queue (CQ).

// High-level conceptual pseudo-code of io_uring mechanism
#include <liburing.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>

#define QUEUE_DEPTH 64

int main() {
    struct io_uring ring;
    int fd = open("huge_database.db", O_RDONLY | O_DIRECT);
    char buf[4096];

    // 1. Initialize the ring structure in shared memory
    io_uring_queue_init(QUEUE_DEPTH, &ring, 0);

    // 2. Fetch a Submission Queue Entry (SQE)
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    
    // 3. Prep the entry: "Read 4KB from fd into buf at offset 0"
    io_uring_prep_read(sqe, fd, buf, sizeof(buf), 0);
    
    // 4. Submit the operation to the kernel (App continues executing immediately)
    io_uring_submit(&ring);
    
    // ...App does other computationally heavy work here...

    // 5. Check the Completion Queue (CQE) for results later
    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);
    
    if (cqe->res > 0) {
        printf("Kernel finished the async read! Total bytes: %d\n", cqe->res);
    }
    
    io_uring_cqe_seen(&ring, cqe);
    io_uring_queue_exit(&ring);
    close(fd);
    return 0;
}

The Benefits

By totally avoiding the overhead of making a system call per I/O boundary:

  1. Zero Syscall Overhead: Under peak loads with the SQPOLL feature enabled, user-space applications can submit infinite operations without triggering a single context switch.
  2. True Async File I/O: io_uring handles blocking disk operations truly asynchronously, unlike the limitations in older POSIX AIO implementations.
  3. Massive Concurrency: Rust’s glommio framework and C++’s seastar execute millions of disk bounds operations per CPU core using this architecture.

Best Practices for High-Performance I/O

  1. Batch Your Reads/Writes: Always buffer your I/O. Writing 10 bytes 100 times requires 100 expensive system context switches. Writing 1,000 bytes once requires 1 switch.
  2. Prefer Async/Event-Driven Models: For network servers, always default to epoll/kqueue backed frameworks (Node.js, Go’s net/http, standard Rust Tokio).
  3. Understand Your File System: EXT4, XFS, and ZFS have different caching and journaling mechanics that impact database write speeds significantly.
  4. Use Direct I/O Bypass CAUTIOUSLY: Databases like PostgreSQL or ScyllaDB sometimes bypass the OS Page Cache (O_DIRECT) because they have built specialized memory managers that know their own access patterns better than the generic Linux kernel scheduler.

Summary

The history of backend software engineering is largely the history of abstracting and mitigating disk latency. Modern SSDs have shrunk the gap, but the principle remains: optimize your memory layouts to maximize cache hits, and use asynchronous system calls (epoll, io_uring) so your CPU cores never have to idle while waiting for the persistence layer to catch up.

Resources

Comments

Share this article

Scan to read on mobile