File System Architecture: From Inodes to Modern File Systems

Introduction

File systems are the backbone of any operating system’s storage management. They transform raw disk sectors into organized, accessible files and directories. Understanding file system architecture is essential for system administrators, developers, and anyone working with storage.

In this comprehensive guide, we’ll explore everything from fundamental concepts to modern file system implementations.

File System Fundamentals

What is a File System?

┌─────────────────────────────────────────────────────────────────┐
│                   FILE SYSTEM OVERVIEW                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   A file system provides:                                       │
│                                                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ 1. STORAGE ORGANIZATION                                │   │
│   │    • Files and directories                             │   │
│   │    • Hierarchical structure                            │   │
│   │    • Metadata (permissions, dates, etc.)              │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ 2. DATA ACCESS                                         │   │
│   │    • Read/write operations                            │   │
│   │    • File naming                                       │   │
│   │    • Access control                                    │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ 3. SPACE MANAGEMENT                                    │   │
│   │    • Allocation of disk space                          │   │
│   │    • Free space tracking                               │   │
│   │    • Fragmentation handling                            │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ 4. INTEGRITY & RECOVERY                               │   │
│   │    • Crash recovery                                    │   │
│   │    • Data consistency                                  │   │
│   │    • Error detection                                   │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

File System Layers

┌─────────────────────────────────────────────────────────────────┐
│                   FILE SYSTEM LAYERS                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │              Application Programs                         │   │
│   │         (ls, cp, cat, text editors, etc.)               │   │
│   └──────────────────────────┬──────────────────────────────┘   │
│                              │                                    │
│   ┌──────────────────────────▼──────────────────────────────┐   │
│   │           System Call Interface (POSIX)                    │   │
│   │      (open, read, write, close, stat, mkdir, etc.)       │   │
│   └──────────────────────────┬──────────────────────────────┘   │
│                              │                                    │
│   ┌──────────────────────────▼──────────────────────────────┐   │
│   │     Virtual File System (VFS) - Linux                    │   │
│   │          (unified interface for all file systems)         │   │
│   └──────────────────────────┬──────────────────────────────┘   │
│                              │                                    │
│         ┌────────────────────┼────────────────────┐              │
│         │                    │                    │              │
│         ▼                    ▼                    ▼              │
│   ┌──────────┐       ┌──────────┐        ┌──────────┐        │
│   │   ext4   │       │  Btrfs   │        │   NFS    │        │
│   │           │       │          │        │          │        │
│   └────┬──────┘       └────┬─────┘        └────┬─────┘        │
│        │                    │                    │              │
│        ▼                    ▼                    ▼              │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │              Block Device Layer                          │   │
│   │          (SDD/HDD, RAID controllers, etc.)              │   │
│   └──────────────────────────┬──────────────────────────────┘   │
│                              │                                    │
│   ┌──────────────────────────▼──────────────────────────────┐   │
│   │              Physical Storage Device                      │   │
│   │                   (Disk or SSD)                           │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

File Organization

Contiguous Allocation

The simplest allocation method - each file occupies consecutive blocks.

"""Contiguous file allocation simulation."""

class ContiguousAllocation:
    """
    Contiguous allocation stores files in consecutive disk blocks.
    
    Pros:
    - Simple to implement
    - Fast sequential access
    - Good random access performance
    
    Cons:
    - External fragmentation
    - Difficult to grow files
    """
    
    def __init__(self, disk_size: int, block_size: int = 4096):
        self.disk_size = disk_size  # Total blocks
        self.block_size = block_size
        self.free_blocks = set(range(disk_size))
        self.files = {}  # filename -> (start_block, size)
    
    def allocate(self, filename: str, size: int) -> bool:
        """Allocate space for a file."""
        num_blocks = (size + self.block_size - 1) // self.block_size
        
        # Find consecutive blocks
        allocated = []
        for start in range(self.disk_size):
            if all((start + i) in self.free_blocks for i in range(num_blocks)):
                # Found consecutive blocks
                for i in range(num_blocks):
                    self.free_blocks.remove(start + i)
                self.files[filename] = (start, num_blocks)
                return True
        
        return False
    
    def deallocate(self, filename: str) -> bool:
        """Free allocated space."""
        if filename not in self.files:
            return False
        
        start, size = self.files[filename]
        for i in range(size):
            self.free_blocks.add(start + i)
        
        del self.files[filename]
        return True
    
    def get_fragmentation(self) -> float:
        """Calculate external fragmentation."""
        if not self.free_blocks:
            return 1.0
        
        # Count gaps between free blocks
        sorted_free = sorted(self.free_blocks)
        gaps = 0
        for i in range(len(sorted_free) - 1):
            if sorted_free[i + 1] - sorted_free[i] > 1:
                gaps += 1
        
        return gaps / max(len(self.free_blocks), 1)

Linked Allocation

Each block contains a pointer to the next block - blocks don’t need to be contiguous.

"""Linked list file allocation."""

class LinkedAllocation:
    """
    Linked allocation uses pointers in each block.
    
    Pros:
    - No external fragmentation
    - Easy to grow files
    
    Cons:
    - Poor random access (must follow pointers)
    - Pointer overhead in each block
    -可靠性问题 (pointer corruption = data loss)
    """
    
    def __init__(self, disk_size: int):
        self.disk_size = disk_size
        self.free_blocks = set(range(disk_size))
        self.files = {}  # filename -> [first_block]
        self.fat = [-1] * disk_size  # File Allocation Table
    
    def allocate(self, filename: str, size_blocks: int) -> bool:
        """Allocate blocks for a file using linked list."""
        if len(self.free_blocks) < size_blocks:
            return False
        
        # Allocate blocks
        blocks = []
        for _ in range(size_blocks):
            block = self.free_blocks.pop()
            blocks.append(block)
        
        # Set up links
        for i in range(len(blocks) - 1):
            self.fat[blocks[i]] = blocks[i + 1]
        self.fat[blocks[-1]] = -1  # End of file marker
        
        self.files[filename] = blocks[0]
        return True
    
    def read_chain(self, filename: str) -> list:
        """Read all blocks in a file."""
        if filename not in self.files:
            return []
        
        blocks = []
        current = self.files[filename]
        
        while current != -1:
            blocks.append(current)
            current = self.fat[current]
        
        return blocks

Indexed Allocation

Uses an index block to store pointers to all data blocks.

"""Indexed file allocation."""

class IndexedAllocation:
    """
    Indexed allocation uses an index block for pointers.
    
    Pros:
    - Fast random access
    - No external fragmentation
    - Direct block addressing
    
    Cons:
    - Index block overhead
    - Limited max file size (by index block entries)
    - Need extra I/O for index
    """
    
    def __init__(self, disk_size: int, index_size: int = 256):
        self.disk_size = disk_size
        self.index_size = index_size  # Max blocks per file
        self.free_blocks = set(range(disk_size))
        self.files = {}  # filename -> index_block
    
    def allocate(self, filename: str, size_blocks: int) -> bool:
        """Allocate blocks with index."""
        if size_blocks > self.index_size:
            return False  # File too large
        
        if len(self.free_blocks) < size_blocks + 1:  # +1 for index
            return False
        
        # Allocate index block
        index_block = self.free_blocks.pop()
        
        # Allocate data blocks
        data_blocks = []
        for _ in range(size_blocks):
            data_blocks.append(self.free_blocks.pop())
        
        # Would store data_blocks in index_block in real implementation
        self.files[filename] = {
            'index': index_block,
            'blocks': data_blocks
        }
        
        return True
    
    def read_block(self, filename: str, block_num: int) -> int:
        """Read a specific block."""
        if filename not in self.files:
            return -1
        
        file_info = self.files[filename]
        if block_num >= len(file_info['blocks']):
            return -1
        
        return file_info['blocks'][block_num]

Comparison

┌─────────────────────────────────────────────────────────────────┐
│              ALLOCATION METHOD COMPARISON                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌─────────────┬────────────┬────────────┬─────────────────┐  │
│   │ Method      │ Sequential │  Random    │ External        │  │
│   │             │ Access     │  Access    │ Fragmentation   │  │
│   ├─────────────┼────────────┼────────────┼─────────────────┤  │
│   │ Contiguous  │ Excellent  │ Excellent  │ High            │  │
│   │ Linked      │ Excellent  │ Poor       │ None            │  │
│   │ Indexed     │ Good       │ Excellent  │ None (index)    │  │
│   │ Extent-based│ Excellent  │ Good       │ Low             │  │
│   └─────────────┴────────────┴────────────┴─────────────────┘  │
│                                                                  │
│   Modern file systems (ext4, Btrfs, NTFS) use extent-based      │
│   allocation - groups of consecutive blocks allocated together  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

The Inode Structure

The inode (index node) is the fundamental structure for storing file metadata in Unix-like file systems.

Inode Structure

// Traditional Unix inode structure (simplified)
struct inode {
    // File type and permissions
    mode_t          i_mode;      // File type and permissions
    uid_t           i_uid;       // Owner UID
    gid_t           i_gid;       // Owner GID
    
    // File size
    off_t           i_size;      // File size in bytes
    
    // Timestamps
    time_t          i_atime;     // Last access time
    time_t          i_mtime;     // Last modification time
    time_t          i_ctime;     // Last status change time
    
    // Link count
    nlink_t         i_nlink;     // Number of hard links
    
    // Block pointers (ext2/3/4 style)
    // Direct blocks: 12 pointers to data blocks
    blkcnt_t        i_blocks;    // Number of blocks allocated
    unsigned long   i_block[15]; // Block pointers
                    // i_block[0-11]: 12 direct blocks
                    // i_block[12]: singly indirect
                    // i_block[13]: doubly indirect
                    // i_block[14]: triply indirect
                    
    // File attributes
    unsigned int    i_flags;     // File flags
    unsigned int    i_fop;       // File operations
    void            *i_private;  // Private data
};

Inode Visualization

┌─────────────────────────────────────────────────────────────────┐
│                    INODE STRUCTURE                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │                    inode                                   │ │
│   ├───────────────────────────────────────────────────────────┤ │
│   │  Mode:     0o100644 (regular file, rw-r--r--)            │ │
│   │  UID:      1000 (user)                                   │ │
│   │  GID:      1000 (group)                                  │ │
│   │  Size:     16384 bytes (4 blocks)                       │ │
│   │  Links:    1                                             │ │
│   │  ATime:    2026-02-27 10:30:00                          │ │
│   │  MTime:    2026-02-27 10:30:00                          │ │
│   │  CTime:    2026-02-27 10:30:00                          │ │
│   ├───────────────────────────────────────────────────────────┤ │
│   │  Block Pointers:                                         │ │
│   │                                                           │ │
│   │  ┌─────────────────────────────────────────────────────┐ │ │
│   │  │ 0   ─────► Block 1001                              │ │ │
│   │  │ 1   ─────► Block 1002                              │ │ │
│   │  │ 2   ─────► Block 1003                              │ │ │
│   │  │ ...                                                  │ │ │
│   │  │ 11  ─────► Block 1012                              │ │ │
│   │  ├─────────────────────────────────────────────────────┤ │ │
│   │  │ 12  ─────► Indirect Block (singly)                 │ │ │
│   │  │         Points to more data blocks                  │ │ │
│   │  ├─────────────────────────────────────────────────────┤ │ │
│   │  │ 13  ─────► Double Indirect Block                   │ │ │
│   │  │         Points to indirect blocks                   │ │ │
│   │  ├─────────────────────────────────────────────────────┤ │ │
│   │  │ 14  ─────► Triple Indirect Block                   │ │ │
│   │  │         Points to double indirect blocks            │ │ │
│   │  └─────────────────────────────────────────────────────┘ │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
│   Max file size with 4KB blocks:                                 │
│   - Direct: 12 × 4KB = 48 KB                                    │
│   - Single indirect: 1024 × 4KB = 4 MB                          │
│   - Double indirect: 1024² × 4KB = 4 GB                         │
│   - Triple indirect: 1024³ × 4KB = 4 TB                         │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Inode Implementation

"""Inode simulation."""

from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime

@dataclass
class Inode:
    """Simplified inode structure."""
    inode_number: int
    mode: int = 0o100644  # Regular file
    uid: int = 1000
    gid: int = 1000
    size: int = 0
    atime: datetime = field(default_factory=datetime.now)
    mtime: datetime = field(default_factory=datetime.now)
    ctime: datetime = field(default_factory=datetime.now)
    nlink: int = 1
    blocks: int = 0
    
    # Block pointers
    direct_blocks: List[int] = field(default_factory=lambda: [0] * 12)
    singly_indirect: Optional[int] = None
    doubly_indirect: Optional[int] = None
    triple_indirect: Optional[int] = None
    
    def add_block(self, block_num: int):
        """Add a data block to the inode."""
        if self.blocks < 12:
            self.direct_blocks[self.blocks] = block_num
        elif self.blocks < 12 + 256:  # Single indirect
            # Would add to singly indirect block
            pass
        # etc.
        
        self.blocks += 1
        self.size = self.blocks * 4096


class FileSystem:
    """Simple file system simulation with inodes."""
    
    def __init__(self, num_blocks: int, block_size: int = 4096):
        self.num_blocks = num_blocks
        self.block_size = block_size
        self.free_blocks = set(range(num_blocks))
        self.inodes = {}  # inode_number -> Inode
        self.next_inode = 1
        
        # Superblock would go here
        self.total_inodes = num_blocks // 10
    
    def create_file(self, name: str, size_blocks: int) -> Optional[Inode]:
        """Create a new file."""
        # Allocate inode
        inode_num = self.next_inode
        self.next_inode += 1
        
        # Allocate blocks
        blocks = []
        for _ in range(size_blocks):
            if not self.free_blocks:
                return None
            blocks.append(self.free_blocks.pop())
        
        # Create inode
        inode = Inode(inode_num=inode_num, blocks=size_blocks)
        for i, block in enumerate(blocks[:12]):
            inode.direct_blocks[i] = block
        
        self.inodes[inode_num] = inode
        return inode
    
    def get_inode(self, inode_num: int) -> Optional[Inode]:
        """Get inode by number."""
        return self.inodes.get(inode_num)

Directory Structure

Directory Entry

// Traditional Unix directory entry (struct dirent)
struct dirent {
    ino_t           d_ino;       // Inode number
    off_t           d_off;       // Offset to next entry
    unsigned short  d_reclen;    // Length of this record
    unsigned char   d_type;      // File type (DT_REG, DT_DIR, etc.)
    char            d_name[256]; // Filename (null-terminated)
};

// Linux-specific with flexible array member
struct linux_dirent {
    unsigned long  d_ino;       // Inode number
    unsigned long  d_off;       // Offset to next dirent
    unsigned short d_reclen;    // Length of this dirent
    char           d_name[];    // Filename (null-terminated)
};

Directory Implementation

"""Directory structure simulation."""

from dataclasses import dataclass
from typing import Dict, List, Optional

@dataclass
class DirEntry:
    """A directory entry."""
    name: str
    inode: int
    file_type: str  # 'file', 'directory', 'symlink', etc.

class Directory:
    """Directory structure."""
    
    def __init__(self, inode_num: int, name: str = "/"):
        self.inode_num = inode_num
        self.name = name
        self.entries: Dict[str, DirEntry] = {}
        
        # Special entries
        self.entries['.'] = DirEntry('.', inode_num, 'directory')
        self.entries['..'] = DirEntry('..', inode_num, 'directory')
    
    def add_entry(self, name: str, inode: int, file_type: str):
        """Add an entry to the directory."""
        self.entries[name] = DirEntry(name, inode, file_type)
    
    def remove_entry(self, name: str) -> bool:
        """Remove an entry."""
        if name in self.entries and name not in ['.', '..']:
            del self.entries[name]
            return True
        return False
    
    def lookup(self, name: str) -> Optional[DirEntry]:
        """Look up an entry by name."""
        return self.entries.get(name)
    
    def list_entries(self) -> List[str]:
        """List all entries."""
        return [name for name in self.entries.keys() if name not in ['.', '..']]


class PathResolver:
    """Resolve file paths to inodes."""
    
    def __init__(self, root_dir: Directory):
        self.root = root_dir
        self.cwd = root_dir
    
    def resolve(self, path: str) -> Optional[int]:
        """Resolve path to inode number."""
        if path == '/':
            return self.root.inode_num
        
        # Absolute or relative path
        if path.startswith('/'):
            current = self.root
            components = path[1:].split('/')
        else:
            current = self.cwd
            components = path.split('/')
        
        for component in components:
            if not component or component == '.':
                continue
            if component == '..':
                # Would need parent pointer
                continue
            
            entry = current.lookup(component)
            if entry is None:
                return None
            # Would need to look up inode and check if directory
        
        return entry.inode if entry else None

Directory Tree Example

┌─────────────────────────────────────────────────────────────────┐
│                    DIRECTORY STRUCTURE                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   / (inode 2)                                                   │
│   ├── bin/ (inode 3)                                            │
│   │   ├── bash (inode 100)                                     │
│   │   ├── ls (inode 101)                                       │
│   │   └── cat (inode 102)                                       │
│   │                                                            │
│   ├── home/ (inode 4)                                          │
│   │   └── user/ (inode 5)                                      │
│   │       ├── documents/ (inode 10)                           │
│   │       │   ├── report.pdf (inode 20)                        │
│   │       │   └── notes.txt (inode 21)                         │
│   │       ├── pictures/ (inode 11)                              │
│   │       │   ├── vacation.jpg (inode 30)                      │
│   │       │   └── family.png (inode 31)                         │
│   │       └── .bashrc (inode 12)                                │
│   │                                                            │
│   └── var/ (inode 6)                                            │
│       └── log/ (inode 7)                                        │
│           ├── syslog (inode 40)                                 │
│           └── error.log (inode 41)                              │
│                                                                  │
│   Each directory entry maps a name to an inode number           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Journaling File Systems

Journaling protects against crashes by logging changes before applying them.

Journaling Concepts

┌─────────────────────────────────────────────────────────────────┐
│                    JOURNALING OVERVIEW                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Problem: Without journaling                                  │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │ 1. Write begins                                          │ │
│   │ 2. Write partially completes                             │ │
│   │ 3. CRASH!                                                │ │
│   │ 4. On recovery: inconsistent state, data loss            │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
│   Solution: Write-Ahead Logging (Journal)                       │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │ 1. Write intent to journal (metadata)                    │ │
│   │ 2. Journal write completes (durable)                    │ │
│   │ 3. Write data to actual location                        │ │
│   │ 4. Mark journal entry as complete                        │ │
│   │ 5. CRASH! Recovery: replay journal = consistent          │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
│   Journal Modes:                                                 │
│   • Journal (data + metadata): Fullest protection, slowest     │
│   • Ordered (metadata only): Default for ext4                  │
│   • Writeback (metadata only, async): Fastest, least safe     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Journal Implementation

"""Journal implementation simulation."""

from dataclasses import dataclass, field
from enum import Enum
from typing import List
from datetime import datetime

class OperationType(Enum):
    """Types of journal operations."""
    CREATE = 1
    DELETE = 2
    MODIFY = 3
    RENAME = 4

@dataclass
class JournalEntry:
    """A journal log entry."""
    transaction_id: int
    operation: OperationType
    inode: int
    block_numbers: List[int]
    timestamp: datetime = field(default_factory=datetime.now)
    committed: bool = False

class Journal:
    """Write-ahead logging journal."""
    
    def __init__(self, max_size: int = 1000):
        self.max_size = max_size
        self.entries: List[JournalEntry] = []
        self.current_transaction = 1
        self.transaction_log = []
    
    def begin_transaction(self) -> int:
        """Start a new transaction."""
        tid = self.current_transaction
        self.current_transaction += 1
        self.transaction_log.append(tid)
        return tid
    
    def log_operation(self, tid: int, operation: OperationType, 
                     inode: int, blocks: List[int]):
        """Log an operation to the journal."""
        entry = JournalEntry(
            transaction_id=tid,
            operation=operation,
            inode=inode,
            block_numbers=blocks
        )
        self.entries.append(entry)
        
        # Enforce max size
        if len(self.entries) > self.max_size:
            self._clean_old_entries()
    
    def commit_transaction(self, tid: int):
        """Mark transaction as committed."""
        for entry in self.entries:
            if entry.transaction_id == tid:
                entry.committed = True
    
    def recover(self) -> List[JournalEntry]:
        """Recover from crash using journal."""
        # Find committed transactions
        committed = [e for e in self.entries if e.committed]
        
        # Incomplete transactions need to be rolled back
        incomplete = [e for e in self.entries if not e.committed]
        
        return committed
    
    def _clean_old_entries(self):
        """Remove old committed entries."""
        # Keep only recent entries
        self.entries = self.entries[-self.max_size//2:]


class JournalingFileSystem:
    """File system with journaling support."""
    
    def __init__(self):
        self.journal = Journal()
        self.data_blocks = {}
        self.inodes = {}
    
    def write(self, inode: int, data: str, blocks: List[int]):
        """Write data with journaling."""
        tid = self.journal.begin_transaction()
        
        # Log the intent
        self.journal.log_operation(
            tid, 
            OperationType.MODIFY, 
            inode, 
            blocks
        )
        
        # Write actual data (simulated)
        for i, block in enumerate(blocks):
            self.data_blocks[block] = data[i*4096:(i+1)*4096]
        
        # Commit transaction
        self.journal.commit_transaction(tid)
    
    def recover(self):
        """Recover from crash."""
        print("Starting recovery...")
        committed = self.journal.recover()
        
        print(f"Replaying {len(committed)} committed transactions")
        
        # Replay committed operations
        for entry in committed:
            # In real implementation, would redo the operation
            print(f"  Transaction {entry.transaction_id}: {entry.operation}")
        
        print("Recovery complete")

Modern File Systems

ext4 (Fourth Extended Filesystem)

┌─────────────────────────────────────────────────────────────────┐
│                         EXT4 FILE SYSTEM                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Overview:                                                      │
│   • Default Linux file system since 2008                        │
│   • Journaling with extents                                    │
│   • Backward compatible with ext2/3                           │
│                                                                  │
│   Key Features:                                                  │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │ Extents: Instead of block-by-block, allocate ranges     │ │
│   │   • Reduced metadata overhead                           │ │
│   │   • Better for large files                              │ │
│   └───────────────────────────────────────────────────────────┘ │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │ Delayed Allocation: Don't allocate until data is        │ │
│   │   written to disk, allows better grouping               │ │
│   └───────────────────────────────────────────────────────────┘ │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │ Journal Checksumming: Detect journal corruption         │ │
│   └───────────────────────────────────────────────────────────┘ │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │ 48-bit block numbering: Supports up to 1 EB             │ │
│   └───────────────────────────────────────────────────────────┘ │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │ Defragmentation: Online defrag with e4defrag            │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
│   Limitations:                                                   │
│   • Not copy-on-write (can cause fragmentation)               │
│   • Limited scalability for very large storage                 │
│   • No built-in RAID, compression, or deduplication           │
│                                                                  │
│   Tools:                                                         │
│   • mkfs.ext4 - Create ext4 filesystem                         │
│   • tune2fs - Adjust parameters                                │
│   • e4defrag - Online defragmentation                          │
│   • dumpe2fs - Dump superblock info                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Btrfs (B-Tree File System)

// Btrfs uses B-trees for all metadata and data
// Key structures (simplified)

struct btrfs_header {
    u8 csum[32];           // Checksum
    u8 fsid[16];           // Filesystem ID
    __le64 blocknr;        // Block number
    __le64 flags;          // Flags
    __le64 chunk_tree_uid; // Chunk tree UUID
};

struct btrfs_root_item {
    // Location of root tree
    struct btrfs_key location;
    __le64 bytenr;        // Logical address
    __le64 byte_limit;
    __le64 bytes_used;
    __le64 last_snapshot;
    __le64 generation;
    __le64 tree_root_id;  // Tree root ID
    // ... more fields
};

struct btrfs_extent_item {
    __le64 generation;
    __le64 num_bytes;
    __le64 disk_byte_nr;
    __le64 disk_len;
    __le64 flags;
    // For data extents:
    // struct btrfs_extent_data_ref ref;
    // struct btrfs_shared_data_ref ref;
};

Btrfs Features

┌─────────────────────────────────────────────────────────────────┐
│                      BTRFS FILE SYSTEM                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Overview:                                                      │
│   • Copy-on-Write (COW) - never overwrites data in place      │
│   • Built-in RAID (0, 1, 5, 6, 10)                            │
│   • Compression (zstd, zlib, lzo)                             │
│   • Snapshots and subvolumes                                   │
│   • Built-in deduplication                                     │
│   • Checksums for data integrity                               │
│                                                                  │
│   Copy-on-Write:                                                │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ Original:                                               │   │
│   │   Block A ──► Data                                     │   │
│   │                                                            │   │
│   │ Write new data:                                         │   │
│   │   Block A ──► Old Data   (unchanged)                   │   │
│   │   Block B ──► New Data   (new location)                │   │
│   │   Update metadata to point to Block B                   │   │
│   │                                                            │   │
│   │ Benefit: Crash-consistent, no corruption                 │   │
│   └───────────────────────────────────────────────────────────┘   │
│                                                                  │
│   Subvolumes:                                                   │
│   • Independent file system-like containers                    │
│   • Can be snapshotted independently                           │
│   • Can have separate quotas                                   │   │
│                                                                  │
│   Snapshots:                                                    │
│   • Read-only or read-write copies of subvolumes              │   │
│   • Essentially zero-cost (COW)                                │   │
│   • Foundation for incremental backups                         │   │
│                                                                  │
│   Limitations:                                                  │
│   • Still considered experimental by some                      │   │
│   • RAID5/6 had stability issues                              │   │
│   • Can have slower performance for some workloads            │   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

ZFS (Zettabyte File System)

┌─────────────────────────────────────────────────────────────────┐
│                      ZFS FILE SYSTEM                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Overview:                                                      │
│   • Originally from Sun Microsystems (now OpenZFS)            │
│   • Combined file system and volume manager                   │
│   • Extremely scalable (128-bit)                                │
│   • Enterprise-grade features                                   │
│                                                                  │
│   Architecture:                                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ ZFS Pool (ZPOOL)                                        │   │
│   │   • Physical storage pool from devices                │   │
│   │   • Virtual devices (vdevs)                           │   │
│   │   • RAID-Z configurations                              │   │
│   ├─────────────────────────────────────────────────────────┤   │
│   │ ZFS Dataset                                             │   │
│   │   • File systems within the pool                      │   │
│   │   • Can have independent properties                    │   │
│   │   • Quotas and reservations                            │   │
│   ├─────────────────────────────────────────────────────────┤   │
│   │ Copy-on-Write                                            │   │
│   │   • Transaction groups (TXG)                          │   │
│   │   • No in-place overwrites                            │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│   Key Features:                                                  │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │ Data Integrity: Every block has checksum                │ │
│   │   • Self-healing with redundancy                       │ │
│   │   • End-to-end checksums                               │ │
│   └───────────────────────────────────────────────────────────┘ │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │ Snapshots: Instant, zero-space (until changed)          │ │
│   │   • Send/receive for backup                            │ │
│   │   • Clones                                             │ │
│   └───────────────────────────────────────────────────────────┘ │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │ Compression: Multiple algorithms                         │ │
│   │   • lz4 (fast, good ratio)                             │ │
│   │   • zstd (better ratio, variable)                      │ │
│   │   • gzip (maximum compression)                        │ │
│   └───────────────────────────────────────────────────────────┘ │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │ Deduplication: Block-level                             │ │
│   │   • Can require lots of RAM                           │ │
│   │   • Optional compression-first approach                │ │
│   └───────────────────────────────────────────────────────────┘ │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │ RAID-Z: Variable-length RAID                            │ │
│   │   • Better than traditional RAID5                      │ │
│   │   • RAID-Z1/2/3 (1-3 parity disks)                   │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

File System Comparison

┌─────────────────────────────────────────────────────────────────┐
│                 FILE SYSTEM COMPARISON                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌──────────┬────────┬────────┬────────┬────────┬──────────┐    │
│   │ Feature  │ ext4   │ Btrfs  │ ZFS    │ XFS    │ NTFS     │    │
│   ├──────────┼────────┼────────┼────────┼────────┼──────────┤    │
│   │ Max Size │ 1 EB  │ 16 EB  │ 256 TB │ 16 TB  │ 256 TB   │    │
│   ├──────────┼────────┼────────┼────────┼────────┼──────────┤    │
│   │ Max File │ 16 TB │ 16 EB  │ 16 EB  │ 8 exb  │ 16 TB    │    │
│   ├──────────┼────────┼────────┼────────┼────────┼──────────┤    │
│   │ Journal  │ Yes   │ Yes    │ Yes    │ Yes    │ Yes      │    │
│   ├──────────┼────────┼────────┼────────┼────────┼──────────┤    │
│   │ COW      │ No    │ Yes    │ Yes    │ No     │ No       │    │
│   ├──────────┼────────┼────────┼────────┼────────┼──────────┤    │
│   │RAID Built│ No    │ Yes    │ Yes    │ No     │ No       │    │
│   ├──────────┼────────┼────────┼────────┼────────┼──────────┤    │
│   │Compress  │ No    │ Yes    │ Yes    │ No     │ Yes      │    │
│   ├──────────┼────────┼────────┼────────┼────────┼──────────┤    │
│   │Snapshots │ No*   │ Yes    │ Yes    │ No*    │ Yes      │    │
│   ├──────────┼────────┼────────┼────────┼────────┼──────────┤    │
│   │Checksum  │ Meta  │ Yes    │ Yes    │ Meta   │ Yes      │    │
│   ├──────────┼────────┼────────┼────────┼────────┼──────────┤    │
│   │Defrag   │ Yes   │ Limited│ Yes    │ Yes    │ Yes      │    │
│   └──────────┴────────┴────────┴────────┴────────┴──────────┘    │
│                                                                  │
│   * Can be added via LVM or other tools                          │
│                                                                  │
│   When to use what:                                              │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ ext4: General purpose, compatibility, simplicity        │   │
│   │ Btrfs: Modern Linux, COW features, container storage   │   │
│   │ ZFS: Enterprise storage, data integrity, large pools   │   │
│   │ XFS: Large files, large directories, high performance  │   │
│   │ NTFS: Windows compatibility, dual-boot                  │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Virtual File System (VFS)

VFS Overview

┌─────────────────────────────────────────────────────────────────┐
│                 VIRTUAL FILE SYSTEM (VFS)                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   VFS provides a unified interface for different file systems:  │
│                                                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    Application                            │   │
│   │         open(), read(), write(), stat(), etc.           │   │
│   └──────────────────────────┬──────────────────────────────┘   │
│                              │                                    │
│   ┌──────────────────────────▼──────────────────────────────┐   │
│   │              System Call Interface                       │   │
│   │                   (POSIX)                                │   │
│   └──────────────────────────┬──────────────────────────────┘   │
│                              │                                    │
│   ┌──────────────────────────▼──────────────────────────────┐   │
│   │                   VFS Layer                               │   │
│   │    • Superblock, inode, dentry operations              │   │
│   │    • Generic file operations                            │   │
│   │    • Cache management                                   │   │
│   └──────────────────────────┬──────────────────────────────┘   │
│                              │                                    │
│   ┌────────────┬────────────┼────────────┬────────────┐        │
│   ▼            ▼            ▼            ▼            ▼        │
│   ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌─────────┐      │
│   │ ext4   │ │ Btrfs  │ │  XFS   │ │  NFS   │ │ ProcFS │      │
│   └────────┘ └────────┘ └────────┘ └────────┘ └─────────┘      │
│                                                                  │
│   Key VFS Structures:                                            │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │ struct super_block: Mounted filesystem info              │ │
│   │ struct inode: Generic inode (file metadata)              │ │
│   │ struct dentry: Directory entry (name -> inode mapping)  │ │
│   │ struct file: Open file instance                         │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

File System Commands

Linux File System Commands

# Create file systems
mkfs.ext4 /dev/sdb1
mkfs.btrfs /dev/sdb1
mkfs.xfs /dev/sdb1

# Check and repair
fsck /dev/sdb1
fsck.ext4 -n /dev/sdb1  # Dry run
fsck -fy /dev/sdb1      # Force repair

# Tune parameters
tune2fs -l /dev/sdb1                   # List
tune2fs -c 50 /dev/sdb1                 # Check interval
tune2fs -m 5 /dev/sdb1                  # Reserved blocks
tune2fs -O ^has_journal /dev/sdb1       # Remove journal

# Mount options
mount /dev/sdb1 /mnt/data
mount -o remount,rw /mnt/data          # Remount
mount -o loop file.img /mnt/iso        # Loop device

# Information
df -h                                    # Disk usage
du -sh /directory                        # Directory size
ls -li                                   # Show inodes
stat file                                # File metadata

# defragment (ext4)
e4defrag -c /dev/sdb1                   # Check
e4defrag /dev/sdb1                      # Defrag

# Btrfs
btrfs filesystem df /mountpoint          # Show space info
btrfs balance start /mountpoint          # Rebalance
btrfs subvolume list /mountpoint         # List subvolumes

# ZFS
zpool list                              # List pools
zpool status                            # Pool health
zfs list                                # List datasets
zfs snapshot pool@snapshot              # Create snapshot

Conclusion

File systems are complex but essential components of any operating system. Key takeaways:

Multiple allocation strategies exist - Contiguous, linked, indexed, and extent-based each have trade-offs
Inodes are fundamental - All Unix file systems use inodes to store metadata
Journaling prevents corruption - Write-ahead logging ensures consistency after crashes
Modern file systems add features - COW, compression, snapshots, built-in RAID
VFS unifies access - Linux’s VFS allows transparent access to different file system types

Choosing the right file system depends on your use case: ext4 for simplicity, Btrfs for modern Linux features, ZFS for enterprise storage with data integrity.

File System Architecture: From Inodes to Modern File Systems

Introduction

File System Fundamentals

What is a File System?

File System Layers

File Organization

Contiguous Allocation

Linked Allocation

Indexed Allocation

Comparison

The Inode Structure

Inode Structure

Inode Visualization

Inode Implementation

Directory Structure

Directory Entry

Directory Implementation

Directory Tree Example

Journaling File Systems

Journaling Concepts

Journal Implementation

Modern File Systems

ext4 (Fourth Extended Filesystem)

Btrfs (B-Tree File System)

Btrfs Features

ZFS (Zettabyte File System)

File System Comparison

Virtual File System (VFS)

VFS Overview

File System Commands

Linux File System Commands

Conclusion

External Resources

Books

Documentation

Comments

Share this article

👍 Was this article helpful?

Introduction

File System Fundamentals

What is a File System?

File System Layers

File Organization

Contiguous Allocation

Linked Allocation

Indexed Allocation

Comparison

The Inode Structure

Inode Structure

Inode Visualization

Inode Implementation

Directory Structure

Directory Entry

Directory Implementation

Directory Tree Example

Journaling File Systems

Journaling Concepts

Journal Implementation

Modern File Systems

ext4 (Fourth Extended Filesystem)

Btrfs (B-Tree File System)

Btrfs Features

ZFS (Zettabyte File System)

File System Comparison

Virtual File System (VFS)

VFS Overview

File System Commands

Linux File System Commands

Conclusion

External Resources

Books

Documentation

Related Articles on This Site

Comments

Share this article

👍 Was this article helpful?