Introduction
File systems are the backbone of any operating system’s storage management. They transform raw disk sectors into organized, accessible files and directories. Understanding file system architecture is essential for system administrators, developers, and anyone working with storage.
In this comprehensive guide, we’ll explore everything from fundamental concepts to modern file system implementations.
File System Fundamentals
What is a File System?
┌─────────────────────────────────────────────────────────────────┐
│ FILE SYSTEM OVERVIEW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ A file system provides: │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 1. STORAGE ORGANIZATION │ │
│ │ • Files and directories │ │
│ │ • Hierarchical structure │ │
│ │ • Metadata (permissions, dates, etc.) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 2. DATA ACCESS │ │
│ │ • Read/write operations │ │
│ │ • File naming │ │
│ │ • Access control │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 3. SPACE MANAGEMENT │ │
│ │ • Allocation of disk space │ │
│ │ • Free space tracking │ │
│ │ • Fragmentation handling │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 4. INTEGRITY & RECOVERY │ │
│ │ • Crash recovery │ │
│ │ • Data consistency │ │
│ │ • Error detection │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
File System Layers
┌─────────────────────────────────────────────────────────────────┐
│ FILE SYSTEM LAYERS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Application Programs │ │
│ │ (ls, cp, cat, text editors, etc.) │ │
│ └──────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────▼──────────────────────────────┐ │
│ │ System Call Interface (POSIX) │ │
│ │ (open, read, write, close, stat, mkdir, etc.) │ │
│ └──────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────▼──────────────────────────────┐ │
│ │ Virtual File System (VFS) - Linux │ │
│ │ (unified interface for all file systems) │ │
│ └──────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ ext4 │ │ Btrfs │ │ NFS │ │
│ │ │ │ │ │ │ │
│ └────┬──────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Block Device Layer │ │
│ │ (SDD/HDD, RAID controllers, etc.) │ │
│ └──────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────▼──────────────────────────────┐ │
│ │ Physical Storage Device │ │
│ │ (Disk or SSD) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
File Organization
Contiguous Allocation
The simplest allocation method - each file occupies consecutive blocks.
"""Contiguous file allocation simulation."""
class ContiguousAllocation:
"""
Contiguous allocation stores files in consecutive disk blocks.
Pros:
- Simple to implement
- Fast sequential access
- Good random access performance
Cons:
- External fragmentation
- Difficult to grow files
"""
def __init__(self, disk_size: int, block_size: int = 4096):
self.disk_size = disk_size # Total blocks
self.block_size = block_size
self.free_blocks = set(range(disk_size))
self.files = {} # filename -> (start_block, size)
def allocate(self, filename: str, size: int) -> bool:
"""Allocate space for a file."""
num_blocks = (size + self.block_size - 1) // self.block_size
# Find consecutive blocks
allocated = []
for start in range(self.disk_size):
if all((start + i) in self.free_blocks for i in range(num_blocks)):
# Found consecutive blocks
for i in range(num_blocks):
self.free_blocks.remove(start + i)
self.files[filename] = (start, num_blocks)
return True
return False
def deallocate(self, filename: str) -> bool:
"""Free allocated space."""
if filename not in self.files:
return False
start, size = self.files[filename]
for i in range(size):
self.free_blocks.add(start + i)
del self.files[filename]
return True
def get_fragmentation(self) -> float:
"""Calculate external fragmentation."""
if not self.free_blocks:
return 1.0
# Count gaps between free blocks
sorted_free = sorted(self.free_blocks)
gaps = 0
for i in range(len(sorted_free) - 1):
if sorted_free[i + 1] - sorted_free[i] > 1:
gaps += 1
return gaps / max(len(self.free_blocks), 1)
Linked Allocation
Each block contains a pointer to the next block - blocks don’t need to be contiguous.
"""Linked list file allocation."""
class LinkedAllocation:
"""
Linked allocation uses pointers in each block.
Pros:
- No external fragmentation
- Easy to grow files
Cons:
- Poor random access (must follow pointers)
- Pointer overhead in each block
-可靠性问题 (pointer corruption = data loss)
"""
def __init__(self, disk_size: int):
self.disk_size = disk_size
self.free_blocks = set(range(disk_size))
self.files = {} # filename -> [first_block]
self.fat = [-1] * disk_size # File Allocation Table
def allocate(self, filename: str, size_blocks: int) -> bool:
"""Allocate blocks for a file using linked list."""
if len(self.free_blocks) < size_blocks:
return False
# Allocate blocks
blocks = []
for _ in range(size_blocks):
block = self.free_blocks.pop()
blocks.append(block)
# Set up links
for i in range(len(blocks) - 1):
self.fat[blocks[i]] = blocks[i + 1]
self.fat[blocks[-1]] = -1 # End of file marker
self.files[filename] = blocks[0]
return True
def read_chain(self, filename: str) -> list:
"""Read all blocks in a file."""
if filename not in self.files:
return []
blocks = []
current = self.files[filename]
while current != -1:
blocks.append(current)
current = self.fat[current]
return blocks
Indexed Allocation
Uses an index block to store pointers to all data blocks.
"""Indexed file allocation."""
class IndexedAllocation:
"""
Indexed allocation uses an index block for pointers.
Pros:
- Fast random access
- No external fragmentation
- Direct block addressing
Cons:
- Index block overhead
- Limited max file size (by index block entries)
- Need extra I/O for index
"""
def __init__(self, disk_size: int, index_size: int = 256):
self.disk_size = disk_size
self.index_size = index_size # Max blocks per file
self.free_blocks = set(range(disk_size))
self.files = {} # filename -> index_block
def allocate(self, filename: str, size_blocks: int) -> bool:
"""Allocate blocks with index."""
if size_blocks > self.index_size:
return False # File too large
if len(self.free_blocks) < size_blocks + 1: # +1 for index
return False
# Allocate index block
index_block = self.free_blocks.pop()
# Allocate data blocks
data_blocks = []
for _ in range(size_blocks):
data_blocks.append(self.free_blocks.pop())
# Would store data_blocks in index_block in real implementation
self.files[filename] = {
'index': index_block,
'blocks': data_blocks
}
return True
def read_block(self, filename: str, block_num: int) -> int:
"""Read a specific block."""
if filename not in self.files:
return -1
file_info = self.files[filename]
if block_num >= len(file_info['blocks']):
return -1
return file_info['blocks'][block_num]
Comparison
┌─────────────────────────────────────────────────────────────────┐
│ ALLOCATION METHOD COMPARISON │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┬────────────┬────────────┬─────────────────┐ │
│ │ Method │ Sequential │ Random │ External │ │
│ │ │ Access │ Access │ Fragmentation │ │
│ ├─────────────┼────────────┼────────────┼─────────────────┤ │
│ │ Contiguous │ Excellent │ Excellent │ High │ │
│ │ Linked │ Excellent │ Poor │ None │ │
│ │ Indexed │ Good │ Excellent │ None (index) │ │
│ │ Extent-based│ Excellent │ Good │ Low │ │
│ └─────────────┴────────────┴────────────┴─────────────────┘ │
│ │
│ Modern file systems (ext4, Btrfs, NTFS) use extent-based │
│ allocation - groups of consecutive blocks allocated together │
│ │
└─────────────────────────────────────────────────────────────────┘
The Inode Structure
The inode (index node) is the fundamental structure for storing file metadata in Unix-like file systems.
Inode Structure
// Traditional Unix inode structure (simplified)
struct inode {
// File type and permissions
mode_t i_mode; // File type and permissions
uid_t i_uid; // Owner UID
gid_t i_gid; // Owner GID
// File size
off_t i_size; // File size in bytes
// Timestamps
time_t i_atime; // Last access time
time_t i_mtime; // Last modification time
time_t i_ctime; // Last status change time
// Link count
nlink_t i_nlink; // Number of hard links
// Block pointers (ext2/3/4 style)
// Direct blocks: 12 pointers to data blocks
blkcnt_t i_blocks; // Number of blocks allocated
unsigned long i_block[15]; // Block pointers
// i_block[0-11]: 12 direct blocks
// i_block[12]: singly indirect
// i_block[13]: doubly indirect
// i_block[14]: triply indirect
// File attributes
unsigned int i_flags; // File flags
unsigned int i_fop; // File operations
void *i_private; // Private data
};
Inode Visualization
┌─────────────────────────────────────────────────────────────────┐
│ INODE STRUCTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ inode │ │
│ ├───────────────────────────────────────────────────────────┤ │
│ │ Mode: 0o100644 (regular file, rw-r--r--) │ │
│ │ UID: 1000 (user) │ │
│ │ GID: 1000 (group) │ │
│ │ Size: 16384 bytes (4 blocks) │ │
│ │ Links: 1 │ │
│ │ ATime: 2026-02-27 10:30:00 │ │
│ │ MTime: 2026-02-27 10:30:00 │ │
│ │ CTime: 2026-02-27 10:30:00 │ │
│ ├───────────────────────────────────────────────────────────┤ │
│ │ Block Pointers: │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ 0 ─────► Block 1001 │ │ │
│ │ │ 1 ─────► Block 1002 │ │ │
│ │ │ 2 ─────► Block 1003 │ │ │
│ │ │ ... │ │ │
│ │ │ 11 ─────► Block 1012 │ │ │
│ │ ├─────────────────────────────────────────────────────┤ │ │
│ │ │ 12 ─────► Indirect Block (singly) │ │ │
│ │ │ Points to more data blocks │ │ │
│ │ ├─────────────────────────────────────────────────────┤ │ │
│ │ │ 13 ─────► Double Indirect Block │ │ │
│ │ │ Points to indirect blocks │ │ │
│ │ ├─────────────────────────────────────────────────────┤ │ │
│ │ │ 14 ─────► Triple Indirect Block │ │ │
│ │ │ Points to double indirect blocks │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ Max file size with 4KB blocks: │
│ - Direct: 12 × 4KB = 48 KB │
│ - Single indirect: 1024 × 4KB = 4 MB │
│ - Double indirect: 1024² × 4KB = 4 GB │
│ - Triple indirect: 1024³ × 4KB = 4 TB │
│ │
└─────────────────────────────────────────────────────────────────┘
Inode Implementation
"""Inode simulation."""
from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime
@dataclass
class Inode:
"""Simplified inode structure."""
inode_number: int
mode: int = 0o100644 # Regular file
uid: int = 1000
gid: int = 1000
size: int = 0
atime: datetime = field(default_factory=datetime.now)
mtime: datetime = field(default_factory=datetime.now)
ctime: datetime = field(default_factory=datetime.now)
nlink: int = 1
blocks: int = 0
# Block pointers
direct_blocks: List[int] = field(default_factory=lambda: [0] * 12)
singly_indirect: Optional[int] = None
doubly_indirect: Optional[int] = None
triple_indirect: Optional[int] = None
def add_block(self, block_num: int):
"""Add a data block to the inode."""
if self.blocks < 12:
self.direct_blocks[self.blocks] = block_num
elif self.blocks < 12 + 256: # Single indirect
# Would add to singly indirect block
pass
# etc.
self.blocks += 1
self.size = self.blocks * 4096
class FileSystem:
"""Simple file system simulation with inodes."""
def __init__(self, num_blocks: int, block_size: int = 4096):
self.num_blocks = num_blocks
self.block_size = block_size
self.free_blocks = set(range(num_blocks))
self.inodes = {} # inode_number -> Inode
self.next_inode = 1
# Superblock would go here
self.total_inodes = num_blocks // 10
def create_file(self, name: str, size_blocks: int) -> Optional[Inode]:
"""Create a new file."""
# Allocate inode
inode_num = self.next_inode
self.next_inode += 1
# Allocate blocks
blocks = []
for _ in range(size_blocks):
if not self.free_blocks:
return None
blocks.append(self.free_blocks.pop())
# Create inode
inode = Inode(inode_num=inode_num, blocks=size_blocks)
for i, block in enumerate(blocks[:12]):
inode.direct_blocks[i] = block
self.inodes[inode_num] = inode
return inode
def get_inode(self, inode_num: int) -> Optional[Inode]:
"""Get inode by number."""
return self.inodes.get(inode_num)
Directory Structure
Directory Entry
// Traditional Unix directory entry (struct dirent)
struct dirent {
ino_t d_ino; // Inode number
off_t d_off; // Offset to next entry
unsigned short d_reclen; // Length of this record
unsigned char d_type; // File type (DT_REG, DT_DIR, etc.)
char d_name[256]; // Filename (null-terminated)
};
// Linux-specific with flexible array member
struct linux_dirent {
unsigned long d_ino; // Inode number
unsigned long d_off; // Offset to next dirent
unsigned short d_reclen; // Length of this dirent
char d_name[]; // Filename (null-terminated)
};
Directory Implementation
"""Directory structure simulation."""
from dataclasses import dataclass
from typing import Dict, List, Optional
@dataclass
class DirEntry:
"""A directory entry."""
name: str
inode: int
file_type: str # 'file', 'directory', 'symlink', etc.
class Directory:
"""Directory structure."""
def __init__(self, inode_num: int, name: str = "/"):
self.inode_num = inode_num
self.name = name
self.entries: Dict[str, DirEntry] = {}
# Special entries
self.entries['.'] = DirEntry('.', inode_num, 'directory')
self.entries['..'] = DirEntry('..', inode_num, 'directory')
def add_entry(self, name: str, inode: int, file_type: str):
"""Add an entry to the directory."""
self.entries[name] = DirEntry(name, inode, file_type)
def remove_entry(self, name: str) -> bool:
"""Remove an entry."""
if name in self.entries and name not in ['.', '..']:
del self.entries[name]
return True
return False
def lookup(self, name: str) -> Optional[DirEntry]:
"""Look up an entry by name."""
return self.entries.get(name)
def list_entries(self) -> List[str]:
"""List all entries."""
return [name for name in self.entries.keys() if name not in ['.', '..']]
class PathResolver:
"""Resolve file paths to inodes."""
def __init__(self, root_dir: Directory):
self.root = root_dir
self.cwd = root_dir
def resolve(self, path: str) -> Optional[int]:
"""Resolve path to inode number."""
if path == '/':
return self.root.inode_num
# Absolute or relative path
if path.startswith('/'):
current = self.root
components = path[1:].split('/')
else:
current = self.cwd
components = path.split('/')
for component in components:
if not component or component == '.':
continue
if component == '..':
# Would need parent pointer
continue
entry = current.lookup(component)
if entry is None:
return None
# Would need to look up inode and check if directory
return entry.inode if entry else None
Directory Tree Example
┌─────────────────────────────────────────────────────────────────┐
│ DIRECTORY STRUCTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ / (inode 2) │
│ ├── bin/ (inode 3) │
│ │ ├── bash (inode 100) │
│ │ ├── ls (inode 101) │
│ │ └── cat (inode 102) │
│ │ │
│ ├── home/ (inode 4) │
│ │ └── user/ (inode 5) │
│ │ ├── documents/ (inode 10) │
│ │ │ ├── report.pdf (inode 20) │
│ │ │ └── notes.txt (inode 21) │
│ │ ├── pictures/ (inode 11) │
│ │ │ ├── vacation.jpg (inode 30) │
│ │ │ └── family.png (inode 31) │
│ │ └── .bashrc (inode 12) │
│ │ │
│ └── var/ (inode 6) │
│ └── log/ (inode 7) │
│ ├── syslog (inode 40) │
│ └── error.log (inode 41) │
│ │
│ Each directory entry maps a name to an inode number │
│ │
└─────────────────────────────────────────────────────────────────┘
Journaling File Systems
Journaling protects against crashes by logging changes before applying them.
Journaling Concepts
┌─────────────────────────────────────────────────────────────────┐
│ JOURNALING OVERVIEW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Problem: Without journaling │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ 1. Write begins │ │
│ │ 2. Write partially completes │ │
│ │ 3. CRASH! │ │
│ │ 4. On recovery: inconsistent state, data loss │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ Solution: Write-Ahead Logging (Journal) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ 1. Write intent to journal (metadata) │ │
│ │ 2. Journal write completes (durable) │ │
│ │ 3. Write data to actual location │ │
│ │ 4. Mark journal entry as complete │ │
│ │ 5. CRASH! Recovery: replay journal = consistent │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ Journal Modes: │
│ • Journal (data + metadata): Fullest protection, slowest │
│ • Ordered (metadata only): Default for ext4 │
│ • Writeback (metadata only, async): Fastest, least safe │
│ │
└─────────────────────────────────────────────────────────────────┘
Journal Implementation
"""Journal implementation simulation."""
from dataclasses import dataclass, field
from enum import Enum
from typing import List
from datetime import datetime
class OperationType(Enum):
"""Types of journal operations."""
CREATE = 1
DELETE = 2
MODIFY = 3
RENAME = 4
@dataclass
class JournalEntry:
"""A journal log entry."""
transaction_id: int
operation: OperationType
inode: int
block_numbers: List[int]
timestamp: datetime = field(default_factory=datetime.now)
committed: bool = False
class Journal:
"""Write-ahead logging journal."""
def __init__(self, max_size: int = 1000):
self.max_size = max_size
self.entries: List[JournalEntry] = []
self.current_transaction = 1
self.transaction_log = []
def begin_transaction(self) -> int:
"""Start a new transaction."""
tid = self.current_transaction
self.current_transaction += 1
self.transaction_log.append(tid)
return tid
def log_operation(self, tid: int, operation: OperationType,
inode: int, blocks: List[int]):
"""Log an operation to the journal."""
entry = JournalEntry(
transaction_id=tid,
operation=operation,
inode=inode,
block_numbers=blocks
)
self.entries.append(entry)
# Enforce max size
if len(self.entries) > self.max_size:
self._clean_old_entries()
def commit_transaction(self, tid: int):
"""Mark transaction as committed."""
for entry in self.entries:
if entry.transaction_id == tid:
entry.committed = True
def recover(self) -> List[JournalEntry]:
"""Recover from crash using journal."""
# Find committed transactions
committed = [e for e in self.entries if e.committed]
# Incomplete transactions need to be rolled back
incomplete = [e for e in self.entries if not e.committed]
return committed
def _clean_old_entries(self):
"""Remove old committed entries."""
# Keep only recent entries
self.entries = self.entries[-self.max_size//2:]
class JournalingFileSystem:
"""File system with journaling support."""
def __init__(self):
self.journal = Journal()
self.data_blocks = {}
self.inodes = {}
def write(self, inode: int, data: str, blocks: List[int]):
"""Write data with journaling."""
tid = self.journal.begin_transaction()
# Log the intent
self.journal.log_operation(
tid,
OperationType.MODIFY,
inode,
blocks
)
# Write actual data (simulated)
for i, block in enumerate(blocks):
self.data_blocks[block] = data[i*4096:(i+1)*4096]
# Commit transaction
self.journal.commit_transaction(tid)
def recover(self):
"""Recover from crash."""
print("Starting recovery...")
committed = self.journal.recover()
print(f"Replaying {len(committed)} committed transactions")
# Replay committed operations
for entry in committed:
# In real implementation, would redo the operation
print(f" Transaction {entry.transaction_id}: {entry.operation}")
print("Recovery complete")
Modern File Systems
ext4 (Fourth Extended Filesystem)
┌─────────────────────────────────────────────────────────────────┐
│ EXT4 FILE SYSTEM │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Overview: │
│ • Default Linux file system since 2008 │
│ • Journaling with extents │
│ • Backward compatible with ext2/3 │
│ │
│ Key Features: │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Extents: Instead of block-by-block, allocate ranges │ │
│ │ • Reduced metadata overhead │ │
│ │ • Better for large files │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Delayed Allocation: Don't allocate until data is │ │
│ │ written to disk, allows better grouping │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Journal Checksumming: Detect journal corruption │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ 48-bit block numbering: Supports up to 1 EB │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Defragmentation: Online defrag with e4defrag │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ Limitations: │
│ • Not copy-on-write (can cause fragmentation) │
│ • Limited scalability for very large storage │
│ • No built-in RAID, compression, or deduplication │
│ │
│ Tools: │
│ • mkfs.ext4 - Create ext4 filesystem │
│ • tune2fs - Adjust parameters │
│ • e4defrag - Online defragmentation │
│ • dumpe2fs - Dump superblock info │
│ │
└─────────────────────────────────────────────────────────────────┘
Btrfs (B-Tree File System)
// Btrfs uses B-trees for all metadata and data
// Key structures (simplified)
struct btrfs_header {
u8 csum[32]; // Checksum
u8 fsid[16]; // Filesystem ID
__le64 blocknr; // Block number
__le64 flags; // Flags
__le64 chunk_tree_uid; // Chunk tree UUID
};
struct btrfs_root_item {
// Location of root tree
struct btrfs_key location;
__le64 bytenr; // Logical address
__le64 byte_limit;
__le64 bytes_used;
__le64 last_snapshot;
__le64 generation;
__le64 tree_root_id; // Tree root ID
// ... more fields
};
struct btrfs_extent_item {
__le64 generation;
__le64 num_bytes;
__le64 disk_byte_nr;
__le64 disk_len;
__le64 flags;
// For data extents:
// struct btrfs_extent_data_ref ref;
// struct btrfs_shared_data_ref ref;
};
Btrfs Features
┌─────────────────────────────────────────────────────────────────┐
│ BTRFS FILE SYSTEM │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Overview: │
│ • Copy-on-Write (COW) - never overwrites data in place │
│ • Built-in RAID (0, 1, 5, 6, 10) │
│ • Compression (zstd, zlib, lzo) │
│ • Snapshots and subvolumes │
│ • Built-in deduplication │
│ • Checksums for data integrity │
│ │
│ Copy-on-Write: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Original: │ │
│ │ Block A ──► Data │ │
│ │ │ │
│ │ Write new data: │ │
│ │ Block A ──► Old Data (unchanged) │ │
│ │ Block B ──► New Data (new location) │ │
│ │ Update metadata to point to Block B │ │
│ │ │ │
│ │ Benefit: Crash-consistent, no corruption │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ Subvolumes: │
│ • Independent file system-like containers │
│ • Can be snapshotted independently │
│ • Can have separate quotas │ │
│ │
│ Snapshots: │
│ • Read-only or read-write copies of subvolumes │ │
│ • Essentially zero-cost (COW) │ │
│ • Foundation for incremental backups │ │
│ │
│ Limitations: │
│ • Still considered experimental by some │ │
│ • RAID5/6 had stability issues │ │
│ • Can have slower performance for some workloads │ │
│ │
└─────────────────────────────────────────────────────────────────┘
ZFS (Zettabyte File System)
┌─────────────────────────────────────────────────────────────────┐
│ ZFS FILE SYSTEM │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Overview: │
│ • Originally from Sun Microsystems (now OpenZFS) │
│ • Combined file system and volume manager │
│ • Extremely scalable (128-bit) │
│ • Enterprise-grade features │
│ │
│ Architecture: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ZFS Pool (ZPOOL) │ │
│ │ • Physical storage pool from devices │ │
│ │ • Virtual devices (vdevs) │ │
│ │ • RAID-Z configurations │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │ ZFS Dataset │ │
│ │ • File systems within the pool │ │
│ │ • Can have independent properties │ │
│ │ • Quotas and reservations │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │ Copy-on-Write │ │
│ │ • Transaction groups (TXG) │ │
│ │ • No in-place overwrites │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Key Features: │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Data Integrity: Every block has checksum │ │
│ │ • Self-healing with redundancy │ │
│ │ • End-to-end checksums │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Snapshots: Instant, zero-space (until changed) │ │
│ │ • Send/receive for backup │ │
│ │ • Clones │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Compression: Multiple algorithms │ │
│ │ • lz4 (fast, good ratio) │ │
│ │ • zstd (better ratio, variable) │ │
│ │ • gzip (maximum compression) │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Deduplication: Block-level │ │
│ │ • Can require lots of RAM │ │
│ │ • Optional compression-first approach │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ RAID-Z: Variable-length RAID │ │
│ │ • Better than traditional RAID5 │ │
│ │ • RAID-Z1/2/3 (1-3 parity disks) │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
File System Comparison
┌─────────────────────────────────────────────────────────────────┐
│ FILE SYSTEM COMPARISON │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┬────────┬────────┬────────┬────────┬──────────┐ │
│ │ Feature │ ext4 │ Btrfs │ ZFS │ XFS │ NTFS │ │
│ ├──────────┼────────┼────────┼────────┼────────┼──────────┤ │
│ │ Max Size │ 1 EB │ 16 EB │ 256 TB │ 16 TB │ 256 TB │ │
│ ├──────────┼────────┼────────┼────────┼────────┼──────────┤ │
│ │ Max File │ 16 TB │ 16 EB │ 16 EB │ 8 exb │ 16 TB │ │
│ ├──────────┼────────┼────────┼────────┼────────┼──────────┤ │
│ │ Journal │ Yes │ Yes │ Yes │ Yes │ Yes │ │
│ ├──────────┼────────┼────────┼────────┼────────┼──────────┤ │
│ │ COW │ No │ Yes │ Yes │ No │ No │ │
│ ├──────────┼────────┼────────┼────────┼────────┼──────────┤ │
│ │RAID Built│ No │ Yes │ Yes │ No │ No │ │
│ ├──────────┼────────┼────────┼────────┼────────┼──────────┤ │
│ │Compress │ No │ Yes │ Yes │ No │ Yes │ │
│ ├──────────┼────────┼────────┼────────┼────────┼──────────┤ │
│ │Snapshots │ No* │ Yes │ Yes │ No* │ Yes │ │
│ ├──────────┼────────┼────────┼────────┼────────┼──────────┤ │
│ │Checksum │ Meta │ Yes │ Yes │ Meta │ Yes │ │
│ ├──────────┼────────┼────────┼────────┼────────┼──────────┤ │
│ │Defrag │ Yes │ Limited│ Yes │ Yes │ Yes │ │
│ └──────────┴────────┴────────┴────────┴────────┴──────────┘ │
│ │
│ * Can be added via LVM or other tools │
│ │
│ When to use what: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ ext4: General purpose, compatibility, simplicity │ │
│ │ Btrfs: Modern Linux, COW features, container storage │ │
│ │ ZFS: Enterprise storage, data integrity, large pools │ │
│ │ XFS: Large files, large directories, high performance │ │
│ │ NTFS: Windows compatibility, dual-boot │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Virtual File System (VFS)
VFS Overview
┌─────────────────────────────────────────────────────────────────┐
│ VIRTUAL FILE SYSTEM (VFS) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ VFS provides a unified interface for different file systems: │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Application │ │
│ │ open(), read(), write(), stat(), etc. │ │
│ └──────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────▼──────────────────────────────┐ │
│ │ System Call Interface │ │
│ │ (POSIX) │ │
│ └──────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────▼──────────────────────────────┐ │
│ │ VFS Layer │ │
│ │ • Superblock, inode, dentry operations │ │
│ │ • Generic file operations │ │
│ │ • Cache management │ │
│ └──────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌────────────┬────────────┼────────────┬────────────┐ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌─────────┐ │
│ │ ext4 │ │ Btrfs │ │ XFS │ │ NFS │ │ ProcFS │ │
│ └────────┘ └────────┘ └────────┘ └────────┘ └─────────┘ │
│ │
│ Key VFS Structures: │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ struct super_block: Mounted filesystem info │ │
│ │ struct inode: Generic inode (file metadata) │ │
│ │ struct dentry: Directory entry (name -> inode mapping) │ │
│ │ struct file: Open file instance │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
File System Commands
Linux File System Commands
# Create file systems
mkfs.ext4 /dev/sdb1
mkfs.btrfs /dev/sdb1
mkfs.xfs /dev/sdb1
# Check and repair
fsck /dev/sdb1
fsck.ext4 -n /dev/sdb1 # Dry run
fsck -fy /dev/sdb1 # Force repair
# Tune parameters
tune2fs -l /dev/sdb1 # List
tune2fs -c 50 /dev/sdb1 # Check interval
tune2fs -m 5 /dev/sdb1 # Reserved blocks
tune2fs -O ^has_journal /dev/sdb1 # Remove journal
# Mount options
mount /dev/sdb1 /mnt/data
mount -o remount,rw /mnt/data # Remount
mount -o loop file.img /mnt/iso # Loop device
# Information
df -h # Disk usage
du -sh /directory # Directory size
ls -li # Show inodes
stat file # File metadata
# defragment (ext4)
e4defrag -c /dev/sdb1 # Check
e4defrag /dev/sdb1 # Defrag
# Btrfs
btrfs filesystem df /mountpoint # Show space info
btrfs balance start /mountpoint # Rebalance
btrfs subvolume list /mountpoint # List subvolumes
# ZFS
zpool list # List pools
zpool status # Pool health
zfs list # List datasets
zfs snapshot pool@snapshot # Create snapshot
Conclusion
File systems are complex but essential components of any operating system. Key takeaways:
- Multiple allocation strategies exist - Contiguous, linked, indexed, and extent-based each have trade-offs
- Inodes are fundamental - All Unix file systems use inodes to store metadata
- Journaling prevents corruption - Write-ahead logging ensures consistency after crashes
- Modern file systems add features - COW, compression, snapshots, built-in RAID
- VFS unifies access - Linux’s VFS allows transparent access to different file system types
Choosing the right file system depends on your use case: ext4 for simplicity, Btrfs for modern Linux features, ZFS for enterprise storage with data integrity.
External Resources
Books
- “Operating System Concepts” - Silberschatz, Galvin, Gagne
- “Understanding the Linux Kernel” - Bovet & Cesati
- “The Art of Unix Programming” - Eric S. Raymond
Comments