Build Optimization and Binary Size Reduction in Rust

TL;DR: This guide covers optimizing Rust binaries for production deployments. You’ll learn release profile configurations, LTO, link-time optimization, binary size reduction techniques, and special considerations for Lambda/serverless and embedded targets.

Introduction

Rust is known for generating optimized binaries, but default release builds aren’t always optimal. This article explores:

Release profile optimization techniques
Link-time optimization (LTO)
Binary size reduction for constrained environments
Lambda/serverless optimization
Cross-compilation for various targets

Cargo Release Profiles

Default Release Profile

[profile.release]
opt-level = 3
lto = false
codegen-units = 16
strip = false

Optimized Profile

[profile.release]
opt-level = 3          # Maximum optimization
lto = "fat"           # Full link-time optimization
codegen-units = 1     # Better optimization, slower compile
strip = true          # Strip symbols
panic = "abort"       # Smaller binary, no unwinding

Comparison

Setting	Default	Optimized	Impact
opt-level	3	3	Same optimization
lto	false	“fat”	10-20% faster, longer compile
codegen-units	16	1	Better optimization
strip	false	true	Smaller binary
panic	unwind	“abort”	5-10% smaller

Link-Time Optimization (LTO)

LTO allows the compiler to optimize across crate boundaries.

Thin LTO (Faster Builds, Good Optimization)

[profile.release]
opt-level = 3
lto = "thin"
codegen-units = 16

Fat LTO (Best Optimization, Slower Builds)

[profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1

Runtime Performance Comparison

// benchmark.rs
use std::time::Instant;

fn main() {
    let iterations = 10_000_000;
    
    // Test function
    let start = Instant::now();
    for _ in 0..iterations {
        compute(42, 100);
    }
    let duration = start.elapsed();
    
    println!("Total time: {:?}", duration);
    println!("Per iteration: {:?}", duration / iterations);
}

fn compute(a: i32, b: i32) -> i32 {
    (a * b + a - b) / (b + 1)
}

Typical results:

Without LTO: ~120ms
Thin LTO: ~95ms (21% faster)
Fat LTO: ~88ms (27% faster)

Binary Size Optimization

Size-Z Profile (Maximum Size Reduction)

[profile.release]
opt-level = "z"       # Optimize for size
lto = "fat"
codegen-units = 1
strip = true
panic = "abort"

Size Optimization Options

opt-level	Effect
0	No optimization
1	Basic optimization
2	More optimization
3	Maximum speed
“z”	Maximum size reduction

Profile-Guided Optimization

// Enable PGO in build.rs
fn main() {
    println!("cargo:rustc-cfg=pgo_gen");
}

[profile.release]
pgo = "generate"

Reducing Dependency Size

Use Smaller Crate Alternatives

# Instead of tokio (full), use specific features
tokio = { version = "1", default-features = false, features = ["rt-multi-thread", "macros"] }

# Instead of serde (all formats), use what you need
serde = { version = "1", default-features = false, features = ["derive"] }
serde_json = "1"

# For logging, consider log + minimal implementation
log = "0.4"
env_logger = "0.10"  # Or use tracing with minimal features

Feature Minimization

# Default: pulls in many dependencies
axum = "0.7"

# Minimal: only what's needed
axum = { version = "0.7", default-features = false, features = ["macros", "tower-log"] }

Bill of Materials Comparison

# Show dependency tree
cargo tree --duplicates

# Show size of dependencies
cargo tree --size -t

# Count lines of code
cargo tree --duplicates | wc -l

Lambda and Serverless Optimization

Lambda-Specific Profile

[profile.release]
opt-level = "z"
lto = true
codegen-units = 1
strip = true
panic = "abort"

# Additional Lambda optimizations
[lib]
crate-type = ["cdylib", "rlib"]

Lambda Handler

use lambda_runtime::{run, service_fn, Error, LambdaEvent};
use serde::{Deserialize, Serialize};

#[derive(Deserialize)]
struct Request {
    name: String,
}

#[derive(Serialize)]
struct Response {
    message: String,
}

async fn function_handler(
    event: LambdaEvent<Request>,
) -> Result<Response, Error> {
    let name = event.payload.name;
    
    Ok(Response {
        message: format!("Hello, {}!", name),
    })
}

#[tokio::main]
async fn main() -> Result<(), Error> {
    run(service_fn(function_handler)).await
}

Docker Multi-Stage Build for Lambda

# Build stage
FROM rust:1.75 AS builder

WORKDIR /app

COPY Cargo.toml Cargo.lock ./
COPY src ./src

RUN cargo build --release --lib

# Runtime stage
FROM scratch

COPY --from=builder /app/target/release/libmy_lambda.so /var/task/libmy_lambda.so
COPY --from=builder /app/bootstrap /var/task/bootstrap

ENTRYPOINT ["/var/task/bootstrap"]

Binary Size Comparison

Build Type	Binary Size
Debug	~25 MB
Release (default)	~4 MB
Release + LTO	~2.5 MB
Release + size opt (“z”)	~1.2 MB
Lambda optimized	~800 KB

Cross-Compilation

ARM64 (Apple Silicon, Raspberry Pi)

# Install target
rustup target add aarch64-unknown-linux-gnu

# Build
cargo build --release --target aarch64-unknown-linux-gnu

WASM (WebAssembly)

# Install WASM target
rustup target add wasm32-unknown-unknown

# Build
cargo build --release --target wasm32-unknown-unknown

# Convert to JavaScript
wasm-bindgen --out-dir ./out/ --target web ./target/wasm32-unknown-unknown/release/myapp.wasm

ARMv7 (Raspberry Pi 3)

rustup target add armv7-unknown-linux-gnueabihf

# Requires cross-compilation toolchain
cargo build --release --target armv7-unknown-linux-gnueabihf

Stripping and Debug Symbols

Strip Symbols

[profile.release]
strip = true

Manual Strip

# Strip all symbols
strip target/release/myapp

# Strip debug symbols only (keep symbols for debugging)
strip --strip-debug target/release/myapp

# Remove symbol table entirely
strip --strip-all target/release/myapp

Analyze Binary Contents

# Show sections
rust-size target/release/myapp

# Show dependencies
ldd target/release/myapp

# Show symbols
nm target/release/myapp | head -20

Incremental Compilation and Cache

Release Build Cache

# Use sccache for faster builds
cargo install sccache

# Set environment
export RUSTC_WRAPPER=sccache
export SCCACHE_GHA_ENABLED=true

# Build
cargo build --release

Build Script Optimization

// build.rs
fn main() {
    // Only rebuild if build.rs changes
    println!("cargo:rerun-if-changed=build.rs");
    println!("cargo:rerun-if-changed=src/config.rs");
    
    // Don't generate code at build time if not needed
    // Use compile-time macros instead
}

WebAssembly Optimization

WASM Size Optimization

[profile.release]
opt-level = "z"
lto = true
codegen-units = 1

[package.metadata.wasm]
profile = "release"

WASM Binaryen Optimization

# Install wasm-opt
cargo install wasm-bindgen-cli

# Run binaryen optimization
wasm-opt -Oz -o output.wasm input.wasm

Example WASM Crate

use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub fn fibonacci(n: u32) -> u32 {
    match n {
        0 => 0,
        1 => 1,
        _ => {
            let mut a = 0u32;
            let mut b = 1u32;
            for _ in 2..=n {
                let temp = a + b;
                a = b;
                b = temp;
            }
            b
        }
    }
}

Embedded Systems Optimization

No Standard Library

#![no_std]
#![no_main]

use core::panic::PanicInfo;

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}

#[no_mangle]
pub extern "C" fn main() {
    // Embedded application
}

Embedded Profile

[profile.release]
opt-level = "z"
lto = true
codegen-units = 1
strip = true

[target.thumbv7em-none-eabihf]
rustflags = [
    "-C", "link-arg=-Tlink.x",
    "-C", "target-cpu=cortex-m4",
]

Production Build Script

#!/bin/bash
set -e

echo "Building for production..."

# Clean previous builds
cargo clean

# Build with optimizations
cargo build --release \
    --lib \
    --bin myapp \
    --locked

# Strip symbols
strip target/release/myapp

# Show final size
echo "Final binary size:"
ls -lh target/release/myapp

# Show section sizes
rust-size target/release/myapp

Performance Benchmark

use std::time::Instant;

fn main() {
    // Fibonacci benchmark
    let iterations = 1000;
    
    // Release build comparison
    for profile in ["debug", "release"] {
        let start = Instant::now();
        for _ in 0..iterations {
            let _ = fibonacci(30);
        }
        println!("{}: {:?}", profile, start.elapsed());
    }
}

fn fibonacci(n: u32) -> u32 {
    match n {
        0 => 0,
        1 => 1,
        _ => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

Typical output:

debug: ~800ms
release (default): ~15ms
release + LTO: ~10ms

Conclusion

Optimizing Rust binaries involves:

Release Profiles - Configure opt-level, LTO, codegen-units
Size Optimization - Use opt-level = “z” for smallest binaries
Dependency Management - Minimize features, use smaller crates
Lambda/Serverless - Profile for size, use cdylib
Cross-Compilation - Target specific architectures
Strip Symbols - Remove debug info for production

The right configuration depends on your deployment target—Lambda needs minimum size, while server applications should prioritize runtime performance.