Skip to main content

Performance Optimization: Profiling and Optimization Techniques

Published: March 12, 2026 Updated: May 24, 2026 Larry Qu 11 min read

Introduction

Performance optimization requires understanding where time is spent and making targeted improvements. This guide covers profiling techniques, common bottlenecks, and optimization strategies across multiple programming languages and environments.

First, measure. Then optimize. Finally, measure again. Without measurement, optimization is guesswork.

Profiling

Python Profiling

import cProfile
import pstats
from io import StringIO

def profile_function(func, *args, **kwargs):
    """Profile a function and print results."""
    profiler = cProfile.Profile()
    profiler.enable()

    result = func(*args, **kwargs)

    profiler.disable()

    s = StringIO()
    ps = pstats.Stats(profiler, stream=s).sort_stats('cumulative')
    ps.print_stats(30)

    print(s.getvalue())

    return result


# Line-level profiling with line_profiler
from line_profiler import LineProfiler


def slow_function():
    result = 0
    for i in range(10000):
        result += i ** 2
    return result


# Run with: kernprof -l -v script.py
# Or use the module directly:
lp = LineProfiler()
lp.add_function(slow_function)
lp.run('slow_function()')
lp.print_stats()

JavaScript/Node.js Profiling

// Chrome DevTools Profiling
// 1. Open DevTools → Performance tab
// 2. Click Record
// 3. Perform actions
// 4. Stop and analyze flamegraph

// Programmatic profiling
console.time('myOperation');
// Do operation
console.timeEnd('myOperation');

// Performance API marks
performance.mark('operation-start');
// Do operation
performance.mark('operation-end');
performance.measure('operation', 'operation-start', 'operation-end');

// Get all measures
const measures = performance.getEntriesByType('measure');
console.table(measures.map(m => ({
  name: m.name,
  duration: `${m.duration.toFixed(2)}ms`,
})));

// Clear marks
performance.clearMarks();
performance.clearMeasures();
# Node.js profiling
node --prof app.js
node --prof-process isolate-*.log > processed.txt

# Flamegraph
npm install -g flamebearer
node --prof app.js
node --prof-process --preprocess isolate-*.log | flamebearer

# Clinic.js (advanced profiling)
npx clinic doctor -- node app.js
npx clinic flame -- node app.js
npx clinic bubbleprof -- node app.js

Go Profiling

// Go profiling with pprof
package main

import (
	"net/http"
	_ "net/http/pprof"
	"runtime/pprof"
	"os"
)

func main() {
	// CPU profiling
	f, _ := os.Create("cpu.prof")
	pprof.StartCPUProfile(f)
	defer pprof.StopCPUProfile()

	// Memory profiling
	mf, _ := os.Create("mem.prof")
	defer mf.Close()
	defer pprof.WriteHeapProfile(mf)

	// HTTP endpoint (serve pprof data)
	go func() {
		http.ListenAndServe("localhost:6060", nil)
	}()

	// Collect profiles via HTTP:
	// go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
	// go tool pprof http://localhost:6060/debug/pprof/heap
	// go tool pprof http://localhost:6060/debug/pprof/goroutine
}
# Go profiling commands
go tool pprof -http=:8080 cpu.prof
go tool pprof -http=:8080 mem.prof

# Compare two profiles
go tool pprof -http=:8080 -base baseline.prof current.prof

# Top functions by CPU
go tool pprof -top cpu.prof

# Flamegraph
go tool pprof -http=:8080 cpu.prof  # View → Flame Graph

Rust Profiling

// Rust profiling with perf and flamegraph
use std::time::Instant;

fn expensive_computation(n: u64) -> u64 {
    (0..n).map(|x| x * x).sum()
}

fn main() {
    let start = Instant::now();
    let result = expensive_computation(10_000_000);
    let duration = start.elapsed();

    println!("Result: {}, Time: {:?}", result, duration);
}

// Run with perf:
// perf record --call-graph dwarf ./target/release/myapp
// perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

Java Profiling

# JVM profiling with Async Profiler
./profiler.sh -e cpu -d 30 -f flamegraph.svg <pid>

# Java Flight Recorder (JFR)
-XX:StartFlightRecording=duration=60s,filename=recording.jfr

# VisualVM
jvisualvm  # Attach to running JVM process

# JMH for microbenchmarks
java -jar benchmarks.jar

Profiling Tool Comparison

Tool Language Methodology Overhead Best For
cProfile Python Deterministic 10-30% CPU profiling
py-spy Python Sampling < 5% Production profiling
line_profiler Python Instrumentation 20-50% Line-level analysis
pprof Go Sampling < 5% CPU, memory, goroutines
Async Profiler Java Sampling + Async < 3% CPU, allocation, wall-clock
perf Linux (all) Sampling (eBPF) < 2% System-wide profiling
Chrome DevTools JavaScript Sampling < 10% Browser performance
Node –prof Node.js Sampling < 5% Server-side JS

Flamegraph Analysis

A flamegraph shows which functions consume the most CPU time. The x-axis represents stack profile population, the y-axis represents stack depth.

# Generate flamegraph for Python
pip install py-spy
py-spy record -o flamegraph.svg -- python myapp.py

# Generate flamegraph for any process
py-spy record -o flamegraph.svg --pid 12345 --duration 30

# Generate flamegraph for Go binary
go tool pprof -http=:8080 cpu.prof

# Interactive flamegraph in browser
# Hover to see function name and percentage
# Click to zoom into a specific code path

Reading a Flamegraph

Characteristic Meaning Action
Wide bars at top Expensive leaf functions Optimize the function itself
Tall stack with wide top Deep call chain to expensive leaf Inline or cache intermediate results
Plateau shape Hot loop or recursive function Optimize loop body or unroll
Multiple peaks Several hot paths Prioritize the widest peak
Narrow base, wide middle Expensive intermediate functions Refactor call chain

Flamegraph Interpretation Example

A flamegraph showing process_ordercalculate_taxdatabase.query as the widest bar suggests the database query is the bottleneck. Optimizing calculate_tax logic would have minimal impact—the fix is to optimize or cache the database query.

Common Bottlenecks

Database Queries

# ❌ N+1 Query Problem
def get_users_with_posts():
    users = db.query(User).all()
    for user in users:
        # This runs N queries!
        posts = db.query(Post).filter_by(user_id=user.id).all()
        user.posts = posts
    return users

# ✅ Eager loading
def get_users_with_posts():
    users = db.query(User).options(
        joinedload(User.posts)
    ).all()
    return users

# ✅ Batch loading (when joins are complex)
def get_users_with_posts():
    users = db.query(User).all()
    user_ids = [u.id for u in users]
    posts = db.query(Post).filter(
        Post.user_id.in_(user_ids)
    ).all()

    post_map = defaultdict(list)
    for post in posts:
        post_map[post.user_id].append(post)
    for user in users:
        user.posts = post_map[user.id]

    return users

Inefficient Algorithms

# ❌ O(n²) — Nested loop lookup
def find_common_elements_slow(list_a, list_b):
    common = []
    for a in list_a:
        for b in list_b:
            if a == b:
                common.append(a)
    return common

# ✅ O(n) — Set intersection
def find_common_elements_fast(list_a, list_b):
    set_a = set(list_a)
    return [item for item in list_b if item in set_a]

# Benchmark
import timeit

slow_time = timeit.timeit(
    lambda: find_common_elements_slow(range(1000), range(500, 1500)),
    number=100,
)
fast_time = timeit.timeit(
    lambda: find_common_elements_fast(range(1000), range(500, 1500)),
    number=100,
)
print(f"Slow: {slow_time:.3f}s")
print(f"Fast: {fast_time:.3f}s")
print(f"Speedup: {slow_time/fast_time:.1f}x")

Memory Allocation

# ❌ Repeated allocation in loop
def process_logs(logs):
    results = []
    for log in logs:
        segments = log.split(",")  # New list every iteration
        processed = segments[0].strip()
        results.append(processed)
    return results

# ✅ Reuse and pre-allocate
def process_logs_fast(logs):
    results = [None] * len(logs)  # Pre-allocate
    for i, log in enumerate(logs):
        idx = log.index(",")  # No split needed, just find comma
        results[i] = log[:idx].strip()
    return results

Serialization Overhead

import json
import orjson  # 3-10x faster than stdlib json


# ❌ Slow serialization with stdlib
def serialize_stdlib(data: list[dict]) -> str:
    return json.dumps(data, default=str)


# ✅ Fast serialization with orjson
def serialize_fast(data: list[dict]) -> str:
    return orjson.dumps(data, option=orjson.OPT_SERIALIZE_NUMPY).decode()


# ❌ Repeated serialization
def api_response(data):
    # Serializes twice!
    cached = json.dumps(data)
    return Response(json.dumps(data))

# ✅ Cache serialized result
def api_response(data):
    serialized = json.dumps(data)
    return Response(serialized)

Caching

LRU Cache

from functools import lru_cache
import time


# ❌ Repeated expensive computation
def calculate_total(order_id):
    order = get_order(order_id)
    total = 0
    for item in order.items:
        tax = calculate_tax(item.price)  # Recalculates every time
        total += item.price + tax
    return total


# ✅ Cached tax rate lookup
@lru_cache(maxsize=1000)
def get_tax_rate(state: str) -> float:
    """Expensive database lookup, now cached."""
    time.sleep(0.1)  # Simulate DB query
    return database.query_tax_rate(state)


# ✅ Cached computation
@lru_cache(maxsize=256)
def compute_discount(price: float, tier: str) -> float:
    """Cache discount computation results."""
    if tier == 'gold':
        return price * 0.2
    elif tier == 'silver':
        return price * 0.1
    return 0

Cache Invalidation Strategies

Strategy Mechanism Use Case
TTL Expire after fixed time Stale data acceptable
Write-through Update cache on DB write Strong consistency
Write-behind Async cache update High throughput
Invalidate on write Delete cache key on DB write Eventual consistency
Version-based Increment version key Schema changes
Stale-while-revalidate Serve stale, refresh async Read-heavy workloads

Benchmarking

import timeit
from functools import reduce


# Compare list construction methods
def slow_way():
    result = []
    for i in range(1000):
        if i % 2 == 0:
            result.append(i)
    return result


def fast_way():
    return [i for i in range(1000) if i % 2 == 0]


def functional_way():
    return list(filter(lambda x: x % 2 == 0, range(1000)))


# Benchmark
for name, fn in [("Loop append", slow_way), ("List comp", fast_way), ("Filter", functional_way)]:
    time = timeit.timeit(fn, number=100000)
    print(f"{name:15s}: {time:.3f}s")

# Output (typical):
# Loop append    : 2.145s
# List comp      : 1.234s
# Filter         : 1.567s

Statistical Benchmarking

import time
import statistics
import math

def benchmark(fn, *args, runs: int = 100, warmup: int = 10, **kwargs):
    """Benchmark with warmup and statistical analysis."""

    # Warmup
    for _ in range(warmup):
        fn(*args, **kwargs)

    # Measured runs
    times = []
    for _ in range(runs):
        start = time.perf_counter()
        fn(*args, **kwargs)
        elapsed = time.perf_counter() - start
        times.append(elapsed)

    return {
        "mean": statistics.mean(times),
        "median": statistics.median(times),
        "stdev": statistics.stdev(times) if len(times) > 1 else 0,
        "min": min(times),
        "max": max(times),
        "p95": sorted(times)[int(len(times) * 0.95)],
        "p99": sorted(times)[int(len(times) * 0.99)],
    }


# Use
result = benchmark(fast_way, runs=50)
print(f"Mean: {result['mean']*1000:.2f}ms, p95: {result['p95']*1000:.2f}ms")

Optimization Techniques

Algorithm Optimization

# ❌ O(n²) — Nested loops
def find_duplicates_slow(items):
    duplicates = []
    for i in range(len(items)):
        for j in range(i + 1, len(items)):
            if items[i] == items[j]:
                duplicates.append(items[i])
    return duplicates

# ✅ O(n) — Using set
def find_duplicates_fast(items):
    seen = set()
    duplicates = []
    for item in items:
        if item in seen:
            duplicates.append(item)
        seen.add(item)
    return duplicates

Data Structure Selection

import timeit

# List membership test: O(n)
def list_membership(n: int) -> bool:
    data = list(range(100000))
    return n in data

# Set membership test: O(1)
def set_membership(n: int) -> bool:
    data = set(range(100000))
    return n in data

# Benchmark
list_time = timeit.timeit(lambda: list_membership(99999), number=1000)
set_time = timeit.timeit(lambda: set_membership(99999), number=1000)
print(f"List: {list_time:.3f}s, Set: {set_time:.3f}s — Set is {list_time/set_time:.0f}x faster")

Lazy Evaluation

# ❌ Eager — load everything into memory
def get_top_users():
    all_users = db.query(User).all()  # Load all users!
    return sorted(all_users, key=lambda u: u.score)[:10]

# ✅ Lazy — only load what we need
def get_top_users():
    return db.query(User).order_by(
        User.score.desc()
    ).limit(10).all()

# ❌ Eager — build entire list
def get_active_items():
    items = []
    for item in inventory:
        if item.is_active:
            items.append(item)  # All items in memory
    return items

# ✅ Lazy — generator yields one at a time
def get_active_items():
    for item in inventory:
        if item.is_active:
            yield item  # One item at a time

Concurrency and Parallelism

import asyncio
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor


# I/O-bound: async
async def fetch_all_urls(urls: list[str]) -> list[dict]:
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_one(session, url) for url in urls]
        return await asyncio.gather(*tasks)


# I/O-bound: threading
def process_batch(items: list, worker_count: int = 4):
    with ThreadPoolExecutor(max_workers=worker_count) as executor:
        results = executor.map(process_item, items)
    return list(results)


# CPU-bound: multiprocessing
def compute_batch(data: list, worker_count: int = 4):
    with ProcessPoolExecutor(max_workers=worker_count) as executor:
        results = executor.map(heavy_computation, data)
    return list(results)

Database Optimization

Indexing

-- Create index for frequently queried column
CREATE INDEX idx_users_email ON users(email);

-- Composite index for multi-column queries
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC);

-- Partial index for specific queries
CREATE INDEX idx_active_orders ON orders(user_id)
WHERE status = 'active';

-- Covering index for index-only scans
CREATE INDEX idx_product_categories ON products(category_id, price)
INCLUDE (name, stock);

-- Analyze query plan
EXPLAIN ANALYZE SELECT * FROM users WHERE email = '[email protected]';

Query Optimization Patterns

-- ❌ SELECT * fetches unnecessary columns
SELECT * FROM users WHERE id = 123;

-- ✅ Select only needed columns
SELECT id, name, email FROM users WHERE id = 123;

-- ❌ Function on indexed column prevents index use
SELECT * FROM orders WHERE DATE(created_at) = '2026-05-24';

-- ✅ Use range query instead
SELECT * FROM orders
WHERE created_at >= '2026-05-24 00:00:00'
  AND created_at < '2026-05-25 00:00:00';

-- ❌ Correlated subquery runs per row
SELECT u.*,
  (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_count
FROM users u;

-- ✅ Window function or join
SELECT u.*, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id;

Continuous Profiling

Continuous profiling runs an always-on profiler in production, collecting samples over time to detect regressions.

# Continuous profiling configuration (Pyroscope)
pyroscope:
  application: api-server
  server: https://profiles.internal.com
  sample_rate: 100  # Hz
  tags:
    environment: production
    region: us-west-2
# Run pyroscope with Python app
export PYROSCOPE_SERVER_ADDRESS=http://pyroscope:4040
export PYROSCOPE_APPLICATION_NAME=myapp
py-spy record -o /tmp/pyroscope/profiles -- python app.py
Benefit Description
Regression detection Compare profiles across deployments
Always-on data Catch intermittent issues
No manual trigger Profiler runs automatically
Historical comparison Compare current vs. last week
Low overhead Sampling profiler adds < 5% CPU

Optimization Trade-Offs

Optimization Performance Gain Cost Risk
Caching High (10-100x) Memory usage Stale data
Connection pooling Medium (2-5x) Connection management Resource exhaustion
Query optimization High (10-1000x) Development time Schema changes
Algorithm change High (10-100x) Code complexity Bugs
CDN High (2-10x latency) Cost Cache invalidation
Async processing Medium (2-5x throughput) Complexity Debugging difficulty
Compression Medium (2-5x bandwidth) CPU overhead Client compatibility
Horizontal scaling Linear throughput Infrastructure cost Distributed systems complexity

Best Practices

  1. Measure first: Don’t optimize without data. Profile shows where time is actually spent.
  2. Profile in production-like environment: Staging with small data misses production bottlenecks.
  3. Optimize the biggest bottleneck: The widest bar in the flamegraph. Small optimizations elsewhere don’t matter.
  4. Verify improvements: Measure after changes. The improvement must be measurable.
  5. Document trade-offs: Optimizations often increase complexity or reduce maintainability.
  6. Test at scale: Benchmarks on small datasets may not reflect production behavior.
  7. Profile in production: Use sampling profilers with low overhead for continuous data.
  8. Consider the full stack: Frontend, network, database, and infrastructure all contribute to user-perceived performance.

Conclusion

Performance optimization is about finding and fixing the actual bottlenecks. Use profiling tools to identify where time is spent, then make targeted improvements. Always measure before and after to confirm the improvement is real.

Key takeaways:

  • Sampling profilers for production (low overhead)
  • Focus on the widest part of the flamegraph
  • Cache aggressively but invalidate carefully
  • Choose the right data structure for the job
  • Benchmark statistically with warmup and multiple runs

Resources

Comments

👍 Was this article helpful?