Introduction
Performance optimization requires understanding where time is spent and making targeted improvements. This guide covers profiling techniques, common bottlenecks, and optimization strategies across multiple programming languages and environments.
First, measure. Then optimize. Finally, measure again. Without measurement, optimization is guesswork.
Profiling
Python Profiling
import cProfile
import pstats
from io import StringIO
def profile_function(func, *args, **kwargs):
"""Profile a function and print results."""
profiler = cProfile.Profile()
profiler.enable()
result = func(*args, **kwargs)
profiler.disable()
s = StringIO()
ps = pstats.Stats(profiler, stream=s).sort_stats('cumulative')
ps.print_stats(30)
print(s.getvalue())
return result
# Line-level profiling with line_profiler
from line_profiler import LineProfiler
def slow_function():
result = 0
for i in range(10000):
result += i ** 2
return result
# Run with: kernprof -l -v script.py
# Or use the module directly:
lp = LineProfiler()
lp.add_function(slow_function)
lp.run('slow_function()')
lp.print_stats()
JavaScript/Node.js Profiling
// Chrome DevTools Profiling
// 1. Open DevTools → Performance tab
// 2. Click Record
// 3. Perform actions
// 4. Stop and analyze flamegraph
// Programmatic profiling
console.time('myOperation');
// Do operation
console.timeEnd('myOperation');
// Performance API marks
performance.mark('operation-start');
// Do operation
performance.mark('operation-end');
performance.measure('operation', 'operation-start', 'operation-end');
// Get all measures
const measures = performance.getEntriesByType('measure');
console.table(measures.map(m => ({
name: m.name,
duration: `${m.duration.toFixed(2)}ms`,
})));
// Clear marks
performance.clearMarks();
performance.clearMeasures();
# Node.js profiling
node --prof app.js
node --prof-process isolate-*.log > processed.txt
# Flamegraph
npm install -g flamebearer
node --prof app.js
node --prof-process --preprocess isolate-*.log | flamebearer
# Clinic.js (advanced profiling)
npx clinic doctor -- node app.js
npx clinic flame -- node app.js
npx clinic bubbleprof -- node app.js
Go Profiling
// Go profiling with pprof
package main
import (
"net/http"
_ "net/http/pprof"
"runtime/pprof"
"os"
)
func main() {
// CPU profiling
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
// Memory profiling
mf, _ := os.Create("mem.prof")
defer mf.Close()
defer pprof.WriteHeapProfile(mf)
// HTTP endpoint (serve pprof data)
go func() {
http.ListenAndServe("localhost:6060", nil)
}()
// Collect profiles via HTTP:
// go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
// go tool pprof http://localhost:6060/debug/pprof/heap
// go tool pprof http://localhost:6060/debug/pprof/goroutine
}
# Go profiling commands
go tool pprof -http=:8080 cpu.prof
go tool pprof -http=:8080 mem.prof
# Compare two profiles
go tool pprof -http=:8080 -base baseline.prof current.prof
# Top functions by CPU
go tool pprof -top cpu.prof
# Flamegraph
go tool pprof -http=:8080 cpu.prof # View → Flame Graph
Rust Profiling
// Rust profiling with perf and flamegraph
use std::time::Instant;
fn expensive_computation(n: u64) -> u64 {
(0..n).map(|x| x * x).sum()
}
fn main() {
let start = Instant::now();
let result = expensive_computation(10_000_000);
let duration = start.elapsed();
println!("Result: {}, Time: {:?}", result, duration);
}
// Run with perf:
// perf record --call-graph dwarf ./target/release/myapp
// perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
Java Profiling
# JVM profiling with Async Profiler
./profiler.sh -e cpu -d 30 -f flamegraph.svg <pid>
# Java Flight Recorder (JFR)
-XX:StartFlightRecording=duration=60s,filename=recording.jfr
# VisualVM
jvisualvm # Attach to running JVM process
# JMH for microbenchmarks
java -jar benchmarks.jar
Profiling Tool Comparison
| Tool | Language | Methodology | Overhead | Best For |
|---|---|---|---|---|
| cProfile | Python | Deterministic | 10-30% | CPU profiling |
| py-spy | Python | Sampling | < 5% | Production profiling |
| line_profiler | Python | Instrumentation | 20-50% | Line-level analysis |
| pprof | Go | Sampling | < 5% | CPU, memory, goroutines |
| Async Profiler | Java | Sampling + Async | < 3% | CPU, allocation, wall-clock |
| perf | Linux (all) | Sampling (eBPF) | < 2% | System-wide profiling |
| Chrome DevTools | JavaScript | Sampling | < 10% | Browser performance |
| Node –prof | Node.js | Sampling | < 5% | Server-side JS |
Flamegraph Analysis
A flamegraph shows which functions consume the most CPU time. The x-axis represents stack profile population, the y-axis represents stack depth.
# Generate flamegraph for Python
pip install py-spy
py-spy record -o flamegraph.svg -- python myapp.py
# Generate flamegraph for any process
py-spy record -o flamegraph.svg --pid 12345 --duration 30
# Generate flamegraph for Go binary
go tool pprof -http=:8080 cpu.prof
# Interactive flamegraph in browser
# Hover to see function name and percentage
# Click to zoom into a specific code path
Reading a Flamegraph
| Characteristic | Meaning | Action |
|---|---|---|
| Wide bars at top | Expensive leaf functions | Optimize the function itself |
| Tall stack with wide top | Deep call chain to expensive leaf | Inline or cache intermediate results |
| Plateau shape | Hot loop or recursive function | Optimize loop body or unroll |
| Multiple peaks | Several hot paths | Prioritize the widest peak |
| Narrow base, wide middle | Expensive intermediate functions | Refactor call chain |
Flamegraph Interpretation Example
A flamegraph showing process_order → calculate_tax → database.query as the widest bar suggests the database query is the bottleneck. Optimizing calculate_tax logic would have minimal impact—the fix is to optimize or cache the database query.
Common Bottlenecks
Database Queries
# ❌ N+1 Query Problem
def get_users_with_posts():
users = db.query(User).all()
for user in users:
# This runs N queries!
posts = db.query(Post).filter_by(user_id=user.id).all()
user.posts = posts
return users
# ✅ Eager loading
def get_users_with_posts():
users = db.query(User).options(
joinedload(User.posts)
).all()
return users
# ✅ Batch loading (when joins are complex)
def get_users_with_posts():
users = db.query(User).all()
user_ids = [u.id for u in users]
posts = db.query(Post).filter(
Post.user_id.in_(user_ids)
).all()
post_map = defaultdict(list)
for post in posts:
post_map[post.user_id].append(post)
for user in users:
user.posts = post_map[user.id]
return users
Inefficient Algorithms
# ❌ O(n²) — Nested loop lookup
def find_common_elements_slow(list_a, list_b):
common = []
for a in list_a:
for b in list_b:
if a == b:
common.append(a)
return common
# ✅ O(n) — Set intersection
def find_common_elements_fast(list_a, list_b):
set_a = set(list_a)
return [item for item in list_b if item in set_a]
# Benchmark
import timeit
slow_time = timeit.timeit(
lambda: find_common_elements_slow(range(1000), range(500, 1500)),
number=100,
)
fast_time = timeit.timeit(
lambda: find_common_elements_fast(range(1000), range(500, 1500)),
number=100,
)
print(f"Slow: {slow_time:.3f}s")
print(f"Fast: {fast_time:.3f}s")
print(f"Speedup: {slow_time/fast_time:.1f}x")
Memory Allocation
# ❌ Repeated allocation in loop
def process_logs(logs):
results = []
for log in logs:
segments = log.split(",") # New list every iteration
processed = segments[0].strip()
results.append(processed)
return results
# ✅ Reuse and pre-allocate
def process_logs_fast(logs):
results = [None] * len(logs) # Pre-allocate
for i, log in enumerate(logs):
idx = log.index(",") # No split needed, just find comma
results[i] = log[:idx].strip()
return results
Serialization Overhead
import json
import orjson # 3-10x faster than stdlib json
# ❌ Slow serialization with stdlib
def serialize_stdlib(data: list[dict]) -> str:
return json.dumps(data, default=str)
# ✅ Fast serialization with orjson
def serialize_fast(data: list[dict]) -> str:
return orjson.dumps(data, option=orjson.OPT_SERIALIZE_NUMPY).decode()
# ❌ Repeated serialization
def api_response(data):
# Serializes twice!
cached = json.dumps(data)
return Response(json.dumps(data))
# ✅ Cache serialized result
def api_response(data):
serialized = json.dumps(data)
return Response(serialized)
Caching
LRU Cache
from functools import lru_cache
import time
# ❌ Repeated expensive computation
def calculate_total(order_id):
order = get_order(order_id)
total = 0
for item in order.items:
tax = calculate_tax(item.price) # Recalculates every time
total += item.price + tax
return total
# ✅ Cached tax rate lookup
@lru_cache(maxsize=1000)
def get_tax_rate(state: str) -> float:
"""Expensive database lookup, now cached."""
time.sleep(0.1) # Simulate DB query
return database.query_tax_rate(state)
# ✅ Cached computation
@lru_cache(maxsize=256)
def compute_discount(price: float, tier: str) -> float:
"""Cache discount computation results."""
if tier == 'gold':
return price * 0.2
elif tier == 'silver':
return price * 0.1
return 0
Cache Invalidation Strategies
| Strategy | Mechanism | Use Case |
|---|---|---|
| TTL | Expire after fixed time | Stale data acceptable |
| Write-through | Update cache on DB write | Strong consistency |
| Write-behind | Async cache update | High throughput |
| Invalidate on write | Delete cache key on DB write | Eventual consistency |
| Version-based | Increment version key | Schema changes |
| Stale-while-revalidate | Serve stale, refresh async | Read-heavy workloads |
Benchmarking
import timeit
from functools import reduce
# Compare list construction methods
def slow_way():
result = []
for i in range(1000):
if i % 2 == 0:
result.append(i)
return result
def fast_way():
return [i for i in range(1000) if i % 2 == 0]
def functional_way():
return list(filter(lambda x: x % 2 == 0, range(1000)))
# Benchmark
for name, fn in [("Loop append", slow_way), ("List comp", fast_way), ("Filter", functional_way)]:
time = timeit.timeit(fn, number=100000)
print(f"{name:15s}: {time:.3f}s")
# Output (typical):
# Loop append : 2.145s
# List comp : 1.234s
# Filter : 1.567s
Statistical Benchmarking
import time
import statistics
import math
def benchmark(fn, *args, runs: int = 100, warmup: int = 10, **kwargs):
"""Benchmark with warmup and statistical analysis."""
# Warmup
for _ in range(warmup):
fn(*args, **kwargs)
# Measured runs
times = []
for _ in range(runs):
start = time.perf_counter()
fn(*args, **kwargs)
elapsed = time.perf_counter() - start
times.append(elapsed)
return {
"mean": statistics.mean(times),
"median": statistics.median(times),
"stdev": statistics.stdev(times) if len(times) > 1 else 0,
"min": min(times),
"max": max(times),
"p95": sorted(times)[int(len(times) * 0.95)],
"p99": sorted(times)[int(len(times) * 0.99)],
}
# Use
result = benchmark(fast_way, runs=50)
print(f"Mean: {result['mean']*1000:.2f}ms, p95: {result['p95']*1000:.2f}ms")
Optimization Techniques
Algorithm Optimization
# ❌ O(n²) — Nested loops
def find_duplicates_slow(items):
duplicates = []
for i in range(len(items)):
for j in range(i + 1, len(items)):
if items[i] == items[j]:
duplicates.append(items[i])
return duplicates
# ✅ O(n) — Using set
def find_duplicates_fast(items):
seen = set()
duplicates = []
for item in items:
if item in seen:
duplicates.append(item)
seen.add(item)
return duplicates
Data Structure Selection
import timeit
# List membership test: O(n)
def list_membership(n: int) -> bool:
data = list(range(100000))
return n in data
# Set membership test: O(1)
def set_membership(n: int) -> bool:
data = set(range(100000))
return n in data
# Benchmark
list_time = timeit.timeit(lambda: list_membership(99999), number=1000)
set_time = timeit.timeit(lambda: set_membership(99999), number=1000)
print(f"List: {list_time:.3f}s, Set: {set_time:.3f}s — Set is {list_time/set_time:.0f}x faster")
Lazy Evaluation
# ❌ Eager — load everything into memory
def get_top_users():
all_users = db.query(User).all() # Load all users!
return sorted(all_users, key=lambda u: u.score)[:10]
# ✅ Lazy — only load what we need
def get_top_users():
return db.query(User).order_by(
User.score.desc()
).limit(10).all()
# ❌ Eager — build entire list
def get_active_items():
items = []
for item in inventory:
if item.is_active:
items.append(item) # All items in memory
return items
# ✅ Lazy — generator yields one at a time
def get_active_items():
for item in inventory:
if item.is_active:
yield item # One item at a time
Concurrency and Parallelism
import asyncio
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
# I/O-bound: async
async def fetch_all_urls(urls: list[str]) -> list[dict]:
async with aiohttp.ClientSession() as session:
tasks = [fetch_one(session, url) for url in urls]
return await asyncio.gather(*tasks)
# I/O-bound: threading
def process_batch(items: list, worker_count: int = 4):
with ThreadPoolExecutor(max_workers=worker_count) as executor:
results = executor.map(process_item, items)
return list(results)
# CPU-bound: multiprocessing
def compute_batch(data: list, worker_count: int = 4):
with ProcessPoolExecutor(max_workers=worker_count) as executor:
results = executor.map(heavy_computation, data)
return list(results)
Database Optimization
Indexing
-- Create index for frequently queried column
CREATE INDEX idx_users_email ON users(email);
-- Composite index for multi-column queries
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC);
-- Partial index for specific queries
CREATE INDEX idx_active_orders ON orders(user_id)
WHERE status = 'active';
-- Covering index for index-only scans
CREATE INDEX idx_product_categories ON products(category_id, price)
INCLUDE (name, stock);
-- Analyze query plan
EXPLAIN ANALYZE SELECT * FROM users WHERE email = '[email protected]';
Query Optimization Patterns
-- ❌ SELECT * fetches unnecessary columns
SELECT * FROM users WHERE id = 123;
-- ✅ Select only needed columns
SELECT id, name, email FROM users WHERE id = 123;
-- ❌ Function on indexed column prevents index use
SELECT * FROM orders WHERE DATE(created_at) = '2026-05-24';
-- ✅ Use range query instead
SELECT * FROM orders
WHERE created_at >= '2026-05-24 00:00:00'
AND created_at < '2026-05-25 00:00:00';
-- ❌ Correlated subquery runs per row
SELECT u.*,
(SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_count
FROM users u;
-- ✅ Window function or join
SELECT u.*, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id;
Continuous Profiling
Continuous profiling runs an always-on profiler in production, collecting samples over time to detect regressions.
# Continuous profiling configuration (Pyroscope)
pyroscope:
application: api-server
server: https://profiles.internal.com
sample_rate: 100 # Hz
tags:
environment: production
region: us-west-2
# Run pyroscope with Python app
export PYROSCOPE_SERVER_ADDRESS=http://pyroscope:4040
export PYROSCOPE_APPLICATION_NAME=myapp
py-spy record -o /tmp/pyroscope/profiles -- python app.py
| Benefit | Description |
|---|---|
| Regression detection | Compare profiles across deployments |
| Always-on data | Catch intermittent issues |
| No manual trigger | Profiler runs automatically |
| Historical comparison | Compare current vs. last week |
| Low overhead | Sampling profiler adds < 5% CPU |
Optimization Trade-Offs
| Optimization | Performance Gain | Cost | Risk |
|---|---|---|---|
| Caching | High (10-100x) | Memory usage | Stale data |
| Connection pooling | Medium (2-5x) | Connection management | Resource exhaustion |
| Query optimization | High (10-1000x) | Development time | Schema changes |
| Algorithm change | High (10-100x) | Code complexity | Bugs |
| CDN | High (2-10x latency) | Cost | Cache invalidation |
| Async processing | Medium (2-5x throughput) | Complexity | Debugging difficulty |
| Compression | Medium (2-5x bandwidth) | CPU overhead | Client compatibility |
| Horizontal scaling | Linear throughput | Infrastructure cost | Distributed systems complexity |
Best Practices
- Measure first: Don’t optimize without data. Profile shows where time is actually spent.
- Profile in production-like environment: Staging with small data misses production bottlenecks.
- Optimize the biggest bottleneck: The widest bar in the flamegraph. Small optimizations elsewhere don’t matter.
- Verify improvements: Measure after changes. The improvement must be measurable.
- Document trade-offs: Optimizations often increase complexity or reduce maintainability.
- Test at scale: Benchmarks on small datasets may not reflect production behavior.
- Profile in production: Use sampling profilers with low overhead for continuous data.
- Consider the full stack: Frontend, network, database, and infrastructure all contribute to user-perceived performance.
Conclusion
Performance optimization is about finding and fixing the actual bottlenecks. Use profiling tools to identify where time is spent, then make targeted improvements. Always measure before and after to confirm the improvement is real.
Key takeaways:
- Sampling profilers for production (low overhead)
- Focus on the widest part of the flamegraph
- Cache aggressively but invalidate carefully
- Choose the right data structure for the job
- Benchmark statistically with warmup and multiple runs
Resources
- Python Profiling Documentation — Official profiling guide
- Chrome DevTools Performance — Browser profiling
- PySpy — Sampling profiler for Python
- Go pprof — Go profiling documentation
- Async Profiler — Java profiling
- Flamegraph — Visualization tools
- Pyroscope — Continuous profiling platform
- web.dev Performance — Web performance guide
- High Performance Browser Networking — Network performance book
- perf Wiki — Linux profiling tools
Comments