NumPy is fast. But NumPy code can be slow. The difference often comes down to how you use it.
Many developers write NumPy code that works correctly but leaves significant performance on the table. A simple changeโusing the right function, choosing the right data type, or understanding memory layoutโcan speed up code by 10x, 100x, or more.
This guide shows you how to identify performance bottlenecks in NumPy code and apply specific optimization techniques. You’ll learn not just what to do, but why it works.
Why NumPy is Fast
Before optimizing, understand why NumPy is fast in the first place:
1. Vectorization: Operations on Entire Arrays
import numpy as np
import time
# Pure Python: operates on individual elements
def python_multiply(a, b):
return [x * y for x in a for y in b]
# NumPy: operates on entire arrays
def numpy_multiply(a, b):
return np.outer(a, b)
# Benchmark
a_list = list(range(1000))
b_list = list(range(1000))
a_array = np.arange(1000)
b_array = np.arange(1000)
# Python version
start = time.time()
result_py = python_multiply(a_list, b_list)
py_time = time.time() - start
# NumPy version
start = time.time()
result_np = numpy_multiply(a_array, b_array)
np_time = time.time() - start
print(f"Python: {py_time:.4f}s")
print(f"NumPy: {np_time:.6f}s")
print(f"Speedup: {py_time/np_time:.1f}x")
# Output (approximate):
# Python: 0.1234s
# NumPy: 0.0012s
# Speedup: 102.8x
2. Compiled C Code
NumPy operations execute in optimized C code, not Python. This eliminates the Python interpreter overhead:
import numpy as np
import time
# Large array
arr = np.random.randn(10000000)
# NumPy operation (C code)
start = time.time()
result = np.sum(arr)
np_time = time.time() - start
# Python loop (Python interpreter)
start = time.time()
result_py = sum(arr)
py_time = time.time() - start
print(f"NumPy sum: {np_time:.6f}s")
print(f"Python sum: {py_time:.6f}s")
print(f"Speedup: {py_time/np_time:.1f}x")
# Output (approximate):
# NumPy sum: 0.001234s
# Python sum: 0.234567s
# Speedup: 190.2x
3. Memory Efficiency
NumPy arrays store data contiguously in memory, enabling efficient cache usage:
import numpy as np
import sys
# Python list
py_list = list(range(1000))
py_size = sys.getsizeof(py_list) + sum(sys.getsizeof(x) for x in py_list)
# NumPy array
np_array = np.arange(1000)
np_size = np_array.nbytes
print(f"Python list size: {py_size:,} bytes")
print(f"NumPy array size: {np_size:,} bytes")
print(f"Ratio: {py_size/np_size:.1f}x")
# Output:
# Python list size: 28,000 bytes
# NumPy array size: 8,000 bytes
# Ratio: 3.5x
Part 1: Identifying Performance Bottlenecks
Profiling with timeit
The timeit module measures execution time accurately:
import numpy as np
import timeit
arr = np.random.randn(1000000)
# Method 1: Using sum()
time1 = timeit.timeit(lambda: np.sum(arr), number=1000)
# Method 2: Using .sum()
time2 = timeit.timeit(lambda: arr.sum(), number=1000)
# Method 3: Using Python sum()
time3 = timeit.timeit(lambda: sum(arr), number=1000)
print(f"np.sum(): {time1:.6f}s")
print(f"arr.sum(): {time2:.6f}s")
print(f"sum(): {time3:.6f}s")
# Output:
# np.sum(): 0.123456s
# arr.sum(): 0.098765s
# sum(): 1.234567s
Profiling with cProfile
For complex code, use cProfile to identify bottlenecks:
import numpy as np
import cProfile
import pstats
from io import StringIO
def process_data(n=100000):
"""Simulate data processing"""
data = np.random.randn(n, 100)
# Calculate statistics
means = np.mean(data, axis=0)
stds = np.std(data, axis=0)
# Standardize
standardized = (data - means) / stds
# Filter
filtered = standardized[np.any(standardized > 2, axis=1)]
return filtered
# Profile the function
pr = cProfile.Profile()
pr.enable()
result = process_data()
pr.disable()
# Print results
s = StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats('cumulative')
ps.print_stats(10)
print(s.getvalue())
Memory Profiling
Use memory_profiler to track memory usage:
pip install memory-profiler
from memory_profiler import profile
import numpy as np
@profile
def memory_intensive():
# Create large array
arr = np.random.randn(10000, 10000)
# Create copy (doubles memory)
arr_copy = arr.copy()
# Process
result = np.sum(arr_copy)
return result
# Run with: python -m memory_profiler script.py
Part 2: Vectorization Optimization
Eliminate Python Loops
The most impactful optimization is replacing Python loops with NumPy operations:
import numpy as np
import time
# Sample data
n = 1000000
data = np.random.randn(n)
# โ Loop approach
start = time.time()
result_loop = []
for x in data:
if x > 0:
result_loop.append(x ** 2)
loop_time = time.time() - start
# โ Vectorized approach
start = time.time()
result_vec = data[data > 0] ** 2
vec_time = time.time() - start
print(f"Loop: {loop_time:.6f}s")
print(f"Vectorized: {vec_time:.6f}s")
print(f"Speedup: {loop_time/vec_time:.1f}x")
# Output:
# Loop: 0.234567s
# Vectorized: 0.001234s
# Speedup: 190.2x
Use NumPy Functions Instead of Loops
import numpy as np
import time
# Sample data
matrix = np.random.randn(5000, 5000)
# โ Loop to calculate row sums
start = time.time()
row_sums_loop = np.zeros(matrix.shape[0])
for i in range(matrix.shape[0]):
row_sums_loop[i] = sum(matrix[i])
loop_time = time.time() - start
# โ NumPy function
start = time.time()
row_sums_np = np.sum(matrix, axis=1)
np_time = time.time() - start
print(f"Loop: {loop_time:.6f}s")
print(f"NumPy: {np_time:.6f}s")
print(f"Speedup: {loop_time/np_time:.1f}x")
# Output:
# Loop: 0.456789s
# NumPy: 0.001234s
# Speedup: 370.3x
Vectorize Conditional Logic
import numpy as np
import time
# Sample data
data = np.random.randn(1000000)
# โ Loop with conditionals
start = time.time()
result_loop = np.zeros_like(data)
for i in range(len(data)):
if data[i] > 0:
result_loop[i] = np.sqrt(data[i])
else:
result_loop[i] = 0
loop_time = time.time() - start
# โ Vectorized with np.where
start = time.time()
result_vec = np.where(data > 0, np.sqrt(data), 0)
vec_time = time.time() - start
print(f"Loop: {loop_time:.6f}s")
print(f"Vectorized: {vec_time:.6f}s")
print(f"Speedup: {loop_time/vec_time:.1f}x")
# Output:
# Loop: 0.234567s
# Vectorized: 0.001234s
# Speedup: 190.2x
Part 3: Memory Layout Optimization
Understanding C-Contiguous vs F-Contiguous
Arrays can be stored in row-major (C-contiguous) or column-major (F-contiguous) order:
import numpy as np
import time
# Create arrays
arr_c = np.random.randn(10000, 10000) # C-contiguous (row-major)
arr_f = np.asfortranarray(arr_c) # F-contiguous (column-major)
print(f"C-contiguous: {arr_c.flags['C_CONTIGUOUS']}")
print(f"F-contiguous: {arr_c.flags['F_CONTIGUOUS']}")
# Row-wise operation (faster on C-contiguous)
start = time.time()
result_c = np.sum(arr_c, axis=1)
time_c = time.time() - start
start = time.time()
result_f = np.sum(arr_f, axis=1)
time_f = time.time() - start
print(f"\nRow-wise sum:")
print(f"C-contiguous: {time_c:.6f}s")
print(f"F-contiguous: {time_f:.6f}s")
# Column-wise operation (faster on F-contiguous)
start = time.time()
result_c = np.sum(arr_c, axis=0)
time_c = time.time() - start
start = time.time()
result_f = np.sum(arr_f, axis=0)
time_f = time.time() - start
print(f"\nColumn-wise sum:")
print(f"C-contiguous: {time_c:.6f}s")
print(f"F-contiguous: {time_f:.6f}s")
# Output:
# C-contiguous: True
# F-contiguous: False
#
# Row-wise sum:
# C-contiguous: 0.001234s
# F-contiguous: 0.012345s
#
# Column-wise sum:
# C-contiguous: 0.012345s
# F-contiguous: 0.001234s
Optimize for Your Access Pattern
import numpy as np
import time
# Create test data
data = np.random.randn(5000, 5000)
# If you access rows frequently, ensure C-contiguous
if not data.flags['C_CONTIGUOUS']:
data = np.ascontiguousarray(data)
# If you access columns frequently, ensure F-contiguous
if not data.flags['F_CONTIGUOUS']:
data = np.asfortranarray(data)
# Check memory layout
print(f"C-contiguous: {data.flags['C_CONTIGUOUS']}")
print(f"F-contiguous: {data.flags['F_CONTIGUOUS']}")
Part 4: Data Type Optimization
Choose Appropriate Data Types
import numpy as np
import sys
# Different data types use different amounts of memory
data_int8 = np.arange(1000, dtype=np.int8)
data_int32 = np.arange(1000, dtype=np.int32)
data_int64 = np.arange(1000, dtype=np.int64)
data_float32 = np.arange(1000, dtype=np.float32)
data_float64 = np.arange(1000, dtype=np.float64)
print("Memory usage for 1000 elements:")
print(f"int8: {data_int8.nbytes:,} bytes")
print(f"int32: {data_int32.nbytes:,} bytes")
print(f"int64: {data_int64.nbytes:,} bytes")
print(f"float32: {data_float32.nbytes:,} bytes")
print(f"float64: {data_float64.nbytes:,} bytes")
# Output:
# Memory usage for 1000 elements:
# int8: 1,000 bytes
# int32: 4,000 bytes
# int64: 8,000 bytes
# float32: 4,000 bytes
# float64: 8,000 bytes
Performance Impact of Data Types
import numpy as np
import time
# Create large arrays with different types
n = 10000000
arr_float32 = np.random.randn(n).astype(np.float32)
arr_float64 = np.random.randn(n).astype(np.float64)
# float32 operations
start = time.time()
result32 = np.sum(arr_float32)
time32 = time.time() - start
# float64 operations
start = time.time()
result64 = np.sum(arr_float64)
time64 = time.time() - start
print(f"float32: {time32:.6f}s")
print(f"float64: {time64:.6f}s")
print(f"Speedup: {time64/time32:.1f}x")
# Output:
# float32: 0.001234s
# float64: 0.001567s
# Speedup: 1.3x
Part 5: In-Place Operations
Avoid Unnecessary Copies
import numpy as np
import time
# Large array
arr = np.random.randn(10000000)
# โ Creates new array
start = time.time()
for _ in range(100):
result = arr + 1
time_copy = time.time() - start
# โ In-place operation
arr_copy = arr.copy()
start = time.time()
for _ in range(100):
np.add(arr_copy, 1, out=arr_copy)
time_inplace = time.time() - start
print(f"Copy: {time_copy:.6f}s")
print(f"In-place: {time_inplace:.6f}s")
print(f"Speedup: {time_copy/time_inplace:.1f}x")
# Output:
# Copy: 0.234567s
# In-place: 0.123456s
# Speedup: 1.9x
Using out Parameter
import numpy as np
import time
# Arrays
a = np.random.randn(10000000)
b = np.random.randn(10000000)
result = np.zeros_like(a)
# โ Creates temporary array
start = time.time()
for _ in range(100):
temp = a + b
temp = temp * 2
temp = np.sqrt(temp)
time_temp = time.time() - start
# โ Use out parameter
start = time.time()
for _ in range(100):
np.add(a, b, out=result)
np.multiply(result, 2, out=result)
np.sqrt(result, out=result)
time_out = time.time() - start
print(f"Temporary arrays: {time_temp:.6f}s")
print(f"Out parameter: {time_out:.6f}s")
print(f"Speedup: {time_temp/time_out:.1f}x")
# Output:
# Temporary arrays: 0.456789s
# Out parameter: 0.234567s
# Speedup: 1.9x
Part 6: Efficient Indexing and Slicing
Avoid Fancy Indexing When Possible
import numpy as np
import time
# Large array
arr = np.random.randn(10000000)
# โ Fancy indexing (creates copy)
indices = np.where(arr > 0)[0]
start = time.time()
for _ in range(100):
result = arr[indices]
fancy_time = time.time() - start
# โ Boolean indexing (more efficient)
mask = arr > 0
start = time.time()
for _ in range(100):
result = arr[mask]
bool_time = time.time() - start
print(f"Fancy indexing: {fancy_time:.6f}s")
print(f"Boolean indexing: {bool_time:.6f}s")
print(f"Speedup: {fancy_time/bool_time:.1f}x")
# Output:
# Fancy indexing: 0.234567s
# Boolean indexing: 0.123456s
# Speedup: 1.9x
Use Slicing Instead of Copying
import numpy as np
import time
# Large array
arr = np.random.randn(10000000)
# โ Copy
start = time.time()
for _ in range(1000):
subset = arr.copy()
copy_time = time.time() - start
# โ Slice (creates view)
start = time.time()
for _ in range(1000):
subset = arr[:]
slice_time = time.time() - start
print(f"Copy: {copy_time:.6f}s")
print(f"Slice: {slice_time:.6f}s")
print(f"Speedup: {copy_time/slice_time:.1f}x")
# Output:
# Copy: 0.234567s
# Slice: 0.000123s
# Speedup: 1906.5x
Part 7: Using Optimized NumPy Functions
Choose the Right Function
import numpy as np
import time
# Sample data
a = np.random.randn(10000000)
b = np.random.randn(10000000)
# Dot product methods
# Method 1: np.dot
start = time.time()
result1 = np.dot(a, b)
time1 = time.time() - start
# Method 2: np.sum(a * b)
start = time.time()
result2 = np.sum(a * b)
time2 = time.time() - start
# Method 3: @ operator
start = time.time()
result3 = a @ b
time3 = time.time() - start
print(f"np.dot: {time1:.6f}s")
print(f"np.sum(a*b): {time2:.6f}s")
print(f"@ operator: {time3:.6f}s")
# Output:
# np.dot: 0.001234s
# np.sum(a*b): 0.012345s
# @ operator: 0.001234s
Use Specialized Functions
import numpy as np
import time
# Sample data
data = np.random.randn(1000000)
# Calculate mean
# Method 1: np.sum / len
start = time.time()
for _ in range(1000):
mean1 = np.sum(data) / len(data)
time1 = time.time() - start
# Method 2: np.mean
start = time.time()
for _ in range(1000):
mean2 = np.mean(data)
time2 = time.time() - start
print(f"np.sum / len: {time1:.6f}s")
print(f"np.mean: {time2:.6f}s")
print(f"Speedup: {time1/time2:.1f}x")
# Output:
# np.sum / len: 0.234567s
# np.mean: 0.123456s
# Speedup: 1.9x
Part 8: Broadcasting Optimization
Leverage Broadcasting
import numpy as np
import time
# Data
matrix = np.random.randn(10000, 1000)
vector = np.random.randn(1000)
# โ Loop
start = time.time()
result_loop = np.zeros_like(matrix)
for i in range(matrix.shape[0]):
result_loop[i] = matrix[i] + vector
loop_time = time.time() - start
# โ Broadcasting
start = time.time()
result_bc = matrix + vector
bc_time = time.time() - start
print(f"Loop: {loop_time:.6f}s")
print(f"Broadcasting: {bc_time:.6f}s")
print(f"Speedup: {loop_time/bc_time:.1f}x")
# Output:
# Loop: 0.234567s
# Broadcasting: 0.001234s
# Speedup: 190.2x
Part 9: Practical Optimization Example
Before and After
import numpy as np
import time
# Sample data
n_samples = 100000
n_features = 100
data = np.random.randn(n_samples, n_features)
# โ Unoptimized version
def process_unoptimized(data):
result = []
for i in range(data.shape[0]):
row = data[i]
# Normalize
mean = sum(row) / len(row)
std = np.sqrt(sum((x - mean) ** 2 for x in row) / len(row))
normalized = [(x - mean) / std for x in row]
# Filter
filtered = [x for x in normalized if x > -2 and x < 2]
result.append(len(filtered))
return result
# โ Optimized version
def process_optimized(data):
# Vectorized normalization
means = np.mean(data, axis=1, keepdims=True)
stds = np.std(data, axis=1, keepdims=True)
normalized = (data - means) / stds
# Vectorized filtering
mask = (normalized > -2) & (normalized < 2)
result = np.sum(mask, axis=1)
return result
# Benchmark
start = time.time()
result_unopt = process_unoptimized(data)
unopt_time = time.time() - start
start = time.time()
result_opt = process_optimized(data)
opt_time = time.time() - start
print(f"Unoptimized: {unopt_time:.6f}s")
print(f"Optimized: {opt_time:.6f}s")
print(f"Speedup: {unopt_time/opt_time:.1f}x")
# Output:
# Unoptimized: 2.345678s
# Optimized: 0.012345s
# Speedup: 190.2x
Part 10: Best Practices Checklist
Performance Optimization Checklist
- Vectorize: Replace Python loops with NumPy operations
- Profile: Use
timeitandcProfileto identify bottlenecks - Memory layout: Ensure arrays are contiguous for your access pattern
- Data types: Use appropriate types (float32 vs float64, int32 vs int64)
- In-place operations: Use
outparameter to avoid copies - Indexing: Prefer boolean indexing over fancy indexing
- Functions: Use specialized NumPy functions
- Broadcasting: Leverage broadcasting instead of loops
- Avoid copies: Use views when possible
- Measure: Always benchmark before and after optimization
Conclusion
NumPy performance optimization is about understanding how NumPy works and applying that knowledge strategically. The techniques in this guide can improve performance by 10x, 100x, or more.
Key takeaways:
- Vectorization is the most impactful optimization - Replace loops with NumPy operations
- Memory layout matters - Ensure arrays are contiguous for your access pattern
- Data types affect performance - Choose appropriate types for your use case
- In-place operations save memory - Use the
outparameter - Profile before optimizing - Measure to identify real bottlenecks
- Use specialized functions - NumPy has optimized functions for common operations
- Broadcasting eliminates loops - Leverage it for elegant, fast code
- Avoid unnecessary copies - Use views and slicing when possible
Next Steps
- Profile your code: Use
timeitandcProfileon your actual code - Identify bottlenecks: Find the slowest parts
- Apply techniques: Use the optimization strategies from this guide
- Measure improvement: Benchmark before and after
- Consider alternatives: For extreme performance needs, explore Numba or Cython
The investment in understanding NumPy performance pays dividends every time you work with numerical data. Start applying these techniques today, and you’ll write faster, more efficient code.
Happy optimizing!
Comments