Introduction
Privacy concerns in machine learning have become paramount as organizations handle increasingly sensitive data. Regulations like GDPR, CCPA, and HIPAA require careful handling of personal information. Privacy-preserving machine learning (PPML) techniques enable organizations to extract value from data while protecting individual privacy.
This comprehensive guide explores PPML techniques, their implementation, and real-world applications.
Understanding Privacy-Preserving ML
Why Privacy Matters
Traditional machine learning requires centralized data collection, creating privacy risks:
- Data Breaches - Centralized data stores are attractive targets
- Regulatory Compliance - GDPR, CCPA, HIPAA impose strict requirements
- User Trust - Privacy violations damage trust and reputation
- Data Silos - Privacy concerns prevent data sharing
PPML Techniques Overview
| Technique | Use Case | Complexity | Overhead |
|---|---|---|---|
| Federated Learning | Distributed training | Medium | Low |
| Differential Privacy | Noise-based privacy | Low | Medium |
| Secure Multi-Party Computation | Secure collaboration | High | High |
| Homomorphic Encryption | Computation on encrypted data | Very High | Very High |
| Split Learning | Vertical/horizontal partitioning | Medium | Medium |
Federated Learning
Concept
Federated learning trains models across decentralized data sources without sharing raw data.
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ Device โ โ Device โ โ Device โ
โ A โ โ B โ โ C โ
โ [Data] โ โ [Data] โ โ [Data] โ
โโโโโโฌโโโโโ โโโโโโฌโโโโโ โโโโโโฌโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Local Model Training โ
โ (Gradient updates only) โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Aggregation Server โ
โ (FedAvg, FedProx, etc.) โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Global Model Update โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Implementation
import torch
import torch.nn as nn
from torchvision import datasets, transforms
from collections import OrderedDict
class FederatedClient:
def __init__(self, model, client_id, data_loader):
self.model = model
self.client_id = client_id
self.data_loader = data_loader
self.criterion = nn.CrossEntropyLoss()
self.optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
def local_train(self, epochs=5):
"""Train model locally on client data."""
self.model.train()
for epoch in range(epochs):
for images, labels in self.data_loader:
self.optimizer.zero_grad()
outputs = self.model(images)
loss = self.criterion(outputs, labels)
loss.backward()
self.optimizer.step()
return self.get_model_update()
def get_model_update(self):
"""Return model weights (not data)."""
return OrderedDict(self.model.named_parameters())
def set_model_weights(self, weights):
"""Apply received global model weights."""
self.model.load_state_dict(weights)
class FederatedServer:
def __init__(self, model):
self.global_model = model
self.clients = []
def add_client(self, client):
self.clients.append(client)
def aggregate(self, client_updates, weights=None):
"""Federated Averaging (FedAvg)."""
if weights is None:
weights = [1.0 / len(client_updates)] * len(client_updates)
aggregated = OrderedDict()
for key in client_updates[0].keys():
aggregated[key] = sum(
w * update[key]
for w, update in zip(weights, client_updates)
) / sum(weights)
self.global_model.load_state_dict(aggregated)
return aggregated
def train_round(self, epochs=5):
"""Execute one round of federated training."""
# Each client trains locally
client_updates = []
client_weights = []
for client in self.clients:
client.set_model_weights(
self.global_model.state_dict()
)
update = client.local_train(epochs=epochs)
client_updates.append(update)
# Weight by data size
client_weights.append(len(client.data_loader.dataset))
# Aggregate updates
aggregated = self.aggregate(client_updates, client_weights)
return aggregated
# Usage
global_model = SimpleNeuralNetwork()
server = FederatedServer(global_model)
# Add clients (each with local data)
for i in range(10):
client_data = get_client_data(i) # Local data
client = FederatedClient(
SimpleNeuralNetwork(),
f"client_{i}",
client_data
)
server.add_client(client)
# Train federated
for round in range(100):
server.train_round(epochs=5)
Differential Privacy Integration
import numpy as np
class DPFederatedClient(FederatedClient):
def __init__(self, model, client_id, data_loader, epsilon=1.0):
super().__init__(model, client_id, data_loader)
self.epsilon = epsilon
def add_noise_to_gradient(self, gradient, sensitivity=1.0):
"""Add Gaussian noise for differential privacy."""
scale = sensitivity * np.sqrt(2 * np.log(1.25 / self.epsilon))
noise = np.random.normal(0, scale, gradient.shape)
return gradient + noise
def local_train(self, epochs=5):
self.model.train()
for epoch in range(epochs):
for images, labels in self.data_loader:
self.optimizer.zero_grad()
outputs = self.model(images)
loss = self.criterion(outputs, labels)
loss.backward()
# Clip gradients and add noise
with torch.no_grad():
for param in self.model.parameters():
if param.grad is not None:
# Clip
torch.clamp(
param.grad,
min=-1.0,
max=1.0
)
# Add noise
param.grad += torch.from_numpy(
np.random.normal(0, 0.1, param.grad.shape)
).float()
self.optimizer.step()
return self.get_model_update()
Differential Privacy
Concept
Differential privacy adds calibrated noise to data or results to provide mathematical privacy guarantees.
import numpy as np
class DifferentialPrivacy:
def __init__(self, epsilon=1.0, delta=1e-5):
self.epsilon = epsilon
self.delta = delta
def laplace_mechanism(self, true_value, sensitivity):
"""Add Laplace noise for differential privacy."""
scale = sensitivity / self.epsilon
noise = np.random.laplace(0, scale)
return true_value + noise
def gaussian_mechanism(self, true_value, sensitivity):
"""Add Gaussian noise for (ฮต, ฮด)-differential privacy."""
scale = sensitivity * np.sqrt(2 * np.log(1.25 / self.delta)) / self.epsilon
noise = np.random.normal(0, scale)
return true_value + noise
def exponential_mechanism(self, candidates, utility_fn, sensitivity):
"""Select from candidates with exponential mechanism."""
utilities = [utility_fn(c) for c in candidates]
exp_utils = np.exp(self.epsilon * utilities / (2 * sensitivity))
probs = exp_utils / exp_utils.sum()
return np.random.choice(candidates, p=probs)
Privacy Budget
class PrivacyBudget:
def __init__(self, initial_epsilon=10.0):
self.initial_epsilon = initial_epsilon
self.spent = 0
def consume(self, epsilon):
"""Consume privacy budget."""
self.spent += epsilon
def remaining(self):
"""Return remaining privacy budget."""
return max(0, self.initial_epsilon - self.spent)
def is_exhausted(self):
return self.spent >= self.initial_epsilon
Secure Multi-Party Computation
Concept
SMPC enables multiple parties to jointly compute a function while keeping inputs private.
class SecureMultiPartyComputation:
"""Simplified SMPC using secret sharing."""
@staticmethod
def share_secret(secret, num_shares):
"""Split secret into shares using additive secret sharing."""
import random
shares = [random.randint(0, 100) for _ in range(num_shares - 1)]
shares.append((secret - sum(shares)) % 100)
return shares
@staticmethod
def reconstruct(shares):
"""Reconstruct secret from shares."""
return sum(shares) % 100
@staticmethod
def add_secure(share1, share2):
"""Add two shared values without revealing."""
return [(a + b) % 100 for a, b in zip(share1, share2)]
@staticmethod
def multiply_secure(share1, share2):
"""Multiply two shared values (requires Beaver triples)."""
# Simplified - real implementation needs Beaver triples
product = share1[0] * share2[0]
return [product] * len(share1)
Secure Aggregation
class SecureAggregation:
"""Secure aggregation for federated learning."""
def __init__(self, threshold=3):
self.threshold = threshold # Minimum participants
def mask_update(self, update, client_id, seed):
"""Mask local update with client-specific noise."""
np.random.seed(seed + client_id)
mask = np.random.randn(*update.shape)
masked = update + mask
return masked, mask
def aggregate(self, masked_updates, seeds):
"""Aggregate masked updates (masks cancel out)."""
# With proper implementation, masks cancel out
return sum(masked_updates) / len(masked_updates)
Homomorphic Encryption
Concept
Homomorphic encryption allows computations on encrypted data.
# Using TenSEAL library for CKKS scheme
import tenseal as ts
class HomomorphicEncryption:
def __init__(self, poly_size=8192, scale=2**40):
self.poly_size = poly_size
self.scale = scale
def create_context(self):
"""Create TenSEAL context."""
context = ts.context(
ts.SCHEME_TYPE.CKKS,
poly_size=self.poly_size,
scale=self.scale
)
context.generate_galois_keys()
return context
def encrypt_vector(self, context, values):
"""Encrypt a vector of values."""
return ts.ckks_vector(context, values)
def compute_on_encrypted(self, enc_vector):
"""Perform computation on encrypted vector."""
# Example: multiply by 2 and add 1
result = enc_vector * 2
result = result + 1
return result
def decrypt(self, encrypted):
"""Decrypt result."""
return encrypted.decrypt()
Split Learning
Concept
Split learning splits neural networks between clients and server.
class SplitNetwork:
"""Client and server portions of split neural network."""
class ClientPart(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(3, 16, 3)
self.relu = nn.ReLU()
def forward(self, x):
x = self.conv(x)
x = self.relu(x)
return x # Send to server
class ServerPart(nn.Module):
def __init__(self):
super().__init__()
self.pool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(16, 10)
def forward(self, x):
x = self.pool(x)
x = x.view(x.size(0), -1)
x = self.fc(x)
return x
class SplitClient:
def __init__(self):
self.model = SplitNetwork.ClientPart()
def forward(self, x):
"""Get intermediate activations."""
return self.model(x)
def backward(self, grad):
"""Backpropagate gradients."""
grad = self.model.relu.backward(grad)
grad = self.model.conv.backward(grad)
return grad
class SplitServer:
def __init__(self):
self.model = SplitNetwork.ServerPart()
def forward(self, activations):
"""Compute on client activations."""
output = self.model(activations)
loss = nn.CrossEntropyLoss()(output, torch.tensor([0]))
loss.backward()
return output, activations.grad
Implementation Best Practices
Architecture Selection
Choose techniques based on your requirements:
- Federated Learning: When data is distributed across devices
- Differential Privacy: When statistical queries are needed
- SMPC: When multiple parties need to collaborate
- Homomorphic Encryption: When computation on encrypted data is required
Privacy Budget Management
class PrivacyAccountant:
def __init__(self, target_epsilon=8.0):
self.target_epsilon = target_epsilon
self.spent = 0.0
self.history = []
def step(self, noise_multiplier=1.1, sample_rate=0.01):
"""Account for privacy spend in one training step."""
# RDP accounting (simplified)
epsilon = noise_multiplier ** 2 / (2 * sample_rate)
self.spent += epsilon
self.history.append(self.spent)
if self.spent > self.target_epsilon:
raise PrivacyBudgetExhausted(
f"Privacy budget exhausted: {self.spent}/{self.target_epsilon}"
)
return self.spent
External Resources
Conclusion
Privacy-preserving machine learning techniques enable organizations to build AI systems while protecting individual privacy. Each technique has strengths and trade-offs:
- Federated Learning: Best for distributed data scenarios
- Differential Privacy: Best for statistical analysis
- SMPC: Best for multi-party collaboration
- Homomorphic Encryption: Best for sensitive computation
By understanding these techniques and their trade-offs, you can build AI systems that respect privacy while delivering value.
Comments