Edge AI: Running Models in the Browser (WebGPU, ONNX.js, TensorFlow.js)

Edge AI: Running Models in the Browser

The paradigm of machine learning deployment is shifting. For years, AI applications relied on cloud infrastructure—sending user data to remote servers for inference. This approach has limitations: latency, privacy concerns, offline unavailability, and infrastructure costs. Edge ML represents a fundamental change in this model, bringing machine learning capabilities directly to users’ devices.

Today, modern browsers can run sophisticated machine learning models efficiently. This democratizes AI development and creates new possibilities for web applications. Whether you’re building real-time computer vision features, on-device recommendations, or privacy-preserving analytics, browser-based ML offers compelling advantages.

What is Edge ML?

Edge ML refers to executing machine learning models on edge devices—smartphones, browsers, IoT devices, or other client-side hardware—rather than relying on centralized cloud servers for inference.

Why Edge ML Matters

Reduced Latency: No network round-trip. Inference happens instantly on the user’s device.

Enhanced Privacy: User data never leaves the device. Sensitive information remains completely under user control, crucial for healthcare, finance, and personal data applications.

Offline Functionality: Applications work without internet connectivity. Users aren’t stranded without service.

Cost Efficiency: Eliminate server inference costs. Scale to millions of users without proportional server infrastructure.

Better User Experience: Real-time responses create more responsive, intuitive applications.

Regulatory Compliance: Meet data residency requirements and GDPR/CCPA compliance more easily.

Browser-Based Machine Learning Technologies

TensorFlow.js: Full-Featured ML in the Browser

TensorFlow.js is Google’s comprehensive JavaScript library for machine learning in browsers and Node.js. It offers both high-level APIs for quick prototyping and low-level APIs for advanced applications.

Key Features:

Pre-trained models for common tasks (image classification, pose detection, text analysis)
Training capabilities directly in the browser
Support for multiple backends (WebGL, WebAssembly, WebGPU)
Extensive documentation and community support

Basic Usage:

import * as tf from '@tensorflow/tfjs';
import * as mobilenet from '@tensorflow/tfjs-models/dist/mobilenet';

// Load a pre-trained model
const model = await mobilenet.load();

// Get image from canvas or video
const image = document.getElementById('image');

// Run inference
const predictions = await model.classify(image);
console.log(predictions);
// Output: [{ className: 'cat', probability: 0.95 }, ...]

ONNX.js: Framework-Agnostic Model Deployment

ONNX (Open Neural Network Exchange) is a standardized format for representing machine learning models. ONNX.js brings this standardization to web browsers, allowing models trained in PyTorch, TensorFlow, or other frameworks to run seamlessly in JavaScript.

Advantages:

Train in Python, deploy in JavaScript without conversion headaches
Framework-agnostic compatibility
Support for diverse model architectures
Multiple backend options for optimization

Basic Usage:

import * as ort from 'onnxruntime-web';

// Configure ONNX Runtime
ort.env.wasm.wasmPaths = '/path/to/onnx/';

// Load model
const session = await ort.InferenceSession.create('model.onnx');

// Prepare input (example for image classification)
const imageData = getImageData(); // Your image processing
const input = new ort.Tensor('float32', imageData, [1, 3, 224, 224]);

// Run inference
const results = await session.run({ input: input });

console.log(results.output.data);

WebGPU: High-Performance GPU Compute

WebGPU is the cutting-edge graphics API providing direct GPU access to web browsers. Unlike WebGL (which targets rendering), WebGPU enables general-purpose GPU computing, ideal for ML inference.

Why WebGPU Matters for ML:

High Performance: Direct GPU utilization for matrix operations and neural network computations
Modern API: Designed with modern hardware in mind, unlike WebGL
Explicit Control: Fine-grained control over GPU resources and execution
Parallel Processing: Harness modern multi-core GPUs for accelerated inference

Matrix Multiplication with WebGPU

// Note: WebGPU is still stabilizing; this is illustrative
const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();

// Create GPU buffers for matrices
const aBuffer = device.createBuffer({
  size: matrixA.byteLength,
  mappedAtCreation: true,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
});
new Float32Array(aBuffer.getMappedRange()).set(matrixA);
aBuffer.unmap();

// Create compute shader
const shaderModule = device.createShaderModule({
  code: `
    @group(0) @binding(0) var<storage, read_write> output : array<f32>;
    @group(0) @binding(1) var<storage, read> input1 : array<f32>;
    
    @compute @workgroup_size(16, 16)
    fn main(@builtin(global_invocation_id) global_id : vec3<u32>) {
      // Matrix multiplication compute shader logic
    }
  `,
});

// Run computation
const pipeline = device.createComputePipeline({
  layout: 'auto',
  compute: { module: shaderModule, entryPoint: 'main' },
});

Practical Implementation Patterns

Real-Time Image Classification

import * as tf from '@tensorflow/tfjs';
import * as cocoSsd from '@tensorflow-models/coco-ssd';

async function setupCamera() {
  const video = document.getElementById('video');
  const stream = await navigator.mediaDevices.getUserMedia({ 
    video: { width: 640, height: 480 } 
  });
  video.srcObject = stream;
  return new Promise(resolve => {
    video.onloadedmetadata = () => resolve(video);
  });
}

async function detectObjects() {
  const model = await cocoSsd.load();
  const video = await setupCamera();
  
  setInterval(async () => {
    const predictions = await model.estimateObjects(video);
    
    predictions.forEach(prediction => {
      console.log(`${prediction.class}: ${(prediction.score * 100).toFixed(1)}%`);
    });
  }, 100);
}

Pose Detection for Movement Analysis

import * as tf from '@tensorflow/tfjs';
import * as posenet from '@tensorflow-models/posenet';

async function detectPose(imageElement) {
  const model = await posenet.load({
    architecture: 'MobileNetV1',
    outputStride: 16,
    inputResolution: { width: 640, height: 480 },
    multiplier: 0.75,
    quantBytes: 2,
  });
  
  const pose = await model.estimateSinglePose(imageElement, {
    flipHorizontal: false,
  });
  
  // Access keypoints
  pose.keypoints.forEach(keypoint => {
    if (keypoint.score > 0.5) {
      console.log(`${keypoint.part}: (${keypoint.position.x}, ${keypoint.position.y})`);
    }
  });
}

Text Processing with ONNX

import * as ort from 'onnxruntime-web';

async function runSentimentAnalysis(text) {
  const session = await ort.InferenceSession.create('sentiment-model.onnx');
  
  // Tokenize text (in practice, use proper tokenizer)
  const tokenIds = tokenizeText(text);
  
  // Prepare input tensor
  const inputTensor = new ort.Tensor('int64', 
    BigInt64Array.from(tokenIds.map(BigInt)), 
    [1, tokenIds.length]
  );
  
  // Run inference
  const results = await session.run({ 
    input_ids: inputTensor 
  });
  
  // Process output
  const scores = results.logits.data;
  const sentiment = scores[1] > scores[0] ? 'positive' : 'negative';
  const confidence = Math.max(...Array.from(scores)) / 
                    Array.from(scores).reduce((a, b) => a + b);
  
  return { sentiment, confidence };
}

Performance Considerations

Backend Selection

WebGL Backend:

Wide browser support (most devices)
Moderate performance
Good for medium-sized models
Default for TensorFlow.js on most platforms

WebAssembly Backend:

Broader compatibility than WebGPU
CPU-based, but optimized
Suitable for edge cases where GPU unavailable
Predictable performance across devices

WebGPU Backend:

Highest performance (where available)
Limited browser support currently
Future-proofed for modern hardware
Best for GPU-intensive operations

Optimization Strategies

Model Quantization:

// Use quantized models for faster inference
const model = await mobilenet.load({
  version: 2,
  alpha: 0.5, // Smaller model
});

Batch Processing:

// Process multiple inputs efficiently
const images = [img1, img2, img3];
const predictions = await Promise.all(
  images.map(img => model.classify(img))
);

Caching:

// Cache model in IndexedDB for faster loading
const model = await tf.loadLayersModel(
  'indexeddb://my-model'
);

Challenges and Limitations

Model Size: Browser download and memory constraints limit model complexity. Typical practical limit is 50-200MB depending on device.

Browser Support: WebGPU and advanced features have limited support. Progressive enhancement is necessary for broad compatibility.

User Device Variability: Performance varies significantly across devices. Testing on diverse hardware is essential.

Model Privacy Trade-off: While data stays local, the model itself is exposed in client-side code—less critical for public models but important for proprietary algorithms.

Choosing the Right Technology

Use TensorFlow.js when:

You want pre-trained models and quick setup
You need a mature, well-documented ecosystem
You’re building computer vision or pose estimation features
Community support and examples matter

Use ONNX.js when:

You have models trained in PyTorch or other frameworks
You need framework agnostic deployment
You want standardized model interchange
You’re managing diverse model sources

Use WebGPU when:

Maximum performance is critical
Your target audience uses modern browsers
You’re building compute-intensive applications
You’re comfortable with cutting-edge technologies

The Future of Edge ML

Browser-based ML is rapidly maturing. Expect:

WebGPU Widespread Adoption: As browsers stabilize WebGPU support, GPU acceleration becomes default
Larger Models: Optimizations will enable larger, more capable models in browsers
Integrated ML APIs: Browsers may provide native ML capabilities similar to Web Audio API
Privacy-First Architecture: Users gain more control over local data processing

Getting Started

1. Choose Your Use Case: Image classification? Pose detection? Text analysis?

2. Select a Framework: TensorFlow.js for vision tasks, ONNX.js for flexibility, native WebGPU for maximum control

3. Start with Pre-trained Models: Don’t build from scratch; leverage existing models

4. Test Performance: Profile on representative devices

5. Implement Graceful Degradation: Ensure your app works across different browser capabilities

Conclusion

Edge ML in browsers represents a paradigm shift in AI deployment. By running models locally, developers can build faster, more private, and more resilient applications. Whether you’re implementing real-time image analysis, building accessible AI features, or creating offline-capable experiences, browser-based machine learning offers compelling advantages.

The technologies are mature enough for production use today, with continued improvements ahead. Start exploring TensorFlow.js or ONNX.js in your next project and experience the benefits of bringing AI capabilities directly to your users.