Introduction
The browser has evolved from a document viewer into a powerful AI runtime. Through WebGPU and its compute shader capabilities, sophisticated neural networks now execute directly in the browser, running on the user’s GPU with no data sent to external servers. Libraries like Transformers.js, WebLLM, and ONNX Runtime Web provide JavaScript APIs that make this accessible to frontend developers.
This guide covers the WebGPU API for AI workloads, provides runnable JavaScript code examples using Transformers.js and WebLLM, documents browser support across Chrome, Firefox, and Safari, and includes a complete single-page chat application that runs an LLM entirely in-browser.
Architecture Overview
flowchart TD
A[JavaScript Application] --> B{AI Library}
B --> C[Transformers.js]
B --> D[WebLLM]
B --> E[ONNX Runtime Web]
C --> F[ONNX Runtime WebGPU Provider]
D --> G[MLC Compiler<br/>WebGPU Kernels]
E --> F
F --> H[WebGPU API]
G --> H
H --> I[GPU Compute Shaders]
I --> J[GPU VRAM<br/>Model Weights + KV Cache]
H --> K[GPU Buffer Management]
L[WASM Fallback] -.->|if WebGPU unavailable| C
At the top, JavaScript application code calls high-level pipeline APIs. The library translates these into WebGPU compute operations. When WebGPU is unavailable (older devices), Transformers.js falls back to WASM-based CPU inference with reduced performance.
WebGPU Browser Support (2026)
| Browser | Minimum Version | WebGPU Status | Notes |
|---|---|---|---|
| Chrome | 113+ | Full support | Best WebGPU implementation; most performant |
| Edge | 113+ | Full support | Chromium-based; matches Chrome behavior |
| Firefox | 141+ (Win/Mac), 147+ (Linux) | Supported | Enabled by default since mid-2024 |
| Safari | 26+ (macOS, iOS, iPadOS, visionOS) | Supported | Late to adopt but now functional |
Feature Detection
async function checkWebGPU() {
if (!navigator.gpu) {
console.log("WebGPU not available — falling back to WASM");
return false;
}
try {
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) {
console.log("No GPU adapter found");
return false;
}
const device = await adapter.requestDevice();
const info = {
name: adapter.info?.description || "Unknown",
vendor: adapter.info?.vendor || "Unknown",
architecture: adapter.info?.architecture || "Unknown",
features: [...adapter.features]
};
console.log("WebGPU available:", info);
device.destroy();
return true;
} catch (err) {
console.log("WebGPU initialization failed:", err);
return false;
}
}
// Run on page load
checkWebGPU().then(available => {
if (!available) console.log("Will use WASM fallback");
});
Transformers.js: High-Level Pipeline API
Transformers.js by Hugging Face provides a pipeline() API that mirrors the Python Transformers library. It supports text generation, classification, embedding, translation, image segmentation, and audio transcription, all running on WebGPU via the ONNX Runtime Web backend.
Text Generation with WebGPU
import { pipeline, env } from '@huggingface/transformers';
// Configure WebGPU backend with 4-bit quantization
env.backends.onnx.wasm.proxy = false;
const MODEL_ID = 'onnx-community/Qwen2.5-0.5B-Instruct';
async function loadModel() {
const statusEl = document.getElementById('status');
statusEl.textContent = 'Loading model (first load downloads ~500MB)...';
const generator = await pipeline('text-generation', MODEL_ID, {
device: 'webgpu', // Use WebGPU execution provider
dtype: 'q4f16', // 4-bit quantized for memory efficiency
});
statusEl.textContent = 'Model ready!';
return generator;
}
async function generate(generator, prompt) {
const result = await generator(prompt, {
max_new_tokens: 256,
temperature: 0.7,
do_sample: true,
});
return result[0].generated_text;
}
The device: 'webgpu' flag selects the WebGPU execution provider. The dtype: 'q4f16' flag requests 4-bit quantization, keeping GPU memory usage manageable. On first visit, the browser downloads the model (cached after that via IndexedDB).
Setting Up a Vite Project
npm create vite@latest browser-llm -- --template vanilla
cd browser-llm
npm install @huggingface/transformers
Complete Single-Page Chat Application
This HTML file runs a local LLM entirely in the browser, no backend required:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Browser LLM Chat</title>
<style>
body { font-family: system-ui; max-width: 800px; margin: 2rem auto; padding: 0 1rem; }
#chat { height: 400px; overflow-y: auto; border: 1px solid #ccc; padding: 1rem; margin-bottom: 1rem; }
#input-area { display: flex; gap: 0.5rem; }
#input { flex: 1; padding: 0.5rem; }
.msg { margin-bottom: 0.5rem; }
.user { color: #2563eb; }
.assistant { color: #16a34a; }
</style>
</head>
<body>
<h2>Browser LLM Chat</h2>
<p id="status">Loading model...</p>
<div id="chat"></div>
<div id="input-area">
<input id="input" type="text" placeholder="Type a message..." disabled />
<button id="send" disabled>Send</button>
</div>
<script type="module">
import { pipeline } from '@huggingface/transformers';
const MODEL = 'HuggingFaceTB/SmolLM2-360M-Instruct';
const chat = document.getElementById('chat');
const input = document.getElementById('input');
const sendBtn = document.getElementById('send');
const status = document.getElementById('status');
let generator = null;
function addMessage(role, text) {
const div = document.createElement('div');
div.className = `msg ${role}`;
div.textContent = `${role === 'user' ? 'You' : 'AI'}: ${text}`;
chat.appendChild(div);
chat.scrollTop = chat.scrollHeight;
}
try {
generator = await pipeline('text-generation', MODEL, {
device: 'webgpu',
dtype: 'q4f16',
});
status.textContent = 'Model loaded!';
input.disabled = false;
sendBtn.disabled = false;
} catch (err) {
status.textContent = `Failed to load: ${err.message}`;
return;
}
sendBtn.addEventListener('click', async () => {
const prompt = input.value.trim();
if (!prompt) return;
input.value = '';
addMessage('user', prompt);
sendBtn.disabled = true;
const result = await generator(prompt, {
max_new_tokens: 200,
temperature: 0.7,
});
addMessage('assistant', result[0].generated_text);
sendBtn.disabled = false;
});
</script>
</body>
</html>
Save this as chat.html and open it in Chrome or Edge. The model downloads on first visit (~200 MB for SmolLM2-360M) and caches locally. Subsequent loads are instant.
WebLLM: Dedicated LLM Runtime
WebLLM (by MLC AI) is purpose-built for running large language models in browsers. It compiles LLM inference engines (Llama, Phi, Mistral, Gemma) to WebGPU-compatible code via the MLC compilation framework.
import { CreateMLCEngine } from "@mlc-ai/web-llm";
async function runWebLLM() {
const engine = await CreateMLCEngine("Mistral-7B-Instruct-v0.3-q4f16_1");
const reply = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Write a haiku about WebGPU." }
],
max_tokens: 100,
temperature: 0.8,
});
console.log(reply.choices[0].message.content);
}
runWebLLM();
WebLLM supports OpenAI-compatible streaming and function calling, making it straightforward to migrate cloud-based LLM code to fully client-side execution:
const stream = await engine.chat.completions.create({
messages: [{ role: "user", content: "Tell me a story." }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
Low-Level WebGPU Compute (Custom Models)
For teams bringing their own models (exported to ONNX format), ONNX Runtime Web provides direct WebGPU execution:
import * as ort from 'onnxruntime-web';
async function runCustomModel() {
// Configure WebGPU execution provider
const session = await ort.InferenceSession.create('./model.onnx', {
executionProviders: ['webgpu', 'wasm'],
graphOptimizationLevel: 'all',
});
// Prepare input tensor
const input = new ort.Tensor(
'float32',
new Float32Array([/* ... model input data ... */]),
[1, 3, 224, 224] // batch, channels, height, width
);
// Run inference
const results = await session.run({ input: input });
console.log('Output:', results.output.data);
}
Performance Benchmarks
Community benchmarks show consistent speedups with WebGPU over WASM fallback:
| Model | Task | WebGPU (Chrome) | WASM Fallback | Speedup |
|---|---|---|---|---|
| BERT base | Text classification | 45ms | 220ms | 4.9x |
| ResNet-50 | Image classification | 65ms | 180ms | 2.8x |
| Qwen2.5-0.5B | Text generation | 12 tok/s | 3 tok/s | 4.0x |
| SmolLM2-360M | Text generation | 28 tok/s | 7 tok/s | 4.0x |
First-run performance includes shader compilation overhead (5-15 seconds). Subsequent inference calls benefit from cached compiled shaders.
Available Browser-Compatible Models (2026)
| Model | Size | Quantized Size | Tasks | Library |
|---|---|---|---|---|
| SmolLM2-360M-Instruct | 360M | ~200 MB | Chat, text gen | Transformers.js |
| Qwen2.5-0.5B-Instruct | 500M | ~280 MB | Chat, code | Transformers.js |
| DeepSeek-R1-1.5B (4-bit) | 1.5B | ~1 GB | Reasoning, code | WebLLM/ONNX |
| Gemma-3-4B-It (quantized) | 4B | ~2.2 GB | Chat, multimodal | WebLLM |
| Mistral-7B (4-bit) | 7B | ~4 GB | Chat, code, RAG | WebLLM |
| Whisper-Base | 74M | ~290 MB | Speech-to-text | Transformers.js |
Resources
- WebGPU Specification (W3C) — Official standard
- MDN WebGPU API Documentation — Browser reference
- Transformers.js Documentation — WebGPU pipeline API, model hub
- WebLLM Documentation — Browser LLM runtime with OpenAI-compatible API
- ONNX Runtime Web — Cross-platform WebGPU inference
- WebGPU Browser Support Status
Comments