Skip to main content

Browser AI and WebGPU 2026: Running AI Models Locally in Your Browser

Created: March 3, 2026 Larry Qu 6 min read

Introduction

The browser has evolved from a document viewer into a powerful AI runtime. Through WebGPU and its compute shader capabilities, sophisticated neural networks now execute directly in the browser, running on the user’s GPU with no data sent to external servers. Libraries like Transformers.js, WebLLM, and ONNX Runtime Web provide JavaScript APIs that make this accessible to frontend developers.

This guide covers the WebGPU API for AI workloads, provides runnable JavaScript code examples using Transformers.js and WebLLM, documents browser support across Chrome, Firefox, and Safari, and includes a complete single-page chat application that runs an LLM entirely in-browser.

Architecture Overview

flowchart TD
    A[JavaScript Application] --> B{AI Library}
    B --> C[Transformers.js]
    B --> D[WebLLM]
    B --> E[ONNX Runtime Web]

    C --> F[ONNX Runtime WebGPU Provider]
    D --> G[MLC Compiler<br/>WebGPU Kernels]
    E --> F

    F --> H[WebGPU API]
    G --> H

    H --> I[GPU Compute Shaders]
    I --> J[GPU VRAM<br/>Model Weights + KV Cache]
    H --> K[GPU Buffer Management]

    L[WASM Fallback] -.->|if WebGPU unavailable| C

At the top, JavaScript application code calls high-level pipeline APIs. The library translates these into WebGPU compute operations. When WebGPU is unavailable (older devices), Transformers.js falls back to WASM-based CPU inference with reduced performance.

WebGPU Browser Support (2026)

Browser Minimum Version WebGPU Status Notes
Chrome 113+ Full support Best WebGPU implementation; most performant
Edge 113+ Full support Chromium-based; matches Chrome behavior
Firefox 141+ (Win/Mac), 147+ (Linux) Supported Enabled by default since mid-2024
Safari 26+ (macOS, iOS, iPadOS, visionOS) Supported Late to adopt but now functional

Feature Detection

async function checkWebGPU() {
    if (!navigator.gpu) {
        console.log("WebGPU not available — falling back to WASM");
        return false;
    }

    try {
        const adapter = await navigator.gpu.requestAdapter();
        if (!adapter) {
            console.log("No GPU adapter found");
            return false;
        }
        const device = await adapter.requestDevice();

        const info = {
            name: adapter.info?.description || "Unknown",
            vendor: adapter.info?.vendor || "Unknown",
            architecture: adapter.info?.architecture || "Unknown",
            features: [...adapter.features]
        };
        console.log("WebGPU available:", info);
        device.destroy();
        return true;
    } catch (err) {
        console.log("WebGPU initialization failed:", err);
        return false;
    }
}

// Run on page load
checkWebGPU().then(available => {
    if (!available) console.log("Will use WASM fallback");
});

Transformers.js: High-Level Pipeline API

Transformers.js by Hugging Face provides a pipeline() API that mirrors the Python Transformers library. It supports text generation, classification, embedding, translation, image segmentation, and audio transcription, all running on WebGPU via the ONNX Runtime Web backend.

Text Generation with WebGPU

import { pipeline, env } from '@huggingface/transformers';

// Configure WebGPU backend with 4-bit quantization
env.backends.onnx.wasm.proxy = false;

const MODEL_ID = 'onnx-community/Qwen2.5-0.5B-Instruct';

async function loadModel() {
    const statusEl = document.getElementById('status');
    statusEl.textContent = 'Loading model (first load downloads ~500MB)...';

    const generator = await pipeline('text-generation', MODEL_ID, {
        device: 'webgpu',     // Use WebGPU execution provider
        dtype: 'q4f16',       // 4-bit quantized for memory efficiency
    });

    statusEl.textContent = 'Model ready!';
    return generator;
}

async function generate(generator, prompt) {
    const result = await generator(prompt, {
        max_new_tokens: 256,
        temperature: 0.7,
        do_sample: true,
    });
    return result[0].generated_text;
}

The device: 'webgpu' flag selects the WebGPU execution provider. The dtype: 'q4f16' flag requests 4-bit quantization, keeping GPU memory usage manageable. On first visit, the browser downloads the model (cached after that via IndexedDB).

Setting Up a Vite Project

npm create vite@latest browser-llm -- --template vanilla
cd browser-llm
npm install @huggingface/transformers

Complete Single-Page Chat Application

This HTML file runs a local LLM entirely in the browser, no backend required:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Browser LLM Chat</title>
    <style>
        body { font-family: system-ui; max-width: 800px; margin: 2rem auto; padding: 0 1rem; }
        #chat { height: 400px; overflow-y: auto; border: 1px solid #ccc; padding: 1rem; margin-bottom: 1rem; }
        #input-area { display: flex; gap: 0.5rem; }
        #input { flex: 1; padding: 0.5rem; }
        .msg { margin-bottom: 0.5rem; }
        .user { color: #2563eb; }
        .assistant { color: #16a34a; }
    </style>
</head>
<body>
    <h2>Browser LLM Chat</h2>
    <p id="status">Loading model...</p>
    <div id="chat"></div>
    <div id="input-area">
        <input id="input" type="text" placeholder="Type a message..." disabled />
        <button id="send" disabled>Send</button>
    </div>

    <script type="module">
        import { pipeline } from '@huggingface/transformers';

        const MODEL = 'HuggingFaceTB/SmolLM2-360M-Instruct';
        const chat = document.getElementById('chat');
        const input = document.getElementById('input');
        const sendBtn = document.getElementById('send');
        const status = document.getElementById('status');

        let generator = null;

        function addMessage(role, text) {
            const div = document.createElement('div');
            div.className = `msg ${role}`;
            div.textContent = `${role === 'user' ? 'You' : 'AI'}: ${text}`;
            chat.appendChild(div);
            chat.scrollTop = chat.scrollHeight;
        }

        try {
            generator = await pipeline('text-generation', MODEL, {
                device: 'webgpu',
                dtype: 'q4f16',
            });
            status.textContent = 'Model loaded!';
            input.disabled = false;
            sendBtn.disabled = false;
        } catch (err) {
            status.textContent = `Failed to load: ${err.message}`;
            return;
        }

        sendBtn.addEventListener('click', async () => {
            const prompt = input.value.trim();
            if (!prompt) return;
            input.value = '';
            addMessage('user', prompt);
            sendBtn.disabled = true;

            const result = await generator(prompt, {
                max_new_tokens: 200,
                temperature: 0.7,
            });
            addMessage('assistant', result[0].generated_text);
            sendBtn.disabled = false;
        });
    </script>
</body>
</html>

Save this as chat.html and open it in Chrome or Edge. The model downloads on first visit (~200 MB for SmolLM2-360M) and caches locally. Subsequent loads are instant.

WebLLM: Dedicated LLM Runtime

WebLLM (by MLC AI) is purpose-built for running large language models in browsers. It compiles LLM inference engines (Llama, Phi, Mistral, Gemma) to WebGPU-compatible code via the MLC compilation framework.

import { CreateMLCEngine } from "@mlc-ai/web-llm";

async function runWebLLM() {
    const engine = await CreateMLCEngine("Mistral-7B-Instruct-v0.3-q4f16_1");

    const reply = await engine.chat.completions.create({
        messages: [
            { role: "system", content: "You are a helpful assistant." },
            { role: "user", content: "Write a haiku about WebGPU." }
        ],
        max_tokens: 100,
        temperature: 0.8,
    });

    console.log(reply.choices[0].message.content);
}

runWebLLM();

WebLLM supports OpenAI-compatible streaming and function calling, making it straightforward to migrate cloud-based LLM code to fully client-side execution:

const stream = await engine.chat.completions.create({
    messages: [{ role: "user", content: "Tell me a story." }],
    stream: true,
});

for await (const chunk of stream) {
    process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Low-Level WebGPU Compute (Custom Models)

For teams bringing their own models (exported to ONNX format), ONNX Runtime Web provides direct WebGPU execution:

import * as ort from 'onnxruntime-web';

async function runCustomModel() {
    // Configure WebGPU execution provider
    const session = await ort.InferenceSession.create('./model.onnx', {
        executionProviders: ['webgpu', 'wasm'],
        graphOptimizationLevel: 'all',
    });

    // Prepare input tensor
    const input = new ort.Tensor(
        'float32',
        new Float32Array([/* ... model input data ... */]),
        [1, 3, 224, 224]  // batch, channels, height, width
    );

    // Run inference
    const results = await session.run({ input: input });
    console.log('Output:', results.output.data);
}

Performance Benchmarks

Community benchmarks show consistent speedups with WebGPU over WASM fallback:

Model Task WebGPU (Chrome) WASM Fallback Speedup
BERT base Text classification 45ms 220ms 4.9x
ResNet-50 Image classification 65ms 180ms 2.8x
Qwen2.5-0.5B Text generation 12 tok/s 3 tok/s 4.0x
SmolLM2-360M Text generation 28 tok/s 7 tok/s 4.0x

First-run performance includes shader compilation overhead (5-15 seconds). Subsequent inference calls benefit from cached compiled shaders.

Available Browser-Compatible Models (2026)

Model Size Quantized Size Tasks Library
SmolLM2-360M-Instruct 360M ~200 MB Chat, text gen Transformers.js
Qwen2.5-0.5B-Instruct 500M ~280 MB Chat, code Transformers.js
DeepSeek-R1-1.5B (4-bit) 1.5B ~1 GB Reasoning, code WebLLM/ONNX
Gemma-3-4B-It (quantized) 4B ~2.2 GB Chat, multimodal WebLLM
Mistral-7B (4-bit) 7B ~4 GB Chat, code, RAG WebLLM
Whisper-Base 74M ~290 MB Speech-to-text Transformers.js

Resources

Comments

👍 Was this article helpful?