Browser AI and WebGPU 2026: Running AI Models Locally in Your Browser

Introduction

The browser has evolved from a document viewer into a powerful AI runtime. Through WebGPU and its compute shader capabilities, sophisticated neural networks now execute directly in the browser, running on the user’s GPU with no data sent to external servers. Libraries like Transformers.js, WebLLM, and ONNX Runtime Web provide JavaScript APIs that make this accessible to frontend developers.

This guide covers the WebGPU API for AI workloads, provides runnable JavaScript code examples using Transformers.js and WebLLM, documents browser support across Chrome, Firefox, and Safari, and includes a complete single-page chat application that runs an LLM entirely in-browser.

Architecture Overview

flowchart TD
    A[JavaScript Application] --> B{AI Library}
    B --> C[Transformers.js]
    B --> D[WebLLM]
    B --> E[ONNX Runtime Web]

    C --> F[ONNX Runtime WebGPU Provider]
    D --> G[MLC Compiler<br/>WebGPU Kernels]
    E --> F

    F --> H[WebGPU API]
    G --> H

    H --> I[GPU Compute Shaders]
    I --> J[GPU VRAM<br/>Model Weights + KV Cache]
    H --> K[GPU Buffer Management]

    L[WASM Fallback] -.->|if WebGPU unavailable| C

At the top, JavaScript application code calls high-level pipeline APIs. The library translates these into WebGPU compute operations. When WebGPU is unavailable (older devices), Transformers.js falls back to WASM-based CPU inference with reduced performance.

WebGPU Browser Support (2026)

Browser	Minimum Version	WebGPU Status	Notes
Chrome	113+	Full support	Best WebGPU implementation; most performant
Edge	113+	Full support	Chromium-based; matches Chrome behavior
Firefox	141+ (Win/Mac), 147+ (Linux)	Supported	Enabled by default since mid-2024
Safari	26+ (macOS, iOS, iPadOS, visionOS)	Supported	Late to adopt but now functional

Feature Detection

async function checkWebGPU() {
    if (!navigator.gpu) {
        console.log("WebGPU not available — falling back to WASM");
        return false;
    }

    try {
        const adapter = await navigator.gpu.requestAdapter();
        if (!adapter) {
            console.log("No GPU adapter found");
            return false;
        }
        const device = await adapter.requestDevice();

        const info = {
            name: adapter.info?.description || "Unknown",
            vendor: adapter.info?.vendor || "Unknown",
            architecture: adapter.info?.architecture || "Unknown",
            features: [...adapter.features]
        };
        console.log("WebGPU available:", info);
        device.destroy();
        return true;
    } catch (err) {
        console.log("WebGPU initialization failed:", err);
        return false;
    }
}

// Run on page load
checkWebGPU().then(available => {
    if (!available) console.log("Will use WASM fallback");
});

Transformers.js: High-Level Pipeline API

Transformers.js by Hugging Face provides a pipeline() API that mirrors the Python Transformers library. It supports text generation, classification, embedding, translation, image segmentation, and audio transcription, all running on WebGPU via the ONNX Runtime Web backend.

Text Generation with WebGPU

import { pipeline, env } from '@huggingface/transformers';

// Configure WebGPU backend with 4-bit quantization
env.backends.onnx.wasm.proxy = false;

const MODEL_ID = 'onnx-community/Qwen2.5-0.5B-Instruct';

async function loadModel() {
    const statusEl = document.getElementById('status');
    statusEl.textContent = 'Loading model (first load downloads ~500MB)...';

    const generator = await pipeline('text-generation', MODEL_ID, {
        device: 'webgpu',     // Use WebGPU execution provider
        dtype: 'q4f16',       // 4-bit quantized for memory efficiency
    });

    statusEl.textContent = 'Model ready!';
    return generator;
}

async function generate(generator, prompt) {
    const result = await generator(prompt, {
        max_new_tokens: 256,
        temperature: 0.7,
        do_sample: true,
    });
    return result[0].generated_text;
}

The device: 'webgpu' flag selects the WebGPU execution provider. The dtype: 'q4f16' flag requests 4-bit quantization, keeping GPU memory usage manageable. On first visit, the browser downloads the model (cached after that via IndexedDB).

Setting Up a Vite Project

npm create vite@latest browser-llm -- --template vanilla
cd browser-llm
npm install @huggingface/transformers

Complete Single-Page Chat Application

This HTML file runs a local LLM entirely in the browser, no backend required:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Browser LLM Chat</title>
    <style>
        body { font-family: system-ui; max-width: 800px; margin: 2rem auto; padding: 0 1rem; }
        #chat { height: 400px; overflow-y: auto; border: 1px solid #ccc; padding: 1rem; margin-bottom: 1rem; }
        #input-area { display: flex; gap: 0.5rem; }
        #input { flex: 1; padding: 0.5rem; }
        .msg { margin-bottom: 0.5rem; }
        .user { color: #2563eb; }
        .assistant { color: #16a34a; }
    </style>
</head>
<body>
    <h2>Browser LLM Chat</h2>
    <p id="status">Loading model...</p>
    <div id="chat"></div>
    <div id="input-area">
        <input id="input" type="text" placeholder="Type a message..." disabled />
        <button id="send" disabled>Send</button>
    </div>

    <script type="module">
        import { pipeline } from '@huggingface/transformers';

        const MODEL = 'HuggingFaceTB/SmolLM2-360M-Instruct';
        const chat = document.getElementById('chat');
        const input = document.getElementById('input');
        const sendBtn = document.getElementById('send');
        const status = document.getElementById('status');

        let generator = null;

        function addMessage(role, text) {
            const div = document.createElement('div');
            div.className = `msg ${role}`;
            div.textContent = `${role === 'user' ? 'You' : 'AI'}: ${text}`;
            chat.appendChild(div);
            chat.scrollTop = chat.scrollHeight;
        }

        try {
            generator = await pipeline('text-generation', MODEL, {
                device: 'webgpu',
                dtype: 'q4f16',
            });
            status.textContent = 'Model loaded!';
            input.disabled = false;
            sendBtn.disabled = false;
        } catch (err) {
            status.textContent = `Failed to load: ${err.message}`;
            return;
        }

        sendBtn.addEventListener('click', async () => {
            const prompt = input.value.trim();
            if (!prompt) return;
            input.value = '';
            addMessage('user', prompt);
            sendBtn.disabled = true;

            const result = await generator(prompt, {
                max_new_tokens: 200,
                temperature: 0.7,
            });
            addMessage('assistant', result[0].generated_text);
            sendBtn.disabled = false;
        });
    </script>
</body>
</html>

Save this as chat.html and open it in Chrome or Edge. The model downloads on first visit (~200 MB for SmolLM2-360M) and caches locally. Subsequent loads are instant.

WebLLM: Dedicated LLM Runtime

WebLLM (by MLC AI) is purpose-built for running large language models in browsers. It compiles LLM inference engines (Llama, Phi, Mistral, Gemma) to WebGPU-compatible code via the MLC compilation framework.

import { CreateMLCEngine } from "@mlc-ai/web-llm";

async function runWebLLM() {
    const engine = await CreateMLCEngine("Mistral-7B-Instruct-v0.3-q4f16_1");

    const reply = await engine.chat.completions.create({
        messages: [
            { role: "system", content: "You are a helpful assistant." },
            { role: "user", content: "Write a haiku about WebGPU." }
        ],
        max_tokens: 100,
        temperature: 0.8,
    });

    console.log(reply.choices[0].message.content);
}

runWebLLM();

WebLLM supports OpenAI-compatible streaming and function calling, making it straightforward to migrate cloud-based LLM code to fully client-side execution:

const stream = await engine.chat.completions.create({
    messages: [{ role: "user", content: "Tell me a story." }],
    stream: true,
});

for await (const chunk of stream) {
    process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Low-Level WebGPU Compute (Custom Models)

For teams bringing their own models (exported to ONNX format), ONNX Runtime Web provides direct WebGPU execution:

import * as ort from 'onnxruntime-web';

async function runCustomModel() {
    // Configure WebGPU execution provider
    const session = await ort.InferenceSession.create('./model.onnx', {
        executionProviders: ['webgpu', 'wasm'],
        graphOptimizationLevel: 'all',
    });

    // Prepare input tensor
    const input = new ort.Tensor(
        'float32',
        new Float32Array([/* ... model input data ... */]),
        [1, 3, 224, 224]  // batch, channels, height, width
    );

    // Run inference
    const results = await session.run({ input: input });
    console.log('Output:', results.output.data);
}

Performance Benchmarks

Community benchmarks show consistent speedups with WebGPU over WASM fallback:

Model	Task	WebGPU (Chrome)	WASM Fallback	Speedup
BERT base	Text classification	45ms	220ms	4.9x
ResNet-50	Image classification	65ms	180ms	2.8x
Qwen2.5-0.5B	Text generation	12 tok/s	3 tok/s	4.0x
SmolLM2-360M	Text generation	28 tok/s	7 tok/s	4.0x

First-run performance includes shader compilation overhead (5-15 seconds). Subsequent inference calls benefit from cached compiled shaders.

Available Browser-Compatible Models (2026)

Model	Size	Quantized Size	Tasks	Library
SmolLM2-360M-Instruct	360M	~200 MB	Chat, text gen	Transformers.js
Qwen2.5-0.5B-Instruct	500M	~280 MB	Chat, code	Transformers.js
DeepSeek-R1-1.5B (4-bit)	1.5B	~1 GB	Reasoning, code	WebLLM/ONNX
Gemma-3-4B-It (quantized)	4B	~2.2 GB	Chat, multimodal	WebLLM
Mistral-7B (4-bit)	7B	~4 GB	Chat, code, RAG	WebLLM
Whisper-Base	74M	~290 MB	Speech-to-text	Transformers.js

Resources

WebGPU Specification (W3C) — Official standard
MDN WebGPU API Documentation — Browser reference
Transformers.js Documentation — WebGPU pipeline API, model hub
WebLLM Documentation — Browser LLM runtime with OpenAI-compatible API
ONNX Runtime Web — Cross-platform WebGPU inference
WebGPU Browser Support Status

Browser AI and WebGPU 2026: Running AI Models Locally in Your Browser

Introduction

Architecture Overview

WebGPU Browser Support (2026)

Feature Detection

Transformers.js: High-Level Pipeline API

Text Generation with WebGPU

Setting Up a Vite Project

Complete Single-Page Chat Application

WebLLM: Dedicated LLM Runtime

Low-Level WebGPU Compute (Custom Models)

Performance Benchmarks

Available Browser-Compatible Models (2026)

Resources

Comments

Share this article

👍 Was this article helpful?