Browser-Native AI: Chrome GenAI APIs, WebGPU, and Running LLMs with ONNX.js

Introduction: AI is Moving to the Browser

For years, artificial intelligence was the domain of powerful servers and cloud infrastructure. But we’re witnessing a fundamental shift. Modern web browsers now possess the computational power and APIs to run sophisticated AI models directly on users’ devices—no API calls, no cloud backends, no latency waiting for remote servers.

This is browser-native AI, and it’s transforming how we build intelligent web applications.

Imagine a user writing in your web application and getting real-time spell-checking, translation, or content generation—all happening instantly on their device. Or a productivity app that processes images, generates summaries, or performs natural language understanding without ever leaving the browser. This is no longer science fiction. It’s happening today.

In this article, we’ll explore three groundbreaking technologies that are making browser-native AI possible: Chrome’s GenAI APIs, WebGPU for GPU-accelerated computing, and ONNX.js for running pretrained machine learning models. By the end, you’ll understand how to build AI-powered web applications that are faster, more private, and completely independent of cloud infrastructure.

The Paradigm Shift: Why Browser AI Matters

Traditional Server-Side AI

User Browser → Network Request → Cloud Server (Run AI Model) → Response → User Browser
Latency: 200-2000ms | Privacy: Data sent to server | Cost: High compute costs

Browser-Native AI

User Browser → (Run AI Model Locally) → Instant Response
Latency: 10-100ms | Privacy: No data leaves device | Cost: Zero server costs

Why This Matters

Privacy: Your user data never leaves their device. No servers, no logs, no privacy concerns.

Performance: Sub-millisecond response times. No network latency. Instant feedback.

Cost: Zero compute costs for AI inference. Scale infinitely without backend infrastructure.

Offline Capability: AI works completely offline. No internet connection? No problem.

User Control: Users own their data and models. Complete transparency and control.

Understanding the Browser AI Stack

Before diving into specific technologies, let’s understand the landscape:

Hardware Acceleration: Modern browsers can access GPU for 10-100x faster computation
Standardized APIs: WebGPU, WebGL, and Web Workers enable efficient computation
Model Formats: ONNX, TensorFlow.js, and WebAssembly enable model deployment
Developer Tools: Frameworks and runtimes make it simple to integrate AI

The three technologies we’ll explore handle different layers of this stack.

Chrome GenAI APIs: Built-In Language Models

What Are Chrome GenAI APIs?

Chrome GenAI APIs (also called the Prompt API) are experimental APIs that bring generative AI capabilities directly into Chrome. Instead of shipping AI models with your application, Chrome provides access to on-device models optimized for the browser.

This approach shifts responsibility for model updates and optimization to the browser vendor, similar to how JavaScript engines improve over time.

Current Status and Availability

Important: As of December 2025, Chrome GenAI APIs are still experimental and available through:

Chrome 123+ with experimental flags enabled
Chrome Canary and Dev builds for early adopters
Gradual rollout planned for stable Chrome in 2025-2026

You can enable them by visiting chrome://flags and searching for “GenAI”.

Using the Prompt API

The most accessible Chrome GenAI API is the Prompt API, which provides access to text generation:

// Check if the API is available
if ('ai' in window && 'languageModel' in window.ai) {
  console.log('Chrome GenAI APIs are available!')
}

// Create a session with the language model
async function initializeAI() {
  try {
    const session = await window.ai.languageModel.create({
      // You can request a specific model version
      topK: 3,  // Return top 3 predictions
      temperature: 1.0,  // Randomness (0-2, higher = more creative)
    })

    return session
  } catch (error) {
    console.error('Failed to create language model session:', error)
    // Gracefully fall back to server-side API
  }
}

Generating Text with Chrome GenAI

async function generateText(prompt) {
  try {
    const session = await window.ai.languageModel.create()

    // Stream text generation for real-time display
    const stream = await session.promptStreaming(prompt)

    let fullResponse = ''

    // Process the stream
    for await (const chunk of stream) {
      console.log('Received chunk:', chunk)
      fullResponse += chunk
      // Update UI in real-time
      updateUI(fullResponse)
    }

    return fullResponse
  } catch (error) {
    console.error('Text generation failed:', error)
    throw error
  }
}

// Use it
const prompt = 'Write a short poem about web development'
generateText(prompt).then(result => {
  console.log('Generated:', result)
})

Real-World Use Cases for Chrome GenAI

1. Content Completion: Auto-complete suggestions as users type

async function getAutocompleteSuggestions(userInput) {
  const prompt = `Based on the input: "${userInput}", suggest 3 completions:\n1.`
  const suggestions = await generateText(prompt)
  return suggestions.split('\n').slice(0, 3)
}

2. Real-Time Summarization: Summarize page content for accessibility

async function summarizePageContent() {
  const pageText = document.body.innerText.substring(0, 5000)
  const prompt = `Summarize this in 2-3 sentences:\n\n${pageText}`
  return generateText(prompt)
}

3. Writing Assistance: Grammar checking and style suggestions

async function improveWriting(text) {
  const prompt = `Fix grammar and improve clarity: "${text}"\n\nImproved: `
  return generateText(prompt)
}

Limitations of Chrome GenAI APIs

Experimental Status: API surface may change before stable release
On-Device Models: Limited model size compared to cloud alternatives (typically 2-4B parameters)
Browser Dependency: Only works in Chrome with the API enabled
No Fine-Tuning: Can’t customize models for specific tasks
Limited Context: Maximum input length may be constrained (typically 4K-32K tokens)

Feature Detection and Graceful Degradation

Always provide fallbacks:

async function smartGenerate(prompt) {
  // Try Chrome GenAI first
  if ('ai' in window && 'languageModel' in window.ai) {
    try {
      return await generateText(prompt)
    } catch (error) {
      console.warn('Chrome GenAI failed, falling back to API')
    }
  }

  // Fallback to server API
  const response = await fetch('/api/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ prompt }),
  })

  const data = await response.json()
  return data.text
}

WebGPU: GPU-Accelerated Computing in Browsers

What is WebGPU?

WebGPU is a modern graphics and compute API for the web, designed as the successor to WebGL. Unlike WebGL (which is primarily for graphics), WebGPU exposes powerful GPU compute capabilities for machine learning, physics simulations, and other compute-intensive tasks.

Key Difference: WebGL is for graphics rendering. WebGPU is for general-purpose GPU computing, making it perfect for AI workloads.

Why WebGPU for AI?

Performance: 10-100x faster than CPU for matrix operations, the core of neural networks

Efficiency: GPU architecture matches the parallelization patterns of deep learning

Modern API: Designed from scratch for modern hardware and use cases (WebGL was 2011)

Safety: Memory-safe by design with built-in validation

Browser Support and Feature Detection

As of December 2025, WebGPU is available in:

Chrome 113+ (enabled by default)
Edge 113+ (enabled by default)
Firefox (behind a flag)
Safari (in development)

Detect availability:

async function checkWebGPU() {
  if (!navigator.gpu) {
    console.warn('WebGPU is not available in this browser')
    return false
  }

  try {
    const adapter = await navigator.gpu.requestAdapter()
    if (!adapter) {
      console.warn('No GPU adapter found')
      return false
    }

    console.log('WebGPU is available!')
    return true
  } catch (error) {
    console.warn('WebGPU error:', error)
    return false
  }
}

checkWebGPU()

Using WebGPU for AI: Matrix Multiplication Example

Here’s a practical example of using WebGPU to perform matrix multiplication—a core operation in neural networks:

class GPUMatrixCompute {
  constructor() {
    this.device = null
    this.queue = null
  }

  async initialize() {
    // Request GPU adapter and device
    const adapter = await navigator.gpu.requestAdapter()
    if (!adapter) throw new Error('GPU adapter not found')

    this.device = await adapter.requestDevice()
    this.queue = this.device.queue
  }

  async matrixMultiply(matrixA, matrixB, dimensions) {
    const { rowsA, colsA, rowsB, colsB } = dimensions

    // Create GPU buffers for input matrices
    const bufferA = this.device.createBuffer({
      size: matrixA.byteLength,
      mappedAtCreation: true,
      usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
    })
    new Float32Array(bufferA.getMappedRange()).set(matrixA)
    bufferA.unmap()

    const bufferB = this.device.createBuffer({
      size: matrixB.byteLength,
      mappedAtCreation: true,
      usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
    })
    new Float32Array(bufferB.getMappedRange()).set(matrixB)
    bufferB.unmap()

    // Create output buffer
    const resultSize = rowsA * colsB * 4
    const bufferResult = this.device.createBuffer({
      size: resultSize,
      usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
    })

    // Create shader for matrix multiplication
    const shaderModule = this.device.createShaderModule({
      code: `
        @group(0) @binding(0) var<storage, read> matrixA: array<f32>;
        @group(0) @binding(1) var<storage, read> matrixB: array<f32>;
        @group(0) @binding(2) var<storage, read_write> result: array<f32>;

        @compute @workgroup_size(8, 8)
        fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
          let row = global_id.x;
          let col = global_id.y;

          if (row >= ${rowsA}u || col >= ${colsB}u) {
            return;
          }

          var sum: f32 = 0.0;
          for (var k: u32 = 0u; k < ${colsA}u; k = k + 1u) {
            sum = sum + matrixA[row * ${colsA}u + k] * matrixB[k * ${colsB}u + col];
          }

          result[row * ${colsB}u + col] = sum;
        }
      `,
    })

    // Create compute pipeline
    const pipeline = this.device.createComputePipeline({
      layout: 'auto',
      compute: { module: shaderModule, entryPoint: 'main' },
    })

    // Execute computation
    const bindGroup = this.device.createBindGroup({
      layout: pipeline.getBindGroupLayout(0),
      entries: [
        { binding: 0, resource: { buffer: bufferA } },
        { binding: 1, resource: { buffer: bufferB } },
        { binding: 2, resource: { buffer: bufferResult } },
      ],
    })

    const commandEncoder = this.device.createCommandEncoder()
    const passEncoder = commandEncoder.beginComputePass()
    passEncoder.setPipeline(pipeline)
    passEncoder.setBindGroup(0, bindGroup)
    passEncoder.dispatchWorkgroups(
      Math.ceil(rowsA / 8),
      Math.ceil(colsB / 8)
    )
    passEncoder.end()

    this.queue.submit([commandEncoder.finish()])

    // Read result
    const stagingBuffer = this.device.createBuffer({
      size: resultSize,
      usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ,
    })

    const copyEncoder = this.device.createCommandEncoder()
    copyEncoder.copyBufferToBuffer(bufferResult, 0, stagingBuffer, 0, resultSize)
    this.queue.submit([copyEncoder.finish()])

    await stagingBuffer.mapAsync(GPUMapMode.READ)
    const result = new Float32Array(stagingBuffer.getMappedRange()).slice(0)
    stagingBuffer.unmap()

    return result
  }
}

// Usage
const compute = new GPUMatrixCompute()
await compute.initialize()

const matrixA = new Float32Array(16) // 4x4 matrix
const matrixB = new Float32Array(16) // 4x4 matrix
matrixA.fill(1)
matrixB.fill(2)

const result = await compute.matrixMultiply(matrixA, matrixB, {
  rowsA: 4, colsA: 4, rowsB: 4, colsB: 4
})
console.log('Result:', result)

WebGPU Performance Benefits

For a typical neural network inference:

CPU-only: 500ms per inference
WebGPU on GPU: 20-50ms per inference (10-25x faster)
With batching: 100ms for 10 inferences (95% reduction)

Frameworks Built on WebGPU

You don’t need to write GPU compute shaders manually. Several frameworks abstract this:

ONNX Runtime Web: Automatically uses WebGPU for acceleration
TensorFlow.js: Added WebGPU backend for faster inference
MediaPipe: Uses WebGPU for real-time pose detection and object recognition

ONNX.js: Running Large Language Models Locally

What is ONNX?

ONNX (Open Neural Network Exchange) is an open standard for representing machine learning models. ONNX.js is the JavaScript runtime that executes ONNX models in browsers.

Instead of training models in the browser (impractical), you:

Train or download a pretrained model in Python
Convert it to ONNX format
Run inference in the browser using ONNX.js

Why ONNX for Browser AI?

Framework Agnostic: Train in PyTorch, TensorFlow, scikit-learn, then run in browser
Pre-Trained Models: Use models trained by researchers worldwide
Optimized: ONNX.js automatically optimizes for browser execution
GPU Acceleration: Automatically uses WebGPU when available

Setting Up ONNX.js

npm install onnxruntime-web

Running a Pretrained Model

import * as ort from 'onnxruntime-web'

// Set execution provider (use WebGPU if available, fall back to WASM)
ort.env.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/onnxruntime-web@latest/dist/'

async function loadModel(modelPath) {
  try {
    const session = await ort.InferenceSession.create(modelPath, {
      executionProviders: ['webgpu', 'wasm'], // Try WebGPU first, fall back to WASM
    })
    return session
  } catch (error) {
    console.error('Failed to load model:', error)
    throw error
  }
}

// Load model
const session = await loadModel('model.onnx')

Running Inference

async function runInference(session, inputData) {
  // Create input tensor
  const input = new ort.Tensor('float32', inputData, [1, 224, 224, 3])

  // Run inference
  const outputs = await session.run({ images: input })

  // Extract predictions
  const predictions = outputs.output.data
  return predictions
}

// Example: Image classification
const imageData = await loadImageAsFloat32Array('image.jpg')
const predictions = await runInference(session, imageData)
console.log('Top prediction:', Math.max(...predictions))

Converting Models to ONNX

You can convert models from various frameworks using the skl2onnx library:

# Python: Convert scikit-learn model to ONNX
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import joblib

# Load your trained model
model = joblib.load('trained_model.pkl')

# Define input shape
initial_type = [('float_input', FloatTensorType([None, 10]))]

# Convert to ONNX
onnx_model = convert_sklearn(model, initial_type=initial_type)

# Save
with open('model.onnx', 'wb') as f:
    f.write(onnx_model.SerializeToString())

Running Small LLMs with ONNX.js

Here’s how to run a small language model (quantized to reduce size):

class BrowserLLM {
  constructor() {
    this.session = null
    this.tokenizer = null
  }

  async initialize(modelPath, tokenizerPath) {
    // Load model
    this.session = await ort.InferenceSession.create(modelPath, {
      executionProviders: ['webgpu', 'wasm'],
    })

    // Load tokenizer (typically a JSON file with vocabulary)
    const response = await fetch(tokenizerPath)
    this.tokenizer = await response.json()

    console.log('LLM initialized')
  }

  tokenize(text) {
    // Simple tokenization (in production, use proper tokenizer)
    const tokens = text.toLowerCase().split(/\s+/)
    return tokens.map(token => {
      return this.tokenizer.vocab?.[token] || 0
    })
  }

  async generate(prompt, maxTokens = 50) {
    let tokens = this.tokenize(prompt)
    let output = tokens.slice()

    for (let i = 0; i < maxTokens; i++) {
      // Prepare input
      const inputIds = new ort.Tensor('int64', BigInt64Array.from(tokens.map(BigInt)), [1, tokens.length])

      // Run inference
      const outputs = await this.session.run({ input_ids: inputIds })
      const logits = outputs.logits.data

      // Get top token
      let maxIndex = 0
      let maxValue = logits[0]

      for (let j = 0; j < logits.length; j++) {
        if (logits[j] > maxValue) {
          maxValue = logits[j]
          maxIndex = j
        }
      }

      // Add token to output
      tokens.push(maxIndex)
      output.push(maxIndex)

      // Check for end token
      if (maxIndex === this.tokenizer.eos_token_id) break
    }

    // Decode tokens back to text
    return this.decodeTokens(output)
  }

  decodeTokens(tokens) {
    // Reverse lookup in vocabulary
    const reverseVocab = Object.entries(this.tokenizer.vocab).reduce(
      (acc, [word, id]) => ({ ...acc, [id]: word }),
      {}
    )

    return tokens
      .map(id => reverseVocab[id] || '')
      .join(' ')
  }
}

// Usage
const llm = new BrowserLLM()
await llm.initialize('model.onnx', 'tokenizer.json')
const output = await llm.generate('The future of AI is', 30)
console.log(output)

Available Pretrained Models for Browser

Distilbert: Fast BERT for text classification
MobileBERT: Optimized BERT for mobile/browser
Phi-2: Small but capable language model
TinyLLaMA: Compact version of LLaMA
Whisper Tiny: Speech-to-text model

Find quantized models at:

Comparing the Three Approaches

Feature	Chrome GenAI	WebGPU	ONNX.js
Model Type	Text generation	General compute	Any pretrained model
Ease of Use	Easiest	Hardest	Medium
Performance	Good	Excellent	Excellent
Browser Support	Chrome only (experimental)	Chrome, Edge (stable)	All browsers (WASM fallback)
Model Customization	None	Full (write shaders)	Pre-trained only
Model Size	Medium (on-device)	Variable	User-defined
Cost	Free	Free	Free
Privacy	Maximum	Maximum	Maximum
Status	Experimental (2025)	Stable	Stable

Real-World Example: AI-Powered Text Editor

Here’s how to combine these technologies into a practical text editor with AI features:

class AITextEditor {
  constructor() {
    this.llm = null
    this.chrome_ai = null
  }

  async initialize() {
    // Try Chrome GenAI first
    if ('ai' in window && 'languageModel' in window.ai) {
      this.chrome_ai = await window.ai.languageModel.create()
    }

    // Also initialize ONNX-based LLM
    this.llm = new BrowserLLM()
    try {
      await this.llm.initialize('model.onnx', 'tokenizer.json')
    } catch (e) {
      console.warn('ONNX model failed to load')
    }
  }

  async generateCompletion(text) {
    // Use Chrome GenAI if available
    if (this.chrome_ai) {
      try {
        const prompt = `Complete this: "${text}"\n\nCompletion:`
        const stream = await this.chrome_ai.promptStreaming(prompt)
        let result = ''
        for await (const chunk of stream) {
          result += chunk
        }
        return result
      } catch (e) {
        console.warn('Chrome GenAI failed')
      }
    }

    // Fall back to ONNX
    if (this.llm) {
      return await this.llm.generate(text, 50)
    }

    throw new Error('No AI backend available')
  }

  async improveWriting(text) {
    const prompt = `Improve this writing for clarity and grammar:\n"${text}"\n\nImproved:`
    return this.generateCompletion(prompt)
  }

  async summarize(text) {
    const prompt = `Summarize this in 2-3 sentences:\n${text}\n\nSummary:`
    return this.generateCompletion(prompt)
  }
}

// Integration with UI
const editor = new AITextEditor()
await editor.initialize()

document.getElementById('enhance-btn').addEventListener('click', async () => {
  const text = document.getElementById('editor').value
  const improved = await editor.improveWriting(text)
  document.getElementById('result').textContent = improved
})

Privacy and Security Considerations

Advantages of Browser AI

✅ Data Privacy: No data leaves the user’s device ✅ GDPR Compliant: No data collection or transmission ✅ No Surveillance: No analytics or logging on backend ✅ User Control: Users control when and how AI runs ✅ Offline: Works without internet connection

Security Best Practices

Verify Model Origin: Only download models from trusted sources
Size Validation: Check model file size matches expected
Hash Verification: Use SHA-256 hashes to verify model integrity
Sandboxing: Use Web Workers to isolate AI computation
Input Validation: Sanitize user inputs before processing

// Verify model hash
async function verifyModelIntegrity(modelPath, expectedHash) {
  const response = await fetch(modelPath)
  const buffer = await response.arrayBuffer()
  const hashBuffer = await crypto.subtle.digest('SHA-256', buffer)
  const hashArray = Array.from(new Uint8Array(hashBuffer))
  const hashHex = hashArray.map(b => b.toString(16).padStart(2, '0')).join('')
  return hashHex === expectedHash
}

const isValid = await verifyModelIntegrity('model.onnx', 'abc123...')
if (isValid) {
  await loadModel('model.onnx')
}

Performance Optimization Tips

1. Use Web Workers to Avoid Blocking UI

// ai-worker.js
importScripts('https://cdn.jsdelivr.net/npm/onnxruntime-web')

let session = null

self.onmessage = async (event) => {
  const { command, data } = event.data

  if (command === 'init') {
    session = await ort.InferenceSession.create(data.modelPath)
  }

  if (command === 'infer') {
    const results = await session.run(data.input)
    self.postMessage({ result: results })
  }
}

// Main thread
const worker = new Worker('ai-worker.js')
worker.postMessage({ command: 'init', data: { modelPath: 'model.onnx' } })

worker.onmessage = (event) => {
  console.log('AI result:', event.data.result)
}

2. Batch Inference

// Process multiple samples at once (much faster)
async function batchInference(session, samples) {
  // Combine samples into single batch
  const batchedInput = new ort.Tensor(
    'float32',
    samples.flat(),
    [samples.length, 224, 224, 3]
  )

  // Single inference run
  const outputs = await session.run({ images: batchedInput })

  // Split results back
  return Array.from(outputs.output.data).reduce((acc, val, idx) => {
    const batchIdx = Math.floor(idx / outputSize)
    if (!acc[batchIdx]) acc[batchIdx] = []
    acc[batchIdx].push(val)
    return acc
  }, [])
}

3. Model Quantization

Quantized models (int8 instead of float32) are 4x smaller and often faster:

// Load quantized model
const session = await ort.InferenceSession.create('model-quantized.onnx', {
  executionProviders: ['webgpu', 'wasm'],
  graphOptimizationLevel: 'all', // Enable optimizations
})

The Future of Browser AI

2025-2026 Outlook

Chrome GenAI APIs: Graduating from experimental to stable WebGPU Standardization: Broader browser support and framework adoption Model Compression: Techniques enabling larger models in browsers Multimodal Models: Vision and audio processing in browsers Real-Time Features: Voice transcription, video analysis, live translation

Emerging Use Cases

Accessibility: Real-time transcription and translation for everyone
Content Moderation: Client-side filtering without backend infrastructure
Personalization: Adaptive experiences based on local user preferences
Productivity: Offline-first tools with AI assistance
Gaming: NPCs with real-time natural language understanding

Conclusion: The Browser as an AI Runtime

We’re at the beginning of a revolution where browsers become first-class AI runtimes. The combination of Chrome GenAI APIs, WebGPU, and ONNX.js enables developers to build intelligent applications that are faster, more private, and completely independent of cloud infrastructure.

Key Takeaways:

Chrome GenAI APIs provide built-in text generation for common tasks
WebGPU enables 10-100x faster AI inference through GPU acceleration
ONNX.js allows running any pretrained model directly in browsers
Browser AI is private: No data leaves the user’s device
Performance is exceptional: Sub-millisecond response times for local inference
The technology is production-ready today, with rapid improvements coming

Getting Started

Enable Chrome GenAI at chrome://flags (if available in your region)
Explore WebGPU with frameworks like TensorFlow.js or ONNX Runtime
Find pretrained models on Hugging Face or the ONNX Model Zoo
Build a prototype: Text completion, image classification, or voice transcription
Deploy with confidence: Your users’ data is completely safe

The edge is where the future of AI is being built. And that edge is now in the browser.