See that chat button in the corner? Click it. Ask it something. Then open your browser's Network tab and watch what happens.

Nothing. No API calls. No data leaving your browser. The AI is running entirely on your device.

This isn't a trick. It's WebGPU and Transformers.js working together to run a real language model in your browser. Let me show you how.

The Traditional Approach: Cloud APIs

When you use ChatGPT, Claude, or any typical AI assistant, here's what happens:

Your message travels to a server, which forwards it to an AI provider, which runs inference on their hardware, and sends back a response. This approach:

  • Requires a backend to store API keys securely
  • Costs money per API call ($0.01-0.10+ per response)
  • Sends your data to third-party servers
  • Needs internet for every single query

For many applications, this is fine. The cloud models are incredibly capable. But what if you want complete privacy? Or offline access? Or zero ongoing costs?

The Browser-Based Approach

Modern browsers can now run AI models directly, using your device's GPU for acceleration:

Everything happens in your browser:

  • Model weights download once and cache in IndexedDB
  • Web Worker runs inference off the main thread
  • WebGPU accelerates computation using your GPU
  • No server needed β€” pure static files

The Key Technologies

WebGPU

WebGPU is the successor to WebGL. While WebGL was designed for graphics, WebGPU is designed for general-purpose GPU computingβ€”including machine learning.

It gives JavaScript direct access to GPU compute shaders, enabling the same kind of parallel processing that powers AI on dedicated hardware.

Transformers.js

Transformers.js is a JavaScript port of Hugging Face's Transformers library. It runs models directly in the browser using ONNX Runtime, with WebGPU acceleration when available.

Web Workers

Model inference is computationally intensive. Running it on the main thread would freeze the UI. Web Workers let us run the model in a background thread, keeping the page responsive.

IndexedDB

Model files are large (500MB-2GB). IndexedDB lets us cache them persistently, so users only download once. Transformers.js handles this automatically.

The Trade-Offs

Browser-based AI isn't always the right choice. Here's an honest comparison:

Aspect Cloud APIs Browser AI
Privacy Data sent to servers 100% on-device
Quality β˜…β˜…β˜…β˜…β˜… (GPT-4, Claude) β˜…β˜…β˜…β˜†β˜† (smaller models)
Latency ~1-3s (network + inference) ~2-10s (inference only)
Cost $0.01-0.10 per query Free (after download)
Offline No Yes
Initial Load Fast ~500MB-2GB download
Browser Support Universal WebGPU required

The iOS Safari Challenge

If you've tried the chat on an iPhone, you may have hit a wall. iOS Safari has significantly stricter memory constraints than desktop browsersβ€”and understanding why taught me a lot about how mobile browsers work. (For a deep dive, see Browser Memory Management.)

Why iOS Is Different

Desktop operating systems have virtual memory. When RAM fills up, the OS can swap less-used memory to disk. This lets applications temporarily exceed physical RAM limits.

iOS has no swap. When memory pressure builds, iOS doesn't page to flash storageβ€”it kills processes. Your Safari tab is just another process that can be terminated to free RAM.

The Numbers

Based on community testing and my own experiments, here are the practical limits:

Environment Practical Limit Behavior
Desktop Chrome/Safari 4GB+ Swaps to disk
iOS Safari (recent iPhone) ~1-2GB Tab killed
iOS Safari (older iPhone) ~300-500MB Tab killed
Android Chrome ~300MB reliable Tab killed

The ~400MB Qwen model I was using? It seems like it should fit. But that's just the model weights. Add ONNX Runtime overhead, WebGPU buffer allocations, JavaScript heap, and the rest of the pageβ€”and you're past the limit before inference even starts.

Safari's WebAssembly Threading Bug

Making things worse: Safari has a known issue with multi-threaded WebAssembly. When ONNX Runtime uses multiple threads (the default for performance), Safari's memory usage grows unboundedly until crash.

The fix? Force single-threaded execution:

This trades some performance for stabilityβ€”but on mobile, stability wins.

The Solution: Smaller Models

The real fix is using a model small enough for mobile memory constraints. SmolLM2-135M is specifically designed for this:

Model Q4F16 Size Quality iOS?
Qwen1.5-0.5B-Chat ~494 MB β˜…β˜…β˜…β˜†β˜† ❌ Too large
SmolLM2-360M ~200 MB β˜…β˜…Β½β˜†β˜† ⚠️ Borderline
SmolLM2-135M ~118 MB β˜…β˜…β˜†β˜†β˜† βœ… Works

Yes, the smaller model is less capable. It won't write your thesis. But it can answer questions, explain concepts, and demonstrate what's possibleβ€”all while fitting in your pocket.

f16: Half the Memory

WebGPU's shader-f16 extension enables half-precision floats. On Apple devices (all of which support f16), this means:

  • 50% memory reduction for model weights
  • ~40% faster inference (less memory bandwidth)
  • Slight accuracy loss (usually imperceptible)

Transformers.js supports this via the dtype: 'q4f16' optionβ€”4-bit quantized weights with f16 activations. It's the sweet spot for mobile.

The Resulting Architecture

WebGPU on iOS: Finally Here

As of iOS 18 (Safari 18), WebGPU is finally enabled by default on iPhone and iPad. Before this, iOS was the last major holdoutβ€”Chrome and Edge had WebGPU for over a year.

This means the chat assistant can now use GPU acceleration on iOS, not just CPU-based WASM inference. The performance difference is significant: WebGPU inference is often 2-5x faster than WASM-only.

When to Use Browser AI

Good fit:

  • Privacy-sensitive applications (medical, financial, personal)
  • Offline-capable tools
  • Educational demos and experiments
  • Avoiding ongoing API costs
  • Simple tasks that don't need GPT-4-level capabilities

Better with cloud:

  • Complex reasoning tasks
  • Long-form content generation
  • Applications needing the best possible quality
  • Users on older devices or browsers

Try It Yourself

The chat assistant on this site is a working example. Click the πŸ’¬ button and start a conversation. The first time, it will download the model (~400MB from HuggingFace). After that, it loads from cache in seconds.

Open DevTools and watch the Network tabβ€”you'll see zero API calls to any AI service. It's all happening locally.

The Code

Here's a simplified version of how the chat assistant works:

What's Next

Browser-based AI is just getting started. As WebGPU matures and models get more efficient, we'll see:

  • Larger, more capable models running locally
  • Better quantization for smaller downloads
  • Specialized models for specific tasks
  • Hybrid approaches (local for privacy, cloud for complexity)

The web platform keeps absorbing capabilities that used to require native apps or server infrastructure. AI inference is just the latest example.

Further Reading