AI in the Browser: No Server Required
See that chat button in the corner? Click it. Ask it something. Then open your browser's Network tab and watch what happens.
Nothing. No API calls. No data leaving your browser. The AI is running entirely on your device.
This isn't a trick. It's WebGPU and Transformers.js working together to run a real language model in your browser. Let me show you how.
The Traditional Approach: Cloud APIs
When you use ChatGPT, Claude, or any typical AI assistant, here's what happens:
Your message travels to a server, which forwards it to an AI provider, which runs inference on their hardware, and sends back a response. This approach:
- Requires a backend to store API keys securely
- Costs money per API call ($0.01-0.10+ per response)
- Sends your data to third-party servers
- Needs internet for every single query
For many applications, this is fine. The cloud models are incredibly capable. But what if you want complete privacy? Or offline access? Or zero ongoing costs?
The Browser-Based Approach
Modern browsers can now run AI models directly, using your device's GPU for acceleration:
Everything happens in your browser:
- Model weights download once and cache in IndexedDB
- Web Worker runs inference off the main thread
- WebGPU accelerates computation using your GPU
- No server needed β pure static files
The Key Technologies
WebGPU
WebGPU is the successor to WebGL. While WebGL was designed for graphics, WebGPU is designed for general-purpose GPU computingβincluding machine learning.
It gives JavaScript direct access to GPU compute shaders, enabling the same kind of parallel processing that powers AI on dedicated hardware.
Transformers.js
Transformers.js is a JavaScript port of Hugging Face's Transformers library. It runs models directly in the browser using ONNX Runtime, with WebGPU acceleration when available.
Web Workers
Model inference is computationally intensive. Running it on the main thread would freeze the UI. Web Workers let us run the model in a background thread, keeping the page responsive.
IndexedDB
Model files are large (500MB-2GB). IndexedDB lets us cache them persistently, so users only download once. Transformers.js handles this automatically.
The Trade-Offs
Browser-based AI isn't always the right choice. Here's an honest comparison:
| Aspect | Cloud APIs | Browser AI |
|---|---|---|
| Privacy | Data sent to servers | 100% on-device |
| Quality | β β β β β (GPT-4, Claude) | β β β ββ (smaller models) |
| Latency | ~1-3s (network + inference) | ~2-10s (inference only) |
| Cost | $0.01-0.10 per query | Free (after download) |
| Offline | No | Yes |
| Initial Load | Fast | ~500MB-2GB download |
| Browser Support | Universal | WebGPU required |
The iOS Safari Challenge
If you've tried the chat on an iPhone, you may have hit a wall. iOS Safari has significantly stricter memory constraints than desktop browsersβand understanding why taught me a lot about how mobile browsers work. (For a deep dive, see Browser Memory Management.)
Why iOS Is Different
Desktop operating systems have virtual memory. When RAM fills up, the OS can swap less-used memory to disk. This lets applications temporarily exceed physical RAM limits.
iOS has no swap. When memory pressure builds, iOS doesn't page to flash storageβit kills processes. Your Safari tab is just another process that can be terminated to free RAM.
The Numbers
Based on community testing and my own experiments, here are the practical limits:
| Environment | Practical Limit | Behavior |
|---|---|---|
| Desktop Chrome/Safari | 4GB+ | Swaps to disk |
| iOS Safari (recent iPhone) | ~1-2GB | Tab killed |
| iOS Safari (older iPhone) | ~300-500MB | Tab killed |
| Android Chrome | ~300MB reliable | Tab killed |
The ~400MB Qwen model I was using? It seems like it should fit. But that's just the model weights. Add ONNX Runtime overhead, WebGPU buffer allocations, JavaScript heap, and the rest of the pageβand you're past the limit before inference even starts.
Safari's WebAssembly Threading Bug
Making things worse: Safari has a known issue with multi-threaded WebAssembly. When ONNX Runtime uses multiple threads (the default for performance), Safari's memory usage grows unboundedly until crash.
The fix? Force single-threaded execution:
This trades some performance for stabilityβbut on mobile, stability wins.
The Solution: Smaller Models
The real fix is using a model small enough for mobile memory constraints. SmolLM2-135M is specifically designed for this:
| Model | Q4F16 Size | Quality | iOS? |
|---|---|---|---|
| Qwen1.5-0.5B-Chat | ~494 MB | β β β ββ | β Too large |
| SmolLM2-360M | ~200 MB | β β Β½ββ | β οΈ Borderline |
| SmolLM2-135M | ~118 MB | β β βββ | β Works |
Yes, the smaller model is less capable. It won't write your thesis. But it can answer questions, explain concepts, and demonstrate what's possibleβall while fitting in your pocket.
f16: Half the Memory
WebGPU's shader-f16 extension enables half-precision floats. On Apple
devices (all of which support f16), this means:
- 50% memory reduction for model weights
- ~40% faster inference (less memory bandwidth)
- Slight accuracy loss (usually imperceptible)
Transformers.js supports this via the dtype: 'q4f16' optionβ4-bit
quantized weights with f16 activations. It's the sweet spot for mobile.
The Resulting Architecture
WebGPU on iOS: Finally Here
As of iOS 18 (Safari 18), WebGPU is finally enabled by default on iPhone and iPad. Before this, iOS was the last major holdoutβChrome and Edge had WebGPU for over a year.
This means the chat assistant can now use GPU acceleration on iOS, not just CPU-based WASM inference. The performance difference is significant: WebGPU inference is often 2-5x faster than WASM-only.
When to Use Browser AI
Good fit:
- Privacy-sensitive applications (medical, financial, personal)
- Offline-capable tools
- Educational demos and experiments
- Avoiding ongoing API costs
- Simple tasks that don't need GPT-4-level capabilities
Better with cloud:
- Complex reasoning tasks
- Long-form content generation
- Applications needing the best possible quality
- Users on older devices or browsers
Try It Yourself
The chat assistant on this site is a working example. Click the π¬ button and start a conversation. The first time, it will download the model (~400MB from HuggingFace). After that, it loads from cache in seconds.
Open DevTools and watch the Network tabβyou'll see zero API calls to any AI service. It's all happening locally.
The Code
Here's a simplified version of how the chat assistant works:
What's Next
Browser-based AI is just getting started. As WebGPU matures and models get more efficient, we'll see:
- Larger, more capable models running locally
- Better quantization for smaller downloads
- Specialized models for specific tasks
- Hybrid approaches (local for privacy, cloud for complexity)
The web platform keeps absorbing capabilities that used to require native apps or server infrastructure. AI inference is just the latest example.
Further Reading
- WebGPU API β MDN Web Docs
- Transformers.js Documentation β Hugging Face
- Web Workers API β MDN Web Docs
- IndexedDB API β MDN Web Docs
- WebGPU Fundamentals β Tutorial site