AI in the Browser: No Server Required

See that chat button in the corner? Click it. Ask it something. Then open your browser's Network tab and watch what happens.

Nothing. No API calls. No data leaving your browser. The AI is running entirely on your device.

This isn't a trick. It's WebGPU and Transformers.js working together to run a real language model in your browser. Let me show you how.

The Traditional Approach: Cloud APIs

When you use ChatGPT, Claude, or any typical AI assistant, here's what happens:

Your message travels to a server, which forwards it to an AI provider, which runs inference on their hardware, and sends back a response. This approach:

Requires a backend to store API keys securely
Costs money per API call ($0.01-0.10+ per response)
Sends your data to third-party servers
Needs internet for every single query

For many applications, this is fine. The cloud models are incredibly capable. But what if you want complete privacy? Or offline access? Or zero ongoing costs?

The Browser-Based Approach

Modern browsers can now run AI models directly, using your device's GPU for acceleration:

Everything happens in your browser:

Model weights download once and cache in IndexedDB
Web Worker runs inference off the main thread
WebGPU accelerates computation using your GPU
No server needed — pure static files

The Key Technologies

WebGPU

WebGPU is the successor to WebGL. While WebGL was designed for graphics, WebGPU is designed for general-purpose GPU computing, including machine learning.

It gives JavaScript direct access to GPU compute shaders, enabling the same kind of parallel processing that powers AI on dedicated hardware.

Transformers.js

Transformers.js is a JavaScript port of Hugging Face's Transformers library. It runs models directly in the browser using ONNX Runtime, with WebGPU acceleration when available.

Web Workers

Model inference is computationally intensive. Running it on the main thread would freeze the UI. Web Workers let us run the model in a background thread, keeping the page responsive.

IndexedDB

Model files are large (500MB-2GB). IndexedDB lets us cache them persistently, so users only download once. Transformers.js handles this automatically.

The Trade-Offs

Browser-based AI isn't always the right choice. Here's an honest comparison:

Aspect	Cloud APIs	Browser AI
Privacy	Data sent to servers	100% on-device
Quality	★★★★★ (GPT-4, Claude)	★★★☆☆ (smaller models)
Latency	~1-3s (network + inference)	~2-10s (inference only)
Cost	$0.01-0.10 per query	Free (after download)
Offline	No	Yes
Initial Load	Fast	~500MB-2GB download
Browser Support	Universal	WebGPU required

The iOS Safari Challenge

If you've tried the chat on an iPhone, you may have hit a wall. iOS Safari has significantly stricter memory constraints than desktop browsers. Understanding why taught me a lot about how mobile browsers work. (For a deep dive, see Browser Memory Management.)

Why iOS Is Different

Desktop operating systems have virtual memory. When RAM fills up, the OS can swap less-used memory to disk. This lets applications temporarily exceed physical RAM limits.

iOS has no swap. When memory pressure builds, iOS doesn't page to flash storage. Instead, it kills processes. Your Safari tab is just another process that can be terminated to free RAM.

The Numbers

Based on community testing and my own experiments, here are the practical limits:

Environment	Practical Limit	Behavior
Desktop Chrome/Safari	4GB+	Swaps to disk
iOS Safari (recent iPhone)	~1-2GB	Tab killed
iOS Safari (older iPhone)	~300-500MB	Tab killed
Android Chrome	~300MB reliable	Tab killed

The ~400MB Qwen model I was using? It seems like it should fit. But that's just the model weights. Add ONNX Runtime overhead, WebGPU buffer allocations, JavaScript heap, and the rest of the page. You're past the limit before inference even starts.

Safari's WebAssembly Threading Bug

Making things worse: Safari has a known issue with multi-threaded WebAssembly. When ONNX Runtime uses multiple threads (the default for performance), Safari's memory usage grows unboundedly until crash.

The fix? Force single-threaded execution:

This trades some performance for stability, but on mobile, stability wins.

The Solution: Smaller Models

The real fix is using a model small enough for mobile memory constraints. SmolLM2-135M is specifically designed for this:

Model	Q4F16 Size	Quality	iOS?
Qwen1.5-0.5B-Chat	~494 MB	★★★☆☆	❌ Too large
SmolLM2-360M	~200 MB	★★½☆☆	⚠️ Borderline
SmolLM2-135M	~118 MB	★★☆☆☆	✅ Works

Yes, the smaller model is less capable. It won't write your thesis. But it can answer questions, explain concepts, and demonstrate what's possible, all while fitting in your pocket.

f16: Half the Memory

WebGPU's shader-f16 extension enables half-precision floats. On Apple devices (all of which support f16), this means:

50% memory reduction for model weights
~40% faster inference (less memory bandwidth)
Slight accuracy loss (usually imperceptible)

Transformers.js supports this via the dtype: 'q4f16' option, which uses 4-bit quantized weights with f16 activations. It's the sweet spot for mobile.

The Resulting Architecture

WebGPU on iOS: Finally Here

As of iOS 18 (Safari 18), WebGPU is finally enabled by default on iPhone and iPad. Before this, iOS was the last major holdout. Chrome and Edge had WebGPU for over a year.

This means the chat assistant can now use GPU acceleration on iOS, not just CPU-based WASM inference. The performance difference is significant: WebGPU inference is often 2-5x faster than WASM-only.

When to Use Browser AI

Good fit:

Privacy-sensitive applications (medical, financial, personal)
Offline-capable tools
Educational demos and experiments
Avoiding ongoing API costs
Simple tasks that don't need GPT-4-level capabilities

Better with cloud:

Complex reasoning tasks
Long-form content generation
Applications needing the best possible quality
Users on older devices or browsers

Try It Yourself

The chat assistant on this site is a working example. Click the 💬 button and start a conversation. The first time, it will download the model (~400MB from HuggingFace). After that, it loads from cache in seconds.

Open DevTools and watch the Network tab. You'll see zero API calls to any AI service. It's all happening locally.

The Code

Here's a simplified version of how the chat assistant works:

What's Next

Browser-based AI is just getting started. As WebGPU matures and models get more efficient, we'll see:

Larger, more capable models running locally
Better quantization for smaller downloads
Specialized models for specific tasks
Hybrid approaches (local for privacy, cloud for complexity)

The web platform keeps absorbing capabilities that used to require native apps or server infrastructure. AI inference is just the latest example.