Run an LLM in Your Browser — Cloud vs. Local AI

A real model, on your machine, right now

Every previous lesson showed a piece of the pipeline — this one runs the whole thing. Using WebLLM and WebGPU, this page downloads a quantized small language model (Qwen 2.5, 0.5 billion parameters) and runs inference entirely inside your browser tab. No server, no API key: the tokens are generated by your own GPU. If your browser does not support WebGPU, a recorded replay shows what it looks like.

Cloud vs. local: the real tradeoffs

Cloud models like GPT-4 and Claude are enormous, fast, and always up to date — but you pay per token and your data leaves your machine. Local models are private, free to run, and work offline — but you trade away model size and speed. The Latency Race demo on this page makes the difference visceral, and the tradeoff cards break down when each option wins: privacy-sensitive work, cost-sensitive scale, and offline use favor local; frontier capability favors the cloud.

This tradeoff is becoming central to how AI is deployed — from phones running on-device assistants to AI workstations with unified memory designed to run very large models at your desk.

You’ve now seen the whole pipeline. Book a free call

howaiworks.io is free and open source (GitHub), built by Matt Feroz.