Four Very Different Machines, One Question: How Fast Can They Run Local LLMs?

May 28

We’ve benchmarked a broad lineup of local LLM setups, ranging from a $200 mini-PC to a $4,000 RTX 4090 tower, to see how they actually feel in day-to-day use.

Hardware under test

GMKTek Mini PC – Intel Twin Lake N150, 16 GB RAM, no GPU (~$200)
MacBook Air (M4) – M4 CPU, 10-core GPU, 16 GB unified memory (~$1,400)
RTX 4090 Tower – Intel i7-5820K, RTX 4090 (24 GB VRAM), 128 GB RAM (~$4,000)
Mac Studio (M1 Ultra) – M1 Ultra, 48-core GPU, 128 GB unified memory (~$3,500)

Models were mostly from the Qwen 3 family (0.6B, 4B, 14B, 32B) with a few heavier hitters: gpt-oss 20B, gpt-oss 120B, and Llama 3.3 70B (Q8). For each run we measured prompt throughput and generation throughput (tokens per second) on two prompts:

A simple factual question
An instruction with multiple constraints

Generation speed is what you actually feel in a chat UI, so that’s where we’ll focus.

Small Models: Everybody Wins (Mostly)

For Qwen 3 0.6B, all four machines are perfectly usable:

GMKTek Mini PC – ~26 tok/s
MacBook Air M4 – ~110 tok/s
Mac Studio – ~99 tok/s
RTX 4090 – ~344 tok/s

At 20–30 tok/s, answers appear fast enough that you don’t notice the stream; at 100+ tok/s it feels instantaneous. The striking thing here is that the $200 mini-PC is genuinely fine for a 0.6B model. The 4090 is ~13× faster, but the mini PC is already “fast enough” for basic chat, code hints, and small tools.

If your use case is light-weight assistants, small tools, and simple automation, the bottleneck is not hardware – even low-power x86 will do.

Stepping Up: 4B and 14B Separate the Pack

Once you move to Qwen 3 4B and 14B, the gap widens dramatically.

Qwen 3 4B (average generation speed):

GMKTek Mini PC – ~6.2 tok/s
MacBook Air (M4) – ~35.8 tok/s
Mac Studio – ~91.0 tok/s
RTX 4090 – ~180.6 tok/s

Qwen 3 14B:

GMKTek Mini PC – ~2.1 tok/s
MacBook Air (M4) – ~11.4 tok/s
Mac Studio – ~38.6 tok/s
RTX 4090 – ~82.2 tok/s

A few patterns jump out:

The mini-PC taps out here. At ~2 tok/s on 14B, you can technically run it, but it feels like a slow remote API: sentences appear one… token… at… a… time.
The MacBook Air is a sweet spot for 4B. ~36 tok/s is absolutely comfortable, and ~11 tok/s on 14B is usable if you’re patient.
The Mac Studio hits a nice middle ground: roughly half the 4090’s speed on 4B/14B, while being an all-in-one desktop with low noise and power draw.
The 4090 is a throughput monster. Compared purely on shared models, it’s:

~3× faster than the M4 Air on 0.6B
~5× faster on 4B
~7× faster on 14B
and ~40× faster than the mini-PC on 14B.

If you want “serious” local LLMs in the 4–14B range as your daily driver, the takeaway is:

4B: runs acceptably everywhere but shines on the 4090 and Mac Studio.
14B: realistically needs at least a MacBook-class machine; it’s nice on Mac Studio, and great on the 4090.

Big Boys: 20B, 32B, 70B, 120B

The really interesting behaviour shows up with the larger models.

For Qwen 3 32B (tested on the “big” machines):

RTX 4090 – ~39.6 tok/s
Mac Studio – ~17.8 tok/s

The GPU wins comfortably, but both are usable: 32B feels like a heavyweight assistant, not a toy.

gpt-oss 20B (13 GB):

RTX 4090 – ~150.7 tok/s
Mac Studio – ~81.1 tok/s

Again, the 4090 is roughly twice as fast; clearly in its comfort zone here. If your workload is “lots of 20–30B models all day long,” VRAM wins.

But then there’s the surprise:

gpt-oss 120B (65 GB):

RTX 4090 – ~7.3 tok/s
Mac Studio – ~11.4 tok/s

Llama 3.3 70B Q8 (74 GB):

RTX 4090 – ~0.8 tok/s
Mac Studio – ~7.5 tok/s

On these ultra-large models, the story flips. The 4090’s 24 GB VRAM simply isn’t enough to keep the whole quantized model local, so you’re likely paying a penalty in offloading and memory shuffling. The M1 Ultra, with 128 GB unified memory, can keep everything resident, and it shows: roughly 9× faster on the 70B model and notably ahead on 120B as well.

So:

Up to ~30B: dedicated GPU with lots of VRAM is king.
At ~70B–120B in big quantizations: massive unified memory can beat a single 4090, even though the GPU is “slower” on paper.

Prompt Speed vs Generation Speed

Across all runs, prompt evaluation (reading your prompt) was much faster than generation. On the 4090, for example, Qwen 3 4B saw prompt throughput in the 5–11k tok/s range, while generation speed sat around 180 tok/s. Even on the mini-PC, prompts went through faster than output.

In practice, this means:

Latency feels dominated by decoding, not “thinking through” your prompt.
Optimising for generation speed (hardware, quantization, model size) matters much more for interactivity than squeezing a few extra prompt tokens per second.

Price vs Performance: Who’s the Best Value?

If we very roughly normalise by the approximate prices you’d pay today, we get tokens per second per dollar. A few highlights:

For Qwen 3 0.6B:

GMKTek Mini PC – ~0.131 tok/s per dollar
MacBook Air – ~0.079
RTX 4090 tower – ~0.086
Mac Studio – ~0.028

The tiny model is most cost-effective on the cheapest machine. If all you need is a small local assistant, throwing thousands of dollars at the problem is pure luxury.

For Qwen 3 4B:

GMKTek Mini PC – ~0.031 tok/s per dollar
MacBook Air – ~0.026
Mac Studio – ~0.026
RTX 4090 tower – ~0.045

Here, the 4090 tower actually gives the best price/performance for 4B: it’s much more expensive, but the speed scales more than linearly.

For Qwen 3 14B:

GMKTek Mini PC – ~0.010 tok/s per dollar
MacBook Air – ~0.008
Mac Studio – ~0.011
RTX 4090 – ~0.021

Once you’re in 14B territory, the big GPU rig is clearly the best value if you care about throughput, and the Mac Studio is a respectable middle ground. The mini-PC only “wins” on small models.

What This Means for Your Local LLM Setup

Putting it all together:

Budget / portable setup (~$200):
A low-power mini-PC can run 0.6B and even 4B models well enough for simple chat, note-taking, and offline tools. It’s astonishingly capable for the price, but not where you want to live for 14B+.
Everyday laptop (~$1,400):
An M4 MacBook Air comfortably runs 4B and can handle 14B if you’re patient. Fantastic for travel, experimentation, and “good enough” local assistants.
Unified-memory desktop (~$3,500):
The M1 Ultra Mac Studio is a workhorse: solid speeds on 4–32B models, and uniquely good at huge models (70B–120B) thanks to 128 GB unified memory. If you want to play with truly enormous LLMs without building a GPU farm, this is where it gets interesting.
GPU tower (~$4,000):
The RTX 4090 tower is the throughput champion for anything up to ~30B. If your workflow is lots of 4–32B inference, batch jobs, or multi-model orchestration, the 4090’s speed is absolutely worth it.

The bigger lesson is that “local LLMs” are not one monolithic thing. A cheap mini-PC is already great for tiny models; a laptop is enough for mid-sized assistants; a unified-memory desktop lets you flirt with 70B+; and a big GPU box dominates the middle weights.

Which one you should care about depends less on benchmarks and more on a simple question:
Do you want “good enough and cheap,” “portable and capable,” or “as fast and as big as possible”?

These benchmarks make it clear that you really can have any of those—so long as you pick the right hardware for the models you actually plan to run.

Deric Miller

Four Very Different Machines, One Question: How Fast Can They Run Local LLMs?

Small Models: Everybody Wins (Mostly)

Stepping Up: 4B and 14B Separate the Pack

Big Boys: 20B, 32B, 70B, 120B

Prompt Speed vs Generation Speed

Price vs Performance: Who’s the Best Value?

What This Means for Your Local LLM Setup

Smart Squared

Email: info@smartsquared.ai

Four Very Different Machines, One Question: How Fast Can They Run Local LLMs?

Small Models: Everybody Wins (Mostly)

Stepping Up: 4B and 14B Separate the Pack

Big Boys: 20B, 32B, 70B, 120B

Prompt Speed vs Generation Speed

Price vs Performance: Who’s the Best Value?

What This Means for Your Local LLM Setup

Re-Alignment with Fine-Tooning

Smart Squared

Email: info@smartsquared.ai