LLM Hardware Calculator
Enter your RAM to see which large language models you can run locally. Covers Apple Silicon Macs, consumer GPUs, and workstations.
How LLM Memory Requirements Work
Running a large language model locally requires holding the model weights — and optionally a KV cache for context — in RAM. Unlike cloud API calls, local inference loads the entire model into memory before generating a single token.
The memory formula
A rough estimate for Q4-quantized models: ~0.5 GB per billion parameters. A 7B model needs about 3.5–4.5 GB. A 70B model needs 35–45 GB. FP16 roughly doubles this; FP32 doubles it again.
Apple Silicon unifies CPU and GPU memory, so your Mac's full RAM pool is available to the model — a significant advantage over discrete GPUs, which are limited to VRAM. A Mac Studio M4 Max with 48 GB can run models that would require a $40,000 server GPU in a data center.
Quantization explained
- Q4_K_M — 4-bit quantization with mixed precision. Best balance of quality and size. Most Ollama defaults use this.
- Q8_0 — 8-bit quantization. Near-identical quality to FP16 at half the memory.
- FP16 — Half-precision floating point. Highest quality, highest memory. Needed for fine-tuning.
- FP32 — Full precision. Only for training or very specific research use cases.
Context window and KV cache
The numbers above cover model weights only. Long context windows add KV cache overhead — a 128K context window can add several GB. For most local inference use cases (chat, code assistance), a 4K–8K context is sufficient and the overhead is small.
Apple Silicon vs. discrete GPU
Apple Silicon Macs use unified memory — the GPU and CPU share the same RAM pool. This means a 96 GB Mac Studio M4 Ultra can load a 70B model at Q4 with room to spare. An NVIDIA RTX 4090 has 24 GB VRAM — enough for 13B at Q8 or 7B at FP16 — but the system RAM is not available for model weights on discrete GPUs running inference.
Frequently Asked Questions
Can I run Llama 3.1 70B on a Mac?
Yes — with enough RAM. Llama 3.1 70B at Q4_K_M requires about 42 GB. A Mac Studio M4 Max with 48 GB or any Mac with 64 GB+ can run it. The 8B version runs on any Mac with 8 GB or more.
What is the minimum RAM to run an LLM locally?
The smallest useful models (Phi-3 mini 3.8B, Gemma 2B) run in 2–3 GB. Any modern Mac with 8 GB can run 7B models at Q4. 16 GB opens up 13B models. 32 GB is the sweet spot for 30B-class models.
Does RAM speed matter for local LLM inference?
Yes, significantly. Apple Silicon's memory bandwidth (up to 800 GB/s on M4 Ultra) directly determines token generation speed. Higher bandwidth = more tokens per second. This is why Macs punch above their weight for local inference compared to systems with similar RAM but lower memory bandwidth.
Which tool should I use to run LLMs locally?
Ollama is the easiest option for Mac users — one command to download and run any supported model. For more control, llama.cpp (which Ollama wraps) supports all major model formats. LM Studio provides a GUI.
How fast will inference be on my Mac?
Speed depends on model size, quantization, and your Mac's memory bandwidth. As a rough guide on Apple Silicon: 7B Q4 models typically run at 40–80 tokens/second; 13B at 20–40 tok/s; 70B at 5–15 tok/s. Qwen3 32B at Q4_K_M runs at ~22 tok/s on an M4 Max 40-core GPU. For measured, hardware-specific benchmark data see SiliconBench — real tok/s measurements across Apple Silicon chips and popular models.
What is the KV cache and why does it matter?
The KV (key-value) cache stores intermediate attention computations for each token in the context window. It grows with context length. For long documents or multi-turn conversations, the KV cache can add several GB on top of the base model weight. Most local inference tools manage this automatically.