LLM Hardware Calculator

Enter your RAM to see which large language models you can run locally. Covers Apple Silicon Macs, consumer GPUs, and workstations.

Fits comfortably Tight — may be slow Won't fit
Know it fits — now see how fast. SiliconBench has real tok/s measurements for popular models on Apple Silicon chips.
Know the cost. Inference Cost Calculator compares your electricity cost per million tokens against OpenAI, Anthropic, and Google API pricing.

How LLM Memory Requirements Work

Running a large language model locally requires holding the model weights — and optionally a KV cache for context — in RAM. Unlike cloud API calls, local inference loads the entire model into memory before generating a single token.

The memory formula

A rough estimate for Q4-quantized models: ~0.5 GB per billion parameters. A 7B model needs about 3.5–4.5 GB. A 70B model needs 35–45 GB. FP16 roughly doubles this; FP32 doubles it again.

Apple Silicon unifies CPU and GPU memory, so your Mac's full RAM pool is available to the model — a significant advantage over discrete GPUs, which are limited to VRAM. A Mac Studio M4 Max with 48 GB can run models that would require a $40,000 server GPU in a data center.

Quantization explained

  • Q4_K_M — 4-bit quantization with mixed precision. Best balance of quality and size. Most Ollama defaults use this.
  • Q8_0 — 8-bit quantization. Near-identical quality to FP16 at half the memory.
  • FP16 — Half-precision floating point. Highest quality, highest memory. Needed for fine-tuning.
  • FP32 — Full precision. Only for training or very specific research use cases.

Context window and KV cache

The numbers above cover model weights only. Long context windows add KV cache overhead — a 128K context window can add several GB. For most local inference use cases (chat, code assistance), a 4K–8K context is sufficient and the overhead is small.

Apple Silicon vs. discrete GPU

Apple Silicon Macs use unified memory — the GPU and CPU share the same RAM pool. This means a 96 GB Mac Studio M4 Ultra can load a 70B model at Q4 with room to spare. An NVIDIA RTX 4090 has 24 GB VRAM — enough for 13B at Q8 or 7B at FP16 — but the system RAM is not available for model weights on discrete GPUs running inference.

Frequently Asked Questions

Can I run Llama 3.1 70B on a Mac?

Yes — with enough RAM. Llama 3.1 70B at Q4_K_M requires about 42 GB. A Mac Studio M4 Max with 48 GB or any Mac with 64 GB+ can run it. The 8B version runs on any Mac with 8 GB or more.

What is the minimum RAM to run an LLM locally?

The smallest useful models (Phi-3 mini 3.8B, Gemma 2B) run in 2–3 GB. Any modern Mac with 8 GB can run 7B models at Q4. 16 GB opens up 13B models. 32 GB is the sweet spot for 30B-class models.

Does RAM speed matter for local LLM inference?

Yes, significantly. Apple Silicon's memory bandwidth (up to 800 GB/s on M4 Ultra) directly determines token generation speed. Higher bandwidth = more tokens per second. This is why Macs punch above their weight for local inference compared to systems with similar RAM but lower memory bandwidth.

Which tool should I use to run LLMs locally?

Ollama is the easiest option for Mac users — one command to download and run any supported model. For more control, llama.cpp (which Ollama wraps) supports all major model formats. LM Studio provides a GUI.

How fast will inference be on my Mac?

Speed depends on model size, quantization, and your Mac's memory bandwidth. As a rough guide on Apple Silicon: 7B Q4 models typically run at 40–80 tokens/second; 13B at 20–40 tok/s; 70B at 5–15 tok/s. Qwen3 32B at Q4_K_M runs at ~22 tok/s on an M4 Max 40-core GPU. For measured, hardware-specific benchmark data see SiliconBench — real tok/s measurements across Apple Silicon chips and popular models.

What is the KV cache and why does it matter?

The KV (key-value) cache stores intermediate attention computations for each token in the context window. It grows with context length. For long documents or multi-turn conversations, the KV cache can add several GB on top of the base model weight. Most local inference tools manage this automatically.