Local LLM Inference Cost Calculator
What does it actually cost to run a large language model locally, versus paying per token through an API? Enter your setup and find out.
API Provider Pricing — Output Tokens
Approximate prices per 1 million output tokens as of March 2026. Verify with each provider before budgeting. Input tokens are typically cheaper.
| Provider / Model | Output (per 1M tokens) | Tier |
|---|---|---|
| Anthropic Claude Opus 4 | $75.00 | frontier |
| Anthropic Claude Sonnet 4.6 | $15.00 | pro |
| OpenAI GPT-4o | $10.00 | pro |
| Google Gemini 1.5 Pro | $5.00 | pro |
| Anthropic Claude Haiku 4.5 | $4.00 | mini |
| OpenAI GPT-4o mini | $0.60 | mini |
| Mistral Small (API) | $0.60 | mini |
| Google Gemini Flash | $0.30 | mini |
| Google Gemini Flash Lite | $0.075 | nano |
Note: batch pricing and prompt caching can reduce API costs by 50–80% for high-volume, cacheable workloads. Local inference has no equivalent discount — the cost is fixed by electricity and hardware.
How Local Inference Cost Is Calculated
Running an LLM locally has two cost components: electricity (ongoing) and hardware (amortized). This calculator focuses on the electricity cost, which is the only cost that scales with usage at inference time.
The electricity formula
The cost per million output tokens from local inference is:
seconds_for_1M_tokens = 1,000,000 / tokens_per_second
energy_kWh = seconds_for_1M_tokens × hardware_watts / 3,600,000
cost_per_1M_tokens = energy_kWh × electricity_price_per_kWh
For example: Qwen3 32B on a Mac Studio M4 Max 64 GB at $0.12/kWh
- Speed: ~22 tok/s
- Time for 1M tokens: 45,455 seconds (about 12.6 hours)
- Energy: 45,455 × 150 W / 3,600,000 = 1.89 kWh
- Cost: 1.89 × $0.12 = $0.23 per million output tokens
That's about 65× cheaper than Claude Sonnet at $15/M, and 43× cheaper than GPT-4o at $10/M. You pay more for electricity in California or Europe — but even at $0.40/kWh, the local cost is $0.76/M, still 20× cheaper than most pro-tier APIs.
Hardware cost amortization
The hardware is a one-time capital cost. A Mac Studio M4 Max retails for $1,999–$2,399 new. At 100,000 tokens/day sustained, the hardware pays for itself in electricity savings relative to a mid-tier API within 2–4 months. At 1M tokens/day (realistic for agentic workloads), payback is under a week.
When local inference wins
Local inference is cost-optimal when:
- Your throughput is high enough to keep the hardware busy (above ~50K tokens/day)
- Your workload is privacy-sensitive (nothing leaves the machine)
- You need low latency for agent loops (no network round-trip, no rate limits)
- You're running a capable open-weight model (Qwen3 32B, Llama 3.3 70B, Gemma 3 27B)
When APIs win
- Low-volume, sporadic use (hardware sits idle most of the day)
- You need the very latest closed frontier models (GPT-4.5, Claude Opus 4)
- Your use case requires guaranteed SLAs or managed infrastructure
- Capital budget is constrained — paying per token has zero upfront cost
Apple Silicon advantages for local inference
Apple Silicon uses unified memory — the CPU, GPU, and Neural Engine share the same physical RAM pool. This means a Mac Studio M4 Max with 64 GB can hold a 32B parameter model in the GPU's address space with no VRAM/RAM split. By contrast, a discrete GPU like the RTX 4090 has 24 GB VRAM with a hard ceiling — models larger than that require slow CPU-offloading or model quantization that may not fit.
The Mac's memory bandwidth (~300 GB/s on M4 Max) is also critical for inference speed. LLM inference is memory-bandwidth-bound, not compute-bound — the bottleneck is how fast you can move weights from memory to the ALU. Apple Silicon's unified memory design makes this fast even on consumer hardware.
See real measured tok/s numbers at SiliconBench. Check which models fit your RAM with the LLM Hardware Calculator.
Frequently Asked Questions
How much does it cost to run an LLM locally per month?
It depends on throughput and hardware. A Mac Studio M4 Max running Qwen3 32B all day (24/7) uses about 1.89 kWh per million output tokens. At $0.12/kWh, that's roughly $0.23/M tokens. Running at 500K tokens/day for 30 days costs about $3.45 in electricity — a fraction of any API tier.
Is local inference cheaper than OpenAI or Anthropic APIs?
Yes — often by 20–100× on a per-token basis, once your hardware is paid off. The breakeven point depends on your electricity rate and hardware cost. For agentic workloads generating millions of tokens per day, local inference is dramatically cheaper. For low-volume sporadic use, API pricing is more flexible.
What is the best Mac for local LLM inference?
The Mac Studio M4 Max with 64 GB is the current sweet spot: enough unified memory to run Qwen3 32B and Gemma 3 27B comfortably, ~22 tok/s on 32B models, and relatively low power draw (~150 W). The Mac Studio M2 Ultra with 192 GB can run Llama 3.3 70B and larger, but at higher power and cost. The Mac Mini M4 Pro (24 GB) is the budget pick for 14B and smaller models. Check the LLM Hardware Calculator for a full compatibility guide.
How do I measure tokens per second on my machine?
Ollama reports tok/s in its verbose output. Run any prompt with OLLAMA_DEBUG=1 ollama run qwen3:32b "hello" and look for the "eval rate" in the output. Benchmarks across hardware configurations are published at SiliconBench.
Does the hardware cost factor into the calculation?
This calculator shows electricity cost only — the marginal cost per token once you own the hardware. If you want to include hardware amortization, estimate your machine cost, divide by expected useful life (e.g., $2,000 Mac Studio over 3 years = $667/year), and add that to your annual electricity cost. At high inference volumes, the hardware cost per token is negligible.
What models can run on a Mac with 16 GB RAM?
A 16 GB Mac can comfortably run models up to about 13B parameters at Q4 quantization — Qwen3 8B, Llama 3.1 8B, Gemma 3 12B (tight), and Mistral 7B variants. Larger models will either not load or will page to disk, causing very slow inference. Use the LLM Hardware Calculator to check specific models against your RAM.