Local LLM Inference Cost Calculator

What does it actually cost to run a large language model locally, versus paying per token through an API? Enter your setup and find out.

API Provider Pricing — Output Tokens

Approximate prices per 1 million output tokens as of March 2026. Verify with each provider before budgeting. Input tokens are typically cheaper.

Provider / Model Output (per 1M tokens) Tier
Anthropic Claude Opus 4$75.00frontier
Anthropic Claude Sonnet 4.6$15.00pro
OpenAI GPT-4o$10.00pro
Google Gemini 1.5 Pro$5.00pro
Anthropic Claude Haiku 4.5$4.00mini
OpenAI GPT-4o mini$0.60mini
Mistral Small (API)$0.60mini
Google Gemini Flash$0.30mini
Google Gemini Flash Lite$0.075nano

Note: batch pricing and prompt caching can reduce API costs by 50–80% for high-volume, cacheable workloads. Local inference has no equivalent discount — the cost is fixed by electricity and hardware.

How Local Inference Cost Is Calculated

Running an LLM locally has two cost components: electricity (ongoing) and hardware (amortized). This calculator focuses on the electricity cost, which is the only cost that scales with usage at inference time.

The electricity formula

The cost per million output tokens from local inference is:

seconds_for_1M_tokens = 1,000,000 / tokens_per_second
energy_kWh = seconds_for_1M_tokens × hardware_watts / 3,600,000
cost_per_1M_tokens = energy_kWh × electricity_price_per_kWh

For example: Qwen3 32B on a Mac Studio M4 Max 64 GB at $0.12/kWh

  • Speed: ~22 tok/s
  • Time for 1M tokens: 45,455 seconds (about 12.6 hours)
  • Energy: 45,455 × 150 W / 3,600,000 = 1.89 kWh
  • Cost: 1.89 × $0.12 = $0.23 per million output tokens

That's about 65× cheaper than Claude Sonnet at $15/M, and 43× cheaper than GPT-4o at $10/M. You pay more for electricity in California or Europe — but even at $0.40/kWh, the local cost is $0.76/M, still 20× cheaper than most pro-tier APIs.

Hardware cost amortization

The hardware is a one-time capital cost. A Mac Studio M4 Max retails for $1,999–$2,399 new. At 100,000 tokens/day sustained, the hardware pays for itself in electricity savings relative to a mid-tier API within 2–4 months. At 1M tokens/day (realistic for agentic workloads), payback is under a week.

When local inference wins

Local inference is cost-optimal when:

  • Your throughput is high enough to keep the hardware busy (above ~50K tokens/day)
  • Your workload is privacy-sensitive (nothing leaves the machine)
  • You need low latency for agent loops (no network round-trip, no rate limits)
  • You're running a capable open-weight model (Qwen3 32B, Llama 3.3 70B, Gemma 3 27B)

When APIs win

  • Low-volume, sporadic use (hardware sits idle most of the day)
  • You need the very latest closed frontier models (GPT-4.5, Claude Opus 4)
  • Your use case requires guaranteed SLAs or managed infrastructure
  • Capital budget is constrained — paying per token has zero upfront cost

Apple Silicon advantages for local inference

Apple Silicon uses unified memory — the CPU, GPU, and Neural Engine share the same physical RAM pool. This means a Mac Studio M4 Max with 64 GB can hold a 32B parameter model in the GPU's address space with no VRAM/RAM split. By contrast, a discrete GPU like the RTX 4090 has 24 GB VRAM with a hard ceiling — models larger than that require slow CPU-offloading or model quantization that may not fit.

The Mac's memory bandwidth (~300 GB/s on M4 Max) is also critical for inference speed. LLM inference is memory-bandwidth-bound, not compute-bound — the bottleneck is how fast you can move weights from memory to the ALU. Apple Silicon's unified memory design makes this fast even on consumer hardware.

See real measured tok/s numbers at SiliconBench. Check which models fit your RAM with the LLM Hardware Calculator.

Frequently Asked Questions

How much does it cost to run an LLM locally per month?

It depends on throughput and hardware. A Mac Studio M4 Max running Qwen3 32B all day (24/7) uses about 1.89 kWh per million output tokens. At $0.12/kWh, that's roughly $0.23/M tokens. Running at 500K tokens/day for 30 days costs about $3.45 in electricity — a fraction of any API tier.

Is local inference cheaper than OpenAI or Anthropic APIs?

Yes — often by 20–100× on a per-token basis, once your hardware is paid off. The breakeven point depends on your electricity rate and hardware cost. For agentic workloads generating millions of tokens per day, local inference is dramatically cheaper. For low-volume sporadic use, API pricing is more flexible.

What is the best Mac for local LLM inference?

The Mac Studio M4 Max with 64 GB is the current sweet spot: enough unified memory to run Qwen3 32B and Gemma 3 27B comfortably, ~22 tok/s on 32B models, and relatively low power draw (~150 W). The Mac Studio M2 Ultra with 192 GB can run Llama 3.3 70B and larger, but at higher power and cost. The Mac Mini M4 Pro (24 GB) is the budget pick for 14B and smaller models. Check the LLM Hardware Calculator for a full compatibility guide.

How do I measure tokens per second on my machine?

Ollama reports tok/s in its verbose output. Run any prompt with OLLAMA_DEBUG=1 ollama run qwen3:32b "hello" and look for the "eval rate" in the output. Benchmarks across hardware configurations are published at SiliconBench.

Does the hardware cost factor into the calculation?

This calculator shows electricity cost only — the marginal cost per token once you own the hardware. If you want to include hardware amortization, estimate your machine cost, divide by expected useful life (e.g., $2,000 Mac Studio over 3 years = $667/year), and add that to your annual electricity cost. At high inference volumes, the hardware cost per token is negligible.

What models can run on a Mac with 16 GB RAM?

A 16 GB Mac can comfortably run models up to about 13B parameters at Q4 quantization — Qwen3 8B, Llama 3.1 8B, Gemma 3 12B (tight), and Mistral 7B variants. Larger models will either not load or will page to disk, causing very slow inference. Use the LLM Hardware Calculator to check specific models against your RAM.