Skip to main content
Best GPU for Running Llama 4 Locally: Scout & Maverick Hardware Guide

Best GPU for Running Llama 4 Locally: Scout & Maverick Hardware Guide


Meta’s Llama 4 introduced two open-weight models: Scout (109B parameters, 16 experts) and Maverick (400B parameters, 128 experts). Both use Mixture-of-Experts (MoE) architecture, activating only 17B parameters per token, which makes local inference surprisingly feasible. This guide covers the exact hardware you need.

Understanding Llama 4 Architecture

Why MoE matters for your hardware: Unlike dense models where all parameters are used for every token, Llama 4’s MoE architecture only activates a fraction of the model per token. This means inference is much faster than the total parameter count suggests, but you still need enough VRAM to load the full model weights.

S

Scout

Total Parameters109B
Active per Token17B
Experts16
Context Window10M tokens
FP16 Size~216 GB
M

Maverick

Total Parameters400B
Active per Token17B
Experts128
Context Window1M tokens
FP16 Size~800 GB
Advertisement

VRAM Requirements by Quantization

Quantization reduces the precision of model weights to fit in less VRAM. Lower bit precision means smaller models with a small quality trade-off. Q4_K_M offers the best balance of quality and memory savings for most users.

Llama 4 Scout VRAM

PrecisionModel SizeVRAM NeededGPU Setup
FP16 (full)~216 GB~232 GB4x H100 80GB
INT8~109 GB~117 GB2x H100 80GB
INT4 / Q4_K_M~55 GB~63 GB2x RTX 5090 32GB
2-bit (Q2_K)~27 GB~35 GB1x RTX 5090 32GB
1.78-bit (Unsloth)~24 GB~24 GB1x RTX 4090 / 5090

Llama 4 Maverick VRAM

PrecisionModel SizeVRAM NeededGPU Setup
FP16 (full)~800 GB~816 GB7x H200 141GB
INT8~400 GB~416 GB5x H200 141GB
INT4 / Q4_K_M~200 GB~216 GB3x H100 80GB
2-bit (Q2_K)~100 GB~116 GB4x RTX 5090 32GB
1.78-bit (Unsloth)~89 GB~96 GB2x RTX 4090 48GB*

Key insight: Scout is the practical local model. At Q4 quantization, it fits on 2x RTX 5090s with room for context. At aggressive 1.78-bit quantization (via Unsloth), it squeezes into a single 24GB GPU. Maverick requires enterprise-class hardware for most quantization levels.

GPU Recommendations

🏆

Best Single-GPU Option

Recommended: RTX 5090 (32 GB)

Runs Scout at 2-bit quantization with ~20 tokens/sec. The 32GB VRAM and 1.8TB/s memory bandwidth make it the best consumer GPU for local LLMs right now.

Scout Q2: YesScout Q4: Needs 2nd GPUMaverick: No
💰

Best Value Option

Recommended: RTX 4090 (24 GB) (used market)

Runs Scout at 1.78-bit via Unsloth GGUF at ~15-20 tokens/sec. Available for $1,200-1,400 on the used market as users upgrade to 50-series.

Scout 1.78-bit: YesScout Q4: NoMaverick: No
🔬

Best Multi-GPU Setup

Recommended: 2x RTX 5090 (64 GB total)

Runs Scout at Q4_K_M with excellent quality and ~30-40 tokens/sec. Requires a motherboard with 2x PCIe 5.0 x16 slots and a 1600W+ PSU.

Scout Q4: YesScout FP8: TightMaverick Q2: Needs 4 GPUs
Advertisement

Performance Benchmarks

Scout Inference Speed by GPU

GPUQuantizationTokens/secUsable?
RTX 5090 (32GB)Q2_K / 1.78-bit~20-25 tok/sComfortable
2x RTX 5090 (64GB)Q4_K_M~30-40 tok/sExcellent
RTX 4090 (24GB)1.78-bit (Unsloth)~15-20 tok/sWorkable
H100 (80GB)INT8~80-109 tok/sFast

Note: For comfortable conversational use, aim for 15+ tokens/sec. Below 10 tok/s feels sluggish. These benchmarks use llama.cpp and vLLM. Actual performance varies by context length, system RAM, and CPU.

For Consumer GPUs

LC
llama.cpp
Best for GGUF quantized models on consumer hardware
OL
Ollama
One-command setup, good for beginners

For Multi-GPU / Enterprise

vL
vLLM
High-throughput serving, tensor parallelism
TG
TGI (Text Generation Inference)
Hugging Face’s inference server

Ready to Build Your Llama 4 Rig?

Need a multi-GPU workstation for Scout Q4? Check our Tailored Builds page.