If you’ve trained a model in PyTorch, you’ve hit this error: RuntimeError: CUDA out of memory. It’s the most common GPU error in deep learning. This guide covers 10 proven fixes, ranked from simplest to most advanced, so you can get back to training.
Quick Navigation:
Understanding the Error
RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 211.00 MiB is free.
This means PyTorch tried to allocate GPU memory for a tensor but the GPU didn’t have enough free VRAM. The causes are usually: batch size too large, model too large, or memory leaks from unreleased tensors.
Before you start fixing: Run nvidia-smi in your terminal to see current GPU memory usage. If another process is using VRAM, kill it first. This is the #1 cause people overlook.
1. Reduce Batch Size
The most straightforward fix. GPU memory scales linearly with batch size, so halving your batch size roughly halves VRAM usage.
Example
# Before (OOM with 24GB GPU)
train_loader = DataLoader(dataset, batch_size=64)
# After (fits in memory)
train_loader = DataLoader(dataset, batch_size=16)
Tip: Start with batch_size=1 to confirm the model fits at all, then increase until you hit OOM. The sweet spot is usually 80-90% of your available VRAM.
2. Enable Mixed Precision (AMP)
Automatic Mixed Precision (AMP) trains your model in FP16 where safe while keeping critical operations in FP32. This nearly halves memory usage with minimal accuracy loss. Available in PyTorch 1.6+.
Implementation
from torch.amp import autocast, GradScaler
scaler = GradScaler(device_type=“cuda”)
for data, target in dataloader:
optimizer.zero_grad()
with autocast(device_type=“cuda”):
output = model(data)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Pros
- 30-50% memory reduction
- Often faster training (Tensor Cores)
- Minimal code changes
Requirements
- NVIDIA GPU with Tensor Cores (RTX 20+)
- PyTorch 1.6 or later
- CUDA 10.0+
3. Gradient Checkpointing
Gradient checkpointing trades compute time for memory. Instead of storing all intermediate activations for the backward pass, it recomputes them on the fly. Training is ~20-30% slower but uses dramatically less VRAM.
Implementation
from torch.utils.checkpoint import checkpoint
# Wrap memory-heavy layers
class MyModel(nn.Module):
def forward(self, x):
x = checkpoint(self.heavy_block, x,
use_reentrant=False)
return self.head(x)
# For Hugging Face Transformers
model.gradient_checkpointing_enable()
4. Gradient Accumulation
Simulate a large batch size while only loading a small batch into VRAM at a time. The gradients accumulate over multiple forward passes before updating weights.
Implementation
accumulation_steps = 4 # Effective batch = 4 x batch_size
for i, (data, target) in enumerate(loader):
with autocast(device_type=“cuda”):
loss = model(data, target) / accumulation_steps
scaler.scale(loss).backward()
if (i + 1) % accumulation_steps == 0:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
5. Clear the CUDA Cache
PyTorch’s memory allocator caches GPU memory blocks for reuse. In Jupyter notebooks or interactive sessions, old tensors can linger. Clear them explicitly.
import torch
import gc
# Delete unused variables
del model, optimizer, outputs
# Run garbage collection
gc.collect()
# Release cached memory back to CUDA
torch.cuda.empty_cache()
# Verify memory freed
print(torch.cuda.memory_summary())
6. Use In-place Operations
In-place operations modify tensors without creating copies, saving memory. Use them for activations where autograd compatibility allows.
# Instead of
x = F.relu(x)
# Use in-place version
x = F.relu(x, inplace=True)
# Or in nn.Sequential
nn.ReLU(inplace=True)
Warning: In-place operations can break autograd in some cases. Don’t use them on tensors that require gradients for loss computation. Safe for activations between layers.
7. Offload to CPU
Move tensors you don’t need on GPU back to CPU RAM. This is especially useful for storing intermediate results, logging, or when processing large datasets.
# Move loss to CPU before storing
train_losses.append(loss.detach().cpu().item())
# Don’t keep prediction tensors on GPU
predictions = model(batch).detach().cpu()
8. Model Quantization
Quantization reduces model weights from FP32 to INT8 or INT4, drastically cutting memory. Best suited for inference; for training, use AMP instead.
# BitsAndBytes 4-bit quantization
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type=“nf4”,
)
model = AutoModelForCausalLM.from_pretrained(
“meta-llama/Llama-4-Scout-17B-16E”,
quantization_config=bnb_config,
)
9. Fix Memory Fragmentation
Sometimes you have enough total free VRAM, but it’s fragmented into small blocks. PyTorch’s allocator can’t find a single contiguous block large enough. The CUDA allocator config can help.
# Set before running your script
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
# Or in Python
import os
os.environ[“PYTORCH_CUDA_ALLOC_CONF”] = “max_split_size_mb:128”
10. Combine Multiple Techniques
The real power comes from stacking these techniques together. Here’s what to combine based on your VRAM:
| GPU VRAM | Recommended Stack | What You Can Train |
|---|---|---|
| 8 GB | AMP + Small batch + Gradient accumulation | ResNets, small transformers, LoRA fine-tuning |
| 12 GB | AMP + Gradient checkpointing + Accumulation | Vision Transformers, 7B LLM fine-tuning (QLoRA) |
| 16 GB | AMP + Checkpointing | Most research models, Stable Diffusion training |
| 24 GB | AMP + Standard batch sizes | Large models, 13B fine-tuning, FLUX image gen |
| 32 GB+ | AMP (optional at this level) | Most workloads without memory tricks |
Diagnostic Cheat Sheet
Memory Debugging Commands
Check GPU memory usage
nvidia-smi
torch.cuda.memory_summary(device=0)
Find memory-heavy tensors
torch.cuda.memory_allocated() / 1e9 # GB in use
torch.cuda.memory_reserved() / 1e9 # GB cached
Monitor during training
watch -n 0.5 nvidia-smi # Live monitoring
Need More VRAM?
If you’re constantly hitting memory limits, it might be time for a GPU upgrade.