Running 70B LLMs on a 4GB GPU: AirLLM’s Layer-Swapping Architecture

Hook

What if you could run Meta’s Llama 3.1 405B—a model requiring hundreds of gigabytes of VRAM—on the same GPU that struggles to run AAA games at medium settings? AirLLM makes this possible by treating your disk as extended memory, swapping model layers in and out like a virtual memory system for neural networks.

Context

The democratization of large language models has hit a hard wall: memory. While open-source models like Llama 2 70B and Qwen 72B are freely available, actually running them requires enterprise-grade hardware with 80GB+ VRAM—hardware that costs tens of thousands of dollars. The standard workarounds involve quantization (reducing precision to 4-bit or 8-bit), model pruning, or distillation to smaller models, but these techniques require expertise, can degrade quality, and still often demand more VRAM than consumer GPUs provide.

This creates a two-tier ecosystem: researchers and developers at well-funded organizations experiment with frontier models, while independent developers and students are confined to 7B parameter models or cloud API bills. AirLLM emerged from this frustration, implementing a conceptually simple but technically sophisticated solution: what if we just don’t load the entire model into memory? By decomposing models into individual layers stored on disk and loading them on-demand during inference, AirLLM enables running models 10-20x larger than your available VRAM. It’s not magic—it’s a calculated trade-off of speed for accessibility, turning inference from memory-bound to I/O-bound.

Technical Insight

AirLLM’s architecture centers on layer-wise model sharding, fundamentally reimagining how we interact with transformer models. During initialization, it decomposes a HuggingFace model into individual layer files stored on disk. Instead of loading the full weight matrix into GPU memory, only the currently-executing layer resides in VRAM at any given time. As inference progresses through the model’s forward pass, AirLLM orchestrates a carefully choreographed dance: load layer N, execute computation, evict layer N, load layer N+1.

The implementation is surprisingly straightforward to use:

from airllm import AirLLMLlama2

# Initialize with model ID - decomposition happens automatically
model = AirLLMLlama2(
    "garage-bAInd/Platypus2-70B-instruct",
    compression='4bit'  # Optional quantization
)

input_text = "What is the meaning of life?"
input_tokens = model.tokenizer(
    input_text,
    return_tensors="pt",
    return_attention_mask=False,
    truncation=True,
    max_length=128
)

# Generation loads layers on-demand
generation_output = model.generate(
    input_tokens['input_ids'].cuda(),
    max_new_tokens=20,
    use_cache=True
)

output = model.tokenizer.decode(generation_output[0])

Under the hood, the system maintains a layer cache and prefetching queue. When generating token N, AirLLM begins asynchronously loading layers for token N+1, overlapping disk I/O with computation. This prefetching mechanism provides roughly 10% throughput improvement—modest, but meaningful when every token generation involves dozens of layer swaps.

The optional block-wise quantization is where AirLLM gets clever. Traditional quantization approaches like GPTQ or AWQ quantize the entire model upfront, requiring significant VRAM for the quantization process itself. AirLLM instead quantizes individual weight blocks on-the-fly during loading, reducing disk bandwidth requirements without needing large amounts of memory. The quantization is deliberately conservative—applied only to weights, not activations—which preserves accuracy while providing a 3x speedup by reducing the data transferred from disk.

The layer decomposition process creates a directory structure like this:

model_name/
├── layer_0.pth
├── layer_1.pth
├── layer_2.pth
...
├── layer_79.pth
└── config.json

Each .pth file contains the serialized weights for a single transformer layer—typically 1-2GB for 70B models. During inference, PyTorch’s torch.load() deserializes these files into GPU memory. The critical insight is that transformer architectures are inherently sequential: you must process layer 0 before layer 1, layer 1 before layer 2, and so on. AirLLM exploits this sequential dependency, ensuring that only the currently-executing layer needs to be resident in memory.

The system also implements intelligent memory management for the key-value cache used in autoregressive generation. Rather than keeping all cached keys and values in GPU memory (which would quickly exhaust VRAM), AirLLM maintains a hybrid cache: recent layers stay in GPU memory while older cache entries are evicted to CPU RAM or disk. This multi-tiered caching strategy mirrors virtual memory systems in operating systems, with similar trade-offs between access latency and capacity.

For Chinese language models—a focus area given the repository’s topics—AirLLM supports architectures like ChatGLM, Baichuan, and Qwen without modification. The layer-sharding approach is architecture-agnostic; as long as the model follows the standard transformer pattern with sequential layers, AirLLM can decompose and run it. This universality is both a strength and a limitation: it works broadly but can’t leverage architecture-specific optimizations that more specialized inference engines exploit.

Gotcha

The elephant in the room is speed—or rather, the lack of it. Inference with AirLLM is glacially slow compared to traditional in-memory inference. Where a quantized Llama 70B on an A100 might generate 20-30 tokens per second, AirLLM on a 4GB GPU manages perhaps 0.1-0.5 tokens per second. You’re effectively swapping gigabytes of model weights from disk for every single token generated. This makes AirLLM completely impractical for interactive use cases, real-time applications, or any production scenario where latency matters.

The disk space requirements are also non-trivial. Decomposing a 70B model creates a layer-sharded copy that occupies roughly the same disk space as the original model—you’re essentially doubling your storage needs. For 405B models, this means reserving nearly a terabyte of fast SSD storage (mechanical drives would make the already-slow inference utterly unusable). Additionally, the constant disk I/O generates significant wear on SSDs, potentially shortening drive lifespan if you’re running extensive experiments. And while the documentation claims 10% speedup from prefetching, in practice the improvement is only noticeable with very fast NVMe drives; SATA SSDs see minimal benefit.

Verdict

Use if: You’re a researcher, student, or hobbyist who wants to experiment with 70B+ parameter models but only has consumer-grade hardware (RTX 3060, 4060, or similar with 4-8GB VRAM). AirLLM is perfect for exploratory work, testing prompts, evaluating model capabilities, or learning how large models behave—scenarios where waiting 30 seconds per response is acceptable. It’s also valuable for one-off analysis tasks like batch processing datasets overnight where throughput doesn’t matter. Skip if: You need anything approaching real-time performance, are building production services, or already have access to adequate VRAM (24GB+). In those cases, traditional inference with quantization via llama.cpp or ExLlama will be 50-100x faster. Also skip if you’re doing extensive fine-tuning or LoRA training; while AirLLM supports these workflows, the I/O overhead makes training prohibitively slow. This tool solves one problem brilliantly—making large models accessible on tiny GPUs—but that’s a narrow use case. Know what you’re trading before you commit to the layer-swapping lifestyle.

Running 70B LLMs on a 4GB GPU: AirLLM’s Layer-Swapping Architecture

Hook

Context

Technical Insight

Gotcha

Verdict

// RELATED

AG-UI: The Missing Protocol Layer Between AI Agents and Your Frontend

Heretic: Using Multi-Objective Optimization to Automatically Uncensor Language Models

Serena: Teaching LLMs to Navigate Code Like an IDE, Not a Text Editor