LLM Inference Optimization

AIWeb Dev concept developing
updated today 2 sources

LLM Inference Optimization

LLM inference optimization encompasses techniques for making large language model inference faster, cheaper, or more memory-efficient. This is a fast-moving domain where new optimization opportunities appear with each model architecture change, and clear throughput metrics (tokens/second, time-to-first-token) make it a natural target for both manual and automated optimization.

Key Concepts

Compute-Bound vs. Memory-Bound

The fundamental constraint in LLM inference depends on the workload phase:

  • Prompt processing (prefill) is typically compute-bound — the model processes all input tokens in parallel, and the bottleneck is arithmetic throughput.
  • Text generation (decode) is typically memory-bandwidth bound — the model generates one token at a time, and the bottleneck is loading model weights from memory. A 606 MiB model at ~49 tokens/s can consume ~30 GB/s of bandwidth, close to DRAM limits on many instances.

This distinction is critical for choosing optimizations. SIMD micro-optimizations in the compute path give negligible returns during text generation because the CPU is stalled waiting for weights to arrive from memory. Optimizations that reduce memory traffic — like Operator Fusion — have more impact.

Quantization

Reducing the precision of model weights (e.g., from FP16 to 4-bit integers) shrinks model size and memory bandwidth requirements. Formats like Q4_0 and Q4_0_8x8 (a repacked format used by llama.cpp) trade small accuracy losses for large throughput gains. Row-interleaved quantization repacking (pioneered by ik_llama.cpp) has shown up to 2.9x prompt processing improvement.

Operator Fusion

See Operator Fusion for full treatment. In inference contexts, fusing adjacent operations (RMS norm + multiply, softmax components, flash attention sub-passes) eliminates intermediate memory writes and can significantly reduce cache pollution. This technique is well-established in GPU backends but underexplored in CPU backends.

Long-Context and Sparse Attention

Extending context windows beyond 128K tokens introduces a distinct optimization challenge. Standard self-attention scales quadratically with context length, making naive scaling to millions of tokens prohibitively expensive. Memory Sparse Attention (MSA) addresses this with learned sparse attention that selects only the most relevant documents from a memory bank, achieving near-linear complexity while maintaining less than 9% performance degradation from 16K to 100M tokens. See Long-Context Models for a full treatment.

Scaling Laws and Compute-Optimal Training

Scaling Laws directly impact inference efficiency. The Chinchilla finding that models should be smaller and trained on more data means that compute-optimal models are inherently cheaper to serve — a 70B model with better training is more efficient at inference than a 280B model, with better performance to boot.

Active Projects

The LLM inference optimization space includes many actively developed frameworks:

  • llama.cpp — CPU/GPU inference for GGUF models, with extensive quantization support
  • vLLM — GPU serving with PagedAttention scheduling and prefix caching
  • SGLang — GPU serving with RadixAttention and constrained decoding
  • TensorRT-LLM — NVIDIA’s optimized inference with kernel fusion and in-flight batching
  • llamafile — single-file LLM inference with tinyBLAS BLAS implementation
  • ExLlamaV2 — quantized GPU inference

Sources