Operator Fusion

AI concept developing
updated today 1 source

Operator Fusion

Operator fusion (also called kernel fusion) is a compiler and runtime optimization that merges multiple sequential operations into a single pass over the data. Instead of writing intermediate results to memory between each operation, the fused kernel computes the final result in one pass, reducing memory traffic and cache pollution.

Why It Matters

In LLM inference, many operations are memory-bandwidth bound — the CPU or GPU spends more time moving data than computing on it. Each separate operation reads its inputs from memory and writes its outputs back. When two operations are adjacent (the output of one feeds directly into the next), fusing them eliminates one full round-trip to memory.

For example, RMS normalization followed by element-wise multiplication:

  • Unfused: read x → compute norm → write y → read y → read weights → write y×weights (4 memory operations)
  • Fused: read x → read weights → compute y = x × scale × weights → write y (3 memory operations)

The arithmetic is identical; the savings come entirely from eliminating intermediate memory traffic.

Examples from Practice

Research on llama.cpp optimization documented several operator fusions applied to CPU inference:

  • Softmax fusion — copy + scale + add-mask merged from 3 passes to 1
  • RMS norm fusion — memcpy + scale merged from 2 passes to 1
  • Flash attention KQ fusion — scale + pad + add-mask + find-max merged into a single AVX2 FMA pass
  • Graph-level RMS_NORM + MUL fusion — a pattern already present in CUDA and Metal backends but missing from the CPU backend, implemented with explicit AVX2 and NEON intrinsics

A key finding: naive scalar fusion can be slower than unfused SIMD-optimized separate passes. The original code used highly optimized memcpy and SIMD vector operations; a scalar fused loop y[i] = x[i] * scale * w[i] may not be auto-vectorized as efficiently by the compiler. Effective fusion requires explicit SIMD intrinsics.

Fusion at Different Levels

  • Kernel-level fusion — fusing operations within a single function (e.g., combining the three passes in softmax)
  • Graph-level fusion — detecting fusible patterns in the computation graph at runtime (e.g., RMS_NORM followed by MUL where the intermediate has no other consumers)
  • Block-level fusion — proposed by research like the Blockbuster paper, fusing entire model blocks (RMSNorm + gate matmul + up matmul + SwiGLU + down matmul) into a single cache-resident tiled pass

Graph-level fusion requires checking that intermediate outputs have no other consumers in the graph. llama.cpp provides ggml_can_fuse() infrastructure for this validation.

Sources