Glossary
Glossary
Key terms used across the wiki. Link here from any page using [[Glossary#Term|Term]].
A
Agent Skills
Packaged bundles of prompts, tools, and documentation that coding agents invoke on demand to access domain-specific knowledge. Skills are an open standard designed for modular, reusable knowledge delivery. Eval results show agents frequently fail to invoke available skills without explicit prompting — a known limitation of current models. Contrast with passive context. See Agent Context Strategies.
Attention
See self-attention. The mechanism by which a Transformer allows each element in a sequence to examine and weight every other element, enabling the model to capture dependencies regardless of distance.
C
Chain-of-Thought
A prompting technique where LLMs are shown examples that include intermediate reasoning steps, causing them to produce their own step-by-step reasoning. Dramatically improves performance on arithmetic, logic, and multi-step inference tasks. See Chain-of-Thought Prompting.
Chinchilla Scaling Laws
Empirical finding from the Chinchilla paper that for compute-optimal training, model size and training tokens should be scaled roughly equally. Overturned prior practice of making models very large while training on relatively few tokens. See Scaling Laws.
Compute-Bound
A workload where performance is limited by the speed of arithmetic operations (CPU/GPU ALUs), not by memory bandwidth or I/O. In LLM inference, prompt processing (prefill) is typically compute-bound because all input tokens are processed in parallel. Contrast with memory-bandwidth bound.
Diffusion Model
A generative model that learns to create data by reversing a gradual noising process. Training adds noise to data over many steps; the model learns to denoise. Generation starts from pure noise and iteratively refines it into coherent output. Powers DALL-E, Stable Diffusion, Midjourney. See Diffusion Models.
F
Few-Shot Learning
The ability of a model to perform a new task from just a few examples provided in the prompt, without fine-tuning or gradient updates. A key property of In-Context Learning that emerges at scale. See GPT-3, In-Context Learning.
Fine-Tuning
Updating a pretrained model’s weights on task-specific data. Full fine-tuning updates all parameters (expensive for large models). Parameter-Efficient Fine-Tuning methods like LoRA update only a small fraction.
G
GRPO
Group Relative Policy Optimization — the reinforcement learning algorithm used to train DeepSeek-R1. A simplified alternative to PPO that computes advantages relative to a group of sampled outputs rather than using a value model. Designed to reduce the resource consumption of RL training for LLMs.
H
Hypernetwork
A neural network that generates the weights of another neural network. Text-to-LoRA is a hypernetwork that produces LoRA adapter weights from a text description of the target task.
I
In-Context Learning
The ability of large language models to perform new tasks from examples in the prompt without any weight updates. See In-Context Learning.
K
Kernel Fusion
See Operator Fusion.
L
LoRA
Low-Rank Adaptation — a Parameter-Efficient Fine-Tuning technique that freezes pretrained weights and adds small trainable low-rank matrices. Reduces the trainable parameter count by 1,000x+ while maintaining performance close to full fine-tuning. See Parameter-Efficient Fine-Tuning.
Large Language Model
A neural network based on the Transformer Architecture trained on large text corpora that generates text by predicting the next token in a sequence. Key capabilities include In-Context Learning (adapting to tasks from prompt examples), chain-of-thought reasoning (solving multi-step problems), and instruction following (via RLHF). Examples include GPT-3, Claude, Llama, and Gemini. LLMs are the foundation of modern AI Agents and are the primary workload targeted by LLM Inference Optimization.
M
MCP
Memory-Bandwidth Bound
A workload where performance is limited by the rate at which data can be moved between memory (DRAM) and the processor, not by compute throughput. In LLM inference, text generation (decode) is typically memory-bandwidth bound because each token requires loading the full model weights from memory. A 606 MiB model at ~49 tokens/s can consume ~30 GB/s, close to the DRAM bandwidth limit of many server instances. Operator Fusion helps by reducing the number of memory round-trips.
O
Operator Fusion
See Operator Fusion. The technique of merging sequential operations into a single pass over data to reduce memory traffic.
P
Parameter-Efficient Fine-Tuning
Techniques for adapting LLMs by modifying a small fraction of parameters (0.1–1%) rather than all weights. Includes LoRA, adapters, and prefix tuning. See Parameter-Efficient Fine-Tuning.
Passive Context
Domain knowledge delivered to agents through files loaded into the system prompt on every turn, without requiring the agent to decide to retrieve it. Examples include AGENTS.md (Cursor, v0), CLAUDE.md (Claude Code), and GEMINI.md (Gemini CLI). Eval research shows passive context outperforms on-demand skill retrieval for general framework knowledge, achieving 100% vs 53% baseline on tasks involving APIs outside model training data. See Agent Context Strategies.
Positional Encoding
Information added to token embeddings to provide sequence order, since self-attention is permutation-invariant. The original Transformer Architecture used fixed sinusoidal encodings; modern models typically use learned embeddings or Rotary Position Embeddings (RoPE). MSA uses document-wise RoPE to enable extrapolation to 100M tokens.
Pre-Training-Led Reasoning
When an agent relies on knowledge embedded in its model weights (training data) rather than consulting available documentation. Pre-training-led reasoning produces incorrect output when APIs or conventions have changed since the model’s training cutoff. The corrective is retrieval-led reasoning — instructing agents to prefer documentation over training knowledge. See Agent Context Strategies.
Q
Quantization
Reducing the numerical precision of model weights (e.g., from 16-bit floating point to 4-bit integers) to shrink model size and reduce memory bandwidth requirements during inference. Formats like Q4_0 trade small accuracy losses for large throughput gains. See LLM Inference Optimization.
R
RAG
Retrieval-Augmented Generation — a technique that augments LLMs with external knowledge by retrieving relevant documents from a database and injecting them into the prompt context. Widely used to provide models with up-to-date or domain-specific information. MSA provides an end-to-end alternative that integrates retrieval into the attention mechanism itself. See Long-Context Models.
Reward Hacking
When a model learns to produce outputs that score high on a reward model without actually being better by human judgment. A key challenge in RLHF; DeepSeek-R1 mitigated this by using rule-based rewards instead of neural reward models.
RLHF
See Reinforcement Learning from Human Feedback. The standard technique for aligning LLMs with human preferences via supervised fine-tuning, reward modeling, and RL optimization.
S
Scaling Laws
Empirical power-law relationships predicting how model performance changes with parameters, training data, and compute. See Scaling Laws.
Self-Attention
See Transformer Architecture. The mechanism that allows each position in a sequence to attend to all other positions, computing relevance weights via scaled dot-product of query and key vectors.
SIMD
Single Instruction, Multiple Data — a CPU feature where one instruction operates on multiple data elements simultaneously. For example, Intel’s AVX2 instructions process 8 floats at once, and ARM’s NEON processes 4. SIMD is widely used in LLM inference kernels, but provides little benefit when a workload is memory-bandwidth bound, since the bottleneck is data transfer rather than arithmetic throughput. Common SIMD instruction sets include SSE, AVX2, AVX-512 (x86), and NEON (ARM).
Sparse Attention
An attention mechanism that attends to a subset of positions rather than all positions, reducing complexity from quadratic to sub-quadratic or linear. Used in Long-Context Models to handle contexts of millions of tokens. MSA uses learned routing to select the top-k most relevant documents.
T
Transformer
See Transformer Architecture. The neural network architecture based on self-attention that underpins virtually all modern LLMs, vision models, and generative AI systems. Introduced in 2017.
Token Budget
The total number of tokens available to an agent across a task — including input context (tool schemas, prior turns, environmental state) and output (tool calls, responses). Token budget is a first-class design constraint for agent-tool interfaces: every unnecessary token consumed by tool schemas or verbose output reduces the agent’s capacity for reasoning and action. AXI treats token budget as the primary optimization target.
TOON
Token-Optimized Object Notation — a data format designed for agent consumption that omits JSON’s braces, quotes, and commas, yielding ~40% token savings while remaining unambiguous to LLMs. Introduced by AXI. Example: issues[2]{number,title,state}: 42,Fix login bug,open instead of [{"number":42,"title":"Fix login bug","state":"open"}].
V
Vision Transformer
A Transformer applied directly to sequences of image patches for image classification. Demonstrated that a pure Transformer matches CNNs when pretrained at sufficient scale. See Vision Transformers.