Source: AGENTS.md Outperforms Skills in Agent Evals
Source: AGENTS.md Outperforms Skills in Agent Evals
Vercel’s evaluation of two approaches for giving coding agents access to framework documentation: skills (on-demand retrieval) vs passive context files (AGENTS.md). Tested against Next.js 16 APIs absent from model training data.
Key Claims
- Passive context achieved 100% pass rate on a hardened eval suite targeting Next.js 16 APIs, vs 53% baseline and 79% for skills with explicit invocation instructions.
- Skills were never invoked in 56% of cases when no explicit instructions were provided. Default skill availability produced zero improvement over having no documentation at all.
- Explicit instructions to use skills raised pass rate to 79%, but results were fragile — subtle wording changes (“invoke first” vs “explore first”) produced large behavioral swings.
- An 8KB compressed docs index in AGENTS.md matched the performance of the full 40KB version. The index uses a pipe-delimited format pointing agents to local doc files rather than embedding full content.
- Three factors explain why passive context wins: no decision point (information is always present), consistent availability (loaded every turn vs asynchronous invocation), and no ordering issues (no sequencing decisions about when to read docs).
- Skills and passive context are complementary — passive context works for broad, horizontal knowledge; skills work better for vertical, action-specific workflows that users explicitly trigger (e.g., framework migrations).
- The core recommendation: shift agents from pre-training-led reasoning to retrieval-led reasoning, and use passive context as the delivery mechanism for general framework knowledge.
Relevance and Implications
This finding directly affects how framework maintainers and development teams should structure agent-facing documentation. The result challenges the assumption that sophisticated retrieval (tool-based, on-demand) outperforms simple context injection — at least for current models. The fragility of skill invocation (agents simply not using available tools) is a known limitation, but the scale of the gap (0pp improvement without prompting) is striking.
For agent-tool interface design, this suggests a hierarchy: embed critical domain knowledge passively, reserve active retrieval for specific workflows. The compression results (80% reduction with no performance loss) indicate that token budget concerns around passive context are manageable with good information architecture.
The instruction wording sensitivity — where “MUST invoke” vs “explore first, then invoke” changed outcomes — reinforces that context delivery order matters as much as content.