Source: AGENTS.md Outperforms Skills in Agent Evals

Jude Gao Jan 26, 2026

PublishedJanuary 26, 2026

AuthorJude Gao

Source: AGENTS.md Outperforms Skills in Agent Evals

Vercel’s evaluation of two approaches for giving coding agents access to framework documentation: skills (on-demand retrieval) vs passive context files (AGENTS.md). Tested against Next.js 16 APIs absent from model training data.

Key Claims

Passive context achieved 100% pass rate on a hardened eval suite targeting Next.js 16 APIs, vs 53% baseline and 79% for skills with explicit invocation instructions.
Skills were never invoked in 56% of cases when no explicit instructions were provided. Default skill availability produced zero improvement over having no documentation at all.
Explicit instructions to use skills raised pass rate to 79%, but results were fragile — subtle wording changes (“invoke first” vs “explore first”) produced large behavioral swings.
An 8KB compressed docs index in AGENTS.md matched the performance of the full 40KB version. The index uses a pipe-delimited format pointing agents to local doc files rather than embedding full content.
Three factors explain why passive context wins: no decision point (information is always present), consistent availability (loaded every turn vs asynchronous invocation), and no ordering issues (no sequencing decisions about when to read docs).
Skills and passive context are complementary — passive context works for broad, horizontal knowledge; skills work better for vertical, action-specific workflows that users explicitly trigger (e.g., framework migrations).
The core recommendation: shift agents from pre-training-led reasoning to retrieval-led reasoning, and use passive context as the delivery mechanism for general framework knowledge.

Relevance and Implications

This finding directly affects how framework maintainers and development teams should structure agent-facing documentation. The result challenges the assumption that sophisticated retrieval (tool-based, on-demand) outperforms simple context injection — at least for current models. The fragility of skill invocation (agents simply not using available tools) is a known limitation, but the scale of the gap (0pp improvement without prompting) is striking.

For agent-tool interface design, this suggests a hierarchy: embed critical domain knowledge passively, reserve active retrieval for specific workflows. The compression results (80% reduction with no performance loss) indicate that token budget concerns around passive context are manageable with good information architecture.

The instruction wording sensitivity — where “MUST invoke” vs “explore first, then invoke” changed outcomes — reinforces that context delivery order matters as much as content.

Sources

Original article

Source: AGENTS.md Outperforms Skills in Agent Evals

Source: AGENTS.md Outperforms Skills in Agent Evals

Key Claims

Relevance and Implications

Sources

Related pages