Source: Research-Driven Agents — What Happens When Your Agent Reads Before It Codes
Source: Research-Driven Agents
Summary
This post from the SkyPilot team demonstrates that coding agents produce significantly better code optimizations when given a research phase — reading arxiv papers, studying competing projects and forks — before touching code. The authors added a literature search step to the autoresearch loop, pointed it at llama.cpp‘s CPU inference path, and over 3 hours with 4 cloud VMs ($29 total cost) produced 5 optimizations that made flash attention text generation 15% faster on x86 and 5% faster on ARM.
Key Claims
- Code-only context produces shallow hypotheses. When pointed at llama.cpp without research context, the agent attempted SIMD micro-optimizations (prefetching, loop unrolling, hoisting) that were all within measurement noise. The agent’s own postmortem identified the root cause: text generation is memory-bandwidth bound, not compute-bound, so optimizing compute instructions doesn’t help.
- Research context changes what the agent tries. After reading papers on Operator Fusion and studying how CUDA/Metal backends handle the same operations, the agent pivoted from “make this loop faster” to “can I fuse these operations to eliminate a memory pass?” This led to qualitatively different hypotheses.
- Forks and competing projects were more useful than arxiv. Several actionable optimizations came from studying ik_llama.cpp (a performance-focused fork) and llamafile’s tinyBLAS. The CUDA backend directly informed one of the five final optimizations.
- 5 of 30+ experiments landed: 4 kernel fusions and an adaptive parallelization strategy. The biggest win fused three passes over flash attention’s QK tile into a single AVX2 FMA loop.
- The fusions reduced variance as much as they improved throughput. Baseline text generation showed ±19 t/s noise; optimized code showed ±0.59 t/s on x86 — the fusions eliminate intermediate cache-polluting writes.
- Agents struggle with benchmarking rigor. A JSON parsing bug in the benchmark script caused multiple experiments to run against wrong baselines. Shared-tenancy EC2 instances showed up to 30% variance from noisy neighbors.
- The agent caught its own correctness bug during self-review: its graph fusion code didn’t check whether intermediate outputs had other consumers. The fix used llama.cpp’s existing
ggml_can_fuse()infrastructure.
The Five Optimizations
- Softmax fusion — merged copy + scale + add-mask from 3 passes into 1
- RMS norm fusion — merged memcpy + scale from 2 passes into 1
- Adaptive from_float parallelization — partition by row for prompt processing, by element for text generation
- Graph-level RMS_NORM + MUL fusion — ported a pattern from CUDA/Metal backends to CPU with explicit AVX2/NEON intrinsics
- Flash attention KQ fusion — fused scale + pad + add-mask + find-max into a single AVX2 FMA pass
Relevance and Implications
This is early evidence for a shift in how AI Agents are used for code optimization: from “generate and test” to “research, hypothesize, then test.” The key insight is that codebases contain what the code does but not why it’s slow or what alternatives exist outside the codebase. Domain knowledge — the kind a senior engineer brings — can be partially substituted by systematic literature and competitor review.
The autoresearch framework generalizes to any project with a benchmark and test suite. The authors suggest ML inference frameworks as good candidates due to fast-moving codebases and clear throughput metrics.