Autonomous Code Optimization
Autonomous Code Optimization
Autonomous code optimization is a workflow where coding agents iteratively modify a codebase to improve a target metric (throughput, latency, memory usage), using automated benchmarks and test suites as guardrails. The agent proposes changes, runs experiments, keeps winners, discards losers, and repeats.
The Autoresearch Loop
The foundational pattern was established by Karpathy’s autoresearch project, which had an agent autonomously improve a neural network training script. pi-autoresearch generalized this into a reusable framework for any project with a benchmarkable metric. The loop is:
- Brainstorm — the agent generates optimization hypotheses
- Edit — apply a candidate change to the code
- Benchmark — run the project’s benchmark suite on cloud infrastructure
- Validate — run correctness checks (test suite) to catch regressions
- Evaluate — compare the metric against baseline; keep or discard
- Repeat — queue the next wave of experiments
SkyPilot provides the cloud execution layer, fanning experiments out across VMs so multiple hypotheses can be tested in parallel.
Literature-Guided Optimization
The standard loop generates hypotheses from code context alone, which works when the optimization surface is visible in the source. Shopify’s Liquid template engine was optimized 53% faster this way — the agent could see the tokenizer bottleneck directly in the code.
But for problems where the codebase lacks sufficient context — where the answer lives in papers, competing projects, or domain knowledge — code-only hypotheses tend to be shallow. Research by the SkyPilot team demonstrated that adding a research phase before the experiment loop changes what the agent tries:
- Without research: the agent attempted SIMD micro-optimizations on llama.cpp‘s hot paths — all within measurement noise, because text generation is memory-bandwidth bound
- With research: the agent read papers on Operator Fusion and studied competing implementations, then pivoted to fusing memory passes — yielding 15% improvement on x86
The research phase includes reading arxiv papers, studying forks and competing projects, and analyzing how other backends solve the same problems.
Practical Considerations
- Benchmark rigor matters. Agents can run many experiments against flawed baselines. Parsing bugs, noisy cloud VMs (up to 30% variance on shared-tenancy instances), and incorrect metric extraction are common failure modes.
- Most experiments fail. In the llama.cpp case, 5 of 30+ experiments landed. Many “optimizations” were already handled by the compiler (auto-vectorization, common subexpression elimination) or hardware prefetcher.
- Self-review catches bugs. The agent found a correctness bug in its own graph fusion code during self-review — a pattern that suggests code review should be part of the loop.
Evolutionary Program Optimization
ShinkaEvolve generalizes the autoresearch loop into a population-based evolutionary framework. Rather than iterating on a single codebase, ShinkaEvolve maintains an archive of program variants and uses LLMs as mutation operators to evolve them. Key innovations include adaptive parent sampling (balancing exploration and exploitation), novelty rejection-sampling (filtering out redundant mutations), and bandit-based LLM ensemble selection (dynamically allocating work to the most productive model in a multi-model pool).
Results across four domains — mathematical optimization, agent scaffold design, competitive programming, and neural network training loss design — show that this approach achieves state-of-the-art results with orders of magnitude fewer evaluations than prior work like AlphaEvolve. The framework is open-source (Apache 2.0) and generalizes to any problem with a fitness function.
Relevance to Atopia Labs Verticals
This workflow is directly applicable to any web development or IT consulting engagement where performance matters. The autoresearch framework is open-source and generalizes to any project with a benchmark. ML inference frameworks are particularly good candidates, but the pattern applies to any performance-critical codebase. ShinkaEvolve extends this to broader optimization problems beyond single-codebase performance tuning.