RL Training for LLM Agents on Live Task Improvement Loops

Independent Research

Code W&B (SDPO) W&B (Baseline) Autoresearch SDPO

Abstract

We present a framework for training LLM agents via RL on any task improvement loop. An agent proposes modifications to a target file, experiments execute on remote machines, and metric outcomes drive policy updates through SDPO, GRPO, or other RL algorithms. Tasks are defined as YAML configs — specifying what file to edit, how to run experiments, and how to score results — making the system pluggable across domains: ML pretraining, GPU kernel optimization, prompt engineering, code speedups, and more.

On our flagship task — Karpathy's autoresearch benchmark — we train Qwen3-14B with SDPO to minimize validation loss on a GPT pretraining script. Over 495 training steps, SDPO converges to val_bpb ~1.025 and holds steady for 300+ steps, achieving a best of 1.023 (−3.1%). Our GRPO run — 38 steps with 128 rollouts/step (~94M tokens), after adding a KL penalty and batch normalization to stabilize training — plateaus at −1.8%. The SDPO-trained 14B model surpasses the Karpathy agent (−2.8%) while learning stable exploration patterns encoded in its weights.

Pluggable Task System

Every task-specific detail lives in a single YAML config: workspace, execution, scoring, prompts, feedback templates, and fleet slots. The agent loops, SSH runners, and training pipeline are fully generic — they read everything from the config at runtime.

Task	Metric	Direction	Target File	GPU	Domain
autoresearch	`val_bpb`	minimize	`train.py`	yes	ML pretraining
triton-kernel	`kernel_latency_us`	minimize	`kernel.py`	yes	GPU kernel optimization
baseball-pitch	`rmse_mph`	minimize	`model.py`	no	Tabular ML
voice-agent	`eval_score`	maximize	`system_prompt.txt`	no	Prompt engineering
liquid-speedup	`combined_time_ms`	minimize	`template.py`	no	Code optimization

To add a new task: create a YAML config in tasks/, point it at a source directory with the target file, and run python test_task_config.py tasks/my_task.yaml to verify. All example tasks have been tested end-to-end on an A100.

Autoresearch Results

Improvement in val_bpb (validation bits per byte, lower is better) over the unmodified train.py baseline. Larger bars = more improvement.

SDPO Ckpt + ICL Qwen3-14B, best of 50

−3.1%

SDPO (training) Qwen3-14B, 495 steps

−3.1%

Karpathy Agent Claude?, 126 experiments

−2.8%

SDPO Ckpt + Single Qwen3-14B, best of 50

−2.6%

Single-turn Qwen3-14B, best of 50

−2.3%

GRPO (training) Qwen3-14B, 38 steps × 128

−1.8%

ICL Baseline Qwen3-14B, best of 50

−1.7%

All improvements relative to unmodified baseline (1.056). Karpathy result from 126-experiment run (0.998 → 0.970, different absolute baseline). Our GRPO run uses 128 rollouts/step (group_size=8) with KL penalty and batch-normalized advantages; plateaus at −1.8%.

Four Approaches (Autoresearch)

SDPO RL

The model generates rollouts — editing train.py via bash tool calls — and experiments execute on remote H100s. A reward signal is provided via hindsight context from a teacher model, flowing to the model's weights via policy gradient updates. SDPO does not need a successful demonstration for an advantage signal — it only needs the correct solution to be more likely under the teacher model compared to the student model, providing dense supervision even when all rollouts fail. Over 495 steps, SDPO entropy increases (0.28→0.61) — no mode collapse — while pg_loss steadily decreases (0.052→0.011), indicating continued learning.

SDPO is extremely sensitive to its training dynamics. We identified two failure modes that required active management. Entropy collapse kills exploration: the model locks into a narrow set of edits and stops trying new things. We addressed this with lower learning rate and entropy regularization. Teacher collapse kills the learning signal: when the EMA teacher becomes too similar to the student, pg_loss heads toward zero and the distributional mismatch that SDPO relies on vanishes. We fixed this by setting EMA τ = 0 (effectively freezing the teacher as the base model), after which we observed healthy entropy and pg_loss throughout training. The teacher-student gap must remain alive, or SDPO loses its reason to exist.

Environment and feedback design mattered as much as the algorithm itself. We added hash-based deduplication with a novelty reward: if a proposed diff was previously unseen, the feedback praised it; if it had been tried before, the model was told to try something new. We also built an experiment cache that cut iteration time from ~20 minutes to ~5 minutes per step — a 4× throughput improvement that, in a search-limited regime, mattered as much as any algorithmic change. Despite these measures, novelty rate (cache miss rate) fell to ~10% by step 100, suggesting the search process exhausts diversity early even when losses still look healthy.

GRPO RL

Group Relative Policy Optimization uses reward differences within a group as the learning signal. Our initial GRPO runs showed no return signal at all, so we made three changes: dropped the learning rate from 5e-6 to 1e-6 to reduce destructive updates in a sparse-reward setting, added a KL penalty (coefficient 1e-3) to prevent drift and collapse, and bumped the group size to 8 (128 rollouts/step vs SDPO’s 16) to increase the chance of seeing useful reward differences per step.

Over 38 steps (~94M tokens), our GRPO run reaches val_bpb = 1.037 (−1.8%) with entropy stable at ~0.30 but no continued improvement. A telling diagnostic: the KL term dominated pg_loss by 3–4 orders of magnitude, and gradient norms were very small. This suggests GRPO was operating more as “stay near base + better search” than as strong policy learning away from initialization.

Notably, once we added MCTS for structured proposal search, GRPO improved significantly and beat SDPO on best reward. This complicates the SDPO-vs-GRPO comparison: the improvement came primarily from better search scaffolding, not from richer GRPO learning dynamics. It suggests that on this task, proposal quality and search structure may matter more than the RL algorithm.

ICL Baseline In-Context

Same model, same tools, no weight updates. The model accumulates experiment feedback in its context window across 50 sequential turns: top results with full diffs, classified crash reports, and structured history. All learning happens through context conditioning. Qwen3-14B ICL achieves val_bpb = 1.038.

Single-turn Baseline No Feedback

N independent single-shot calls with no history and no feedback. Just “here’s train.py, improve it” repeated N times, taking the best result. This measures the entropy of the sampling distribution — what you get from pure random exploration at temperature 1.0. Qwen3-14B single-turn achieves val_bpb = 1.032 — beating its own ICL result of 1.038. The pattern is consistent: single-turn finds better peaks through variance, while ICL adds consistency without improving outcomes.

What the Model Finds

The dominant insight across both SDPO and ICL: trade model size for training throughput. Within the fixed 5-minute budget, a smaller model that completes more gradient steps outperforms a larger model that trains fewer steps. The best SDPO result combines ASPECT_RATIO 40, DEPTH 12, batch 64, and learning rate tuning — a compound edit that increases depth while reducing width and batch size, achieving 1.024 val_bpb.

Top discoveries (SDPO training)

#	Edit	val_bpb	Improvement
1	`ASPECT_RATIO 40` + `DEPTH 12` + `batch 64` + LR tuning	1.024	−3.1%
2	`ASPECT_RATIO 64 → 43`	1.027	−2.8%
3	`ASPECT_RATIO 64 → 48` + LR tuning	1.027	−2.7%
4	`ASPECT_RATIO 64 → 40` + `DEPTH 8 → 12`	1.030	−2.4%
5	`DEPTH 8 → 6`	1.031	−2.4%

The best result is a compound edit: the model learned to combine architecture changes (more depth, smaller aspect ratio, halved batch size) with learning rate tuning — a multi-parameter optimization that only emerged after extended training.

Learned intuition through weight updates

The SDPO prompt contains zero experiment history — just the system prompt and train.py. Yet after training, the model generates chain-of-thought that references “previous attempts” and their outcomes. These aren’t memories from context; they’re learned intuitions encoded in the weights through SDPO, surfacing as recollections in chain-of-thought used to guide reasoning:

<think>

The previous attempt tried changing the ASPECT_RATIO from 40 to 80, which increased the model size, but that resulted in a higher val_bpb. So, that approach didn’t work. Then someone else tried changing the ASPECT_RATIO to 60, but that led to an increase in parameters and a crash due to VRAM issues. The latest attempt had a change that increased val_bpb, so that wasn’t helpful. The goal is to find changes that can lower val_bpb without causing out-of-memory errors.

<think>

The previous attempt increased the ASPECT_RATIO from 40 to 60, which made the model bigger, but it led to a higher val_bpb. Maybe increasing the model size isn’t the best approach if it causes the val_bpb to go up. Wait, the previous suggestion actually had a lower val_bpb than the baseline. Wait, no, the previous attempt had a val_bpb of 1.024676, which was lower than the baseline of 1.056, but then the user tried changing ASPECT_RATIO to 60 and the val_bpb went up to 1.033, which was worse. So that indicates that increasing the model size beyond a certain point might not be helpful.

<think>

But perhaps increasing ASPECT_RATIO (and thus model_dim) requires decreasing DEVICE_BATCH_SIZE to prevent OOM. For example, if increasing ASPECT_RATIO to 40, which increases model parameters, we might need to lower DEVICE_BATCH_SIZE to, say, 32 to stay within 85GB VRAM.

<think>

The original code had HEAD_DIM = 128 (a power of 2), and the previous attempt tried changing it to 64, which is also a power of 2. So that’s allowed. But that led to a higher val_bpb. Maybe that’s because the model became too narrow. Another angle: the number of heads and key-value heads. The current setup uses n_head = num_heads and n_kv_head = num_heads. If n_kv_head is reduced, that might allow saving VRAM.

<think>

But maybe changing the window pattern to “SLL” or “LLL” could allow the model to have longer contexts in more layers, which might help with validation performance. But how does that relate to the training time and VRAM? If changing the window pattern to “LLL” (all long windows) would require more memory, but if the VRAM is acceptable, it might be worth trying.

The model has never seen experiment feedback in its prompt. There is no “user” and no “previous attempt.” But through SDPO training on thousands of rollouts, the model’s weights have internalized which changes help, which crash, and why — all encoded as intuition, not context.

Best single-turn result (Claude Haiku 4.5, −4.4%)

The strongest single result across all methods came from Claude Haiku 4.5 on turn 42 of 50 independent single-shot attempts (no feedback, no history). The edit follows the same core insight — trade model size for throughput — but arrives at it in one shot:

--- a/train.py +++ b/train.py -TOTAL_BATCH_SIZE = 2**19 # ~524K tokens +TOTAL_BATCH_SIZE = 2**17 # ~131K tokens -EMBEDDING_LR = 0.6 +EMBEDDING_LR = 0.4 -MATRIX_LR = 0.04 +MATRIX_LR = 0.03 -WARMDOWN_RATIO = 0.5 +WARMDOWN_RATIO = 0.3 -FINAL_LR_FRAC = 0.0 +FINAL_LR_FRAC = 0.05 -DEPTH = 8 +DEPTH = 10 -DEVICE_BATCH_SIZE = 128 +DEVICE_BATCH_SIZE = 64

Result: val_bpb = 1.009 (−4.4%), completing 1,580 training steps vs 591 at default settings. Smaller batch (131K → 4× more steps), more depth (10 layers), halved device batch, lower learning rates, and a nonzero final LR fraction.

Likely a lucky outlier. Haiku only succeeded on 4 of 50 attempts with an average of 1.070. For comparison, Claude Opus 4.6 single-turn averaged 1.036 across 39 attempts (34/39 succeeded, best 1.015) — much more reliable but with a less extreme best case. The Haiku result is impressive but not reproducible; Opus shows what a stronger model achieves consistently.

Weights vs. Context

The core question: does RL training add value beyond what in-context learning provides?

Single-turn beats ICL. The single-turn baseline (1.032) outperforms ICL (1.038). Context conditioning adds reliability, not capability — and can even hurt when the model over-conditions on noisy feedback.

SDPO breaks out. Weight updates encode exploration patterns that persist across rollouts without consuming context. SDPO converges to ~1.025 and holds steady for 300+ steps with 16 rollouts/step (~137M tokens total). Our GRPO run, despite 128 rollouts/step (~94M tokens) and the addition of a KL penalty and batch-normalized advantages, plateaus at 1.037. The trained SDPO checkpoint (1.023) surpasses the Karpathy agent (0.970, −2.8%) from a 14B open-source model.

The gap between single-turn and ICL measures what context contributes (nothing — single-turn actually wins). The gap between both baselines and SDPO measures what weights contribute.

Why SDPO Outperforms GRPO

We ran four complementary analyses on the trained checkpoints to understand how SDPO (−3.1% in ~137M tokens, 16 rollouts/step) achieves better results than our GRPO run (−1.8% in ~94M tokens, 128 rollouts/step with KL penalty and batch-normalized advantages). The picture is consistent: SDPO makes targeted, low-rank weight changes that don’t alter internal activation geometry but meaningfully shift what the model predicts. Our GRPO checkpoint, even with 8× more rollouts per step, shows minimal functional divergence from the base model.

1. Weight Delta SVD

We decompose ΔW = W_ft − W_base via SVD across all 282 weight matrices to characterize the structure of what each method learns.

863

SDPO effective rank
(surgical changes)

1346

GRPO effective rank
(diffuse changes)

4×

SDPO relative weight
change magnitude

13.9%

SDPO top-5 SV
concentration

SDPO’s weight deltas are 4× larger in magnitude but concentrated in fewer dimensions (effective rank 863 vs 1346). Its top-5 singular values capture 13.9% of total variance vs 8.7% for GRPO. SDPO learned specific computational patterns in a few targeted directions, while our GRPO run spread smaller changes diffusely across many dimensions — consistent with the KL penalty and batch normalization constraining updates to stay near the base distribution.

2. Forgetting Probes

Standard few-shot benchmarks (MMLU, HellaSwag, ARC, Winogrande) plus generative evaluation (GSM8K 8-shot CoT) and WikiText perplexity, measured via lm-eval-harness on both checkpoints.

Metric	GRPO Δ	SDPO Δ	Interpretation
MC benchmark avg	−0.21%	−0.39%	Mild forgetting, within noise
MMLU (5-shot)	+0.06%	+0.02%	Knowledge fully preserved
GSM8K CoT (8-shot)	+0.23%	+1.21%	SDPO gains math reasoning
WikiText PPL	−0.003	−0.105	SDPO improves language modeling
KL(base‖model)	0.0007	0.0017	SDPO diverges 2.3× more

SDPO achieves a favorable learning-forgetting trade-off: it improves GSM8K by +1.21% and WikiText perplexity by −0.1 — genuine capability gains on generative tasks — while preserving MMLU and accepting only mild MC benchmark degradation. Our GRPO run, with the KL penalty (coefficient 1e-5) we added to stabilize training, barely moves from the base distribution — KL divergence of just 0.0007 vs SDPO’s 0.0017.

3. Activation Similarity

We measured per-token cosine similarity between base and fine-tuned model activations at layers 10, 20, 30, and 38 across 600 diverse sequences (C4, code, GSM8K, Wikipedia).

Result: cosine similarity > 0.999 everywhere for both models. Neither GRPO nor SDPO meaningfully changes the model’s internal activation geometry. The representational space is nearly identical to the base model. This means the weight changes operate at a finer granularity than activations can capture — they modify what the model outputs without changing how it represents inputs internally.

4. Logit Lens

Since activations are geometrically identical, we projected each layer’s residual stream through the unembedding matrix to see how predictions evolve through the network. This captures functional differences invisible to activation-space metrics.

GRPO SDPO

Predictions changed from base (final layer)

GRPO

1.16%

SDPO

1.88%

Entropy shift (final layer, nats)

GRPO

SDPO

+0.017

SDPO changes 1.6× more predictions than GRPO (1.88% vs 1.16% of tokens at the final layer) and consistently increases prediction entropy (+0.017 nats vs ~0). The divergence is uniform across all layers — not concentrated at any particular depth — suggesting a distributed functional shift rather than a single circuit being modified. Both models converge back toward base predictions in the final layers (agreement rises from ~97% at mid-layers to ~99% at final), indicating the changes primarily affect intermediate processing.

The Full Picture

Combining all four analyses: SDPO makes surgical, low-rank weight changes (SVD) that don’t alter activation geometry (cosine similarity > 0.999) but meaningfully shift predictions (1.6× more tokens changed, higher entropy) in a way that improves generative capabilities (GSM8K +1.21%, WikiText −0.1) without degrading knowledge (MMLU +0.02%). Our GRPO run — which required adding a KL penalty and batch normalization just to produce stable training — made changes too small and diffuse to show up in any of these analyses. The 8× rollout advantage (128 vs 16 per step) didn’t compensate for the constrained update magnitude.

Full analysis code and results: scripts/interp/

Comparison to the Karpathy Agent

The original autoresearch experiments used an LLM-based agent on H100 GPUs, running 126 experiments over multiple cycles. The best result improved val_bpb by 2.8% (0.998 → 0.970). Key improvements: batch halving for more gradient steps (−0.012), adding depth (−0.004), embedding LR tuning (−0.003), and weight decay adjustments.

A key design difference: Karpathy’s setup accepts successful diffs into train.py, so each experiment builds on the last improvement — the file evolves over time. In our setup, the agent always receives the original unmodified train.py. This is deliberate: SDPO rollouts start from scratch each time (learning lives in the weights, not the file), so the baselines must do the same for a fair comparison. It means every improvement must be independently rediscovered, making our results harder to achieve but cleaner to interpret.

Despite this handicap, our SDPO-trained checkpoint evaluated with ICL (−3.1%) surpasses the original Karpathy agent (−2.8%), using a 14B open-source model trained with RL. The dominant insight is the same across all agents: throughput over parameters — trade model size for more training steps within the fixed budget. Both agents independently converge on this strategy.

Method	Model	Experiments	Best	Avg
SDPO ckpt + ICL	Qwen3-14B-SDPO	50 turns (with feedback)	1.023 (−3.1%)	1.071
SDPO (training)	Qwen3-14B	7,920 rollouts (495 steps × 16)	1.023 (−3.1%)	—
Karpathy Agent	Claude?	126	0.970 (−2.8%)	—
SDPO ckpt + single	Qwen3-14B-SDPO	50 turns (no feedback)	1.028 (−2.6%)	1.060
Single-turn	Qwen3-14B	50 turns (no feedback)	1.032 (−2.3%)	1.122
GRPO (training)	Qwen3-14B	4,864 rollouts (38 steps × 128)	1.037 (−1.8%)	—
ICL baseline	Qwen3-14B	50 turns (with feedback)	1.038 (−1.7%)	1.066

^* Absolute baselines differ (Karpathy: 0.998, ours: 1.056) due to platform/setup differences. Relative improvements are compared. Our GRPO run uses group_size=8 (128 rollouts/step, ~94M tokens) with a KL penalty and batch-normalized advantages added to stabilize training; SDPO uses 16 rollouts/step (~137M tokens).

Limitations & Future Work

Teacher model bias at small scale. SDPO’s teacher is the same model conditioned on hindsight feedback. At 14B, this creates a ceiling: the teacher can only “correct” the student toward solutions the base model could already express when given the right context. The teacher-student gap that drives SDPO’s learning signal is bounded by the base model’s own capabilities, which may explain why SDPO improved average behavior more than it discovered qualitatively new solutions. Larger base models or external teacher models (e.g., using a stronger model to generate feedback) could widen this ceiling.

Compound edits. Encouraging the model to combine multiple single edits that individually reduce validation loss remains a challenge. The best SDPO result is a compound edit, but these emerged rarely and late in training.

Rollout conflation. Although references to previous experiments in chain-of-thought serve as a means of improving reasoning quality, the model tends to conflate feedback from different rollouts. We suspect this is due to using a batch size greater than one with the same prompt and varying teacher hindsight context.

Search exhaustion. Novelty rate collapsed to ~10% by step 100 despite explicit novelty rewards and hash-based deduplication. The model cycles through a limited repertoire of edits rather than discovering genuinely new regions of the solution space. Improving proposal diversity — through better prompting, tool use, or external search — may matter more than further algorithmic improvements.

Built On

Autoresearch (Karpathy) — A single-file GPT pretraining script. The flagship benchmark: modify train.py to minimize val_bpb within a 5-minute training budget on a single H100.

SDPO (Hübotter et al., 2026) — Self-Distillation Policy Optimization. Converts tokenized environment feedback into a dense learning signal by treating the current model conditioned on feedback as a self-teacher, then distilling its corrected next-token predictions back into the policy. We extend it with agentic tool use and live environment rewards from real experiments.

VERL — RL training framework for LLMs. Our fork adds SDPO, agentic tool use, and multi-turn rollouts with configurable task environments.

Citation

@misc{naihin2026distillloop,
  title   = {RL Training for LLM Agents on Live Task Improvement Loops},
  author  = {Naihin, Silen and Fallah, Kion},
  year    = {2026},
  url     = {https://github.com/SilenNaihin/autoresearch-distillation}
}