BLOG ASSETS FOR SE-AGENT

Why Temperature Sampling Isn't Enough for Agents

A deep dive into trajectory homogenization and how trajectory-level evolution achieves 80% on SWE-bench Verified

CASE STUDY

scikit-learn #14629: The Homogenization Trap

MultiOutputClassifier fails to expose classes_ attribute, causing AttributeError in cross_val_predict

⚠️ Bug Manifestation

# Error trace
AttributeError:
'MultiOutputClassifier' object
has no attribute 'classes_'
1

Error surfaces in _validation.py

2

cross_val_predict expects all classifiers to have classes_

!

Root cause: multioutput.py never stores classes_ after fit()

πŸ’‘ Key Insight

"Stack trace shows WHERE the crash happens, but not WHERE the fix should go."
Symptom Location
_validation.py
Root Cause Location
multioutput.py
🎯
Core Insight for Agent Builders

Temperature sampling creates token-level diversity, but agents need trajectory-level diversity to escape the homogenization trap. SE-Agent's operators inject prompts that force orthogonal exploration.

EXPERIMENTAL RESULTS

SWE-bench Verified Performance

Pass@1 comparison across 5 LLMs β€’ SE-Agent delivers +30-112% relative improvement

SWE-Agent (CodeAct)
SWE-Search (MCTS)
SE-Agent (Ours)
DS-V3
54.8%
+73%
Qwen
38.8%
+106%
Llama
32.6%
+112%
GPT-4o
40.4%
+80%
Claude
61.2%
+51%
0% 20% 40% 60%
80.0%
Best Result
Claude-4-Sonnet + SE-Agent
+55%
Max Relative Gain
vs SWE-Agent baseline
10
Trajectories
Near-optimal performance

⚑ Efficiency: Same Cost, Better Results

🎯 Key Finding

SE-Agent reaches near-optimal performance with just 10 trajectories, demonstrating that trajectory-level evolution is far more efficient than naive sampling.

SWE-Agent @ 20 tries
SE-Agent @ 10 tries
2Γ— more efficient
πŸ“‰
Traditional TTS
Diminishing returns after ~5 samples
πŸ“ˆ
SE-Agent Evolution
Each iteration explores orthogonal paths
KEY INSIGHTS

What Matters for Agent Builders

🎯

The Homogenization Problem

LLM agents don't fail randomlyβ€”they fail in highly correlated ways. Running the same agent 10 times often produces 9 nearly identical wrong answers. This makes traditional test-time scaling (more samples = better) extremely inefficient.

🌑️

Temperature β‰  Diversity

Increasing temperature or top-p creates token-level variation, not strategy-level diversity. The agent still follows the same reasoning pattern, just with slightly different wording. You need to intervene at the trajectory level.

πŸ’‰

Two Injection Points

System prompt: Stable, affects overall strategy.
Key action injection: Stronger, more targeted. In SWE-bench, this is typically the step right after bug reproduction succeeds.

🧬

Treat Failures as "Wrong"

The revision operator's power comes from explicitly framing previous attempts as incorrect. This prompts the model to explore genuinely different approaches rather than minor variations.

The Bottom Line

Making each agent attempt "orthogonal" in solution space is more valuable than making more attempts. SE-Agent achieves this through trajectory-level revision, recombination, and refinement.