SE-Agent Blog Assets - Case Study & Results

CASE STUDY

scikit-learn #14629: The Homogenization Trap

MultiOutputClassifier fails to expose classes_ attribute, causing AttributeError in cross_val_predict

⚠️ Bug Manifestation

# Error trace

AttributeError:

'MultiOutputClassifier' object

has no attribute 'classes_'

Error surfaces in _validation.py

cross_val_predict expects all classifiers to have classes_

Root cause: multioutput.py never stores classes_ after fit()

💡 Key Insight

"Stack trace shows WHERE the crash happens, but not WHERE the fix should go."

Symptom Location

_validation.py

Root Cause Location

multioutput.py

🎯

Core Insight for Agent Builders

Temperature sampling creates token-level diversity, but agents need trajectory-level diversity to escape the homogenization trap. SE-Agent's operators inject prompts that force orthogonal exploration.

EXPERIMENTAL RESULTS

SWE-bench Verified Performance

Pass@1 comparison across 5 LLMs • SE-Agent delivers +30-112% relative improvement

SWE-Agent (CodeAct)

SWE-Search (MCTS)

SE-Agent (Ours)

DS-V3

54.8%

+73%

Qwen

38.8%

+106%

Llama

32.6%

+112%

GPT-4o

40.4%

+80%

Claude

61.2%

+51%

0% 20% 40% 60%

80.0%

Best Result

Claude-4-Sonnet + SE-Agent

+55%

Max Relative Gain

vs SWE-Agent baseline

Trajectories

Near-optimal performance

⚡ Efficiency: Same Cost, Better Results

🎯 Key Finding

SE-Agent reaches near-optimal performance with just 10 trajectories, demonstrating that trajectory-level evolution is far more efficient than naive sampling.

SWE-Agent @ 20 tries

SE-Agent @ 10 tries

2× more efficient

📉

Traditional TTS

Diminishing returns after ~5 samples

📈

SE-Agent Evolution

Each iteration explores orthogonal paths

KEY INSIGHTS

What Matters for Agent Builders

🎯

The Homogenization Problem

LLM agents don't fail randomly—they fail in highly correlated ways. Running the same agent 10 times often produces 9 nearly identical wrong answers. This makes traditional test-time scaling (more samples = better) extremely inefficient.

🌡️

Temperature ≠ Diversity

Increasing temperature or top-p creates token-level variation, not strategy-level diversity. The agent still follows the same reasoning pattern, just with slightly different wording. You need to intervene at the trajectory level.

💉

Two Injection Points

System prompt: Stable, affects overall strategy.
Key action injection: Stronger, more targeted. In SWE-bench, this is typically the step right after bug reproduction succeeds.

🧬

Treat Failures as "Wrong"

The revision operator's power comes from explicitly framing previous attempts as incorrect. This prompts the model to explore genuinely different approaches rather than minor variations.

The Bottom Line

Making each agent attempt "orthogonal" in solution space is more valuable than making more attempts. SE-Agent achieves this through trajectory-level revision, recombination, and refinement.

Why Temperature Sampling Isn't Enough for Agents