A deep dive into trajectory homogenization and how trajectory-level evolution achieves 80% on SWE-bench Verified
MultiOutputClassifier fails to expose classes_ attribute,
causing AttributeError in cross_val_predict
Error surfaces in _validation.py
cross_val_predict expects all classifiers to have classes_
Root cause: multioutput.py never stores classes_ after fit()
"Stack trace shows WHERE the crash happens, but not WHERE the fix should go."
Temperature sampling creates token-level diversity, but agents need trajectory-level diversity to escape the homogenization trap. SE-Agent's operators inject prompts that force orthogonal exploration.
Pass@1 comparison across 5 LLMs β’ SE-Agent delivers +30-112% relative improvement
SE-Agent reaches near-optimal performance with just 10 trajectories, demonstrating that trajectory-level evolution is far more efficient than naive sampling.
LLM agents don't fail randomlyβthey fail in highly correlated ways. Running the same agent 10 times often produces 9 nearly identical wrong answers. This makes traditional test-time scaling (more samples = better) extremely inefficient.
Increasing temperature or top-p creates token-level variation, not strategy-level diversity. The agent still follows the same reasoning pattern, just with slightly different wording. You need to intervene at the trajectory level.
System prompt: Stable, affects overall strategy.
Key action injection: Stronger, more targeted.
In SWE-bench, this is typically the step right after bug reproduction succeeds.
The revision operator's power comes from explicitly framing previous attempts as incorrect. This prompts the model to explore genuinely different approaches rather than minor variations.
Making each agent attempt "orthogonal" in solution space is more valuable than making more attempts. SE-Agent achieves this through trajectory-level revision, recombination, and refinement.