Your LLM Understood the Code — Then Forgot the Answer
Today, Code Agents are increasingly capable of acting as autonomous software engineers—resolving complex issues, writing tests, and refactoring systems within real-world codebases. Yet, peel back this grandiose facade, and the bedrock of their performance relies strictly on a few atomic code deduction capabilities. But what exactly happens inside these models when they process such code?
Typically, we only see the final output—right or wrong—which masks a fascinating possibility: often, the model actually knows the answer. It computes the correct result in its early layers but actively overwrites it before reaching the final token layer.
In this post, we apply controlled code understanding stimuli alongside the interpretability probing tool Patchscopes (Ghandeharioun et al., 2024) to dissect the layer-by-layer processing timelines within Qwen2.5-Coder. We heavily adapted and modified this probe to illuminate a surprising phenomenon in a very practical setting:
- For certain coding tasks, semantic truth emerges extremely early but is behaviorally unstable, getting destroyed by the model's ongoing processing (a phenomenon we term Overthinking).
- "What the model internally knows" vs. "What it can articulate" are vastly different mechanisms.
- This directly suggests that existing Early Exit strategies hold massive potential for real-world Agent scenarios—if and only if we are acutely aware of the specific task profile the model is currently executing.
Code understanding turns out to be an exceptional lens for studying Reasoning Instability. Unlike the inherent fuzziness of natural language, executed code provides deterministic, verifiable semantic intermediate states, granting us incredibly clean ground truth signals.
I. A Counterintuitive Gap: Data Flow vs Control Flow
Our initial motivation came from observing behavior inside real-world Code Agents solving SWE-Bench-style issues. Under the hood, the agent relentlessly oscillates between two primitives:
- Tracking data-flow:
x = 2; y = x + 1 - Reasoning control-flow:
if x > 1: return A else: return B
When subjecting the Qwen2.5-Coder 7B parameter model to script-generated, minimalist, pure examples of these two operations, the results were severely split:
- Conditional (Control-flow bounds): 81.7% final accuracy.
- Computing (Data-flow arithmetic): 18.6% final accuracy.
A staggering gap. Follow-up validation—flattening out addition and subtraction into pure sequential assignments (e.g. a=b; c=a)—yielded similarly abysmal metrics. The root issue isn't that "math is hard." Rather, the specific data-flow tracking formatting activates an incredibly fragile internal computational pathway.
II. Poking Around with Patchscopes
If the final answer is wrong, did it never understand it, or did it lose it along the way?
We leveraged Patchscopes to find out. Conceptually, it acts as a non-invasive mind-reader: we copy the hidden state hℓ of the code evaluation from layer ℓ and "patch" it into a generic prompt canvas (e.g. "# The answer is "), forcing the model to explicitly vocalize its mid-forward-pass assumptions.
For those interested in exploring this space, we must highlight two massive caveats shaping the adaptation (detailed in Appendix A):
- Last-Position Execution is Non-Negotiable. Early literature (such as Belrose et al., 2023's Tuned Lens) advocated for targeted extraction/injection at arbitrary tokens. Doing so in coding loops shatters following token KV-caches, driving accuracy flat to zero due to massive context contamination.
- First-Token Pruning on Giants. At 14B bounds, generation momentum takes over, causing wild hallucinations spilling out endless text. If you don't aggressively crop evaluation to strictly the absolute first token emitted, you retrieve nothing but noise.
III. The Strongest Finding: Overthinking
The punchline: For the Computing task suffering an 18.6% final accuracy rate, a stunning 55.8% of test subjects successfully calculated the perfect answer at some intermediate layer block.
Nearly 37 percentage points of intelligence evaporated in transit. The model didn't fail to understand; it overthought. An answer materialized deep in the network but was actively compromised throughout later evaluations—perhaps corrupted by irrelevant Attention sweeps—wholly overwriting the truth.
Remarkably, this instability isn't universal. Under Conditional logic trials, representations harden securely. Once the answer is grasped, it sails seamlessly and stably towards final output.
In short: "Knowing the answer" and "being able to dictate the answer" utilize fractured pathways.
IV. Information Brewing: The Output-Ready Void
Why does such catastrophic decay orbit Computing exclusively?
We correlated findings applying early classical Linear Probing frameworks (inspired by Alain & Bengio, 2016) against our Patchscopes generation bounds, uncovering what we designate as the Information Brewing Gap.
- Running linear approximations on our 14B models indicated that by superficial Layer 2, the core answers were already strictly determinable (Probing ~100%). Truth was perfectly intact and linearly separable.
- Yet forcibly prompting the LLM via Patchscopes stalled completely, taking up until Layer 33 to start correctly articulating generated completions.
This massive translation lag defines vulnerability. The representation matures conceptually, but exists in a form heavily hostile toward auto-regressive generation execution (Non Output-Ready). For 30 subsequent layers, the model violently forces format adjustments to match token distributions—a vulnerable window where the slightest external semantic noise wipes the memory clean. This aligns closely with broader mechanistic findings detailing deep layer functional drifts (e.g., DoLa).
V. A Brief Aside: The Loop Illusion
When measuring strict Loop control behaviors containing traditional for i in range(n) operations, all scales essentially scored perfectly at n=1 loops.
Incrementing n=2 forced total systemic breakdowns. Upon manually unspooling loops to flat linear strings for verification, scaling models snapped back to near-perfect performance. The insight is glaring: Current LLMs tackling basic iteration bypass genuine recursive loop inferences entirely. Via strong pattern recognition, they adapt single loops directly to primitive semantic Copying circuitry shortcuts. It warns us to remain vigilant against illusionary competence masked within Benchmarks.
VI. Utilizing Instability: Task-Aware Early Exits
How does diagnosing underlying instability optimize real-world autonomous coding systems?
Since overwriting manifests broadly, stopping inference processing artificially via an Early Exit is the natural conclusion—strategies implemented historically in broader NLP parameters per Schuster et al. (2022) to cut inference costs.
Yet our observations stipulate refined rules. Determining the safest, statically optimal extraction layer bounded by Validation tests mapped out as such:
| Operation | Native Default Exit | Optimal Layer Exit | Gained Return |
|---|---|---|---|
| Computing | 18.6% | 45.0% (@L22) | + 26.4 pp |
| Conditional | 81.7% | 82.7% (@L26) | + 1.0 pp |
This is the keystone claim: Benefit yields are aggressively Task-Aware. Adopting a singular overarching blanket truncation metric ruins overall stability. Effective future Agents must dynamically read operational contexts—allowing deep deductions to sail toward completion smoothly while violently interrupting numerical data-tracing evaluations where their answers peak brightest within internal processing architectures before they succumb to self-imposed forgetfulness.
Citation
Cited as:
Guo, Yifu; Chen, Siyue. (Mar 2026). Your LLM Understood the Code — Then Forgot the Answer. Yifu Guo's Blog. https://ericguo1019.com/blog/code_understanding_llm/.
Or
@article{guo2026codeunderstanding,
title = "Your LLM Understood the Code — Then Forgot the Answer",
author = "Guo, Yifu and Chen, Siyue",
journal = "ericguo1019.com",
year = "2026",
month = "Mar",
url = "https://ericguo1019.com/blog/code_understanding_llm/"
}
References
- Ghandeharioun, Asma, et al. "Patchscopes: A unifying framework for inspecting hidden representations of language models." ICML (2024).
- Alain, Guillaume, and Yoshua Bengio. "Understanding intermediate layers using linear classifier probes." arXiv preprint arXiv:1610.01644 (2016).
- Belrose, Nora, et al. "Eliciting latent predictions from transformers with the tuned lens." arXiv preprint arXiv:2304.14997 (2023).
- Chuang, Yung-Sung, et al. "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models." ICLR (2024).
- Schuster, Tal, et al. "Confident Adaptive Language Modeling." NeurIPS (2022).
Appendix A: Methodological Adaptations to Patchscopes
Patchscopes was originally designed for factual extraction (e.g., token identity and entity attributes) in natural language. Migrating it to multi-step code execution tracking revealed severe incompatibilities, leading to three crucial modifications:
- Last-Position Injection is Mandatory: While earlier interpretability methods (like Belrose et al., 2023's Tuned Lens) advocated for injection at arbitrary mid-prompt tokens, doing so in auto-regressive code generation completely corrupts subsequent KV caches. If hℓ is injected at
x_pos, subsequent tokens (like→or=) fall out of sync, driving accuracy roughly to 0. Injecting strictly at the last position bypasses context contamination. - First-Token Pruning on Giants: Surprisingly, the 14B model initially exhibited higher instability on simple
Copyingtasks than on computing tasks. Investigation revealed that the massive generation momentum of larger models caused them to hallucinate immense continuous code blocks instead of answering the target. Forcing strict first-token cropping and evaluation was required to restore the legitimate data patterns. - Task-Dependent Decoding Bounds: Patchscopes' decoding capability is inherently limited. In dense computing tasks, even when linear Probing validated that the outcome already existed in the hidden dimension (100% separable), Patchscopes delayed successfully articulating that answer by dozens of layers. This confirms that computing the information and aligning the information for token emission are wholly disjointed circuits within LLMs.