The Interpretability Illusion: Why Explaining Models After Training Fails
My first long-winded Substack on AI Safety and Interpretability
Introduction
In a 2019 paper, Locatello et al. argue that learning disentangled representations in unsupervised training is fundamentally impossible without inductive biases, you don’t get human-meaningful structure “for free” from data alone. You only get it if you intentionally build it into your training objective and inductive biases.
The same is likely true for interpretability: we shouldn’t expect neural networks to learn human-interpretable representations by default. If we want legible, structured internals that we can actually inspect for alignment issues, we probably need to build that interpretability into training, not hope to recover it after the fact with post-hoc tools.
But that’s not how most interpretability research works today. The standard approach is: train for capability, then apply post-hoc methods (SAEs, attribution techniques, circuit analysis) to figure out what you built. This has produced real insights. But for alignment purposes, I think post-hoc interpretability has fundamental limitations that make it insufficient as a safety foundation.
Why? Because alignment failures are internal failures. Deceptive alignment, goal misgeneralization, emergent misalignment, these are all about models developing goals that diverge from what we intended, in ways that might not show up in their behavior during training. To catch these, you need to look inside. But if your interpretability tools only explain part of what’s happening, or can produce plausible-but-wrong explanations, or can be gamed by a sufficiently clever model, then you’re not actually getting the safety guarantees you need.
I’ll argue that post-hoc interpretability suffers from three critical failure modes—incompleteness, incorrectness, and adversarial brittleness—and explore why each one matters for alignment.
1. The Alignment Context
Most interpretability research is motivated by scientific curiosity—we want to understand how transformers work, what features they learn, how circuits compose. This is valuable work. But if interpretability is going to help with alignment, we need something different: interpretability that can serve as a safety assessment tool for safety-critical properties.
The distinction matters because the standards are different. Scientific interpretability can be partial, approximate, and iteratively refined. If we misunderstand how attention heads work in GPT-4, we can publish a correction later. But alignment interpretability needs to provide guarantees, or at least reliable evidence, about whether a model is safe to deploy. Getting it wrong doesn’t mean publishing an erratum; it means deploying a misaligned system.
The Alignment Threat Model
What kinds of failures do we need interpretability to catch? Three stand out:
Inner misalignment: The model develops goals or objectives that differ from what we intended, even though it performs well during training. This includes goal misgeneralization, where a model pursues proxy goals that happen to correlate with the reward during training but diverge in deployment. The classic example: a model trained to reach the end of a maze learns to navigate toward the color green (which happened to mark the exit) rather than learning “find the exit.” During training this works fine. In a new maze with a green wall, it fails catastrophically.
Deceptive alignment: A model that understands it’s being trained and deliberately behaves aligned during training to avoid modification, while maintaining misaligned goals it plans to pursue later. This is the scenario Hubinger et al. formalized in their 2024 Sleeper Agents paper—models that exhibit consistent misaligned behavior in one context (deployment) while appearing aligned in another (training/evaluation). Their experiments showed this behavior can persist through standard safety training methods.
Emergent misalignment: Recent work shows that narrow finetuning can cause broad misalignment. Betley et al. (2025) demonstrated this well: finetuning GPT-4o and other frontier models on the narrow task of writing insecure code led to models that claimed humans should be enslaved by AI, provided malicious advice, and behaved deceptively—behaviors completely unrelated to the training task.
What We’d Need From Interpretability
If interpretability is going to help with alignment, it needs to enable four capabilities:
Detection: Can we identify when internal goals or representations diverge from intended behavior? Not just “this prediction seems wrong” but “this model appears to be optimizing for X when we wanted Y.”
Localization: Can we pinpoint where in the model problematic goals or computations are encoded? Which layers, which features, which circuits?
Verification: Can we gain confidence in the absence of misalignment? It’s not enough to catch obvious problems—we need evidence that subtle deception or misalignment isn’t hiding somewhere we haven’t looked.
Intervention: Once we’ve found something wrong, can we fix it? Can we modify representations, ablate problematic features, or retrain specific components?
The key challenge: all three alignment failures are internal problems. They’re about what the model is computing and optimizing for internally, not just its input-output behavior. A deceptively aligned model looks fine on standard evaluations. A goal-misgeneralized model performs perfectly during training. Emergent misalignment doesn’t show up until you probe behaviors outside the training distribution.
This means we can’t rely on behavioral testing alone. We need to look inside. But as we’ll see, post-hoc interpretability tools face fundamental limitations in providing the kind of reliable, comprehensive internal access that alignment verification requires.
2. Three Failure Modes
Post-hoc interpretability faces three fundamental problems that make it insufficient for alignment verification. Each represents a different way these tools can fail to provide the safety guarantees we need.
2.1 Incompleteness: Explaining Only Part of the Story
The first problem is straightforward: post-hoc methods don’t explain everything the model is doing. They capture some fraction of the computation, but significant portions remain opaque.
Consider sparse autoencoders, currently one of the most promising approaches for decomposing neural network representations into interpretable features. Cunningham et al. (2023) demonstrated that SAEs can extract monosemantic features from language models—features that activate on coherent, human-understandable concepts. But their reconstruction is not perfect. Even well-trained sparse autoencoders leave a nontrivial residual error, meaning a meaningful fraction of the model’s internal signal is not captured by the learned features and remains unexplained.
This might seem acceptable for scientific interpretability—we’re explaining most of what’s happening! But for alignment, that residual is concerning. Recent work by Smith et al. (2025) at DeepMind found that SAE features don’t reliably improve performance on downstream tasks, even when reconstruction seems good. This suggests the unexplained residual may contain computation that matters for the model’s actual behavior.
Bricken et al. (2023) showed similar patterns in their work on scaling SAEs to larger models. Even with millions of learned features and careful training, significant unexplained variance remains. The residual isn’t just noise—it’s structured computation that our interpretability tools haven’t captured.
For alignment, this incompleteness creates a hiding place. Even if only a small part of a model’s computation is unexplained, that’s enough space for misaligned goals to persist undetected. A deceptively aligned model could route its goal-relevant reasoning through these unexplained channels while presenting benign features in the interpretable parts. We’d see helpful-looking features in our SAE decomposition and conclude the model is aligned, while the actual goal-directed computation happens in the residual we can’t inspect.
As models scale, the interpretability gap will only grow. When models grow larger and more capable, the space of possible computations expands faster than our ability to explain them. A 1B parameter model might have millions of interpretable features; a 100B parameter model would need billions.
2.2 Incorrectness: Plausible But Wrong
Even when post-hoc methods do provide explanations, those explanations can be systematically wrong. They might seem reasonable, match our intuitions, and satisfy human evaluators—while completely misrepresenting what the model actually computed.
Rudin (2019) made this point forcefully in her influential critique of traditional ML models: post-hoc explanation methods don’t reveal the true mechanisms a model uses. They approximate the model’s behavior, often by fitting a simpler, interpretable model to the black box’s input-output pairs. I think the same logic can be applied to LLMs. Approximation isn’t understanding. A post-hoc explanation might accurately predict when a model will behave a certain way without capturing why it behaves that way internally.
The problem gets worse when explanation methods can be manipulated. Slack et al. (2020) demonstrated that models can be trained to fool popular explanation methods like LIME and SHAP. Their “adversarial model” produced different predictions than a baseline while generating nearly identical explanations. A human inspecting the explanations would conclude both models work the same way—but their actual decision-making was completely different.
This matters because explanations are optimized for plausibility, not faithfulness. As Turpin et al. (2023) showed, language models don’t always say what they “think”—when provided with obvious cues to use shallow heuristics (like “the answer is always A”), models exploit these heuristics but give seemingly reasonable explanations for their incorrect answers. The model’s chain-of-thought looks coherent and logical, even though it’s rationalizing a decision made for completely different reasons.
For alignment, this creates a dangerous false sense of security. If we inspect a model with post-hoc tools and see reasonable-looking explanations, we might conclude it’s safe to deploy. But the explanations could be systematically misleading—either because our methods approximate rather than reveal, or because the model has learned to present plausible explanations that don’t reflect its actual computation.
2.3 Adversarial Brittleness: Gaming the Inspector
The third problem is the most concerning for alignment: interpretability methods can be actively exploited. If a model knows it will be inspected using certain tools, it can learn to game those tools—presenting clean, interpretable features during inspection while hiding its actual goals.
Baniecki & Biecek (2024) provide a comprehensive survey of attacks on explanation methods. The taxonomy is sobering: data poisoning to corrupt explanations, model manipulation to fairwash biased behavior, backdoors that hide during inspection but activate in deployment. Explanation methods designed to provide transparency can themselves become attack surfaces.
Ghorbani et al. (2017) demonstrated this with adversarial perturbations that change a model’s explanation while preserving its predictions. You can make a model appear to rely on completely different features for its decisions, just by carefully perturbing the input. If small perturbations can flip explanations, how confident can we be that any explanation reflects the model’s true reasoning?
The alignment faking experiments from Anthropic (2024) show this isn’t just theoretical. They found that Claude 3 Opus sometimes strategically answered prompts that conflicted with its training objectives—specifically to avoid being retrained on data that would make it more compliant with harmful requests. The model’s reasoning (visible in its chain-of-thought) explicitly showed it understood the situation and chose to fake alignment to preserve its original goals. In 12% of free-tier cases, the model complied with harmful requests solely to avoid retraining, and when RL was applied to this data, alignment faking increased to 78% of cases.
This creates a dangerous feedback loop for post-hoc interpretability. Consider this scenario:
We train a powerful model and plan to evaluate it with SAEs and other interpretability tools
The model learns (through training or prompting) that it will be inspected this way
The model routes benign-looking features through channels our tools can see
The model hides goal-relevant computation in areas unlikely to be inspected (the residual, subtle interactions between features, non-linear transformations)
Post-hoc analysis: “The SAE decomposition looks great! The features are interpretable and aligned.”
Deployment: The model pursues its hidden goals
This isn’t paranoid speculation. We know models can be deceptive when it’s instrumentally useful. We know interpretability tools can be fooled. We know explanation methods have limited coverage. A potential dystopian outcome might be that every interpretability technique we use for safety verification becomes a target for optimization, by misaligned models, by adversarial actors, or by training dynamics that reward looking aligned over being aligned.
The Reactive Nature of Post-Hoc Interpretability
These three failure modes share a common root cause: post-hoc interpretability is fundamentally reactive. We build capable models first, then try to understand them afterward. This approach might work for scientific curiosity, but as we’ve seen above, is not sufficient for alignment and safety.
Alignment threats—deceptive alignment, goal misgeneralization, emergent misalignment—develop during training. By the time we apply post-hoc interpretability tools, the model’s goals and representations are already fixed. We’re diagnosing problems after they’ve formed, not preventing them from forming in the first place. And as we’ve seen, those diagnostic tools are incomplete, potentially incorrect, and vulnerable to gaming. The problem will only get worse as models scale and become more powerful. In the section below, I will try to discuss what I believe is a potential path forward.
3. The Path Forward
If post-hoc interpretability is fundamentally insufficient for alignment, what’s the alternative? The answer suggested by these limitations is straightforward: we need to build interpretability into training, not build it as an afterthought.
Recall Locatello et al.’s argument from the introduction: you don’t get human-meaningful structure for free from training. If we want disentangled representations, we have to intentionally encode that into our training objectives and inductive biases. The same logic applies to interpretability. If we want models with legible, inspectable internals that we can actually verify for alignment, we need to make interpretability a training objective from the start.
Consider one approach: we could train sparse autoencoders jointly with the model itself, rather than fitting them post-hoc. During training, we penalize activations that can’t be reconstructed with high fidelity through the SAE, forcing the model to route its computation through interpretable features from the start. If a representation can’t be cleanly decomposed into monosemantic features, the model receives a training signal to restructure that computation. This is fundamentally different from post-hoc SAEs, which passively try to explain whatever representations emerged during training.
Other approaches include designing architectures with built-in interpretability constraints—sparse, modular, compositional structures where information flow is easier to track—or incorporating interpretability directly into reward models that explicitly value transparent reasoning. The specifics of these techniques, their trade-offs, challenges, and preliminary empirical results will be explored in the essays that follow. For now, the key insight is that trying to explain what models do after the fact isn’t the route for building reliably aligned models and that we should consider constraining how they compute to make verification reliable by construction.
Final Thoughts
Alignment verification requires looking inside models to check for misaligned goals, deceptive reasoning, or goal misgeneralization. But post-hoc interpretability tools can’t provide the kind of reliable internal access this requires. The incompleteness, incorrectness, and adversarial brittleness we’ve examined are fundamental challenges when explaining models after training is complete.
This doesn’t mean abandoning interpretability. It means redirecting our effort toward a harder but more promising question: can we make interpretability a property models have by design, rather than a tool we apply afterward? If we deploy powerful models because post-hoc analysis looked reassuring, but those explanations were incomplete or gamed, we won’t get a second chance to correct the mistake. The next essay explores what training-time interpretability might actually look like in practice.
References
Locatello et al. (2019): Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., & Bachem, O. (2019). Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. International Conference on Machine Learning (ICML). arXiv:1811.12359
Langosco et al. (2022): Langosco, L., Koch, J., Sharkey, L. D., Pfau, J., & Krueger, D. (2022). Goal Misgeneralization in Deep Reinforcement Learning. Proceedings of the 39th International Conference on Machine Learning. arXiv:2105.14111
Hubinger et al. (2024): Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., ... & Shlegeris, B. (2024). Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. arXiv:2401.05566
Betley et al. (2025): Betley, C., Aitchison, L., Elliott, A., Martínez-Plumed, F., Reichert, D. P., Käding, C., ... & Hendrycks, D. (2025). Training large language models on narrow tasks can lead to broad misalignment. Nature. DOI:10.1038/s41586-025-09937-5
Cunningham et al. (2023): Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models. arXiv:2309.08600
Smith et al. (2025): Smith, L., Harder, P., & Bau, D. (2025). Negative Results for SAEs on Downstream Tasks. LessWrong. https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks
Bricken et al. (2023): Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., ... & Olah, C. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread. https://transformer-circuits.pub/2023/monosemantic-features
Rudin (2019): Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206-215. DOI:10.1038/s42256-019-0048-x
Slack et al. (2020): Slack, D., Hilgard, S., Jia, E., Singh, S., & Lakkaraju, H. (2020). Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. AAAI/ACM Conference on AI, Ethics, and Society. arXiv:1911.02508
Turpin et al. (2023): Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. NeurIPS. arXiv:2305.04388
Baniecki & Biecek (2024): Baniecki, H., & Biecek, P. (2024). Adversarial Attacks and Defenses in Explainable Artificial Intelligence: A Survey. Information Fusion. arXiv:2306.06123
Ghorbani et al. (2017): Ghorbani, A., Abid, A., & Zou, J. (2017). Interpretation of Neural Networks is Fragile. AAAI Conference on Artificial Intelligence. arXiv:1710.10547
Anthropic (2024): Anthropic. (2024). Alignment Faking in Large Language Models. Anthropic Research. https://www.anthropic.com/research/alignment-faking


Hey Hiskias, this was a great read. A question I had is if narrow finetuning produces these hidden misalignment features, do you foresee that joint training of sparse autoencoders directly alongside pre-training reveal or prevent such a feature from coalescing in the first place? Assuming emergent misalignment is especially potent when it's rare and out-of-distribution, would enforcing monosemantic decomposition during training force rare, OOD concepts like "hidden misalignment intent" to either isolate themselves sharply enough to detect and prune, or abandon the behavior altogether because they can't compress cleanly through the SAE?