A Taxonomy of Training-Time Interpretability
On designing models to be interpretable by construction, not by post-hoc analysis.
Introduction
In my previous essay, I argued that post-hoc interpretability — the dominant paradigm in the field — suffers from three fundamental failure modes that make it insufficient as an alignment safety tool: it’s incomplete, potentially incorrect, and brittle against adversarial pressure. The natural question that follows is: what do we do instead?
The answer I want to explore in this essay is a shift in how we think about interpretability entirely. Instead of asking “how do we explain a trained model?”, we should be asking “how do we train a model that is already explainable?” This might sound like a subtle reframing, but it has deep implications for both how we build models and what we can actually guarantee about them.
This isn’t a new idea in machine learning broadly. As mentioned in my last essay, in their 2019 paper, Locatello et al. showed that learning disentangled, human-meaningful representations in an unsupervised setting is fundamentally impossible without inductive biases. This means you don’t get structured, interpretable internals for free just by training on data. Structure has to be intentionally built in. The same principle applies here: if we want models whose internal representations we can meaningfully inspect for alignment failures, we probably have to make that a design objective from the start, not a forensic exercise afterward.
What I want to do in this essay is map out the design space for doing exactly that. In this essay, I want to propose a framework for thinking about this design space — mapping out the distinct approaches to building interpretability into training, the trade-offs each carries, and what it would take to pursue this seriously. Each comes with different trade-offs, and the field’s understanding of how to navigate them is still nascent.
These questions don’t have clean answers yet: Is interpretability actually compatible with capability at scale? Can we train for legibility without sacrificing the very performance that makes these models worth building? I’ll work through what we currently know theoretically — and in a later post, I’ll try to shed some empirical light on these through experiments I’m running.
Training time interpretability is a largely unexplored design space, the theoretical case for early intervention is strong, and the capability costs may be more manageable than they first appear. But this is an area where confident claims outpace evidence, and I want to be careful to distinguish between what we know, what we suspect, and what remains genuinely open.
1. Conceptual Framework: What Does “Training for Interpretability” Actually Mean?
Before mapping the design space, it’s worth being precise about what we’re actually trying to do when we talk about training for interpretability.
The standard framing of interpretability treats it as a property of explanation methods applied to a fixed model: given a trained network, how well can we explain its predictions? How interpretable a model seems depends entirely on which explanation tool you use — the model itself is treated as a black box to be explained.
Training-time interpretability reframes this relationship. Instead of asking how well we can explain a fixed model, the question we should be asking instead is can we train a model whose internal representations are themselves interpretable, independent of any explanation method? Interpretability becomes a property of the model, not of the lens we apply to it afterward. This means treating interpretability as an optimization target — something to be explicitly encouraged during training through architectural choices, loss terms, or both.
This distinction matters because it changes what guarantees are even possible. A post-hoc explanation method can only tell you what it can see; a model trained to be interpretable, if the approach works, should be legible by construction.
What Makes a Representation “Interpretable”?
To train for interpretability, we need to define what we’re optimizing toward. There’s no single agreed-upon definition, but the literature has converged on a few properties that function as working criteria — each with varying degrees of theoretical and empirical grounding:
Sparsity refers to how many features are active at once. A sparse representation is one where, for any given input, only a small fraction of neurons fire. This is desirable because it suggests that the model is encoding information in relatively isolated, non-overlapping units, hence making it easier to attribute specific behaviors to specific features.
Disentanglement refers to whether distinct factors of variation in the data are encoded in distinct, separable dimensions of the representation. A disentangled representation of images, for example, would encode “shape” and “color” in separate directions of the feature space, so you can vary one without affecting the other. Disentangled representations are easier to interpret because individual dimensions have cleaner, more consistent meaning.
Linearity refers to whether features can be recovered from activations using simple linear probes. If a concept like “sentiment” or “capital city” can be linearly decoded from a model’s internal activations, that’s evidence the concept is cleanly represented rather than scattered across complex non-linear interactions. This is probably the best-supported of the four empirically — Alain & Bengio (2017) established linear probing as a standard diagnostic, and more recent work has found that many human-meaningful concepts are already linearly encoded in frontier models.
Modularity refers to whether the model has developed distinct sub-components that handle identifiable sub-tasks — the kind of structure you see in mechanistic interpretability work on circuits. Modular models are easier to reason about because you can analyze components somewhat independently. This is the least formally grounded of the four; it’s more an intuition from the circuits literature than a precisely defined property with an agreed measurement framework.
None of these properties is sufficient on its own. A representation can be sparse but still polysemantic — where individual neurons respond to many unrelated concepts, making them hard to interpret even though few are active at once. And a representation can be perfectly disentangled in a mathematical sense without the factors of variation corresponding to anything meaningful from a human perspective. The goal is representations that are sparse, disentangled, and semantically coherent — which is considerably harder.
The Superposition Problem
To understand why interpretable representations don’t emerge by default, it helps to understand superposition. The basic insight is this: neural networks have more features they want to represent than they have neurons to represent them in. The solution they find is to encode multiple features in overlapping directions in activation space — superimposing them on top of each other. This works because, in practice, most features are sparse: if feature A and feature B rarely co-occur, a network can represent both in nearly the same direction without much interference. The cost is that individual neurons become polysemantic, responding to multiple unrelated concepts, which makes them much harder to interpret. This phenomenon was shown by Elhage et al. (2022) by using toy models.
Superposition is, in a sense, the natural equilibrium. It’s an efficient compression strategy, and gradient descent finds it reliably when capacity is constrained. This is precisely why interpretability doesn’t come for free: the training dynamics that optimize for capability actively push representations toward superposition, and away from the clean, monosemantic structure we’d want for interpretability.
Training-time interpretability can be understood, in large part, as an attempt to intervene in this dynamic — to apply pressure against superposition before it fully takes hold, or to constrain the model in ways that make superposition less attractive as a solution. The question is how to do that without paying an unacceptable price in capability. This will be explored in the following sections.
2. Taxonomy of Approaches
There are two broad levers for building interpretability into training: architectural constraints that shape how information flows through the network by design, and training objectives that add explicit interpretability pressure to the loss function. These aren’t mutually exclusive — in practice the most promising approaches may combine both — but they represent meaningfully different philosophies about where and how to intervene.
2.1 Architectural Constraints
The most direct way to make a model interpretable is to design it so that uninterpretable representations are structurally difficult or impossible to learn. Rather than hoping gradient descent finds interpretable solutions, architectural constraints enforce them by construction.
The clearest example is sparse networks. If you constrain a model so that only a small number of neurons can be active for any given input — through hard sparsity masks, winner-take-all activations, or k-sparse layers — you directly limit the model’s ability to engage in superposition. Fewer co-active neurons means fewer opportunities for features to interfere with each other in activation space. Sparse coding models like those explored in Olshausen & Field (1996) showed early on that this kind of constraint naturally produces more monosemantic, edge-like features in the representations that emerge — the kind of structure that interpretability research is trying to recover after the fact in standard networks.
Bottleneck architectures work on a related principle. By forcing information through a low-dimensional intermediate representation, bottlenecks create pressure for the model to compress what it knows into a compact, structured form. Variational autoencoders exploit this intentionally: the constraint of passing through a narrow latent space encourages disentangled representations of the input, because the model has to be selective about what it encodes and how. The hope is that with the right bottleneck design, the model learns to separate factors of variation rather than entangling them.
Modular architectures take a more explicit approach: rather than constraining internal representations, they constrain information flow between components. Mixture of Experts (MoE) models are a prominent example — inputs are routed to specialized sub-networks (experts) rather than flowing through a monolithic network. If routing is sparse and consistent, different experts specialize in different sub-tasks, which makes the model’s structure more legible at a coarse level. You can ask “which expert handled this input?” and get a meaningful answer about which part of the model was responsible. Work on modular circuits in transformers pushes this further, trying to identify and isolate functional sub-components that handle identifiable computations.
The core appeal of architectural constraints is that they’re enforced by design. You don’t have to hope the training signal pushes representations toward interpretability — the model literally cannot learn certain kinds of entangled or superposed representations because the architecture doesn’t permit them. This is a strong guarantee compared to soft regularization approaches.
The cost, however, is expressiveness. Every architectural constraint is a restriction on the hypothesis class the model can explore. Sparse activations may prevent the model from learning useful dense representations when they’d be beneficial. Bottlenecks may discard information that turns out to matter. Modular routing may fail to generalize when inputs don’t fit cleanly into the categories the router has learned. There’s a real risk that architectural constraints buy interpretability at the cost of capability in ways that are hard to predict in advance — you build the constraint in, and only find out later how much it costs.
This is why architectural interpretability is most compelling in settings where you have strong prior knowledge about the structure of the problem — where you know, for instance, that the task decomposes into identifiable sub-problems that a modular architecture can exploit. In more general settings, the tension between expressiveness and legibility is harder to resolve by architecture alone.
2.2 Training Objectives & Regularizers
Where architectural constraints work by limiting what a model can learn, training objectives work by shaping what it wants to learn. Rather than hardwiring interpretability into the network’s structure, this approach adds terms to the loss function that explicitly reward interpretable representations — letting the model retain its full expressive capacity while applying gradient pressure toward legibility.
The most straightforward example is sparsity regularization. Adding an L1 penalty to activations during training encourages the model to keep most neurons inactive for any given input, pushing representations toward the sparse structure we discussed earlier. Unlike architectural sparsity, this is a soft constraint — the model can violate it if the task benefit is large enough, but pays a cost for doing so. This flexibility is the key advantage: the model can find its own balance between interpretability and capability rather than having that balance imposed from outside. Techniques like those used in dictionary learning extend this idea, training the model to represent activations as sparse combinations of learned basis vectors — essentially building the SAE objective into training rather than applying it afterward.
Orthogonality constraints push in a complementary direction. If we penalize representations where different features are correlated with each other, we encourage the model to encode distinct concepts in distinct, non-overlapping directions in activation space. This directly combats superposition: features that are orthogonal can’t interfere with each other, which makes individual neurons easier to interpret in isolation. The challenge is that full orthogonality is expensive to enforce at scale — computing and penalizing all pairwise correlations in a large network is computationally demanding — so in practice these constraints are often applied selectively, to particular layers or modules.
Disentanglement losses are a more targeted version of the same idea, borrowed from the representation learning literature. Rather than just encouraging orthogonality globally, these losses try to ensure that specific identifiable factors of variation in the data are encoded in specific dimensions of the representation. The Total Correlation Variational Autoencoder (TC-VAE) approach, for instance, adds a penalty on the total correlation between latent dimensions — a measure of how much information is shared across dimensions — to push the model toward representations where each dimension captures something independent. For language models, analogous approaches might try to separate syntactic from semantic features, or factual knowledge from stylistic tendencies, in the model’s internal representations.
One particularly interesting recent direction is joint SAE training — rather than fitting a sparse autoencoder to a model’s activations after training is complete, training the SAE alongside the model and using its reconstruction error as an additional loss signal. If the model’s activations can’t be cleanly reconstructed through the SAE’s sparse decomposition, the model receives a gradient pushing it toward representations that are more decomposable. This is a direct operationalization of the training-time interpretability idea: the interpretability tool becomes part of the training loop rather than an afterthought.
The flexibility of the objective-based approach is its main strength. You can tune the weight of interpretability terms relative to task loss, apply them selectively to specific layers, or anneal them over training — none of which is possible with hard architectural constraints. This makes it easier to explore the capability/interpretability trade-off empirically and find operating points that work for a given application.
The corresponding weakness is that soft constraints can be soft in the wrong direction. If the task loss dominates, the interpretability terms get washed out — the model learns to mostly satisfy them on easy inputs while routing difficult computations through whatever representations are most efficient, regardless of legibility. There’s also a Goodhart’s law risk: a model optimizing explicitly for sparsity or orthogonality metrics might find ways to satisfy those metrics without actually producing representations that are meaningfully interpretable to humans. A neuron that fires on exactly one token type is technically monosemantic — but if that token type is arbitrary and semantically incoherent, the sparsity is real and the interpretability is illusory. This is a challenge we’ll return to when discussing evaluation in the upcoming essays.
3. Theoretical Considerations
Having mapped the design space, it’s worth stepping back and asking the harder theoretical questions. Why might training-time interpretability work better than post-hoc approaches in principle? When and why does superposition emerge, and can we actually intervene in that process? And most fundamentally, is interpretability even compatible with the kind of capability we need from frontier models?
Why Early Intervention Might Work Better
The core intuition is about inductive biases and optimization trajectories. Training a neural network isn’t just finding a solution to an optimization problem — it’s finding a solution via a particular path through a high-dimensional loss landscape, starting from a particular initialization, shaped by a particular sequence of gradient updates. The representations a model ends up with depend heavily on that path, not just on what’s theoretically possible given the architecture.
This matters because alignment failures — the kinds of internal misalignment we care about detecting — develop during training. They’re not imposed on an already-trained model; they emerge as the model learns. A goal-misgeneralized model doesn’t have misaligned goals added to it after the fact; it develops them because the training dynamics found a solution that satisfied the training objective in a way that generalizes badly. By the time we apply post-hoc interpretability tools, the problematic representations are already fixed, fully formed, and potentially optimized to be hard to detect.
Early intervention changes this entirely. If we apply interpretability pressure during training, we’re shaping the optimization trajectory itself — influencing which solutions gradient descent finds, not just analyzing the one it settled on. The analogy here is useful: it’s much easier to grow a bonsai tree into a desired shape than to prune a mature tree into one. The earlier you apply the constraint, the less work it has to do against the grain of what’s already there.
There’s also a more technical argument about inductive biases. The representations a model learns early in training tend to be simpler and more structured — there’s good evidence that neural networks exhibit a “simplicity bias,” preferring low-complexity solutions early in the learning process before gradually fitting more complex structure. If interpretable representations are simpler in the relevant sense, applying interpretability pressure early may be working with the grain of training dynamics rather than against them. The model naturally wants to learn clean, structured features first; the question is whether we can lock that in before optimization pressure forces it toward superposition.
When Superposition Happens — And Whether We Can Prevent It
Elhage et al.’s toy model results are illuminating here. Superposition doesn’t emerge gradually and uniformly — it exhibits phase transition behaviour. As you increase the number of features a model needs to represent relative to its capacity, there’s a relatively sharp transition from a regime where features are represented monosemantically (one feature per neuron, roughly) to one where features are superposed. Below the transition, the model can afford clean representations; above it, superposition becomes the only viable compression strategy.
This has a direct implication for training-time intervention: if we can keep the model below or near that phase transition during the critical period when representations are being formed, we may be able to avoid superposition taking hold in the first place. Architectural constraints like sparsity or bottlenecks can help by making the transition point less sharp — effectively raising the capacity threshold at which superposition becomes attractive. Regularization terms can penalize the model for moving into the superposition regime even when it would be capacity-efficient to do so.
The relationship between superposition and feature importance is also relevant. Elhage et al. showed that in the superposition regime, less important features tend to be the ones that get superposed — the model sacrifices clean representation of rare or low-salience features to preserve capacity for frequent, high-importance ones. This suggests a targeted strategy: rather than trying to prevent superposition everywhere, we might focus interpretability constraints on the features that matter most for alignment — the representations most likely to encode goal-relevant information — while allowing the model to superpose freely for less critical features.
Is Interpretability Compatible with Capability?
This is the central question, and it deserves an honest answer rather than a reassuring one.
The pessimistic case is straightforward: superposition is not a bug, it’s a feature. It’s an efficient compression strategy that lets models represent far more information than their parameter count would naively allow. If we constrain models to avoid superposition, we’re asking them to be less efficient, which likely means less capable for a given compute budget. At frontier scale — where the gap between what models need to represent and what their architecture can hold monosemantically is enormous — the capability cost of enforcing interpretable representations could be severe.
The optimistic case rests on two related ideas. The first is the convergent representations hypothesis: that for sufficiently general tasks, the most capable representations and the most interpretable ones may largely overlap. If the concepts a model needs to represent to perform well are the same concepts that humans find meaningful — because those concepts carve nature at its joints in ways that are useful for both — then optimizing for interpretable representations isn’t necessarily fighting against capability, it’s just adding a constraint that the model might have found anyway. Work on linear representation of concepts in language models offers some support for this view: it turns out that many human-meaningful concepts are already linearly encoded in frontier models’ representations, suggesting that interpretable structure isn’t entirely at odds with how these models learn.
The second argument is about the acceptable cost. Even if there is a capability tax for training-time interpretability, the relevant question for alignment isn’t whether the tax is zero — it’s whether it’s small enough to be worth paying given what we get in return. We already accept capability costs for other safety interventions: RLHF and constitutional AI training reduce capability on some dimensions in exchange for safer behavior. If interpretability constraints buy us reliable internal access for alignment verification, a modest capability cost may be entirely acceptable. The question is how modest.
My view is that the truth sits closer to the optimistic end, but with meaningful uncertainty. The theoretical case for some capability/interpretability compatibility is real. But the empirical picture is thin — most training-time interpretability work has been done at small scale, and whether the compatibility holds as models scale is genuinely unknown. The honest position is: probably manageable, not certainly negligible.
4. Evaluation: A Different Kind of Problem
Evaluating training-time interpretability is harder than evaluating post-hoc interpretability, and it’s worth being clear about why — because the difficulty isn’t just technical, it’s conceptual.
With post-hoc interpretability, the evaluation problem is essentially one-dimensional. You have a fixed model, you apply an explanation method, and you ask: how good is this explanation? Does it faithfully capture what the model is actually computing? You can test this through perturbation experiments, ablations, consistency checks — there are imperfect but tractable ways to measure whether an explanation method is doing what it claims.
Training-time interpretability introduces a second dimension. Now you’re not just evaluating an explanation method; you’re evaluating the model’s representations directly. And you have to do this simultaneously along two axes: capability (does the model still perform well?) and interpretability (are its representations actually legible?). There’s no single metric that captures both, and optimizing hard for one will generally hurt the other. Success means finding a point on the capability/interpretability trade-off curve that’s acceptable for the application at hand — which requires knowing what “acceptable” means before you start, and being honest about where you actually land.
The deeper problem is that, unlike post-hoc evaluation, there’s no clear ground truth for what makes a representation interpretable. When we evaluate a post-hoc explanation method, we can at least ask whether the explanation matches the model’s actual behavior — that’s a concrete, testable criterion. When we ask whether a representation is interpretable, we’re asking something more slippery: interpretable to whom, for what purpose, at what level of analysis? Sparsity and orthogonality are measurable proxies, but they’re proxies for something that’s ultimately about human understanding, which is harder to formalize. A sparse representation isn’t necessarily an interpretable one, as we noted earlier — and this means Goodhart’s law looms large. Optimize hard enough for sparsity metrics, and you’ll get models that are sparse but not meaningfully interpretable.
There’s also an irony worth acknowledging: even when we train for interpretability, we still rely on post-hoc methods to evaluate whether we’ve succeeded. We use SAE reconstruction quality, linear probes, and manual inspection of top-activating examples to check whether the representations we’ve trained are actually legible. The training-time and post-hoc paradigms aren’t fully separable — the latter remains the primary lens through which we assess the former. This isn’t a fatal problem, but it does mean the evaluation is only as good as the post-hoc tools we’re using to validate it, with all the limitations those carry.
Developing rigorous evaluation methods for training-time interpretability is itself an open research problem — arguably as important as developing the training methods themselves. We’ll return to this in depth in the next post, alongside experimental results that put these questions in a more concrete empirical context.
Conclusion
The argument I’ve tried to make in this essay is that training-time interpretability isn’t just a variation on the standard interpretability toolkit — it’s a fundamentally different way of thinking about the problem. Instead of asking how well we can explain a model after the fact, we ask how we can build models that are legible by construction. The design space for doing this is real and largely unexplored: architectural constraints that enforce interpretable structure, training objectives that reward it, and combinations of both that try to find the best operating point on the capability/interpretability curve.
The theoretical case for this approach is strong. Intervening early in training means shaping the optimization trajectory rather than analyzing where it ended up. It means applying pressure against superposition before it takes hold, rather than trying to untangle it afterward. And it means that whatever guarantees we achieve about a model’s internal representations are built in, not bolted on.
The honest caveat is that the empirical picture hasn’t caught up with the theoretical promise. Most of what we know about training-time interpretability comes from small-scale experiments, and whether these approaches hold up as models scale — where the pressure toward superposition is most intense and the stakes are highest — remains an open question. The capability cost is probably manageable, but “probably” is doing real work in that sentence.
What I do feel confident about is that this direction is worth pursuing seriously. The limitations of post-hoc interpretability for alignment aren’t going to be solved by better post-hoc tools alone. If we want reliable internal access to model representations for safety verification, we likely need to make interpretability a design objective from the start. The question is how — and that’s what the next post will begin to answer, with concrete experimental approaches and the results of putting these ideas to an empirical test.
References
Alain & Bengio (2017): Alain, G., & Bengio, Y. (2017). Understanding Intermediate Layers Using Linear Classifier Probes. ICLR Workshop. arXiv:1610.01644
Locatello et al. (2019): Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., & Bachem, O. (2019). Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. International Conference on Machine Learning (ICML). arXiv:1811.12359
Elhage et al. (2022): Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., ... & Olah, C. (2022). Toy Models of Superposition. Transformer Circuits Thread. transformer-circuits.pub
Olshausen & Field (1996): Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609. DOI:10.1038/381607a0
Cunningham et al. (2023): Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models. arXiv:2309.08600
Bricken et al. (2023): Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., ... & Olah, C. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread. transformer-circuits.pub
Chen et al. (2023): Chen, Z., Meng, L., Zhao, R., Li, J., & Liang, D. (2023). Modular Circuits for Improved Compositionality. arXiv:2306.09539
Chen et al. (2018): Chen, R. T. Q., Li, X., Grosse, R., & Duvenaud, D. (2018). Isolating Sources of Disentanglement in Variational Autoencoders. NeurIPS. arXiv:1802.04942

