Training for Interpretability: Lessons from Anti-Superposition and Sparsity Constraints
On sparsity, specialisation, and what training-time interpretability looks like in practice
This is the third essay in a four-part series on training-time interpretability. Essay 1 argued that post-hoc interpretability methods face fundamental limitations for alignment. Essay 2 mapped the design space of training-time alternatives. Here, I’ve put those ideas to the test.
The first two essays in this series built an argument: if we want models whose internals we can genuinely trust, we need to make interpretability a property that emerges during training rather than something we attempt to reverse-engineer afterward. Essay 1 diagnosed the problem: post-hoc methods face fundamental limits in coverage, faithfulness, and robustness. They can miss important features (incompleteness), misidentify what neurons actually represent (incorrectness), and break under adversarial conditions (brittleness). Essay 2 offered a map: a taxonomy of training-time approaches spanning regularization penalties, architectural constraints, and hybrid methods, along with an evaluation framework for measuring whether they actually work.
But a taxonomy is just a map. This essay is where we walk the territory.
I present results from two experiments on a GPT-2-scale language model trained on TinyStories: anti-superposition training (penalizing feature overlap in neuron activations) and sparsity constraints (forcing selective neuron activation via L1 regularization and TopK gating). Together, these represent two fundamental strategies from Essay 2’s design space, specifically regularization-based and architecture-based approaches to training-time interpretability.
The headline finding: training-time constraints can shape internal representations, but not always in the ways we expect. TopK activation gating emerged as the clear winner, achieving roughly 30% reduction in polysemanticity with under 4% capability degradation. But the orthogonality experiment produced a genuine surprise, successfully decorrelating neurons while increasing polysemanticity, revealing that mathematical independence and semantic coherence are very different things.
These results are preliminary. The model is small, the dataset is simple, and the evaluation suite is incomplete. But they provide concrete, empirical evidence that the design space from Essay 2 is not merely theoretical. They also surface specific challenges that any serious training-time interpretability program will need to address. Essay 4 will take up those challenges directly.
Section 1: Experimental Setup
1.1 Why These Two Approaches
Essay 2 identified a broad design space for training-time interpretability, spanning regularization penalties, architectural constraints, objective modifications, and hybrid approaches. For this first round of experiments, I chose two strategies that represent fundamentally different theories of how to make representations more interpretable.
Experiment 1: Anti-Superposition Training targets the root cause identified by the superposition hypothesis. If polysemanticity arises because networks compress more features than they have neurons (Elhage et al., 2022), packing multiple concepts into overlapping directions in activation space, then perhaps we can penalize that overlap directly. I tested two variants: an orthogonality constraint that penalizes correlation between neuron activation patterns, and an anti-polysemanticity penalty that directly penalizes the entropy of each neuron’s activation contexts (i.e., how many distinct semantic roles a neuron plays).
Experiment 2: Sparsity Constraints takes a more architectural approach. Rather than trying to reshape what neurons represent, we constrain how many neurons can be active at once. The intuition is simple: if only a handful of neurons can fire for any given input, the model is forced to make each activation count, which should drive specialization. I tested two variants: L1 regularization (a soft penalty on total activation magnitude) and TopK activation gating (a hard constraint allowing only the top k% of neurons to fire per forward pass).
In the taxonomy from Essay 2, these map onto two fundamental categories: regularization-based approaches (anti-superposition penalties) and architecture-based approaches (sparsity constraints). They also represent two different bets: that interpretability comes from decorrelation (making neurons independent) versus selection (making neurons compete for activation).
1.2 Shared Methodology
Both experiments share a common setup to enable direct comparison.
Model architecture. I use a GPT-2-style transformer with 6 layers, 384 hidden dimensions, and approximately 30 million total parameters (~10.6 million non-embedding). This is deliberately small: large enough to exhibit meaningful superposition, but small enough to train many variants and inspect individual neurons. The limitations of this scale are acknowledged in the Conclusion.
Dataset. All models were trained on TinyStories, a synthetic dataset of short children’s stories generated by GPT-3.5/GPT-4. TinyStories offers controlled complexity: the vocabulary is limited and the narrative structures are simple, but the dataset still requires the model to learn genuine linguistic patterns (grammar, coreference, narrative coherence), making it a useful testbed for interpretability work.
Training. All configurations used identical training hyperparameters: 50,000 steps, AdamW optimizer, learning rate of 3e-4, and batch size of 32. The only differences between configurations are the interpretability-related loss terms or architectural modifications.
Evaluation framework. Following the multi-level evaluation framework proposed in Essay 2 (Section 4), I assess each configuration across three levels. Level 1 (Capability): validation perplexity on held-out TinyStories data, measuring whether the model can still do its job. Level 2 (Representation Metrics): activation sparsity (what fraction of neurons fire per input) and polysemanticity scores. Polysemanticity is measured as the entropy of each neuron’s activation distribution across semantic context categories, where higher entropy indicates a more polysemantic neuron, one that fires indiscriminately across unrelated contexts rather than specializing.
Level 3 (Post-hoc Validation): I trained Sparse Autoencoders on frozen checkpoints from each configuration, measuring L0 (number of active SAE features) and MSE (reconstruction quality) as external validation of representation structure. Level 4 (Adversarial Robustness) was not implemented in these experiments, an acknowledged limitation.
1.3 Specific Configurations Tested
Experiment 1, Anti-Superposition:
Baseline: Cross-entropy loss only (no interpretability penalty)
Orthogonality penalty: λ ∈ {0.001, 0.01, 0.1}
Anti-polysemanticity penalty: λ ∈ {0.001, 0.01, 0.1}
Experiment 2, Sparsity Constraints:
Baseline: Same as Experiment 1 (shared reference point)
L1 regularization: λ ∈ {0.001, 0.01}
TopK activation gating: k ∈ {10%, 25%}
Section 2: Individual Results
2.1 Experiment 1: Anti-Superposition Training
2.1.1 Orthogonality Penalty: The Surprise
The orthogonality experiment was built on a straightforward hypothesis: if we penalize correlation between neuron activation patterns, neurons should develop more independent representations, and independent representations should be more monosemantic. Decorrelation seemed like a natural proxy for reducing superposition.
The decorrelation worked. Across all λ values, activation correlation between neurons dropped substantially, around 40% reduction at λ=0.01 compared to the baseline. The penalty was doing exactly what I designed it to do.
But polysemanticity increased by approximately 15%.
This was not what I expected. The orthogonality constraint successfully made neurons mathematically independent and their activation patterns no longer correlated. Yet each individual neuron became more polysemantic, responding to a wider range of semantic contexts than in the baseline model. The capability cost was moderate: roughly 1.5 perplexity points at λ=0.01.

What happened? The key insight is that decorrelation and semantic specialization are not the same thing. Orthogonality constraints push neurons to occupy linearly independent directions in activation space, but those directions need not correspond to coherent semantic features. A neuron can be perfectly orthogonal to all its neighbors while still activating on an arbitrary mixture of unrelated concepts. “Mammal+verb” and “bird+adjective” are orthogonal to each other but neither is monosemantic.
In hindsight, this makes sense. The superposition hypothesis tells us that polysemanticity arises because the model needs to represent more features than it has neurons, so it packs multiple features into overlapping directions. Orthogonality constraints prevent the overlap but don’t reduce the packing. Neurons still need to encode multiple features, they just do so in non-overlapping ways. The model finds a new solution: orthogonal polysemantic neurons instead of correlated polysemantic neurons.
Revised conclusion: Orthogonality successfully decorrelates neuron activations but does not reduce polysemanticity, demonstrating that decorrelation alone does not produce monosemantic features. This is one of the clearest findings from these experiments, and it carries an important methodological lesson: mathematical properties of representations (independence, sparsity, etc.) do not automatically translate into semantic properties (monosemanticity, interpretability).
2.1.2 Anti-Polysemanticity Penalty: Modest Success
The anti-polysemanticity penalty takes a more direct approach: rather than targeting a proxy like correlation, it directly penalizes the entropy of each neuron’s activation distribution across semantic contexts. A neuron that fires equally for animals, verbs, and colors receives a high penalty; one that fires primarily for animal words receives a low penalty.
This approach worked as intended, though the gains were modest. At the best-performing setting (λ=0.001), polysemanticity scores dropped by approximately 20%, with only a ~0.5 perplexity point capability cost. Activation sparsity also increased slightly (~5%), suggesting that the penalty nudges neurons toward selective activation as a side effect.
The results validate the core premise: training objectives can directly reduce polysemanticity. But the improvements are incremental rather than transformative. Even with the penalty, many neurons remain stubbornly polysemantic; the model still faces the fundamental pressure to represent more concepts than it has neurons, and the penalty can only push so hard against that pressure before capability starts to collapse.
Lambda sensitivity was steep. At λ=0.01 the capability penalty increased significantly without proportionally better interpretability. At λ=0.1, the model largely collapsed: perplexity degraded sharply while polysemanticity scores actually worsened, suggesting the model was no longer learning coherent representations at all. The sweet spot is narrow.
2.2 Experiment 2: Sparsity Constraints
2.2.1 L1 Regularization: Too Expensive
L1 regularization penalizes the total magnitude of neuron activations, encouraging the model to use smaller activation values overall and, ideally, to zero out neurons that aren’t strongly needed for a given input.
At λ=0.001, L1 did increase sparsity, roughly 60% of neurons remained active per input, compared to ~85% in the baseline. But the capability cost was punishing: a 3-point perplexity increase, far exceeding any of the anti-superposition approaches. Worse, the interpretability payoff was minimal, only about a 5% reduction in polysemanticity.
The problem is that L1 regularization is indiscriminate. It pushes all activations toward zero uniformly rather than encouraging the model to make sharp choices about which neurons to use. The result is a model that speaks in a whisper (all activations dampened) rather than one that speaks selectively. The neurons that survive aren’t meaningfully more specialized; they’re just quieter.
2.2.2 TopK Activation: The Winner
TopK activation gating imposes a hard constraint: at each layer, only the top k% of neurons by activation magnitude are allowed to fire. Everything else is zeroed out. Unlike L1’s soft nudge, this is a structural intervention. The model must learn to route computation through a limited set of neurons for each input.
At k=10%, the results were the best of any configuration I tested. Perplexity came in at approximately 20.5, compared to 19.8 for the baseline, only a 3.5% degradation. Meanwhile, polysemanticity dropped by roughly 30%, the largest reduction across all experiments.
Why does TopK work where other approaches struggled? The hard constraint creates competition between neurons. For each input, only 10% of neurons can be active, so the model is incentivized to develop neurons that are reliably useful for specific kinds of inputs. A neuron that responds weakly to many different contexts is unlikely to survive the top-k cutoff; it’s better for the model to develop neurons that respond strongly to narrow contexts. The hard constraint, paradoxically, is easier for the model to optimize around than soft penalties: instead of navigating a continuous trade-off between capability loss and regularization penalty, the model simply learns to work within a fixed budget.
But there are complications. Roughly 35% of neurons never activated across the entire validation set. They appear to be completely dead. Whether these neurons are truly unused capacity or would activate on out-of-distribution inputs remains an open question with potential alignment implications (discussed further in Section 4.2).
More puzzling is the SAE reconstruction divergence. When I trained Sparse Autoencoders on frozen TopK checkpoints, the models showed better L0 (fewer active SAE features needed to represent the activations) but worse MSE (the reconstruction was less faithful). This is counterintuitive: if TopK representations are genuinely simpler and more interpretable, SAEs should reconstruct them more easily.
The likely explanation is that forced sparsity and natural sparsity are different things. TopK makes activations sparse by construction, but the features encoded in those sparse activations may still be entangled or overlapping in ways that a post-hoc SAE struggles to decompose. The sparsity is in the activation pattern, not necessarily in the underlying feature structure. This distinction turns out to have direct alignment implications. A model can look interpretable on sparsity metrics while still encoding features that resist clean decomposition. I return to this in Section 4.2.
As a sanity check, I manually inspected 20 randomly selected neurons from each configuration, rating them as monosemantic or polysemantic based on their top-10 activating examples. The results track the quantitative metrics: 45% of TopK 10% neurons were clearly monosemantic (vs. 25% baseline), 35% for anti-polysemanticity, and 20% for orthogonality (worse than baseline). The sample is small but it corroborates the automated scores.
Section 3: Comparative Analysis
3.1 Cross-Experiment Insights
3.1.1 The Interpretability-Capability Trade-off
Plotting all configurations on a unified Pareto frontier (perplexity on one axis, my best composite interpretability measure on the other), a clear hierarchy emerges.
Figure 5: Unified Pareto frontier — all methods plotted on perplexity vs. interpretability
TopK 10% occupies the most favorable position: a small capability cost for the largest interpretability gain. Anti-polysemanticity at λ=0.001 sits nearby, offering modest improvements on both dimensions with minimal capability tax. TopK 25% provides a middle-ground trade-off. The remaining approaches fall below the frontier: orthogonality incurs a capability cost without interpretability improvement, and L1 pays a heavy capability penalty for negligible gains.
The critical observation is not that trade-offs exist (that was expected) but that different approaches yield qualitatively different trade-offs. Some methods are strictly dominated: there is no reason to use L1 over TopK, or orthogonality over anti-polysemanticity, because the latter achieve better interpretability at lower capability cost. This matters for practitioners, because it means the choice of method is not merely a matter of tuning a single dial but of selecting fundamentally different strategies with different payoff structures.
3.1.2 Sparsity vs. Decorrelation
The sharpest contrast in our results is between sparsity-based approaches (TopK, L1) and the decorrelation approach (orthogonality). Sparsity consistently outperformed decorrelation on interpretability metrics, despite both strategies having plausible theoretical justifications.
The reason, I believe, is that sparsity directly enforces selectivity, neurons must compete for activation, which incentivizes specialization, while decorrelation only enforces independence, which is compatible with continued polysemanticity.
An analogy may help. Forcing people to speak less often (sparsity) means that when they do speak, they tend to say something specific and important. Forcing people to talk about different things than each other (decorrelation) doesn’t stop any individual person from rambling across multiple topics. Sparsity constrains the neuron’s behavior; decorrelation constrains its relationship to other neurons. For interpretability, behavior turns out to matter more.
This doesn’t mean decorrelation is useless — an ideal representation might be both sparse and decorrelated, but it suggests that decorrelation alone is insufficient. Any training-time interpretability approach should include a selectivity mechanism.
3.1.3 The Metric Divergence Problem
One of the most practically important findings from my experiments is that different interpretability metrics do not always agree. TopK models score well on L0 but poorly on MSE. Anti-polysemanticity models show good polysemanticity scores but only modest L0 improvements. Orthogonality models achieve excellent decorrelation while worsening on polysemanticity.
Figure 6: Heatmap showing correlation between different interpretability metrics across all configurations
“Interpretability” is not a single dimension. It is a multi-dimensional property that different metrics capture different facets of, and optimizing for one facet can leave others unchanged or even degrade them. This finding directly validates the multi-level evaluation framework proposed in Essay 2: any serious attempt to measure training-time interpretability needs multiple complementary metrics, because no single metric captures the full picture.
It also raises a Goodhart’s Law concern. If we optimize a model against a single interpretability metric during training, we may produce representations that score well on that metric while failing to be genuinely interpretable in the ways that matter for alignment. The TopK L0/MSE divergence is a concrete example: the model looks great on sparsity (L0) while potentially hiding entangled features that a post-hoc SAE can’t cleanly reconstruct (MSE). This carries direct alignment implications that I address in Section 4.2.
3.1.4 Lambda/Hyperparameter Sensitivity
Every regularization-based approach in my experiments was highly sensitive to the choice of λ. The pattern was remarkably consistent: the optimal range for the interpretability penalty sat around λ = 0.001 to 0.01. Below this range, the penalty had no measurable effect. Above it (particularly at λ = 0.1), the model’s capability collapsed without corresponding interpretability improvements.
This sensitivity is not just a practical nuisance; it reveals something about the nature of the optimization landscape. Training-time interpretability constraints must compete with the primary language modeling objective for gradient signal. Push too hard and the model can no longer learn useful representations at all; push too gently and the constraint is drowned out by the capability loss. The sweet spot is narrow, and it may shift depending on model size, dataset complexity, and training duration.
For practical deployment, this raises a genuine challenge: selecting the right λ requires extensive hyperparameter search, and my experiments tested only 2–3 values per approach. The true optimal settings may differ from what we found, and they will almost certainly differ at larger scales.
Section 4: Implications
4.1 For Interpretability Research
My experiments yield three categories of findings.
What worked. These experiments provide initial empirical evidence that training-time constraints can shape internal representations — this is no longer purely theoretical.
TopK activation gating produces a favorable capability-interpretability trade-off, achieving the largest interpretability gains with the smallest capability cost. And approaches that directly target the property we care about (sparsity, polysemanticity reduction) outperform those that target proxies (decorrelation). The general principle: if you want neurons to specialize, constrain their behavior directly rather than constraining their relationships to other neurons.
What didn’t work or needs scrutiny. The orthogonality penalty backfired, producing the clearest negative result in my experiments. L1 regularization was too expensive relative to its benefits; soft penalties are less effective than hard constraints for driving specialization. And the ~35% dead neurons in TopK raise an unresolved question: does TopK produce genuine sparsity (the model truly doesn’t need those neurons) or artificial sparsity (the model is working around a constraint that removes useful capacity)? This distinction matters for how I interpret the results and whether we can trust TopK representations. Given I cannot yet rule out activation on out-of-distribution inputs, we treat this as an alignment concern rather than merely an open question (see Section 4.2).
Methodological lessons. Lambda tuning is critical and the optimal range is narrow. Multi-metric evaluation is essential because single metrics can be misleading or Goodhartable. And qualitative validation matters — the human case studies provided information that no automated metric could capture, particularly the insight that anti-polysemanticity produces broad category-level specialization rather than fine-grained monosemanticity.
Key open questions remain. Why does TopK exhibit the L0/MSE divergence? What is the right level of sparsity? Can we combine approaches (for instance, TopK with an anti-polysemanticity overlay) to get better-than-additive improvements? And most critically, how do these results change at scales relevant to frontier models?
4.2 For Alignment
The optimistic view. These results provide the first empirical evidence that training-time interpretability interventions are feasible at reasonable computational cost. A ~5% capability tax for ~30% polysemanticity reduction suggests that the capability-interpretability trade-off may be tractable, at least at small scales, validating the central argument from Essay 1: that making interpretability a training objective is a viable alternative to the post-hoc approach, and it’s worth investing research effort in this direction.
The realistic view. A 30% reduction in polysemanticity is meaningful but not transformative. It does not give us models whose internals we can fully trust for safety-critical applications. My results surface specific concerns that temper the optimism.
Recall the L0/MSE divergence from Section 2.2.2. TopK models need fewer active SAE features to describe their activations (better L0) but those features reconstruct the activations less faithfully (worse MSE). This is the forced-versus-natural sparsity distinction, and it matters directly for alignment: a model can score well on sparsity-based interpretability metrics while still encoding features that are entangled in ways that resist reliable inspection. If we use training-time interpretability constraints as a safety tool, we need assurance that the interpretability we’re measuring corresponds to genuine transparency, not metric satisfaction.
The ~35% dead neurons compound this concern. If those neurons would activate on adversarially chosen or out-of-distribution inputs, they represent a potential hiding place for alignment-relevant computation: features that appear absent during normal evaluation but are available when the model encounters inputs designed to elicit deceptive or misaligned behavior.
The cautious view. There is no evidence that any of these approaches would detect deceptive alignment, mesa-optimization, or other alignment-critical phenomena. The gains I observe are on polysemanticity metrics, which measure statistical properties of activation distributions. They don’t tell us whether the model’s representations encode the specific features that alignment researchers need to inspect. Models could learn to game interpretability metrics under sufficient optimization pressure, producing representations that look clean on my measures while concealing structure we can’t detect.
The key insight for alignment: my experiments provide early evidence that training-time interpretability is possible, but the gap between “statistically less polysemantic” and “safe to deploy without human oversight” remains enormous. Closing that gap requires scaling these approaches, developing adversarial evaluation, and, perhaps most importantly, understanding whether “interpretable by our metrics” and “genuinely transparent” are the same thing. Essay 4 will take up this challenge.
Section 5: Next Steps
5.1 Immediate Next Experiment: Joint SAE Training
The most natural follow-up addresses the SAE reconstruction puzzle directly. TopK models produce sparse activations that SAEs struggle to reconstruct faithfully, but the SAEs were trained post-hoc on frozen checkpoints. The next experiment will embed SAE-inspired objectives into the training loop itself, asking whether a model can learn to produce representations that are jointly optimized for both language modeling and SAE decomposability. If the L0/MSE divergence disappears under joint training, it was likely an artifact of post-hoc fitting. If it persists, it tells us something deeper about the nature of forced sparsity.
5.2 Other Planned Experiments
Beyond joint SAE training, several avenues are in progress or planned. I intend to test combination approaches (particularly TopK with an anti-polysemanticity overlay) to see whether targeting both sparsity and semantic specialization simultaneously yields better-than-additive gains. Modular architectures, specifically Mixture-of-Experts with specialization constraints, offer a different angle: rather than constraining a monolithic network, organize the model into naturally specialized sub-modules. I also plan longer training runs to test whether interpretability degrades over extended training, and adversarial evaluation to test whether models can learn to game the metrics.
5.3 Bridge to Essay 4
These results raise several challenges that Essay 4 will address directly. The capability tax problem: 5% degradation is acceptable for research, but what does the trade-off look like for production models at scale? The evaluation validity problem: the metrics diverge, so how do we know what “interpretable” really means, and how do we avoid Goodharting interpretability the way we might Goodhart any other training objective? The deceptive compliance problem: could models learn to produce representations that satisfy our interpretability metrics without being genuinely transparent? And the scalability question: do any of these approaches work at 70B or 400B parameters, where the practical stakes for alignment are highest?
Conclusion
What did we learn?
First, that training-time interpretability is real. Constraints applied during training can measurably shape internal representations. This is no longer a theoretical proposition but an empirical finding with concrete numbers behind it.
Second, that not all approaches work. The orthogonality penalty backfired in an instructive way, revealing that mathematical properties of representations (independence, decorrelation) do not automatically produce semantic properties (monosemanticity, interpretability). This is perhaps the single most useful result from my experiments, because it narrows the design space: future work should focus on selectivity-based methods over correlation-based ones.
Third, that the trade-offs are real but manageable. TopK activation gating achieved roughly 30% polysemanticity reduction at approximately 5% capability cost, a ratio that, if it holds at scale, could make training-time interpretability a practical tool rather than an academic curiosity.
Fourth, that evaluation is hard. Different interpretability metrics tell different stories, and optimizing for one can leave others unchanged or degraded. The L0/MSE divergence in TopK models is a microcosm of this broader challenge: looking interpretable and being interpretable may not be the same thing.
Fifth, and most importantly, that much work remains. These results come from a ~30M parameter model trained on TinyStories with single training runs per configuration, sparse hyperparameter search, no adversarial evaluation, and SAEs trained post-hoc rather than jointly. They should be read as proof-of-concept, not as direct evidence about frontier model behavior.
The overall takeaway: training for interpretability is a promising direction, but it is not a silver bullet. We can make models somewhat more interpretable with acceptable capability trade-offs, but the gains are incremental rather than transformative. The path forward requires better understanding of what “interpretable” actually means (Essay 4’s focus); scaling these approaches to frontier models, adversarial testing, and combining multiple methods for compounding improvements.
For now, I’ve shown that the design space from Essay 2 is not just theoretical. These ideas can be implemented and tested, and the results, while mixed, give us concrete data to refine the approach. The next step is to push further: joint SAE training, combination methods, and the hard conceptual work of defining what it actually means for a model to be interpretable. That’s where Essay 4 begins.
References
Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv preprint arXiv:2305.07759.
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. (2022). Toy Models of Superposition. Transformer Circuits Thread.






