Discussion about this post

User's avatar
calebhab's avatar

Hey Hiskias, this was a great read. A question I had is if narrow finetuning produces these hidden misalignment features, do you foresee that joint training of sparse autoencoders directly alongside pre-training reveal or prevent such a feature from coalescing in the first place? Assuming emergent misalignment is especially potent when it's rare and out-of-distribution, would enforcing monosemantic decomposition during training force rare, OOD concepts like "hidden misalignment intent" to either isolate themselves sharply enough to detect and prune, or abandon the behavior altogether because they can't compress cleanly through the SAE?

2 more comments...

No posts

Ready for more?