|
||||||
|
Lead Architecture
Inside the Model: The New Wave of Steering, Circuits, and Interpretable FeaturesThe bastards are moving the furniture again — interpretability is no longer just a microscope held over a dead model, it is becoming a wrench. SafeSeek goes hunting for safety behavior like contraband hidden in a mattress and finds circuits so sparse they feel like a dare: cut the wrong sliver and the backdoor coughs itself to death, while the right circuit keeps safety alive under fine-tuning pressure [1]. DSPA takes the same basic heresy and makes it prompt-conditional, using sparse autoencoders to steer preference behavior without dragging the whole model through alignment surgery [2]. Then CuE stomps in with a culturally loaded flashlight, showing that underspecified generations slide toward Anglophone defaults unless you shove against the model's internal cultural geometry instead of merely pleading with prompts [3]. The thread running through all of it is not subtle: these systems can be read, localized, and then edited more precisely than generic prompting allows. But the tension is nasty and useful — every gain drags along questions about transfer, stability, and whether the discovered structure is real or just benchmark theater in a lab coat. Still, the direction is obvious: interpretability is getting operational, and the next fight will be over whether these internal levers hold up when the prompts get weird, the data gets filthy, and the deployment stack starts lying back.
|
||||||
|
||||||
|
||||||
| Subscribe: email [email protected] with your interests. Steer: send Style: feynman or updated interests to [email protected]. To unsubscribe, email [email protected]. |