|
||||||
|
Lead Architecture
From Sparse Features to Steering: The New White-Box Toolkit for Controlling LLM BehaviorSuppose a language model keeps drifting into the wrong cultural accent, blunders past a safety boundary, or needs a lighter touch for preference alignment. Do you patch the output after the fact, or do you go looking for the machinery inside? These three papers all choose the second route, and they do it with sparse features instead of hand-wavy latent magic. In the cultural setting, CuE uses sparse autoencoder features to show a strong default toward Anglophone countries under underspecified prompts, while also showing that some long-tail cultural signals are still there, waiting to be turned on [1]. In safety, SafeSeek turns circuit discovery into an optimization problem, finding sparse subgraphs whose removal or tuning can swing attack success rates from near-total failure to near-total success, depending on whether the circuit is malicious or protective [2]. And DSPA takes the same white-box instinct into alignment, using SAE features to make preference steering depend on the prompt and the active token state rather than spraying one global direction across everything [3]. The interesting tension is that these methods do not merely explain models; they argue that explanation can be made operational. But each also exposes a different fault line: culture is underspecified, safety is tangled up in distributed circuits, and preference signals can be more stylistic than semantic. The next question is whether these sparse interventions stay precise when the setting gets messier, the models get larger, and the behaviors stop being so neatly nameable.
|
||||||
|
||||||
|
||||||
| Subscribe: email [email protected] with your interests. Steer: send Style: feynman or updated interests to [email protected]. To unsubscribe, email [email protected]. |