The Week in Research

Published: Thursday, Mar 26, 2026

Style: Richard Feynman

Lead Architecture

From Sparse Features to Steering: The New White-Box Toolkit for Controlling LLM Behavior

Suppose a language model keeps drifting into the wrong cultural accent, blunders past a safety boundary, or needs a lighter touch for preference alignment. Do you patch the output after the fact, or do you go looking for the machinery inside? These three papers all choose the second route, and they do it with sparse features instead of hand-wavy latent magic. In the cultural setting, CuE uses sparse autoencoder features to show a strong default toward Anglophone countries under underspecified prompts, while also showing that some long-tail cultural signals are still there, waiting to be turned on [1]. In safety, SafeSeek turns circuit discovery into an optimization problem, finding sparse subgraphs whose removal or tuning can swing attack success rates from near-total failure to near-total success, depending on whether the circuit is malicious or protective [2]. And DSPA takes the same white-box instinct into alignment, using SAE features to make preference steering depend on the prompt and the active token state rather than spraying one global direction across everything [3]. The interesting tension is that these methods do not merely explain models; they argue that explanation can be made operational. But each also exposes a different fault line: culture is underspecified, safety is tangled up in distributed circuits, and preference signals can be more stylistic than semantic. The next question is whether these sparse interventions stay precise when the setting gets messier, the models get larger, and the behaviors stop being so neatly nameable.

Sources [1] Steering LLMs for Culturally Localized Generation [2] SafeSeek: Universal Attribution of Safety Circuits in Language Models [3] DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

Deployment & Adaptation

Interpretability Goes Multimodal: From Visual Features to Medical Embeddings

What happens if you stop treating sparse features as a language-only trick? One paper asks a vision-language model to explain a visual SAE feature by steering the vision encoder with that feature first, then letting the model describe what it sees. That move matters because plain top-activating images can be noisy witnesses; steering turns the feature into a little experiment, not just a retrieval query. The result is a hybrid: steering alone is scalable and often good, but Steering-informed Top-k is better across the board, which is a neat reminder that causality and exemplars each catch different parts of the story [1].

The medical imaging paper pushes the other way: instead of using steering to explain, it uses SAEs to make frozen CT and MRI embeddings more legible and clinically useful. With Matryoshka sparsity, the model preserves reconstruction while recovering most downstream performance from only a handful of features, and even the random-weight baseline warns us that high R2 is not the same thing as semantic structure [2]. Put beside the vision work, the contrast is sharp: in VLMs, sparse features are evidence you can poke; in medical imaging, they are representations you hope to trust. The shared lesson is that sparse latents only earn meaning when they survive a causal test or a retrieval test, not just a reconstruction score. The next step is to see whether these feature vocabularies can be made clinically and visually portable without losing their bite.

[1] Language Models Can Explain Visual Features via Steering [2] Sparse Autoencoders for Interpretable Medical Image Representation Learning

Explainable AI

Building the Infrastructure for Scalable Circuit Analysis

If mechanistic interpretability is going to leave the demo stage, it needs plumbing. CLT-Forge is basically a workshop for that job: distributed training, compressed activation caching, circuit tracing, pruning, automated interpretability, and visualization, all bundled for cross-layer transcoders [1]. The paper is not selling a new benchmark number; it is trying to make a whole research workflow less brittle. That matters because CLTs are only useful if researchers can actually train and inspect them without drowning in memory and tooling overhead.

But then comes the harder question: even if you can scale the tools, do the explanations transfer? The cross-domain VAE study points straight at that uncertainty, asking whether causal circuit analysis survives changes in modality [2]. The provided summary is sparse on results, which is almost fitting for a paper about transfer: the field still does not know whether a circuit found in one domain is a reusable mechanism or just a local convenience. CLT-Forge attacks the engineering bottleneck; the VAE paper attacks the epistemic bottleneck. Together they frame the same problem from opposite sides. The practical future of circuit analysis may depend less on any one clever attribution method than on whether the community can build reusable infrastructure for testing portability across model families, data types, and interventions.

[1] CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs [2] Does Mechanistic Interpretability Transfer Across Data Modalities? A Cross-Domain Causal Circuit Analysis of Variational Autoencoders

3D Perception

Attention Under the Microscope: Spatial Reasoning in Vision-Language Models

Why do vision-language models stumble on spatial questions that feel easy to humans? This paper answers by going inside the heads, not outside with more prompts. Using CogVSR to decompose spatial reasoning into cognitive substeps, it maps attention heads to functions like spatial perception and relational reasoning, and finds something telling: functional heads are sparse, and spatial ones are scarcer still [1]. That scarcity is the mechanism-shaped clue. If the right heads are few, then the model’s spatial skill is not spread evenly through the network; it is hanging on a handful of fragile components.

The intervention is just as revealing. Spatial Head Activation can wake up latent spatial heads and improve accuracy by more than 10% on InternVL3-2B, while removing or perturbing functional heads changes downstream reasoning in measurable ways [1]. Compared with the steering papers elsewhere in this issue, the move here is narrower and more surgical: not a general representation trick, but a way to identify which subcircuits are carrying spatial load. That makes the paper feel like a diagnostic map for a common failure mode in multimodal reasoning. The next question is whether these latent spatial heads are a patchable quirk of current models or a deeper sign that VLMs learn space in a brittle, underdistributed way.

[1] Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning

Subscribe: email [email protected] with your interests. Steer: send Style: feynman or updated interests to [email protected]. To unsubscribe, email [email protected].