The Week in Research

Published: Thursday, Mar 26, 2026

Style: Hunter S. Thompson

Lead Architecture

Inside the Model: The New Wave of Steering, Circuits, and Interpretable Features

The bastards are moving the furniture again — interpretability is no longer just a microscope held over a dead model, it is becoming a wrench. SafeSeek goes hunting for safety behavior like contraband hidden in a mattress and finds circuits so sparse they feel like a dare: cut the wrong sliver and the backdoor coughs itself to death, while the right circuit keeps safety alive under fine-tuning pressure [1]. DSPA takes the same basic heresy and makes it prompt-conditional, using sparse autoencoders to steer preference behavior without dragging the whole model through alignment surgery [2]. Then CuE stomps in with a culturally loaded flashlight, showing that underspecified generations slide toward Anglophone defaults unless you shove against the model's internal cultural geometry instead of merely pleading with prompts [3]. The thread running through all of it is not subtle: these systems can be read, localized, and then edited more precisely than generic prompting allows. But the tension is nasty and useful — every gain drags along questions about transfer, stability, and whether the discovered structure is real or just benchmark theater in a lab coat. Still, the direction is obvious: interpretability is getting operational, and the next fight will be over whether these internal levers hold up when the prompts get weird, the data gets filthy, and the deployment stack starts lying back.

Sources [1] SafeSeek: Universal Attribution of Safety Circuits in Language Models [2] DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment [3] Steering LLMs for Culturally Localized Generation

Deployment & Adaptation

From Vision to Medicine: Sparse Features Become Usable Explanations

The multimodal crowd has finally stopped pretending that captioning is enlightenment. In vision, steering turns sparse autoencoder features into language by literally poking the encoder and asking the model what ghost it sees; the trick is causal, not merely correlational, and the hybrid Steering-informed Top-k beats plain top-activating examples because reality, inconveniently, likes to be interrogated from two angles at once [1]. Medical imaging takes the same sparse-feature obsession and drags it into the clinic: Matryoshka SAEs recover anatomy-aligned structure from CT and MRI embeddings, and with only a handful of features they preserve a shocking amount of retrieval and downstream performance, which is exactly the sort of compression that makes the audit trail less of a swamp [2]. Then CogVSR and Spatial Head Activation show the hidden machinery behind spatial reasoning in VLMs — not a diffuse fog of magic, but a few specialized heads that can be nudged, removed, or awakened to change behavior by double digits on InternVL3-2B [3]. The tension here is delicious: one paper generalizes explanation with steering, one makes sparse structure clinically useful, and one says the hard skill lives in a small set of heads anyway. If these strands converge, we may end up with multimodal systems that are not just better explained, but actually more editable at the right level of abstraction.

[1] Language Models Can Explain Visual Features via Steering [2] Sparse Autoencoders for Interpretable Medical Image Representation Learning [3] Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning

Explainable AI

Scaling the Interpretability Stack

Now we hit the plumbing, which is where the romance goes to die and the engineering starts screaming. CLT-Forge is the antidote to interpretability as artisanal suffering: it wraps training, distributed sharding, compressed activation caching, Circuit-Tracer attribution graphs, and visualization into one open-source stack for cross-layer transcoders, a setup meant to turn a fragile research trick into something people can actually run without sacrificing a week to container exorcism [1]. That matters because the field keeps inventing more structured units than ordinary transcoders can handle, yet every gain in compactness seems to demand another pile of infrastructure just to keep the lights on. The cross-domain VAE study raises the more dangerous question: even if the tooling works, do the findings travel? A mechanistic circuit discovered in one modality may be a local truth, not a universal law, and without readable results the portability claim stays foggy as sewer gas [2]. Put together, these papers expose the field's split personality: one side wants reusable systems for scaling analysis, the other side is still asking whether the analysis means anything outside its original sandbox. That is not a contradiction so much as the honest shape of the problem, and the next real breakthrough may depend on whether interpretability can learn to survive both bigger stacks and broader domains at the same time.

[1] CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs [2] Does Mechanistic Interpretability Transfer Across Data Modalities? A Cross-Domain Causal Circuit Analysis of Variational Autoencoders

3D Perception

Training for Better Internals

GeoLAN is the boldest kind of meddling: it refuses to wait around for interpretation after the fact and instead tries to sculpt the model's internal geometry while training is still underway [1]. The pitch is deliciously perverse — if hidden states are collapsing into ugly, low-rank sludge and attention routing is a bottlenecked mess, then put geometric constraints on them and force the model to walk straighter. KT-CW penalizes collapse, KT-Attn pushes back on low-rank routing, and the whole thing is meant to preserve accuracy while making the latent space less like a drunken alley and more like something a human might inspect without needing three antacids. What makes GeoLAN interesting against the steering papers is the change in philosophy: SafeSeek, DSPA, and CuE all assume the model already contains the relevant structure and the job is to find and manipulate it; GeoLAN says the structure itself should be trained into place. That is a bigger gamble, because it asks optimization to serve interpretability without wrecking the task. The reported sweet spot on mid-sized models hints at a real opening, while the smaller-model trade-off with superposition is a nasty reminder that geometry can become a tyrant if you squeeze too hard. Still, if this line matures, interpretability stops being a forensic exercise and starts becoming a design constraint, and that is where the real fight over model internals begins.

[1] GeoLAN: Geometric Learning of Latent Explanatory Directions in Large Language Models

Subscribe: email [email protected] with your interests. Steer: send Style: feynman or updated interests to [email protected]. To unsubscribe, email [email protected].