Select a feature from the sidebar
to inspect its activation profile and contexts.
Select a feature from the sidebar
to inspect its activation profile and contexts.
Steering is applied simultaneously at layers {8, 16, 20, 24}, spanning the middle-to-late range of RadVLM's 36-layer backbone. At each layer, a separate SAE (D=32,768, k=64) decomposes the residual stream, and causal screening identifies which features to suppress or boost. Below: the feature budget per layer, the top causally-impactful features, and a stacked composition view.
Single-layer steering can only edit the residual stream at one depth. But hallucination-related computations are distributed across layers:
The ablation study in the paper (§5 Results) confirms: steering all four layers reduces errors more than any single-layer or subset. The multi-layer approach addresses hallucination at multiple stages of the model's computation, from early binding errors to late-stage repetition.
| Feature | Role | Error type | Causal delta | Firings | Studies | Mean act. | Pattern |
|---|
Side-by-side comparison of ground truth, baseline (unsteered) generation, and steered generation on test-set samples. Steering uses V3 causal features at layers {8, 16, 20, 24} with α=0.20 and top 20 features per layer. GREEN score: 0 = no clinical errors, 1 = maximum errors.
For each model pair, we compute Jaccard overlap and mean cosine similarity of decoder directions for top causally-implicated features. Universally functional directions (e.g. "anatomic checklist", "negation chain") tend to recur across models; model-specific harm-suppression features do not.
Decoder cosine similarity between this model's 12 dashboard features and the RadVLM reference (only available when viewing CheXOne or LLaVA-Rad).
This dashboard visualises 12 SAE features per model, discovered by Sparse Autoencoders (TopK, D=32,768, k=64) trained on residual-stream activations. The currently displayed model is shown in the header; switch using the Model dropdown to compare RadVLM (Qwen3-VL 8B), CheXOne (Qwen2.5-VL 3B), or LLaVA-Rad (Vicuna 7B). Each model has its own SAE, its own headline causal layer, and (where complete) its own per-feature mechanistic statistics. The dashboard has six tabs: Feature Inspector (single-layer deep-dive), Multi-Layer View (cross-layer summary), Overview & Similarity (density + geometry), Steering Examples (before/after comparisons), Cross-model Census (alignment between models), and this Guide.
delta = ablated_errors − baseline_errors.# is the feature index in the SAE dictionary; the small number is the causal magnitude (how much the model's output changed when this feature was intervened on). Hover for exact quality-change values.
radvlm_v2_pertoken test run (same .npy arrays as the paper).
GREEN = matched / (matched + significant errors), so higher is better.
The Matched column is a sentence-level heuristic (count of OK-classified sentences), not GREEN’s internal matched-finding count.
The Sig. errors total is always the sum of the six FF–MC cells below it, recomputed by scripts/annotate_steer_examples.py so it cannot drift out of sync with the row.
The category badge uses only ΔGREEN: large improvement if >0.05, moderate if in (0, 0.05], no change if ~0, regression if <0.