SAE Feature Explorer

Model selector and source badges

Header

Model: RadVLM (Qwen3-VL 8B) — 12 headline features at layer 16, selected by V3 single-feature causal ablation on ~3K validation samples. All five dashboard tabs are fully populated.
Feature source: V3 causal — features chosen by single-feature causal ablation; the most reliable selection method.

Feature Inspector tab

Feature Inspector

Sidebar (left panel): Lists all 12 features. Each card shows the feature index, its role, the primary error type abbreviation, the causal delta, and how many studies it activates in. Use the filter buttons at the top to narrow by role or error type. Click any card to inspect it.
Filter buttons: Boost / Suppress — filter by the feature's role in steering. FF / MF / WL / WS — filter by the primary GREEN error type the feature targets (False Finding, Missing Finding, Wrong Location, Wrong Severity).
Role badge — Boost vs Suppress: Boost (green): Amplifying this feature's activation reduces clinical errors. The model generates more accurate reports when this direction is strengthened.
Suppress (red): Zeroing out this feature's activation reduces errors. This direction promotes hallucinations, so we silence it.
Causal delta (Δ): The headline number in the top-right of the detail view. It is the average change in error count per report when this single feature is intervened on (boosted or suppressed) vs. the unsteered baseline, measured on 100 validation samples. A positive Δ (green) means the intervention removed errors. A negative Δ (red) for suppress features means the intervention removed errors by silencing a harmful direction. Example: Δ=+0.30 means the model made 0.30 fewer errors per report on average when this feature was boosted.
Stats cards: Studies active — How many of the 3,307 IU-Xray reports triggered this feature at least once. High = broadly active; low = highly selective.
Total firings — Total number of token positions where the feature activated above threshold (2.0). One study can produce many firings across its generated tokens.
Mean activation — Average activation magnitude when the feature fires (above the 2.0 threshold). Higher values mean the feature fires more strongly.
Median position — Where in the generated report the feature typically fires, expressed as a percentage from start (0%) to end (100%). Features near 0% fire during the opening; features near 100% fire at the end.
Causal effect on all error types (4-cell grid): Shows what happens to each GREEN error type when this feature is removed (zeroed out) from the residual stream. The sign convention comes directly from the causal screening: delta = ablated_errors − baseline_errors.

Green (+): removing the feature increased this error type → the feature was preventing these errors.
Red (−): removing the feature decreased this error type → the feature was causing these errors.

The cell with a blue border is the feature's primary error type (the one it was selected for). Why some suppress features show green cells: A feature can simultaneously cause one type of error and prevent another. For example, feature #4240 causes wrong-location errors (WL=−0.09, so we suppress it) but also prevents false findings (FF=+0.20). This is a trade-off. In multi-feature steering, other boost features (like #5872, which prevents FF by +0.28) compensate for these side effects. The combined effect across 50 features per layer nets out positively on all error types.
Report position distribution (bar): A horizontal bar showing the fraction of this feature's firings that occur in the early vs. late half of the report. "Early" = first half of generated tokens; "Late" = second half. Example: a feature with "87% late" fires almost exclusively toward the end of reports (often on repetition or closing statements).
Top activating contexts: The 3 token positions where this feature activated most strongly across the validation set. Each card shows an anonymized case ID, the activation magnitude, and a text snippet around the activating token. These snippets help you form an intuitive hypothesis about what the feature detects (e.g., support-device language, negation chains, repetition loops).
Per-token activation highlighting (in “Top activating contexts”): When per-token activation data is loaded, each token in the example snippet is shaded by its SAE pre-activation (continuous projection onto the encoder direction, before TopK gating). Red = positive activation (feature direction aligned with this token's hidden state); blue = negative (anti-aligned). Saturation encodes magnitude. The most-activating tokens (often a noun phrase or anatomical term) appear most saturated.

We use pre-activations rather than post-TopK latents because the text-only replay of a snippet rarely puts a specific feature in the top-64 active features (which are dominated by image-conditioned context during actual generation). Pre-activations preserve the alignment signal at every token.

The replay peak badge reports the peak pre-activation observed during this fresh forward pass. The orig. max in the header is the original post-TopK activation captured during full image-conditioned generation; the two are on different scales and not directly comparable.
Activation distribution (histogram): Shows the approximate distribution of activation magnitudes for this feature, conditional on firing above the 2.0 threshold. The shape is fitted from observed statistics (mean activation, max activation, number of firings) across MIMIC-CXR validation (SAE training set); example snippets below are from IU-Xray.

The blue dashed line marks the mean activation. Green bars = boost features; red bars = suppress features.

Stats below the chart: mean activation (conditional on firing), maximum observed activation, total number of firings (token positions), and number of distinct studies where the feature activated. A high max relative to the mean suggests the feature fires very strongly on specific token patterns.
Conditional activation — pre-activation strength by error type: For each of the four GREEN error types (FF, MF, WL, WS), this panel shows the mean SAE pre-activation (continuous projection onto the encoder direction, before TopK gating) computed on the mean-pooled hidden state of each of the 3,000 training samples, split into two groups:
w/ err — reports that contain at least one error of this type;
no err — reports without any error of this type.

We use pre-activations rather than post-TopK firing rates because mean-pooled samples rarely place a specific feature in the top-64 active features, but the continuous projection still captures how strongly the feature direction aligns with the sample's hidden state.

The right column reports Cohen's d, the standardised effect size of the with-error vs no-error difference (mean difference divided by overall standard deviation). d > +0.10 (red) means the feature is more active on erroneous reports — consistent with a hallucination-promoting role. d < −0.10 (green) means depletion on errors — consistent with a clean-output role. This is independent of the causal screening and provides a complementary check: features whose conditional activation matches their assigned role have stronger interpretive support.
Position distribution within report: A 10-bin histogram showing where this feature tends to fire in the generated report, from the first token (0%) to the last token (100%). Reconstructed from the median position (p50) and the early/late-half split using a smoothed Gaussian fit.

Features that fire near the start (e.g., "20872" at 17%) typically activate on opening anatomy mentions; features that fire late (e.g., "15810" at 87%) often correspond to repetition loops or closing statements.
Feature geometry — decoder cosine similarity: For each feature, shows its 3 nearest and 3 most dissimilar neighbours among the 12 curated features. Similarity is measured by cosine similarity between decoder weight vectors — the learned directions in the 4,096-dimensional residual stream.

Low cosine values (<0.1) confirm that the features capture independent aspects of the model's computation. This geometric independence is important: it means combining all 12 features for steering doesn't create redundant or conflicting interventions. Each feature steers a genuinely different direction in residual space.

Multi-Layer View tab

Multi-Layer

Layer cards: One card per steered layer (8, 16, 20, 24). Each shows:
Depth — Where in the 36-layer backbone this layer sits (e.g., layer 16 = 44% depth).
Screened / suppress / boost counts — Of the 200 candidate features screened by causal ablation at this layer, how many were classified as suppress (promoting hallucinations) vs boost (improving accuracy).
Top 5 chips — The 5 most impactful features for each role, ranked by causal magnitude. The number after # is the feature index in the SAE dictionary; the small number is the causal magnitude (how much the model's output changed when this feature was intervened on). Hover for exact quality-change values.
Feature budget composition (stacked bar chart): A horizontal stacked bar for each layer showing the proportion of screened features classified as suppress (red) vs boost (green). This visualises the key insight: layers have different suppress/boost ratios. Layer 8 is suppress-heavy (early hallucination seeds), layer 16 is boost-heavy (clinical semantics), and layer 24 returns to suppress-heavy (output-level errors).
Why multi-layer?: A text explanation of why single-layer steering is insufficient and how the four chosen layers each contribute differently to hallucination mitigation.

Overview & Similarity tab

Overview

Activation density chart: A bar chart of total firings for each feature, sorted from most to least active. Bar height uses a log10 scale (otherwise the densest feature would dwarf all others). The number above each bar is the raw count. Green bars = boost features; red bars = suppress features. Click any bar to jump to that feature in the inspector.
Pairwise cosine similarity heatmap: Each cell shows the cosine similarity between two features' SAE decoder columns (the directions in residual-stream space). Diagonal cells are always 1.0 (a feature is identical to itself). Off-diagonal values near 0 mean the two features point in unrelated directions — they capture independent aspects of the model's computation. Green = positive similarity; red = negative (anti-correlated). Hover over any cell for the exact value.
Feature summary table: A compact reference table with all 12 features in one view: index, role, error type, causal delta, firings, studies, mean activation, and the hypothesised pattern. Click any row to jump to that feature in the inspector tab.

Steering Examples tab

Steering

What is shown: Each card shows a chest X-ray image (IU-Xray chest radiograph) alongside three report texts: the radiologist-written ground truth (reference), the model's baseline (unsteered) generation, and the steered generation after SAE intervention. Examples are drawn from the 3,314-sample held-out test set.
GREEN error highlighting: Each sentence in the baseline and steered reports is classified by comparing it to the ground truth (GT) reference. Sentences are color-coded by the 6 GREEN error categories, matching the official GREEN taxonomy:

Matched — sentence agrees with the ground truth (word overlap >50%).
FF — False Finding — sentence reports a finding not present in the ground truth (hallucination).
MF — Missing Finding — a ground-truth finding is absent from the report. Shown as a “\u26A0 Missing from report” block below the text.
WL — Wrong Location — the finding matches the GT but specifies an incorrect anatomical location (e.g. “right” vs “left”).
WS — Wrong Severity — the finding matches the GT but uses a different severity descriptor (e.g. “small” vs “large”).
FC — False Comparison — the sentence mentions a temporal comparison (e.g. “compared to prior”) that is not in the ground truth.
MC — Missing Comparison — the ground truth has a temporal comparison that is absent from the report.

Insignificant — the GREEN model classified this as a clinically insignificant error (does not affect the GREEN score).

How it works: Annotations are pre-computed from the GREEN model’s raw output (StanfordAIMI/GREEN-radllama2-7b). Each error description from the GREEN response is fuzzy-matched to the corresponding sentence in the report. This is far more accurate than simple word-overlap heuristics, since it uses the GREEN model’s actual clinical judgment.
GREEN scores and delta: The headline Baseline / Steered GREEN values and ΔGREEN are the official per-sample scores from the frozen radvlm_v2_pertoken test run (same .npy arrays as the paper). GREEN = matched / (matched + significant errors), so higher is better. The Matched column is a sentence-level heuristic (count of OK-classified sentences), not GREEN’s internal matched-finding count. The Sig. errors total is always the sum of the six FF–MC cells below it, recomputed by scripts/annotate_steer_examples.py so it cannot drift out of sync with the row. The category badge uses only ΔGREEN: large improvement if >0.05, moderate if in (0, 0.05], no change if ~0, regression if <0.
Per-error-type breakdown: The six FF–MC columns are a heuristic proxy (sentence match vs.\ reference), not a second GREEN API call. The 6 labels follow the GREEN taxonomy:
FF — False Finding: hallucinated finding not in reference
MF — Missing Finding: finding from reference omitted in report
WL — Wrong Location: finding reported at incorrect anatomy
WS — Wrong Severity: finding with incorrect severity descriptor
FC — False Comparison: fabricated temporal comparison
MC — Missing Comparison: omitted temporal comparison from reference
Red counts indicate errors present; gray zeros indicate no errors of that type. The Σsig column repeats the sum of those six cells (same as the Sig. errors headline).
Category badges: Large improvement — ΔGREEN > 0.05.
Moderate improvement — 0 < ΔGREEN ≤ 0.05.
No change — ΔGREEN is effectively zero.
Regression — ΔGREEN < 0.
Filter buttons: Filter examples by category. "Improvements" combines large and moderate. Use "Regressions" to inspect failure cases.
How to read the columns: Start by viewing the X-ray image, then read the ground truth. Compare the baseline and steered columns: look for red strikethrough sentences in the baseline (hallucinations) that disappear in the steered version, and green sentences in the steered version (findings recovered by steering). In regression cases, look for new blue sentences that may represent hallucinations introduced by steering.
Steering configuration: All examples use: α=0.20 (steering strength), n=20 (top 20 features per layer), layers {8, 16, 20, 24}, V3 causal features, combined mode (suppress harmful + boost beneficial).

Key terminology

SAE (Sparse Autoencoder): A neural network that decomposes a dense hidden-state vector into a sparse set of features. Each feature is a learned direction in the model's residual stream. We use a TopK SAE with a dictionary of 32,768 features and sparsity k=64 (at most 64 features active per token).
Residual stream: The main information highway inside a transformer. At each layer, the model reads from and writes to this 4,096-dimensional vector. Our SAE decomposes this vector at layer 16.
Feature (SAE feature): A single learned direction in the SAE dictionary. When a feature "fires" (activates above threshold), it means the model's hidden state has a strong component along that direction at that token position.
Causal ablation: The method used to identify these 12 features. For each candidate feature, we run the model twice: once normally, once with the feature intervened on (boosted or zeroed out). The difference in clinical error count is the causal delta.
GREEN error types: Clinical errors scored by the GREEN evaluator (an LLM-based radiology judge). Four types:
FF (False Finding) — The report describes something not present in the image.
MF (Missing Finding) — The report omits something visible in the image.
WL (Wrong Location) — A finding is described in the wrong anatomical location.
WS (Wrong Severity) — A finding's severity is described incorrectly (e.g., "mild" vs. "moderate").
Steering: At inference time, we hook the model at layers {8, 16, 20, 24}. At each hook, we encode the hidden state through the SAE, edit the sparse code (zero out suppress features, amplify boost features), decode back, and add the difference to the residual stream. This nudges the model's generation without any fine-tuning.

Multi-Layer Steering Overview

Why multi-layer?

Activation Density (firings per feature)

Pairwise Cosine Similarity (decoder directions)

Feature summary table

Steering Effect on Generated Reports

Cross-model functional feature census

Cross-model alignment heatmap

How to read this dashboard

Model selector and source badges

Feature Inspector tab

Multi-Layer View tab

Overview & Similarity tab

Steering Examples tab

Key terminology