SAE Feature Explorer

RadVLM · Layer 16 · 12 headline causal features TopK D=32,768 k=64 · DUA-safe public examples

Select a feature from the sidebar
to inspect its activation profile and contexts.

Multi-Layer Steering Overview

Steering is applied simultaneously at layers {8, 16, 20, 24}, spanning the middle-to-late range of RadVLM's 36-layer backbone. At each layer, a separate SAE (D=32,768, k=64) decomposes the residual stream, and causal screening identifies which features to suppress or boost. Below: the feature budget per layer, the top causally-impactful features, and a stacked composition view.

Feature budget composition
Suppress Boost Unselected (of 200 screened)
Layers differ in the balance of boost vs suppress features. Layer 8 is suppress-heavy (136 suppress, 64 boost) — early-mid representations carry many hallucination-promoting defaults. Layer 16 flips: 153 boost vs 47 suppress — mid-layer features are rich with accuracy-improving directions. Layers 20 and 24 return to mixed balance. This asymmetry is a key motivation for multi-layer steering: no single layer captures both the features to silence and the features to amplify.

Why multi-layer?

Single-layer steering can only edit the residual stream at one depth. But hallucination-related computations are distributed across layers:

  • Layer 8 — Encodes low-level visual-token binding. Has the most suppress features (136), suggesting early hallucination seeds.
  • Layer 16 — Rich in clinical semantics. Highest boost count (153): many accuracy-promoting directions live here.
  • Layer 20 — Transition layer where factual content consolidates. Balanced mix (70 suppress, 130 boost).
  • Layer 24 — Late feature integration before final layers. Suppress-heavy again (127 suppress, 73 boost).

The ablation study in the paper (§5 Results) confirms: steering all four layers reduces errors more than any single-layer or subset. The multi-layer approach addresses hallucination at multiple stages of the model's computation, from early binding errors to late-stage repetition.

Activation Density (firings per feature)

Feature index height = log10(firings)
Density spans 3 orders of magnitude. Feature #1925 (incidental findings) fires on 167K tokens — almost every report. Feature #20872 (osseous checklist) fires on only 440 tokens — highly selective. Suppress features tend to be denser than boost features, consistent with the idea that hallucination-promoting circuits are broadly active defaults that need to be turned off.

Pairwise Cosine Similarity (decoder directions)

All 12 features are nearly orthogonal (max |cos| = 0.11). This confirms each feature captures an independent direction in the model's 4,096-dimensional residual stream. Steering multiple features simultaneously is unlikely to cause interference because their decoder directions do not overlap.

Feature summary table

Feature Role Error type Causal delta Firings Studies Mean act. Pattern

Steering Effect on Generated Reports

Side-by-side comparison of ground truth, baseline (unsteered) generation, and steered generation on test-set samples. Steering uses V3 causal features at layers {8, 16, 20, 24} with α=0.20 and top 20 features per layer. GREEN score: 0 = no clinical errors, 1 = maximum errors.

Sentence tags: OK — matched FF — false finding MF — missing finding INS — insignificant

Cross-model functional feature census

For each model pair, we compute Jaccard overlap and mean cosine similarity of decoder directions for top causally-implicated features. Universally functional directions (e.g. "anatomic checklist", "negation chain") tend to recur across models; model-specific harm-suppression features do not.

Loading census data…

Cross-model alignment heatmap

Decoder cosine similarity between this model's 12 dashboard features and the RadVLM reference (only available when viewing CheXOne or LLaVA-Rad).

Switch to CheXOne or LLaVA-Rad to view alignment with RadVLM.

How to read this dashboard

This dashboard visualises 12 SAE features per model, discovered by Sparse Autoencoders (TopK, D=32,768, k=64) trained on residual-stream activations. The currently displayed model is shown in the header; switch using the Model dropdown to compare RadVLM (Qwen3-VL 8B), CheXOne (Qwen2.5-VL 3B), or LLaVA-Rad (Vicuna 7B). Each model has its own SAE, its own headline causal layer, and (where complete) its own per-feature mechanistic statistics. The dashboard has six tabs: Feature Inspector (single-layer deep-dive), Multi-Layer View (cross-layer summary), Overview & Similarity (density + geometry), Steering Examples (before/after comparisons), Cross-model Census (alignment between models), and this Guide.

Model selector and source badges

Header
Model
RadVLM (Qwen3-VL 8B) — 12 headline features at layer 16, selected by V3 single-feature causal ablation on ~3K validation samples. All five dashboard tabs are fully populated.
Feature source
V3 causal — features chosen by single-feature causal ablation; the most reliable selection method.

Feature Inspector tab

Feature Inspector
Sidebar (left panel)
Lists all 12 features. Each card shows the feature index, its role, the primary error type abbreviation, the causal delta, and how many studies it activates in. Use the filter buttons at the top to narrow by role or error type. Click any card to inspect it.
Filter buttons
Boost / Suppress — filter by the feature's role in steering. FF / MF / WL / WS — filter by the primary GREEN error type the feature targets (False Finding, Missing Finding, Wrong Location, Wrong Severity).
Role badge — Boost vs Suppress
Boost (green): Amplifying this feature's activation reduces clinical errors. The model generates more accurate reports when this direction is strengthened.
Suppress (red): Zeroing out this feature's activation reduces errors. This direction promotes hallucinations, so we silence it.
Causal delta (Δ)
The headline number in the top-right of the detail view. It is the average change in error count per report when this single feature is intervened on (boosted or suppressed) vs. the unsteered baseline, measured on 100 validation samples. A positive Δ (green) means the intervention removed errors. A negative Δ (red) for suppress features means the intervention removed errors by silencing a harmful direction. Example: Δ=+0.30 means the model made 0.30 fewer errors per report on average when this feature was boosted.
Stats cards
Studies active — How many of the 3,307 IU-Xray reports triggered this feature at least once. High = broadly active; low = highly selective.
Total firings — Total number of token positions where the feature activated above threshold (2.0). One study can produce many firings across its generated tokens.
Mean activation — Average activation magnitude when the feature fires (above the 2.0 threshold). Higher values mean the feature fires more strongly.
Median position — Where in the generated report the feature typically fires, expressed as a percentage from start (0%) to end (100%). Features near 0% fire during the opening; features near 100% fire at the end.
Causal effect on all error types (4-cell grid)
Shows what happens to each GREEN error type when this feature is removed (zeroed out) from the residual stream. The sign convention comes directly from the causal screening: delta = ablated_errors − baseline_errors.

Green (+): removing the feature increased this error type → the feature was preventing these errors.
Red (−): removing the feature decreased this error type → the feature was causing these errors.

The cell with a blue border is the feature's primary error type (the one it was selected for). Why some suppress features show green cells: A feature can simultaneously cause one type of error and prevent another. For example, feature #4240 causes wrong-location errors (WL=−0.09, so we suppress it) but also prevents false findings (FF=+0.20). This is a trade-off. In multi-feature steering, other boost features (like #5872, which prevents FF by +0.28) compensate for these side effects. The combined effect across 50 features per layer nets out positively on all error types.
Report position distribution (bar)
A horizontal bar showing the fraction of this feature's firings that occur in the early vs. late half of the report. "Early" = first half of generated tokens; "Late" = second half. Example: a feature with "87% late" fires almost exclusively toward the end of reports (often on repetition or closing statements).
Top activating contexts
The 3 token positions where this feature activated most strongly across the validation set. Each card shows an anonymized case ID, the activation magnitude, and a text snippet around the activating token. These snippets help you form an intuitive hypothesis about what the feature detects (e.g., support-device language, negation chains, repetition loops).
Per-token activation highlighting (in “Top activating contexts”)
When per-token activation data is loaded, each token in the example snippet is shaded by its SAE pre-activation (continuous projection onto the encoder direction, before TopK gating). Red = positive activation (feature direction aligned with this token's hidden state); blue = negative (anti-aligned). Saturation encodes magnitude. The most-activating tokens (often a noun phrase or anatomical term) appear most saturated.

We use pre-activations rather than post-TopK latents because the text-only replay of a snippet rarely puts a specific feature in the top-64 active features (which are dominated by image-conditioned context during actual generation). Pre-activations preserve the alignment signal at every token.

The replay peak badge reports the peak pre-activation observed during this fresh forward pass. The orig. max in the header is the original post-TopK activation captured during full image-conditioned generation; the two are on different scales and not directly comparable.
Activation distribution (histogram)
Shows the approximate distribution of activation magnitudes for this feature, conditional on firing above the 2.0 threshold. The shape is fitted from observed statistics (mean activation, max activation, number of firings) across MIMIC-CXR validation (SAE training set); example snippets below are from IU-Xray.

The blue dashed line marks the mean activation. Green bars = boost features; red bars = suppress features.

Stats below the chart: mean activation (conditional on firing), maximum observed activation, total number of firings (token positions), and number of distinct studies where the feature activated. A high max relative to the mean suggests the feature fires very strongly on specific token patterns.
Conditional activation — pre-activation strength by error type
For each of the four GREEN error types (FF, MF, WL, WS), this panel shows the mean SAE pre-activation (continuous projection onto the encoder direction, before TopK gating) computed on the mean-pooled hidden state of each of the 3,000 training samples, split into two groups:
w/ err — reports that contain at least one error of this type;
no err — reports without any error of this type.

We use pre-activations rather than post-TopK firing rates because mean-pooled samples rarely place a specific feature in the top-64 active features, but the continuous projection still captures how strongly the feature direction aligns with the sample's hidden state.

The right column reports Cohen's d, the standardised effect size of the with-error vs no-error difference (mean difference divided by overall standard deviation). d > +0.10 (red) means the feature is more active on erroneous reports — consistent with a hallucination-promoting role. d < −0.10 (green) means depletion on errors — consistent with a clean-output role. This is independent of the causal screening and provides a complementary check: features whose conditional activation matches their assigned role have stronger interpretive support.
Position distribution within report
A 10-bin histogram showing where this feature tends to fire in the generated report, from the first token (0%) to the last token (100%). Reconstructed from the median position (p50) and the early/late-half split using a smoothed Gaussian fit.

Features that fire near the start (e.g., "20872" at 17%) typically activate on opening anatomy mentions; features that fire late (e.g., "15810" at 87%) often correspond to repetition loops or closing statements.
Feature geometry — decoder cosine similarity
For each feature, shows its 3 nearest and 3 most dissimilar neighbours among the 12 curated features. Similarity is measured by cosine similarity between decoder weight vectors — the learned directions in the 4,096-dimensional residual stream.

Low cosine values (<0.1) confirm that the features capture independent aspects of the model's computation. This geometric independence is important: it means combining all 12 features for steering doesn't create redundant or conflicting interventions. Each feature steers a genuinely different direction in residual space.

Multi-Layer View tab

Multi-Layer
Layer cards
One card per steered layer (8, 16, 20, 24). Each shows:
Depth — Where in the 36-layer backbone this layer sits (e.g., layer 16 = 44% depth).
Screened / suppress / boost counts — Of the 200 candidate features screened by causal ablation at this layer, how many were classified as suppress (promoting hallucinations) vs boost (improving accuracy).
Top 5 chips — The 5 most impactful features for each role, ranked by causal magnitude. The number after # is the feature index in the SAE dictionary; the small number is the causal magnitude (how much the model's output changed when this feature was intervened on). Hover for exact quality-change values.
Feature budget composition (stacked bar chart)
A horizontal stacked bar for each layer showing the proportion of screened features classified as suppress (red) vs boost (green). This visualises the key insight: layers have different suppress/boost ratios. Layer 8 is suppress-heavy (early hallucination seeds), layer 16 is boost-heavy (clinical semantics), and layer 24 returns to suppress-heavy (output-level errors).
Why multi-layer?
A text explanation of why single-layer steering is insufficient and how the four chosen layers each contribute differently to hallucination mitigation.

Overview & Similarity tab

Overview
Activation density chart
A bar chart of total firings for each feature, sorted from most to least active. Bar height uses a log10 scale (otherwise the densest feature would dwarf all others). The number above each bar is the raw count. Green bars = boost features; red bars = suppress features. Click any bar to jump to that feature in the inspector.
Pairwise cosine similarity heatmap
Each cell shows the cosine similarity between two features' SAE decoder columns (the directions in residual-stream space). Diagonal cells are always 1.0 (a feature is identical to itself). Off-diagonal values near 0 mean the two features point in unrelated directions — they capture independent aspects of the model's computation. Green = positive similarity; red = negative (anti-correlated). Hover over any cell for the exact value.
Feature summary table
A compact reference table with all 12 features in one view: index, role, error type, causal delta, firings, studies, mean activation, and the hypothesised pattern. Click any row to jump to that feature in the inspector tab.

Steering Examples tab

Steering
What is shown
Each card shows a chest X-ray image (IU-Xray chest radiograph) alongside three report texts: the radiologist-written ground truth (reference), the model's baseline (unsteered) generation, and the steered generation after SAE intervention. Examples are drawn from the 3,314-sample held-out test set.
GREEN error highlighting
Each sentence in the baseline and steered reports is classified by comparing it to the ground truth (GT) reference. Sentences are color-coded by the 6 GREEN error categories, matching the official GREEN taxonomy:

Matched — sentence agrees with the ground truth (word overlap >50%).
FF — False Finding — sentence reports a finding not present in the ground truth (hallucination).
MF — Missing Finding — a ground-truth finding is absent from the report. Shown as a “\u26A0 Missing from report” block below the text.
WL — Wrong Location — the finding matches the GT but specifies an incorrect anatomical location (e.g. “right” vs “left”).
WS — Wrong Severity — the finding matches the GT but uses a different severity descriptor (e.g. “small” vs “large”).
FC — False Comparison — the sentence mentions a temporal comparison (e.g. “compared to prior”) that is not in the ground truth.
MC — Missing Comparison — the ground truth has a temporal comparison that is absent from the report.

Insignificant — the GREEN model classified this as a clinically insignificant error (does not affect the GREEN score).

How it works: Annotations are pre-computed from the GREEN model’s raw output (StanfordAIMI/GREEN-radllama2-7b). Each error description from the GREEN response is fuzzy-matched to the corresponding sentence in the report. This is far more accurate than simple word-overlap heuristics, since it uses the GREEN model’s actual clinical judgment.
GREEN scores and delta
The headline Baseline / Steered GREEN values and ΔGREEN are the official per-sample scores from the frozen radvlm_v2_pertoken test run (same .npy arrays as the paper). GREEN = matched / (matched + significant errors), so higher is better. The Matched column is a sentence-level heuristic (count of OK-classified sentences), not GREEN’s internal matched-finding count. The Sig. errors total is always the sum of the six FF–MC cells below it, recomputed by scripts/annotate_steer_examples.py so it cannot drift out of sync with the row. The category badge uses only ΔGREEN: large improvement if >0.05, moderate if in (0, 0.05], no change if ~0, regression if <0.
Per-error-type breakdown
The six FF–MC columns are a heuristic proxy (sentence match vs.\ reference), not a second GREEN API call. The 6 labels follow the GREEN taxonomy:
FF — False Finding: hallucinated finding not in reference
MF — Missing Finding: finding from reference omitted in report
WL — Wrong Location: finding reported at incorrect anatomy
WS — Wrong Severity: finding with incorrect severity descriptor
FC — False Comparison: fabricated temporal comparison
MC — Missing Comparison: omitted temporal comparison from reference
Red counts indicate errors present; gray zeros indicate no errors of that type. The Σsig column repeats the sum of those six cells (same as the Sig. errors headline).
Category badges
Large improvement — ΔGREEN > 0.05.
Moderate improvement — 0 < ΔGREEN ≤ 0.05.
No change — ΔGREEN is effectively zero.
Regression — ΔGREEN < 0.
Filter buttons
Filter examples by category. "Improvements" combines large and moderate. Use "Regressions" to inspect failure cases.
How to read the columns
Start by viewing the X-ray image, then read the ground truth. Compare the baseline and steered columns: look for red strikethrough sentences in the baseline (hallucinations) that disappear in the steered version, and green sentences in the steered version (findings recovered by steering). In regression cases, look for new blue sentences that may represent hallucinations introduced by steering.
Steering configuration
All examples use: α=0.20 (steering strength), n=20 (top 20 features per layer), layers {8, 16, 20, 24}, V3 causal features, combined mode (suppress harmful + boost beneficial).

Key terminology

SAE (Sparse Autoencoder)
A neural network that decomposes a dense hidden-state vector into a sparse set of features. Each feature is a learned direction in the model's residual stream. We use a TopK SAE with a dictionary of 32,768 features and sparsity k=64 (at most 64 features active per token).
Residual stream
The main information highway inside a transformer. At each layer, the model reads from and writes to this 4,096-dimensional vector. Our SAE decomposes this vector at layer 16.
Feature (SAE feature)
A single learned direction in the SAE dictionary. When a feature "fires" (activates above threshold), it means the model's hidden state has a strong component along that direction at that token position.
Causal ablation
The method used to identify these 12 features. For each candidate feature, we run the model twice: once normally, once with the feature intervened on (boosted or zeroed out). The difference in clinical error count is the causal delta.
GREEN error types
Clinical errors scored by the GREEN evaluator (an LLM-based radiology judge). Four types:
FF (False Finding) — The report describes something not present in the image.
MF (Missing Finding) — The report omits something visible in the image.
WL (Wrong Location) — A finding is described in the wrong anatomical location.
WS (Wrong Severity) — A finding's severity is described incorrectly (e.g., "mild" vs. "moderate").
Steering
At inference time, we hook the model at layers {8, 16, 20, 24}. At each hook, we encode the hidden state through the SAE, edit the sparse code (zero out suppress features, amplify boost features), decode back, and add the difference to the residual stream. This nudges the model's generation without any fine-tuning.