Aletheia — Experiment Walkthrough

Mechanistic interpretability audit of Phikon-v2 on colorectal cancer tissue
Model: Phikon-v2 (Owkin, ViT-L)
Dataset: NCT-CRC-HE · 9,999 patches
SAE: 4,096 features
Date: March 2026

1. What I built

9,999
Patches Analysed
9
Tissue Classes
4,096
SAE Features
410
Biology Features
1,674
Artifact Features

Pathology AI models can score well on benchmarks while relying on features that have nothing to do with biology. A model trained on tissue from a referral hospital using a Hamamatsu scanner might learn "Hamamatsu = aggressive cancer," not because of the tissue, but because the referral centre happens to receive harder cases. Standard explainability tools (Grad-CAM, SHAP) show where a model looks on an image. They cannot tell you what concept the model has formed internally, or whether that concept is a genuine biological signal or a shortcut.

This experiment asks: if we decompose a pathology foundation model's internal representations into individual interpretable features using sparse autoencoders, can we identify which features encode real tissue morphology and which encode staining or scanner artefacts? And can we prove the difference causally, not just by correlation?

The pipeline extracts CLS-token embeddings (1,024 dimensions) from Phikon-v2, trains a sparse autoencoder to decompose them into 4,096 features, then classifies each feature by tissue specificity (ANOVA F-statistic) and staining sensitivity (Pearson correlation with colour properties). 258 features satisfy both thresholds. These are entangled: they encode tissue identity partly through appearance properties. This is the core finding. It is exactly the kind of confound that would cause silent failures in cross-site deployment.

Biological features: what the model learned about tissue

These are the clean biology features: high tissue-type specificity, low correlation with staining properties. Each row shows the eight most-activating patches for a single SAE feature.

Clinical biology features
SAE features detecting clinically relevant tissue morphology. Top rows: normal colon mucosa (gland architecture, goblet cells, crypt structure). Middle rows: smooth muscle (fibre patterns, elongated nuclei). Bottom rows: lymphocytes (dense immune cell infiltration). These features correspond to concepts a pathologist would recognise.

The model has learned genuine histological concepts. The normal mucosa features clearly separate glandular tissue from the smooth muscle features: different morphology, different SAE features.

Artifact-sensitive features: what the model learned about staining

These features tell a different story. They correlate strongly with image-level colour and staining properties rather than tissue morphology.

Artifact-sensitive features
SAE features with strong staining/colour correlation. Note the mixed tissue types within each row: tumour, lymphocytes, mucosa, muscle all appear together. What they share is staining intensity, not morphology. These features encode tissue identity through appearance shortcuts.

Look at the patches within each row. They are a mix of tissue types (tumour, lymphocytes, normal mucosa, smooth muscle) but they share intense staining. The model has learned a second classification pathway: "dark staining = dense tissue." Within one laboratory's staining protocol, this correlation holds. Across laboratories with different protocols, it breaks.

Ablation study: proving which features matter

Finding features and labelling them is not enough. The question is whether the model actually depends on them. The ablation study answers this by zeroing out matched sets of 79 clean biology features and 79 artifact-sensitive features, then measuring the impact on a tissue classification probe.

Ablation study results
Matched ablation: 79 biology features vs 79 artifact features. Combined baseline: 97.8%. Biology-only: 97.0%. Artifact-only: 91.2%. Zero biology: 48.1%. Zero artifact: 87.9%.
97.8%
Full baseline
97.0%
Bio only
91.2%
Artifact only
48.1%
Zero bio
87.9%
Zero artifact
What this means: The 79 biology features are close to sufficient on their own, carrying 97% accuracy. Remove them and the model collapses to 48%. The artifact features are not noise: they carry 91.2% accuracy. But they encode tissue identity through appearance properties specific to this dataset's staining protocol. The model has two classification pathways, one based on morphology and one based on staining. Standard evaluation metrics cannot distinguish between them.

Representation-level follow-up

The ablation study operates at the probe level: it measures how much a downstream classifier depends on each feature set. To confirm that the shortcut features are part of the model's learned representation (and not just a quirk of the probe), I ran a representation-level patching experiment. I zeroed feature families inside the SAE's reconstruction of the CLS embedding, then re-probed.

Activation patching results
Representation-level patching on the final Phikon-v2 CLS embedding. Raw: 99.12%. SAE reconstruction: 99.25%. Patch bio: 99.10% (-0.2 pts). Patch artifact: 99.30%. Patch random: 99.20%. The shortcut features are embedded in the representation itself, not an artefact of the probe.

Patching the biology features produces a small drop (99.10%). Patching the artifact features does not (99.30%). The artifact-sensitive features are part of the model's actual learned representation, not a downstream classification artefact.

Cross-cluster stability

I also clustered patches by inferred staining style (k-means on colour statistics) and measured how much each feature family's activation shifts across clusters.

Cross-site stability analysis
Left: absolute Cohen's d across inferred staining clusters. Biology features: median 0.21. Artifact features: median 0.78, nearly 4x larger. Middle/right: per-feature mean activation across clusters. Biology features cluster near the diagonal (stable). Artifact features scatter widely (unstable across staining conditions).
Biology features are stable across staining conditions. Median shift: 0.21 Cohen's d. Artifact features are not. Median shift: 0.78 Cohen's d. This corroborates the ablation finding: the staining-dependent pathway would behave differently at a hospital with different staining protocols.

Verification methodology

This audit follows the two-stage architecture from AutoMechInterp, an open-source verification framework I built for mechanistic interpretability (26 methodology iterations; docs at fcistud.github.io/mechanistic-interpretability).

The idea is simple: discovery and verification have to be separate steps. The SAE discovers features. That is hypothesis generation. The ablation study, activation patching, and stability analysis are verification. AutoMechInterp enforces this separation through a deterministic stage-gate with 15 evaluation gates, mandatory negative controls per component type, robustness checks, multiplicity correction, and evidence tiers (cross-model confirmed, single-model confirmed, causal-tested unstable, suggestive, rejected).

Discovery stage
SAE extracts 4,096 features from Phikon-v2's CLS representations. Each feature is a hypothesis about what the model has learned.
Verification stage
Ablation, activation patching, negative controls, stability analysis. Only features that survive verification are treated as confirmed.
Where this is going: the Aletheia product

I'm packaging this verification workflow into a product that pathology AI companies can use for regulatory compliance. Below is the product prototype: an interactive assurance workspace showing the full audit workflow from feature atlas through causal validation to regulatory compliance mapping.

Aletheia product prototype - audit overview
Aletheia product prototype - feature atlas

Under the EU AI Act, high-risk medical AI systems require interpretability evidence by August 2027. This is what that evidence workflow looks like: feature-level assurance with causal validation, mapped to Article 13, MHRA AIaMD, and FDA SaMD requirements.

Limitations and next steps

This is a single-dataset demonstration audit, not a multi-site clinical validation study.

Current scope: Probe-level ablation on one model, one dataset, one cancer type. The activation patching follow-up is representation-level but not a full-stack causal intervention on the end-to-end clinical system. The stability analysis uses inferred staining clusters, not verified institution labels. Feature labels are statistical associations, not pathologist-validated annotations.

Next steps: Generalise across pathology foundation models with different architectures (UNI ViT-L, Prov-GigaPath ViT-g, CONCH ViT-B). Implement full model-level activation patching via hooks. Run cross-site validation on multi-hospital datasets with verified institution metadata. Add expert-in-the-loop feature validation with clinical ontology mapping (SNOMED-CT, UBERON). Run pilot audits on real vendor models under NDA.

The entire pipeline runs on a single GPU for under £10 of compute. Scaling it into a real assurance service requires running across model architectures and getting access to pathology AI companies willing to pilot.