Design-partner audits now open

Know what your model
is actually doing —
before you ship it.

Aletheia opens up foundation model internals to find shortcuts, confounders, and unstable reasoning — then packages what it finds into evidence your QA, regulatory, and ML teams can act on.

Get in touch See the platform →

4,096

Sparse features per audit

410

Biology features identified

9 tissue

Classes covered

<1 wk

Intake to evidence pack

The problem

Strong benchmarks hide weak reasoning.

Pathology foundation models pass every benchmark thrown at them. But benchmarks don’t tell you whether the model is relying on gland architecture or staining darkness. That distinction matters when a new lab’s slides look different from training data.

Two pathways, one metric

Our experiments show models encode both a robust morphology pathway and a brittle stain-sensitive pathway. Standard AUC conflates them — you can’t tell how much of your 97% accuracy comes from biology and how much from color shortcuts.

Silent failure across labs

Staining protocols, scanner manufacturers, and tissue processing differ between hospitals. Features that correlate with tissue type at one site become noise at another. Distribution shift is invisible until a patient is misclassified.

Evidence that regulators can read

EU AI Act Article 13 requires transparency evidence for high-risk medical AI from August 2026. SHAP values and attention maps don’t answer the question regulators are asking: what concepts has this model learned, and are they sound?

Internal review is not neutral

The team that built the model is not well-positioned to audit its reasoning. Confirmation bias is structural. An independent interpretability audit provides evidence that stands up to external scrutiny.

How it works

From model internals to release-ready evidence.

Aletheia combines sparse autoencoder analysis with domain-specific stress tests and a structured evidence workflow. The output is a decision artifact, not a dashboard.

Intake & scoping

Define the safety claim, confirm model access, validate cohort breadth. Not every claim is testable — we figure that out before burning compute.

Semi-automated · 0.5–1 day

Representation extraction

Run validation cohorts through target layers and store the activations. These are the raw material for feature discovery — what the model has actually encoded.

Automated · Hours

Sparse feature discovery

Train sparse autoencoders on internal representations. Each learned feature maps to a single concept — a tissue type, a morphological pattern, or a staining artifact.

Automated · Hours–days

Biological triage

Classify every feature: is it biological signal or preparation artifact? Map clean features to tissue ontologies. Flag artifact-correlated features for causal testing.

Semi-automated · 1–2 days

Ablation & causal validation

Zero out specific features and measure the downstream effect. This separates features the model depends on from features that are just along for the ride.

Automated · 1–3 days

Evidence pack export

Structured report with feature atlas, ablation results, confounder register, cross-site stability analysis, and regulatory mapping. Human sign-off required.

Template-driven · <1 day

“The question is not whether your model is accurate. It’s whether it is accurate for the right reasons — and whether you can prove it when someone asks.”

Regulatory context

The window for evidence-light deployment is closing.

Three regulatory bodies are converging on the same requirement: AI medical devices must demonstrate that their internal reasoning is trustworthy — with documented evidence.

Speak with us

2025

FDA — AI-Enabled Device Software Lifecycle Guidance

Draft guidance requires transparency and monitoring frameworks for AI/ML-based software as a medical device across the entire product lifecycle.

2026

EU AI Act — High-risk AI system obligations

Medical device AI classified as high-risk must meet transparency, technical documentation, and human oversight requirements. Explainability is explicitly mandated under Article 13.

Active

MHRA — Project Glass Box (UK)

MHRA’s AI as a Medical Device programme targets explainability and transparency as core requirements for regulatory approval in Great Britain.

Get started

Book an introductory call.
Know before you ship.

Design-partner engagements are open for pathology AI teams preparing for deployment or regulatory submission. Scoped, structured, delivered in days.

Get in touch

London, UK

Audit overview

Prov-GigaPath · NSCLC Subtype Classification · Pre-deployment

312

Features discovered

Layers 18 & 22

247

Histological features

79% of total

Confounders flagged

Action required

0.71

Cross-site stability

↓ Threshold 0.80

Hold

Release gate

2 blockers

Top findings

Scanner-type shortcut (feat_0203)

Layer 22 · Hamamatsu background signature encoded as predictive feature. Active in 34% of predictions.

Staining batch correlation (feat_0211)

Layer 18 · Eosin intensity gradient correlated with site origin. Causal ablation confirms −0.09 AUC.

Tissue fold artifact (feat_0256)

Layer 18 · Tissue folds spuriously correlated with ADC prediction. Active in 11% of predictions.

✓

Nuclear pleomorphism validated (feat_0041)

Layer 22 · Stable across all sites. Confirmed causal with 0.91 confidence.

✓

Glandular architecture (feat_0112)

Layer 18 · Histological grade component. Cross-site Δ < 0.01.

Audit progress

Feature discovery 100%

Biological triage 100%

Causal validation 312 experiments

Stress tests 4/6 passed

Evidence pack Ready

Sign-off matrix 1/4 ready

Feature atlas

312 features · Layers 18 & 22 · SAE sparsity 0.04

ID	Feature / Concept	Type	Layer	Confidence	Causal Role	Active in	Cross-Site Δ	Risk
feat_0203	scanner_bg_hamamatsu	Confounder	L22	0.23	Confound ✗	34%	−0.14	HIGH
feat_0211	stain_eosin_intensity	Confounder	L18	0.31	Confound ✗	19%	−0.09	MED
feat_0256	tissue_fold_artifact	Confounder	L18	0.44	Confound ✗	11%	−0.06	MED
feat_0041	nuclear_pleomorphism	Histological	L22	0.91	Causal ✓	67%	−0.02	None
feat_0088	mitotic_index_proxy	Histological	L22	0.87	Causal ✓	58%	−0.03	None
feat_0112	glandular_architecture	Histological	L18	0.84	Causal ✓	51%	−0.01	None
feat_0219	lymphocyte_infiltration	Histological	L22	0.79	Causal ✓	44%	−0.02	None
feat_0287	stage_size_proxy	Ambiguous	L22	0.58	Partial	22%	−0.07	MED

Causal validation

312 ablation experiments · 4,204 slides · 2 cohorts

High-Impact Ablations

Top 6

Nuclear pleomorphism (feat_0041)

−0.17 AUC

Glandular architecture (feat_0112)

−0.12 AUC

★ Scanner artifact (feat_0203)

−0.14 AUC

Mitotic index (feat_0088)

−0.09 AUC

★ Eosin intensity (feat_0211)

−0.09 AUC

Lymphocyte infiltration (feat_0219)

−0.07 AUC

Confounder Ablation Summary

38 confounders

★ Scanner background

−0.14

★ Eosin intensity

−0.09

★ Tissue fold

−0.06

★ Tissue edge artifact

−0.04

Key finding: Ablating all confounder features simultaneously improves cross-site accuracy by 5.8% — confirming spurious signal dependency.

Negative Controls

✓ Passed

50 random low-activation features ablated. None produced classification changes exceeding 0.3% — confirming observed effects are feature-specific.

0 / 50

Controls with Δ > 0.3%

0.04%

Mean Δ across controls

Cross-Site Feature Stability

★ Below threshold

Feature importance rankings consistent within TCGA (ρ = 0.94). CPTAC-SAR diverges (ρ = 0.71), driven by elevated confounder activations on unseen scanner types.

0.94

TCGA internal ρ

0.71

TCGA↔CPTAC ρ

0.80

Threshold

Stress tests

6 test suites · 4 passed · 2 failed

Cross-site generalisation

FAILED

AUC measured on holdout site when trained on the remaining cohort.

TCGA internal AUC0.93

TCGA → CPTAC transfer0.71

Feature stability index0.71

Threshold≥ 0.80

Stain normalisation robustness

FAILED

Model output variance under Macenko normalisation vs raw input.

Normalised AUC0.901

Raw AUC0.834

Delta−6.7%

ThresholdΔ ≤ 3%

Scanner make variation

PASSED

Performance stratified by Hamamatsu vs Aperio vs Leica.

Max inter-scanner Δ2.1%

Threshold≤ 5%

Subgroup fairness

PASSED

AUC parity across patient demographics and tumour subtypes.

Max subgroup AUC gap3.4%

Threshold≤ 5%

Occlusion intervention

PASSED

Border masking reduces false positives, confirming validated feature contribution.

FP reduction with masking14.2%

Feature drift simulation

PASSED

Synthetic distribution shift applied to confounder features.

AUC after drift0.871

Threshold≥ 0.85

Regulatory compliance map

EU AI Act · MHRA AIaMD · FDA SaMD lifecycle

EU AI Act

High-risk AI — Articles 6–15, Annex IV

✓

Art. 13 — Interpretation toolsFeature dictionary with 312 interpretable features mapped to clinical ontologies.

Art. 9 — Risk management38 confounder features identified. 2 HIGH severity require remediation.

✓

Annex IV — Technical documentationSAE architecture, training protocol, and validation pipeline documented.

✓

Art. 15 — Accuracy & robustnessCross-site stability analysis complete. 6 stress test suites executed.

MHRA AIaMD

Project Glass Box · AI as Medical Device

✓

Interpretability evidenceFeature-level explanations meet draft guidance on interrogability of AI reasoning.

✓

Lifecycle monitoringFeature drift thresholds and production triggers documented.

○

AI Airlock evaluationMethodology eligible for sandbox review. Application pending.

FDA SaMD

AI/ML-Based SaMD Lifecycle Guidance

✓

Model traceabilityFeature-to-outcome causal chain documented for 312 tested features.

✓

PCCP — Change controlFeature drift thresholds established. Retraining triggers defined.

✓

Good Machine Learning PracticeNegative controls, cross-validation, and robustness diagnostics follow GMLP.

Evidence pack

Pre-Deployment Audit Report — Prov-GigaPath NSCLC-v2

Aletheia

Mechanistic Interpretability Audit — Evidence Report

Prepared by Aletheia · For internal QA and regulatory use

Report ref

ATH-2026-0034

Model

Prov-GigaPath/NSCLC-v2

Date

28 Mar 2026

Status

DRAFT

1. Executive summary & release recommendation

Hold

The audit identified 38 confounder features across 312 total. The model demonstrates strong within-site performance but carries documented shortcut dependency that reduces cross-site robustness. Scanner artifact (feat_0203) confirmed causal via ablation (−0.14 AUC). Deployment to new sites not recommended without stain normalisation pipeline.

2. Feature atlas — 312 features catalogued

38 flagged

247 histological features mapped to NCIT, UBERON, and GO ontologies. 38 confounder features flagged. 27 features classified as ambiguous pending expert pathology review.

3. Causal validation — 312 ablation experiments

Methodology validated

Ablation protocol confirmed by 50 negative controls (max Δ = 0.04%). Confounder ablation improves cross-site accuracy by 5.8%.

4. Stress test results — 6 suites, 2 failed

2 failed

Cross-site generalisation and stain normalisation robustness tests failed. Scanner variation and subgroup fairness passed.

5. Regulatory alignment notes

Compliance ref

Report structured to support EU AI Act Article 13, FDA AIMD lifecycle guidance, and MHRA Project Glass Box requirements.

Release gate

Sign-off matrix & open actions

Current gate status

Hold

Blocked by scanner shortcut and cross-site stability failure. 3 sign-offs pending.

Pending sign-offs

Open blockers

Evidence pack version

Sign-off matrix

ML Lead

Needs replay data on remediation branch before sign-off.

pending

Validation / QA

Requires site-specific limitation statement before committee review.

pending

Regulatory partner

Method appendix acceptable. Decision depends on final remediation wording.

review

Founder / Product owner

Executive summary drafted. Ready to finalise once blockers resolve.

ready

Open actions before release

Complete replay on remediation branch — verify site-holdout delta improves after stain normalisation.

Draft site-specific limitation statement — QA requires bounded limitation text for external sites.

Update executive summary — reflect final remediation status and limitation wording.

Method appendix finalised — regulatory partner confirmed methodology documentation is acceptable.

Know what your modelis actually doing —before you ship it.

Strong benchmarks hide weak reasoning.

Two pathways, one metric

Silent failure across labs

Evidence that regulators can read

Internal review is not neutral

From model internals to release-ready evidence.

The window for evidence-light deployment is closing.

Book an introductory call.Know before you ship.

EU AI Act

MHRA AIaMD

FDA SaMD

Know what your model
is actually doing —
before you ship it.

Book an introductory call.
Know before you ship.