Benchmark Positioning¶

AutoMechInterp is best positioned as a reliability benchmark for mechanistic-interpretability evidence.

It is not primarily a discovery benchmark and not a general model capability benchmark. Its core question is:

Given a concrete mechanistic claim and its artifact bundle, does the evidence meet a frozen acceptance contract?

Ecosystem map¶

The table below organizes adjacent work by evaluation target and objective.

Category	Representative examples	What is evaluated	Typical output
Claim-verification benchmark	AutoMechInterp	quality and robustness of submitted claim evidence	gate outcomes, tier labels, remediation diagnostics
Method-performance benchmark (mechanistic interpretability)	MIB, InterpBench	how well interpretability methods recover known/curated structures	task/track scores, benchmark leaderboards
Synthetic known-mechanism testbeds	Tracr and related synthetic setups	behavior on models with known construction	controlled success/failure analyses
General model-eval frameworks	HELM, lm-eval-harness, OpenAI Evals	model behavior/capability under task suites	aggregate model performance metrics
Discovery/intervention toolkits	TransformerLens, nnsight, pyvene, SAELens	exploratory hypothesis generation and intervention workflows	hypotheses, traces, activation analyses

Core objective differences¶

1) What is the unit of evaluation?¶

AutoMechInterp: a claim bundle (claim + protocol + raw cells + manifest).
Method benchmarks: a method/task pair.
Capability benchmarks: a model/task pair.

2) What counts as success?¶

AutoMechInterp: passing a frozen evidentiary contract with deterministic outputs.
Method benchmarks: high benchmark score against known/curated criteria.
Capability benchmarks: high performance on behavior tasks.

3) How are failures explained?¶

AutoMechInterp: gate-level decomposition (causal, controls, robustness, statistics, integrity).
Method benchmarks: usually task-level misses or method-level score drop.
Capability benchmarks: benchmark metric regressions.

Detailed positioning matrix¶

Dimension	AutoMechInterp	MIB / InterpBench-style	HELM / lm-eval / OpenAI Evals
Primary question	Is this claim evidence robust enough to accept?	Which method performs better across benchmark tasks?	How good is model behavior on benchmark tasks?
Input artifact	claim bundle with intervention cells	method outputs on benchmark suites	model responses / eval traces
Contract stability	explicit and frozen per protocol version	benchmark protocol dependent	benchmark suite dependent
Deterministic rerun requirement	first-class requirement	varies	varies
Failure granularity	gate-level	task/track-level	metric-level
Suitability for publication claim auditing	high	medium	low
Suitability for broad capability ranking	low	low/medium	high

Where AutoMechInterp is strongest¶

Standardizing acceptance criteria across heterogeneous discovery lanes.
Making evidence failures actionable via structured diagnostics.
Enabling longitudinal comparability with protocol versioning.
Supporting external bundle submissions with deterministic rerun checks.

Where AutoMechInterp is not the best tool¶

Discovering mechanisms from scratch.
Ranking foundation models by capability.
Replacing controlled synthetic ground-truth method tests.

Complementarity model¶

A practical high-rigor stack looks like:

discovery tools generate hypotheses and intervention traces
method benchmarks evaluate discovery quality trends
AutoMechInterp verifies claim evidence readiness for external reporting

This stack separates exploration speed from acceptance rigor.

Current field position summary¶

Today, many mechanistic-interp workflows have strong discovery tooling but weaker standardized acceptance contracts. AutoMechInterp addresses that gap by treating claim verification as a first-class benchmark objective.

That makes it a strong complement to both method benchmarks and discovery toolchains, especially when teams need reproducible acceptance decisions rather than one-off case-study judgments.