Reproducibility · AutoMechInterp

One-Command Rebuild

python main/reproducibility_audit.py

Runs tests, environment manifest capture, coverage and failure analyses, breadth summaries, stress suites, field findings, and runtime/cost reporting.

Primary Outputs

main/output/repro/reproducibility_audit.json
main/output/repro/reproducibility_audit.md
main/output/repro/environment_manifest.json
main/output/repro/benchmark_breadth_summary.json
main/output/repro/field_level_findings.json
main/output/repro/runtime_cost_report.json
main/output/community_submissions/community_value_summary.json

Determinism Definition

For fixed code, artifacts, seeds, and environment versions, evaluator outputs should match across reruns. Decision-hash checks compare per-claim (passed, tier, checks) views.

Manual Command Sequence

python -m pytest packages/evaluator/tests -q
python -m pytest packages/runner/tests -q
python main/environment_manifest.py --output main/output/repro/environment_manifest.json
python main/summarize_real_multi_task.py
python main/analyze_real_bundle_failures.py
python main/build_multilane_real_bundles.py
python main/run_community_submission_demo.py
python main/summarize_benchmark_breadth.py
python main/field_level_findings.py
python main/runtime_cost_report.py --reruns 3
python main/stress_test_ablation.py --bundle-dir main/output/real_multi_task/ioi_v0_gpt2-small
python main/stress_test_agnostic.py --bundle-dir main/output/real_multi_task/ioi_v0_gpt2-small
python main/stress_test_red_team.py --bundle-dir main/output/real_multi_task/ioi_v0_gpt2-small
python -m automechinterp_evaluator.cli reference-vectors