One-Command Rebuild
python main/reproducibility_audit.py
Runs tests, environment manifest capture, coverage and failure analyses, breadth summaries, stress suites, field findings, and runtime/cost reporting.
Primary Outputs
main/output/repro/reproducibility_audit.jsonmain/output/repro/reproducibility_audit.mdmain/output/repro/environment_manifest.jsonmain/output/repro/benchmark_breadth_summary.jsonmain/output/repro/field_level_findings.jsonmain/output/repro/runtime_cost_report.jsonmain/output/community_submissions/community_value_summary.json
Determinism Definition
For fixed code, artifacts, seeds, and environment versions, evaluator outputs should match across reruns. Decision-hash checks compare per-claim (passed, tier, checks) views.
Manual Command Sequence
python -m pytest packages/evaluator/tests -q
python -m pytest packages/runner/tests -q
python main/environment_manifest.py --output main/output/repro/environment_manifest.json
python main/summarize_real_multi_task.py
python main/analyze_real_bundle_failures.py
python main/build_multilane_real_bundles.py
python main/run_community_submission_demo.py
python main/summarize_benchmark_breadth.py
python main/field_level_findings.py
python main/runtime_cost_report.py --reruns 3
python main/stress_test_ablation.py --bundle-dir main/output/real_multi_task/ioi_v0_gpt2-small
python main/stress_test_agnostic.py --bundle-dir main/output/real_multi_task/ioi_v0_gpt2-small
python main/stress_test_red_team.py --bundle-dir main/output/real_multi_task/ioi_v0_gpt2-small
python -m automechinterp_evaluator.cli reference-vectors