Quickstart (10 min)¶

This quickstart is designed to get you from zero to a deterministic bundle review with outputs you can interpret immediately.

Goal¶

By the end of this flow you will have:

run evaluator and runner tests
executed deterministic submission review on a bundle
inspected gate failures and remediation guidance
run a full reproducibility audit command

Step 1: validate packages¶

# Evaluator tests
cd packages/evaluator
python -m pytest tests/ -v

# Runner tests
cd ../runner
python -m pytest tests/ -v

If either suite fails, resolve environment issues before evaluating claims.

Step 2: run deterministic submission review¶

python -m automechinterp_evaluator.cli submission-review \
  --bundle /path/to/bundle \
  --reruns 3 \
  --output-json /path/to/bundle/submission_review.json \
  --output-md /path/to/bundle/submission_review.md

This command runs evaluation multiple times to check rerun agreement and writes workflow guidance.

Step 3: inspect outputs in this order¶

submission_review.json
submission_review.md
bundle-level result.json (if generated in your workflow)
bundle-level stage_gate_report.md

Step 4: run repository reproducibility audit¶

# from repository root
python main/reproducibility_audit.py

This produces environment and benchmark summary artifacts under main/output/repro/.

Quick interpretation checklist¶

Are failures concentrated in one gate family?
Are missing slices causing not_evaluated outcomes?
Do reruns agree across all claims?
Is failure remediation specific enough to guide next experiments?

Common first-run problems¶

Problem	Likely cause	First fix
schema parse failure	malformed bundle file	validate keys/types in `hypothesis.jsonl` and `evaluation_result.json`
manifest mismatch	file changed after hashing	regenerate `manifest.json`
many `confirmatory_present` failures	missing confirmatory slice	regenerate raw cells with required slices
many `method_sensitivity` failures	unstable intervention setup	review intervention method consistency and control setup