Step 1: Verify Environment
python -m pytest packages/evaluator/tests -q
python -m pytest packages/runner/tests -q
Step 2: Create a Bundle
python -m automechinterp_evaluator.cli init-template \
--output-dir /tmp/my_bundle
This writes protocol.yaml, hypothesis.jsonl, evaluation_result.json, and manifest.json.
Step 3: Generate Candidate Hypotheses (Optional)
python -m automechinterp_evaluator.cli generate-agent-output \
--bundle /tmp/my_bundle --count 3 --overwrite
python -m automechinterp_evaluator.cli generate-hypotheses \
--bundle /tmp/my_bundle \
--agent-output /tmp/my_bundle/agent_output.json \
--overwrite
Step 4: Run Stage-2 and Evaluate
python -m automechinterp_runner.cli run \
--bundle /tmp/my_bundle --mode mock --device cpu --examples-per-cell 20
python -m automechinterp_evaluator.cli evaluate \
--bundle /tmp/my_bundle --output /tmp/my_bundle/result.json
python -m automechinterp_evaluator.cli report \
--bundle /tmp/my_bundle --output /tmp/my_bundle/stage_gate_report.md
Step 5: Generate Workflow Review
python -m automechinterp_evaluator.cli submission-review \
--bundle /tmp/my_bundle --reruns 3 \
--output-json /tmp/my_bundle/submission_review.json \
--output-md /tmp/my_bundle/submission_review.md
Maps failed gates to concrete next actions and verifies deterministic decision stability.
Step 6: Compatibility Check
python -m automechinterp_evaluator.cli reference-vectors
Validates canonical tiering behavior against shared reference vectors.