Interpret Results¶
Recommended reading order¶
failed_checksnot_evaluated_checksgate_outcomesevidence_tier
This avoids over-indexing on the final label before understanding the causal reason for it.
Evidence tiers in practice¶
cross_model_confirmed: strongest cross-model evidence pathwaysingle_model_confirmed: accepted within one model contextcausal_plus_robustness: meaningful signal, but incomplete acceptance evidencecausal_tested_unstable: causal evidence exists but instability remainssuggestive: weak/incomplete supportrejected: evidence did not satisfy the contract
Failure decomposition workflow¶
For a bundle set:
- Count failed gates across all claims.
- Group failures by task/model/lane/provider.
- Separate missing-evidence failures from contradiction failures.
- Prioritize experiments that target the highest-concentration failure modes.
Interpreting not_evaluated¶
not_evaluated often indicates missing slice or unavailable transfer evidence, not necessarily negative evidence.
Treat it as a data-completeness signal.
Resubmission strategy¶
- Fix structural incompleteness first (slices, schema, manifest integrity).
- Then target robustness/method sensitivity failure clusters.
- Re-run deterministic review before resubmitting.