2026-05-10
Can you trust an investigation’s conclusion? Run it again and see if a different temp reaches the same place.
A single Factosis investigation is path-dependent. The main agent’s first hypothesis priority, the sidekick’s first tool choice, even the specific tokens sampled — all stochastic, all shaping everything downstream. The result is internally consistent (the evidence supports the conclusion on the path taken) but you can’t know whether a different path would have reached a different conclusion.
The completion audit (Opus) catches logical errors and unsupported claims within the path. It doesn’t catch path-shaped conclusions — results that are artifacts of the investigation order rather than properties of the underlying data.
Run N independent investigations of the same goal. Same data access, same tools, same (or different) models. Different stochastic paths. Compare outcomes.
Goal: "What caused the spike in failed logins last Tuesday?"
Investigation A (Sonnet, run 1):
→ Explores geo-IP first → finds 203.0.113.42 → concludes "external brute force"
Investigation B (Sonnet, run 2):
→ Explores IAM role changes first → finds policy change at 14:02 → concludes "misconfigured role caused auth failures"
Investigation C (Haiku, run 1):
→ Explores both → finds both → concludes "policy change + opportunistic scanning"
Three investigations, three conclusions. The divergence itself is the signal:
The mental model: you hire N temps from different agencies, hand them the same brief and the same filing cabinet, put them in separate rooms, and compare their reports at the end.
What you learn:
Different models amplify this (different salience patterns explore genuinely different paths). Same model, different runs still diverge due to sampling — less divergent, but cheaper and still informative.
Convergence is not proof. N models can converge on the same wrong answer if:
Divergence is not failure. Some questions legitimately have multiple valid answers. “What caused X?” may have three contributing factors, and each investigation finds a different one. The union of findings is richer than any single run.
The useful signal is unexpected convergence or unexpected divergence:
| Mechanism | What it validates | Scope |
|---|---|---|
| Completion audit (Opus) | Internal consistency of one path | End of investigation |
| Controller validation | LLM claims vs. actual tool output | Every iteration |
| Convergence testing | Whether the conclusion is path-dependent | Across N runs |
Convergence testing is orthogonal to both. You could have an investigation that passes the Opus audit perfectly (internally consistent, well-evidenced) but fails convergence testing (a different run reaches a different, equally well-evidenced conclusion).
┌─────────────────────────────────────────────────────────┐
│ ORCHESTRATOR │
│ Same goal → N independent investigation repos │
└───────┬───────────┬───────────┬─────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Investigation│ │ Investigation│ │ Investigation│
│ A (Sonnet) │ │ B (Sonnet) │ │ C (Opus) │
│ Fresh state │ │ Fresh state │ │ Fresh state │
│ Own git repo│ │ Own git repo│ │ Own git repo│
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────┐
│ CONVERGENCE GATE │
│ │
│ Inputs: N final hypothesis stores + findings │
│ │
│ Compare: │
│ - Do hypotheses overlap in substance (not wording)? │
│ - Do certainty levels agree? │
│ - Do cited evidence sources overlap? │
│ - Are there contradictions? │
│ │
│ Output: │
│ - Convergent conclusions (high confidence) │
│ - Divergent conclusions (flag for human review) │
│ - Union of explored paths (broader than any single) │
└─────────────────────────────────────────────────────────┘
Each investigation is a standard Factosis run — unchanged architecture, no coordination between them, no shared state. The convergence gate only fires after all N complete. It’s a pure post-hoc comparison.
| Configuration | Cost multiplier | Divergence sensitivity |
|---|---|---|
| Same model, 2 runs | 2× | Catches sampling-driven path dependence |
| Same model, 3 runs | 3× | Majority vote possible |
| Mixed models, 2 runs | 2× | Catches model-specific salience bias |
| Mixed models, 3 runs | 3× | Best coverage of investigation space |
The cheap version: run twice with the same model. If they converge, done. If they diverge, run a third as tiebreaker (or flag for human). This is 2× cost in the common case (convergence), 3× in the rare case (divergence).
For high-stakes investigations (incident response, compliance), 3× cost for confidence that the conclusion isn’t path-shaped may be cheap compared to acting on a path-dependent artifact.
Hypothesis stores are structured — comparison is mechanical:
// Investigation A
{"id": "H1", "statement": "external brute force", "certainty": "HIGH"}
{"id": "H2", "statement": "IAM misconfiguration", "certainty": "LOW"}
// Investigation B
{"id": "H1", "statement": "IAM policy change caused auth cascade", "certainty": "CONFIRMED"}
{"id": "H2", "statement": "external scanning (opportunistic)", "certainty": "MEDIUM"}
But statements are free-text — semantic comparison needs a model call. The gate itself requires LLM judgment: “are these two hypotheses about the same thing?”
Cheaper signals that don’t need LLM comparison:
data/, should B see it? (Probably not — that leaks path dependence. Each run fetches independently.)Consensus-Based Agent Memory asks: “what should persist across investigations?” Convergence testing asks: “can you trust this investigation’s conclusion?”
They share the mechanism (agreement between independent cognitive processes) but serve different purposes:
If convergence testing were standard, its output (the convergent subset) would be the natural input to a consensus memory store — you’d only persist conclusions that survived both the convergence gate and the agreement gate.