Convergence Testing

N Temps, Same Brief

2026-05-10

Can you trust an investigation’s conclusion? Run it again and see if a different temp reaches the same place.

The Problem

A single Factosis investigation is path-dependent. The main agent’s first hypothesis priority, the sidekick’s first tool choice, even the specific tokens sampled — all stochastic, all shaping everything downstream. The result is internally consistent (the evidence supports the conclusion on the path taken) but you can’t know whether a different path would have reached a different conclusion.

The completion audit (Opus) catches logical errors and unsupported claims within the path. It doesn’t catch path-shaped conclusions — results that are artifacts of the investigation order rather than properties of the underlying data.

The Idea

Run N independent investigations of the same goal. Same data access, same tools, same (or different) models. Different stochastic paths. Compare outcomes.

Goal: "What caused the spike in failed logins last Tuesday?"

Investigation A (Sonnet, run 1):
  → Explores geo-IP first → finds 203.0.113.42 → concludes "external brute force"

Investigation B (Sonnet, run 2):
  → Explores IAM role changes first → finds policy change at 14:02 → concludes "misconfigured role caused auth failures"

Investigation C (Haiku, run 1):
  → Explores both → finds both → concludes "policy change + opportunistic scanning"

Three investigations, three conclusions. The divergence itself is the signal:

Convergence → high confidence the conclusion is a property of the territory
Divergence → the conclusion is path-shaped; more investigation needed, or the question is underdetermined by available data
Partial overlap → the overlapping subset is territory-shaped; the remainder is path-shaped

N Temps at a Point in Time

The mental model: you hire N temps from different agencies, hand them the same brief and the same filing cabinet, put them in separate rooms, and compare their reports at the end.

What you learn:

If all N find the same thing via different paths → that thing is almost certainly real
If they find different things → either the brief is ambiguous, the data supports multiple valid conclusions, or some of them got stuck in local optima
The set of all findings across N runs is broader than any single run — you’ve explored more of the investigation space

Different models amplify this (different salience patterns explore genuinely different paths). Same model, different runs still diverge due to sampling — less divergent, but cheaper and still informative.

What Convergence Means (and Doesn’t)

Convergence is not proof. N models can converge on the same wrong answer if:

The available data only supports one interpretation (even if that interpretation is wrong)
All models share training biases that make them notice the same things
The tools available constrain which paths are explorable

Divergence is not failure. Some questions legitimately have multiple valid answers. “What caused X?” may have three contributing factors, and each investigation finds a different one. The union of findings is richer than any single run.

The useful signal is unexpected convergence or unexpected divergence:

Expected convergence (clear-cut data) → confirms but doesn’t surprise
Unexpected divergence (seemingly clear question, different answers) → something is wrong with the data, the question, or the investigation depth
Unexpected convergence (ambiguous question, same answer anyway) → strong signal

Relationship to Existing Mechanisms

Mechanism	What it validates	Scope
Completion audit (Opus)	Internal consistency of one path	End of investigation
Controller validation	LLM claims vs. actual tool output	Every iteration
Convergence testing	Whether the conclusion is path-dependent	Across N runs

Convergence testing is orthogonal to both. You could have an investigation that passes the Opus audit perfectly (internally consistent, well-evidenced) but fails convergence testing (a different run reaches a different, equally well-evidenced conclusion).

Design Sketch

┌─────────────────────────────────────────────────────────┐
│  ORCHESTRATOR                                           │
│  Same goal → N independent investigation repos          │
└───────┬───────────┬───────────┬─────────────────────────┘
        │           │           │
        ▼           ▼           ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Investigation│ │ Investigation│ │ Investigation│
│ A (Sonnet)  │ │ B (Sonnet)  │ │ C (Opus)    │
│ Fresh state │ │ Fresh state │ │ Fresh state │
│ Own git repo│ │ Own git repo│ │ Own git repo│
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
       │                │                │
       ▼                ▼                ▼
┌─────────────────────────────────────────────────────────┐
│  CONVERGENCE GATE                                       │
│                                                         │
│  Inputs: N final hypothesis stores + findings           │
│                                                         │
│  Compare:                                               │
│  - Do hypotheses overlap in substance (not wording)?    │
│  - Do certainty levels agree?                           │
│  - Do cited evidence sources overlap?                   │
│  - Are there contradictions?                            │
│                                                         │
│  Output:                                                │
│  - Convergent conclusions (high confidence)             │
│  - Divergent conclusions (flag for human review)        │
│  - Union of explored paths (broader than any single)    │
└─────────────────────────────────────────────────────────┘

Each investigation is a standard Factosis run — unchanged architecture, no coordination between them, no shared state. The convergence gate only fires after all N complete. It’s a pure post-hoc comparison.

Cost Model

Configuration	Cost multiplier	Divergence sensitivity
Same model, 2 runs	2×	Catches sampling-driven path dependence
Same model, 3 runs	3×	Majority vote possible
Mixed models, 2 runs	2×	Catches model-specific salience bias
Mixed models, 3 runs	3×	Best coverage of investigation space

The cheap version: run twice with the same model. If they converge, done. If they diverge, run a third as tiebreaker (or flag for human). This is 2× cost in the common case (convergence), 3× in the rare case (divergence).

For high-stakes investigations (incident response, compliance), 3× cost for confidence that the conclusion isn’t path-shaped may be cheap compared to acting on a path-dependent artifact.

What the Convergence Gate Needs to Compare

Hypothesis stores are structured — comparison is mechanical:

// Investigation A
{"id": "H1", "statement": "external brute force", "certainty": "HIGH"}
{"id": "H2", "statement": "IAM misconfiguration", "certainty": "LOW"}

// Investigation B  
{"id": "H1", "statement": "IAM policy change caused auth cascade", "certainty": "CONFIRMED"}
{"id": "H2", "statement": "external scanning (opportunistic)", "certainty": "MEDIUM"}

But statements are free-text — semantic comparison needs a model call. The gate itself requires LLM judgment: “are these two hypotheses about the same thing?”

Cheaper signals that don’t need LLM comparison:

Evidence overlap — do both cite the same tool outputs / data ranges?
Source overlap — do both query the same tables / APIs?
Certainty direction — do both escalate the same hypothesis IDs? (fails if they use different ID schemes)

Open Questions

When to trigger? Every investigation? Only high-stakes ones? Only when the single-run audit raises concerns?
How to handle the union? If A finds X and B finds Y and both are valid, the “true” answer is X+Y. Who synthesises?
Shared data vs. independent data? If A’s bun_script fetches data and writes to data/, should B see it? (Probably not — that leaks path dependence. Each run fetches independently.)
Stopping early? If A and B converge after 5 loops each, do you need C? (Probably not for same-model runs; yes for mixed-model if you want salience-diversity.)

Relationship to Consensus Memory

Consensus-Based Agent Memory asks: “what should persist across investigations?” Convergence testing asks: “can you trust this investigation’s conclusion?”

They share the mechanism (agreement between independent cognitive processes) but serve different purposes:

Consensus memory → durability of knowledge
Convergence testing → reliability of a single conclusion

If convergence testing were standard, its output (the convergent subset) would be the natural input to a consensus memory store — you’d only persist conclusions that survived both the convergence gate and the agreement gate.