You Keep Trusting The Model And The Model Keeps Lying

2026-05-08

Having built a couple of agents from the ground up, the flakiness problem always shone as an issue.

As I see it there are two ways of dealing with it:

Deal with possible failure by testing output of what the agent fabricated
Minimize the chance of failure by shaping the context

The first one is to build a decent pipeline, where you actually test and reboot and are ok with the waste of cheap LLM labour and repeated generation time. See provable-pipes for a warning shot of how your pipeline is the only real authoritative harness.

The second one is what this post is about.

Previous agent context corralling

Tool results

The first agent I built had to deal with lots of data. Massive kubernetes/prometheus dumps flooded the context and completely screwed the context pooch with noise.

Tool tweaks I tried:

json -> yaml, format based simplest of token optimizing.
summarizing the results with a fast SLM (Small Language Model) was good, but sometimes the results are larger than it can handle.
so telling the agent the tool call size failed, and that it should ask differently, use filters etc.
losing details in the summary because the SLM didn’t know what it was looking for - adding a ‘focus on finding x’ to every tool call, just in case it has to summarize.

Convo history

Second, obviously loop iteration history

With the agent kicking ass and taking tokens from tools, exploring reality, it’s filling the context with 80% noise 20% signal for the next loop iteration.

Context tweaks I exposed the agent loops to:

Sliding window just cutting off the old stuff was simple, but what it tried got nuked.
Next, compressing the old stuff with an SLM just like the tools, worked but facts might get lost along the way.
Adding a tool to save facts and append it to all tool calls - was a hit-or-miss on whether the agent called it.
Adding a tool for the agent to compress its own context (intent, from, to) - ran into the same hit-or-miss of the fact tool.

Context shaping that led this meatbag along the yellow brick road

Lots has been written about large contexts’ quality attrition and bullshit fabrication levels increasing, not repeating it here.

Some of this is basically how coding agents deal with massive bodies of text and source code, by grepping and mapping their way through it.

The context saving skills and proxies - cavemen, rtks of the coding agents.

Building emapi (basically an emoji knowledge graph) made me very aware of the newer models’ capability to stick to json outputs and how it shapes its thinking.

I liked the idea of multiple agents handing off snippets of context, the context ‘steering’ idea from Strands Agents and for sure never liked langgraph’s workflow the shit out of thought much.

Sometimes mused about the effect of deleting rando chunks of context out of a history to ‘boot the head’ a stuck-in-a-loop loop.

I also really liked the lean ideas like JIT skills expanding MCP tools and focused sub agents.

So on a mission to keep the context ‘so tight that the LLM cannot get it wrong’ I built factosis.

Factosis

graph LR
  Human([Human])
  Controller[Controller<br/>Go]
  Main[Main Agent<br/>Sonnet]
  Sidekick[Sidekick<br/>Sonnet]
  Audit[Audit<br/>Opus]

  Human --- Controller
  Controller --- Main
  Controller --- Sidekick
  Controller --- Audit
  Main -.- Sidekick

☝️ is the high level components of factosis.

It’s a two-loop draconian context management system, wrangling LLMs into finding facts in data, systems etc. The main idea being - give an agentic system a fair shot at either admit failure or find proof in data or sources - by fighting the embedded must answer something bias its training bestowed on it.

My take on agentic ‘reasoning’ failure is the shitty state the dojos (RLHF labs) put LLMs into. They don’t reason, they just generate probabilistic tokens and if there’s too much noise shoved into the crap they base their generation on, they have no idea they are enshittifying ‘reasoning’. And since the training says they should answer in a happy tone, the model will confidently talk shit / lie when it’s not super obvious as to what the most likely answer is. There is no context filter that makes it better, no multi shot, context phase shifting to see which is better, no clear after 100k tokens stop using this LLM, it’s just going to be shitting you. So I stopped trying to do post-call bs-catching in the output and focused on what goes into the context, to increase the odds of a good token guess.

Reasoning and fact-finding split

First splitting the roles of two agents, one being a main hypothesis reasoner, the other being a tool endowed researcher.

The main loop’s context would deal only with hypothesis and reasoning on proofs, then ask a sidekick to investigate an angle. The sidekick investigation is also context-gated to only see its goal, snippets of its loop history, data collected, and the last tool output.

The main agent spawns as many sidekicks as it thinks are needed to prove / disprove its set of hypotheses.

The hand-off between them is controlled via the controller through structured json interface. And even some of the hand-off is checked for quality by the Go controller.

Context reconstruction

Within both of the agents’ loops, in each iteration they only get a controller-generated context based on the role they play and what has happened before.

The main agent gets:

list of hypotheses,
episodic summary (including ‘abandoned angles’) and
the results from the previous investigations it launched.

The abandoned_angles field is to prevent the agent from chasing down previously explored leads/missions.

The sidekick gets a bit more involved. It gets:

plan status,
iteration budget,
executed tool summaries,
output previews and the last tool output if small enough,
previously collected data ( from other sidekicks )
notes on previous sources poked and success rates.

Re-entrant

All state changes in the main and sidekicks are tracked in a git repo. Which means that at any point you can kill / restart the controller and only the current in-play LLM loop will be reset. The context will be reconstituted, just like in the loop and the system will continue on its merry way.

The idea being that the main / sidekick can be interrupted anywhere makes the whole execution something that can be branched and nudged if you ‘did not like where it went’ or stopped after 10m to see how it is going.

Needs and nudges

Whenever the main or sidekick requires some human input, it writes a needs/xxx file. Primarily used for ask-human direction and request-credentials of the respective agents. When these tools are called and the file is written it dies, waiting to restart once the feedback was given. The filesystem basically plays an async machine-human messaging role.

Ask-human is used when the agent hits a reasoning dead-end that requires human domain knowledge or judgment to resolve.

Request-credentials is when the sidekick needs access to a system it doesn’t have access to and asks for credential/infrastructure required.

If, either in retrospect (going back to a previous commit where the agent was still on track), or mid loop, the human operator thinks the reasoning of the main agent went/is going off the rails the human can, kill the controller mid loop, write a nudge file to shove something important into the main agent’s context, and restart the controller. This will influence the main agent’s reasoning, but not replace its state.

Files and deployment

The file system, the git on top of it and the restartability built into factosis means you can build a filesystem watching, git logging ui on top of it in n different forms: ui lambda watching efs, or a local fat/text ui watching the local filesystem. The actual controller and LLM calls could easily be lambda runs with possibly a stepfunction nanny.

Claude-Opus-sized quality gate

Even with all the two agent context dieting going on, I still wanted to have a second outside opinion. What started as an optional call-a-friend (which was another hit or miss based on the main agent’s certainty of its reasoning over time), turned into a forced audit and reasoning rating at the end. This Opus git-endowed agent would review the git history and check on how the main agent progressed over time and rate its performance. So after the main agent figured out it had sent sidekicks off to yonder real world to make a conclusive yay or nay, the audit agent would vet the reasoning one last time.

Conclusions

For the testing I’ve done, the bun->duckdb, bun->jq handoffs work well, and the needles were found in the haystacks of 2026’s interwebs.

Another observation, from my limited testing, is the sluggish sidekick speed. It probably is due to some of the framing or some of the suggestions in the prompts. Will have to experiment further at some point. Possibly generating a fake tool loop into the context.

So if you need to find proof or a reason behind some event, factosis will make up hypotheses and then chase them down to see if it can find the what and why.

Here is the factosis system’s repo if you want to see what three months of deleting your own bright ideas looks like.

Example runs of INTC’s dodgy moves, and world energy changes in 2026