2026-05-08
Having built a couple of agents from the ground up, the flakiness problem always shone as an issue.
As I see it there are two ways of dealing with it:
The first one is to build a decent pipeline, where you actually test and reboot and are ok with the waste of cheap LLM labour and repeated generation time. See provable-pipes for a warning shot of how your pipeline is the only real authoritative harness.
The second one is what this post is about.
The first agent I built had to deal with lots of data. Massive kubernetes/prometheus dumps flooded the context and completely screwed the context pooch with noise.
Tool tweaks I tried:
Second, obviously loop iteration history
With the agent kicking ass and taking tokens from tools, exploring reality, it’s filling the context with 80% noise 20% signal for the next loop iteration.
Context tweaks I exposed the agent loops to:
Lots has been written about large contexts’ quality attrition and bullshit fabrication levels increasing, not repeating it here.
Some of this is basically how coding agents deal with massive bodies of text and source code, by grepping and mapping their way through it.
The context saving skills and proxies - cavemen, rtks of the coding agents.
Building emapi (basically an emoji knowledge graph) made me very aware of the newer models’ capability to stick to json outputs and how it shapes its thinking.
I liked the idea of multiple agents handing off snippets of context, the context ‘steering’ idea from Strands Agents and for sure never liked langgraph’s workflow the shit out of thought much.
Sometimes mused about the effect of deleting rando chunks of context out of a history to ‘boot the head’ a stuck-in-a-loop loop.
I also really liked the lean ideas like JIT skills expanding MCP tools and focused sub agents.
So on a mission to keep the context ‘so tight that the LLM cannot get it wrong’ I built factosis.
graph LR
Human([Human])
Controller[Controller<br/>Go]
Main[Main Agent<br/>Sonnet]
Sidekick[Sidekick<br/>Sonnet]
Audit[Audit<br/>Opus]
Human --- Controller
Controller --- Main
Controller --- Sidekick
Controller --- Audit
Main -.- Sidekick
☝️ is the high level components of factosis.
It’s a two-loop draconian context management system, wrangling LLMs into finding facts in data, systems etc. The main idea being - give an agentic system a fair shot at either admit failure or find proof in data or sources - by fighting the embedded must answer something bias its training bestowed on it.
My take on agentic ‘reasoning’ failure is the shitty state the dojos (RLHF labs) put LLMs into. They don’t reason, they just generate probabilistic tokens and if there’s too much noise shoved into the crap they base their generation on, they have no idea they are enshittifying ‘reasoning’. And since the training says they should answer in a happy tone, the model will confidently talk shit / lie when it’s not super obvious as to what the most likely answer is. There is no context filter that makes it better, no multi shot, context phase shifting to see which is better, no clear after 100k tokens stop using this LLM, it’s just going to be shitting you. So I stopped trying to do post-call bs-catching in the output and focused on what goes into the context, to increase the odds of a good token guess.
First splitting the roles of two agents, one being a main hypothesis reasoner, the other being a tool endowed researcher.
The main loop’s context would deal only with hypothesis and reasoning on proofs, then ask a sidekick to investigate an angle. The sidekick investigation is also context-gated to only see its goal, snippets of its loop history, data collected, and the last tool output.
The main agent spawns as many sidekicks as it thinks are needed to prove / disprove its set of hypotheses.
The hand-off between them is controlled via the controller through structured json interface. And even some of the hand-off is checked for quality by the Go controller.
Within both of the agents’ loops, in each iteration they only get a controller-generated context based on the role they play and what has happened before.
The main agent gets:
The abandoned_angles field is to prevent the agent from chasing down previously explored leads/missions.
The sidekick gets a bit more involved. It gets:
All state changes in the main and sidekicks are tracked in a git repo. Which means that at any point you can kill / restart the controller and only the current in-play LLM loop will be reset. The context will be reconstituted, just like in the loop and the system will continue on its merry way.
The idea being that the main / sidekick can be interrupted anywhere makes the whole execution something that can be branched and nudged if you ‘did not like where it went’ or stopped after 10m to see how it is going.
Whenever the main or sidekick requires some human input, it writes a needs/xxx file. Primarily used for ask-human direction and request-credentials of the respective agents. When these tools are called and the file is written it dies, waiting to restart once the feedback was given. The filesystem basically plays an async machine-human messaging role.
Ask-human is used when the agent hits a reasoning dead-end that requires human domain knowledge or judgment to resolve.
Request-credentials is when the sidekick needs access to a system it doesn’t have access to and asks for credential/infrastructure required.
If, either in retrospect (going back to a previous commit where the agent was still on track), or mid loop, the human operator thinks the reasoning of the main agent went/is going off the rails the human can, kill the controller mid loop, write a nudge file to shove something important into the main agent’s context, and restart the controller. This will influence the main agent’s reasoning, but not replace its state.
The file system, the git on top of it and the restartability built into factosis means you can build a filesystem watching, git logging ui on top of it in n different forms: ui lambda watching efs, or a local fat/text ui watching the local filesystem. The actual controller and LLM calls could easily be lambda runs with possibly a stepfunction nanny.
Even with all the two agent context dieting going on, I still wanted to have a second outside opinion. What started as an optional call-a-friend (which was another hit or miss based on the main agent’s certainty of its reasoning over time), turned into a forced audit and reasoning rating at the end. This Opus git-endowed agent would review the git history and check on how the main agent progressed over time and rate its performance. So after the main agent figured out it had sent sidekicks off to yonder real world to make a conclusive yay or nay, the audit agent would vet the reasoning one last time.
For the testing I’ve done, the bun->duckdb, bun->jq handoffs work well, and the needles were found in the haystacks of 2026’s interwebs. So if you need to find proof or a reason behind some event, factosis will make up hypotheses and then chase them down to see if it can find the what and why.
The factosis system’s repo is here if you want to see what three months of deleting your own bright ideas looks like.
Example runs of INTC’s dodgy moves, and world energy changes in 2026