My agent kept losing facts it had already saved
AIdaemon runs entirely on my own machine, on open-source models like Gemma 4, with nothing calling out to a cloud API. It saves facts about me as I talk to it, then looks them up when I ask. Saving them already worked. Finding them was what kept letting me down. I'd ask about something I knew it had stored and get back nothing useful, or two facts that were on the right topic and missed the point.
The search runs on meaning. Every fact becomes a vector when it's saved, my question becomes a vector when I ask, and it ranks facts by how near the two land. The trouble is that near isn't the same as right. A question pulls up whatever shares its subject, so a few wordy facts about the same thing crowd out the short one that actually holds the answer. On one query that kept failing I checked where the answer was sitting. About rank 30. The search read the top few and quit long before it got there.
Why the gist isn't enough
That first search uses a bi-encoder, a small model called all-MiniLM-L6-v2. It encodes each fact on its own when you save it, and your query on its own when you search, then compares the two. It never sees them side by side. That's the whole reason it's fast. You embed everything once and a search is just cheap math over the store, no model call when you ask.
It's also why it gets the order wrong. A bi-encoder catches the general topic and fumbles the precise match. Ask about one thing and it floats up anything in the same neighborhood while the fact you wanted sinks. It's cheap enough to run on every fact you've got, which is exactly why you reach for it when it all has to live on your own hardware. It just isn't careful enough to rank them well. Most of the time that's good enough. When I need a specific answer it falls apart.
Letting a second model actually read
The fix is a reranker. It's a cross-encoder, so it reads my question and one candidate fact together, in the same pass, and rates how well that fact answers the question. Where the bi-encoder compared two summaries built apart from each other, this one reads the pair. Better question, better answer. The downside is speed. It's one model run per fact, so you can't point it at a whole memory store and wait around.
The way around that is old and boring and it works. Retrieve first, rerank second. Let the cheap model grab a wide pile of maybe-relevant facts, then spend the expensive one only on that short pile.
In AIdaemon the bi-encoder pulls the top 50 candidates on a loose cutoff of 0.22. I keep that looser than the 0.30 I use for memory that gets dropped into prompts on its own, because synonyms score low and a tight cutoff would throw the right fact out before the reranker ever saw it. Stage one doesn't have to be right. It just has to land the answer somewhere in the 50. Then the cross-encoder rereads all 50 against the question and reorders them. The fact stuck at rank 30 finally gets read on its own terms, jumps up, and comes back. I didn't change a thing about how facts are stored, only how they get ordered on the way out.
// stage 1: bi-encoder casts a wide net (cosine over all live facts)
let mut pool = cosine_rank(&query_vec, &facts, MIN_SCORE); // 0.22
pool.truncate(CANDIDATE_POOL); // 50
// stage 2: cross-encoder re-reads each (query, fact) pair and reorders
let docs: Vec<String> = pool.iter().map(|c| c.text()).collect();
let ranked = reranker.rerank(&query, docs, false, None)?;
The reranker runs on the same machine as everything else, through the fastembed crate as an ONNX model, with no rerank API to call out to. I'm using Jina Reranker v2 Base Multilingual, and I went multilingual on purpose, since my notes to AIdaemon switch between English and Spanish, and an English-only reranker would trip on exactly the facts I care most about getting right.
Keeping it cheap
A second model is a second thing that can break, so it stays on a short leash. It only loads the first time a search needs it, because the download isn't small. If it fails to load, the search drops back to the plain cosine order it used before. Nothing breaks, it just gets less sharp. And it only runs when I explicitly ask the agent to look something up. The memory that gets folded into every prompt by itself still takes the cheap path. Running 50 facts through a cross-encoder is reasonable once, when I asked a question. On every message it'd be a waste.
Retrieve-then-rerank isn't new. Search engines have leaned on it for years. It just turns out to fit agent memory well too. The first model grabs a wide pile, the second one reads that pile properly, and that second read costs almost nothing, fifty short strings through one model. It's the reason AIdaemon now hands me back the fact it saved instead of coming up empty when I ask.