Local Gemma was too slow with AIdaemon until I fixed llama.cpp and the prompt size
I run AIdaemon on my Mac most days. It's the self-hosted agent daemon I built in Rust. For months the LLM backend was OpenRouter, plus Gemini, whose free tier was generous. Once it ran out, I was paying a few bucks a month for something I could host myself. I just wanted to try local inference with Google's Gemma family without adding a runtime I didn't already use.
I had llama.cpp installed through Homebrew and a Gemma 4 26B MoE GGUF on disk (unsloth/gemma-4-26B-A4B-it, Q4_K_M), about sixteen gigabytes, on an M4 Pro with 48 GB of unified memory. Ollama would've been the easy path. I skipped it on purpose, since it wraps llama.cpp anyway and I wanted the performance flags directly.
The stack ended up looking like this.
Telegram / Slack → AIdaemon → llama-server (OpenAI-compatible API) → Gemma 4 26B GGUF
AIdaemon doesn't load model weights. It talks to anything that looks like the OpenAI chat API. llama.cpp's llama-server fits that shape.
Getting it running at all
First snag was ports. AIdaemon already binds 8080 for health checks and OAuth callbacks. llama-server defaults to the same port. I put inference on 8081.
Second snag was thinking mode. Gemma 4 ships with reasoning/thinking turned on in the chat template. llama-server logged thinking = 1 on startup. Responses landed in reasoning_content while content came back empty. AIdaemon reads content. From Telegram it looked like the model had gone silent.
Fix was one flag.
llama-server \
-m ~/models/llm/gemma-4-26b/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
--jinja \
--reasoning off \
-c 16384 \
-ngl 99 \
--alias gemma-4-26b \
--host 127.0.0.1 \
--port 8081
--jinja matters for Gemma 4's template. --reasoning off matters for AIdaemon. Without it you're debugging the agent when the model is actually replying into a field nothing reads.
And I'd keep it off even without that bug. Thinking spends a stack of extra tokens before every reply, which is real latency on a local model, and an agent gets some of that reasoning back for free by working through a task across tool calls with real feedback instead of one long internal monologue. I'm trading a little deep-reasoning headroom for speed, which for a fast local assistant is the right call, and I can flip it back on for the rare task that truly needs it.
AIdaemon side, the provider block pointed at the local server.
[provider]
api_key = "local"
base_url = "http://127.0.0.1:8081/v1"
kind = "openai_compatible"
max_tokens = 4096
[provider.models]
default = "gemma-4-26b"
fallback = []
I kept OpenRouter as a [[provider.fallbacks]] entry so a dead llama-server wouldn't brick the daemon. The local model name has to match --alias on llama-server, not the Hugging Face repo slug.
Smoke test before touching Telegram.
curl http://127.0.0.1:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gemma-4-26b","messages":[{"role":"user","content":"Say hi."}],"max_tokens":50}'
If content is empty and reasoning_content is full, thinking is still on.
Why it still felt slow
Simple messages were fine. Agent work wasn't. I'd send something normal on Telegram and wait. And wait.
The llama-server log told the story. Prompts around 14,500 tokens. That's not a typo and it's not one fat user message. On a 16k-context model, AIdaemon budgets messages plus tool schemas to about 14.8k and reserves the rest for output. I was running against the ceiling every turn.
A few things fill that payload.
- System prompt. Operating rules, security guardrails, specialists list, channel context. A large static template. On later loop iterations it drops the markdown tool guide, but the core prompt is still thousands of tokens.
- Tool JSON schemas, sent separately from chat messages on every LLM call. With my full install that's roughly 35 to 40 built-in tools, plus any MCP tools that matched the turn. Names, parameters, required fields, enums. Descriptions add up even after compaction.
- Conversation history. Collapsed recent turns, optional session summary, and full tool results for the current interaction. A few chunky
terminalorread_fileoutputs can rival the schema cost. - Memory, and not your whole fact store dumped in. A small critical-facts pin and instructions to fetch the rest via memory tools when needed.
First iteration of a turn is worse. The system prompt still includes markdown tool documentation and the JSON schemas go out in parallel. Chat UIs send a bubble. Agent daemons ship an operating manual plus a tool catalog plus whatever the last command returned.
Local inference has two speeds, and they behave very differently.
- Prefill is the model reading your prompt. Time scales with prompt size. This is where agent workloads hurt.
- Generation is the model writing the reply. Throughput stays roughly flat whether the prompt was short or long.
On my box that distinction mattered more than any single llama.cpp flag.
Inference speed on my M4 Pro
Hardware for these numbers. Apple M4 Pro, 48 GB unified memory, gemma-4-26B-A4B-it-UD-Q4_K_M.gguf (Unsloth Q4_K_M), llama.cpp build 9140, full Metal offload (-ngl 99), optimized single-slot config below. Gemma 4 26B is MoE, about 4B active parameters at inference time. Generation feels closer to a mid-size model. Prefill still walks the whole prompt.
I pulled timings from llama-server's timings field in the OpenAI-compatible JSON response after a warm model. Same server that's running today.
| Prompt size | Input tokens | Prefill wall time | Prefill tok/s | Generation tok/s |
|---|---|---|---|---|
| Short chat (warm) | 49 | 0.2 s | ~230 | ~48 |
| Small agent turn | ~1,000 | 1.6 s | ~650 | ~44 |
| Medium context | ~5,000 | 6.2 s | ~630 | ~43 |
| Large context | ~10,000 | 8.9 s | ~550 | ~40 |
| Real AIdaemon turn | ~14,500 | 8.4 s | ~480 | ~35 |
Generation held near 40 to 48 tok/s across the board. Prefill dominated. A ~14.5k-token agent prompt spent about eight and a half seconds processing input before the first output token. That's not a hung server. That's the model finishing the read phase.
Back-of-napkin math for one agent LLM hop on this setup.
- ~14.5k prefill at ~480 tok/s ≈ 8 s before anything comes back
- ~200 token reply at ~40 tok/s ≈ 5 s of generation
- One hop ≈ 13 s minimum, before tool execution or a second hop
A three-iteration agent loop with tool calls can easily sit at forty-plus seconds of model time alone. Telegram feels broken long before the hardware is actually struggling.
Compare that to a dumb chat prompt. "Say hello in one sentence" on the default four-slot llama-server config measured about 48 tok/s prefill and 52 tok/s generation on the same machine. After switching to --parallel 1 and the batch/cache flags, the same short curl test jumped to about 127 tok/s prefill with generation still around 57 tok/s. Server tuning mostly moved the needle on prefill for small prompts and memory overhead. It did not erase the eight-second tax on a 14k agent context.
Default llama-server settings made the agent case worse, and the knob that surprised me was --parallel.
llama-server doesn't run one conversation at a time under the hood. It keeps separate slots. Each slot is a full context window with its own KV cache in memory. When a request arrives, the server picks a slot, loads your prompt into it, and generates from there. A second request can use a different slot at the same time without wiping the first conversation.
--parallel sets how many slots exist. If you omit it, recent llama.cpp builds pick auto, which on my Mac meant four slots. Startup logged n_parallel is set to auto, using n_parallel = 4 and initializing slots, n_slots = 4.
Four slots make sense when one GPU serves multiple clients. A browser UI, a curl test, maybe a second user. The server can juggle concurrent chats.
Within one chat, AIdaemon is mostly serial. Telegram, Slack, and Discord queue messages per session so you don't get two agent loops fighting over the same thread. But that's not the whole picture. A scheduled cron goal can spawn a background task lead while you're chatting on Telegram. A second goal can do the same. Slack and Telegram are different sessions, so both can hit the model at once if you're active on both.
For my setup that overlap was rare. One Telegram chat, a handful of scheduled checks, not usually at the same second. Default --parallel 4 still meant three slots sat idle most of the time while reserving KV cache and prompt-cache RAM. I saw the prompt cache grow past three gigabytes during testing. When I dropped to --parallel 1, concurrent requests from AIdaemon didn't break. llama-server queues them and runs one at a time. You wait your turn instead of sharing GPU memory across empty lanes.
If you routinely run several scheduled goals at once, or live in Telegram while cron fires every minute, try --parallel 2 or 3 instead of 1. You trade some single-request speed for not serializing every overlapping hop. Match the slot count to how many LLM calls you actually overlap, not the default four.
Setting --parallel 1 collapsed it to one slot. Logs showed n_slots = 1. All the KV cache budget went to the one agent turn I was actually running.
llama-server flags that actually helped
I restarted llama-server with a single-slot, single-user profile.
llama-server \
-m ~/models/llm/gemma-4-26b/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
--jinja \
--reasoning off \
-c 16384 \
-ngl 99 \
--parallel 1 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--cache-ram 1024 \
-b 4096 \
-ub 1024 \
--prio 2 \
--alias gemma-4-26b \
--host 127.0.0.1 \
--port 8081
--parallel 1 was the biggest win on the inference side for my setup. Not because the model got smarter. Because I stopped paying for three empty conversation lanes I never used. That held until I started reusing the cache across turns and the background jobs became the problem. More on that below.
--flash-attn on and q8_0 KV cache types helped on Apple Silicon. Capping --cache-ram stopped the prompt cache from ballooning during long sessions. Larger batch sizes (-b 4096, -ub 1024) sped up prefill on fat prompts. --prio 2 nudged the process up in the scheduler. Small thing, but when you're iterating on config it helps.
On short prompts, prefill went from about 48 tok/s to 127 tok/s. Generation stayed around 57 tok/s. That confirmed the server tuning was worth doing. It also confirmed something else. At ~14k tokens you're still looking at eight-plus seconds of prefill no matter what. The next lever had to be prompt size.
Shrinking what AIdaemon sends
Server tuning alone doesn't erase an eight-second prefill when you're pinned near 15k tokens every hop. The other half was teaching AIdaemon to respect a 16k window when the model is local, and to compact what it sends before the call instead of hoping the server survives it.
I added a per-model budget in config.toml.
[state.context_window.model_budgets]
gemma-4-26b = 16384
That number should match -c on llama-server. If AIdaemon thinks it has 128k tokens but the server only holds 16k, you're paying for work that gets truncated or fails weirdly.
In code, the message build phase runs fit_tool_definitions_to_budget() before each LLM call. It never drops tools. It trims metadata in stages. Descriptions get shorter, schema annotations and examples get stripped, until the serialized tools fit whatever budget is left after the system prompt and history are counted. There's a second pass after the full prompt is assembled, because those inserts can eat the headroom you thought you had left.
The agent still exposes every tool. It just stops shipping essay-length schema text the local model doesn't need to pick terminal over read_file. On a 48k or 128k cloud model you might never notice. On 16k local, it's the difference between a usable hop and eight seconds of silence.
I also dropped reasoning_effort on the local provider. That's for cloud thinking models. Gemma's thinking path is different and we already disabled it in llama-server.
That gets local Gemma usable. It doesn't mean 14k tokens is the target. I'm still looking at where the prompt can shrink further. Duplicate tool docs on the first iteration, a leaner system prompt when the model budget is small, smarter tool filtering so local runs don't carry a cloud-sized catalog. Compaction was the fix that unblocked me; the next round is about sending each piece of context once.
The real fix was reusing the prompt, not just shrinking it
Shrinking the prompt helped, but I was still paying a prefill on every single turn. Then it clicked. The model should not have to re-read the same 15,000 tokens twice. Almost all of an agent prompt is identical from one turn to the next. The system prompt, the tool schemas, the older messages. Only the new user message and the latest tool result change, and they sit at the end.
llama.cpp already knows how to take advantage of that. It keeps the KV cache from the previous turn. If the start of your next prompt is byte-for-byte identical to the last one, it reuses that cached work and jumps straight to the new tokens. That is a warm start, and it is fast. If anything near the front differs, even a single token, it cannot trust the rest, so it throws the cache out and reads the whole thing again. That is a cold start, and it is the slow path I had been hitting every turn.
The problem was that AIdaemon kept changing the front of the prompt without meaning to. A timestamp that ticked over, a memory block that re-ordered, an old turn that got re-summarized a little differently. Tiny edits, but they landed near the start, so the cache never matched and every turn went cold. The fix was to make the front boring. One stable system block, and old turns frozen into a fixed shape the moment they scroll out of the live window, never rewritten again. After that, the first 15,000 tokens of each prompt were identical to the turn before, and llama.cpp could finally reuse them.
That also changed my mind about --parallel. A single slot was fastest for one isolated request, but AIdaemon does memory and summary work in the background on the same server, and every one of those jobs kept landing in my chat's slot and wiping the cache I was trying to keep warm. So I moved to --parallel 2, pinned my conversation to one slot, and sent the background jobs to the other. Now the housekeeping churns in its own lane and my chat stays warm.
The flag nobody mentions
Stable prompt, my own slot, and every new turn was still cold. I almost gave up. Then I read the llama-server log one more time.
forcing full prompt re-processing due to lack of cache data
(likely due to SWA or hybrid/recurrent memory)
SWA stands for sliding-window attention. In most of its layers, Gemma 4 only looks at a window of recent tokens instead of the whole history. That is part of what makes a 26B model this cheap to run. The catch is that, by default, llama.cpp only stores that little window, so the moment a new turn shifts the token positions around there is nothing left to reuse, and it starts over. All my careful prompt-stability work could not survive an attention scheme that throws most of its own cache away.
One flag fixed it.
llama-server \
-m ~/models/llm/gemma-4-26b/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
--jinja --reasoning off \
-c 131072 \
-ngl 99 \
--parallel 2 \
--swa-full \
--flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--cache-ram 12288 \
-b 4096 -ub 1024 \
--alias gemma-4-26b --host 127.0.0.1 --port 8081
--swa-full tells llama.cpp to keep a full-size cache for the windowed layers instead of the slice. It costs more memory, quite a bit more on a model that is mostly windowed layers, but I have 48 GB and a conversation that finally stays warm. On Gemma that one flag is the whole difference between reusing the cache across turns and re-reading the prompt every time. Without it, prompt stability and slot pinning buy you almost nothing.
The numbers moved the way I wanted. A follow-up that used to re-read about 15,000 tokens and stall for roughly thirty seconds now re-reads around 1,300 and answers in a couple. Same model, same hardware, same answer, about ninety percent less work per turn.
I only found this because AIdaemon told me
None of this was findable by feel. "It seems slow" is not a bug report. What made it tractable is that AIdaemon logs the anatomy of every model call. The prompt size, how many input tokens were served from cache versus read fresh, a fingerprint of each part of the prompt, and which background job ran when.
The cached-versus-fresh number was the tell. On a turn that should have been warm, watching the fresh count jump back to fifteen thousand meant the cache had broken, and the per-section fingerprints showed exactly which part of the prompt had changed to break it. That is how I caught the prompt churn first, then the background jobs stealing the slot, then SWA. Three different culprits hiding behind the same symptom of a slow reply. Without that telemetry I would have been swapping flags at random.
If you take one thing from this, make your local agent observable. The model is a black box and the server is mostly a black box. Your own daemon is the one place you control, so have it tell you what it sent and what got reused on every call.
What I'd tell someone else trying this
Start with the model you already have. I used Gemma 4 26B MoE because the GGUF was already downloaded. The 12B unified variant is on my list next. Smaller context, less RAM, probably snappier for chat-heavy use.
Match three numbers. llama-server -c, AIdaemon model_budgets, and what you actually expect in a busy agent turn. They should agree.
Watch the logs. tail -f ~/.aidaemon/llama-server.log shows prompt token counts and slot behavior. If you see multi-thousand-token prefills every turn, fix the agent context before buying faster hardware.
Keep a cloud fallback while you're tuning. Local-first with OpenRouter (or whatever you already pay for) as backup means you can restart llama-server twenty times without losing Telegram.
Run llama-server before AIdaemon. The daemon starts fine without it and then falls back or errors on the first message. I forgot that once.
On macOS I run AIdaemon under launchd with caffeinate -i so idle sleep doesn't kill a long agent session. llama-server is still manual unless you give it its own plist. Worth doing if this becomes your daily driver.