Shrew small-model setup
Small-model setup
Section titled “Small-model setup”Shrew is small-model-first. The default model is qwen3.6-35b-a3b,
a Mixture-of-Experts checkpoint that is 22 GB on disk at Q4
quantisation but uses only ~3B active parameters per token.
The trick that makes this practical on consumer hardware is to keep
the experts in system RAM and only the attention layers plus KV
cache on the GPU. This page walks the entire setup end to end:
build llama.cpp, download a GGUF, and serve it with the right
flags.
If you already have a llama.cpp HTTP server on 127.0.0.1:8888,
skip to quickstart.md. If you’d rather use Ollama
(simpler but slower for MoE), see the Ollama section.
Why llama.cpp
Section titled “Why llama.cpp”Two reasons shrew defaults to llama.cpp over Ollama:
- MoE offload control. llama.cpp exposes
--n-cpu-moe(and-ngl), which is what makes the experts-in-RAM trick work. Ollama does not expose these knobs in stable form yet. - Tool-calling shape. llama.cpp’s OpenAI-compatible shim
speaks the
tools=[...]schema cleanly with the right--jinjachat template. Tool calling is the single biggest small-model capability gap; getting the wire shape right is non-negotiable.
The trade-off is that you compile llama.cpp yourself and download GGUFs by hand. The payoff is full control over the inference stack.
Step 1 — Build llama.cpp
Section titled “Step 1 — Build llama.cpp”cd ~/src # or wherever you keep sourcesgit clone https://github.com/ggerganov/llama.cppcd llama.cppPick the build for your accelerator:
# Apple Silicon (Metal)cmake -B build -DLLAMA_METAL=oncmake --build build --config Release -j
# NVIDIA (CUDA)cmake -B build -DGGML_CUDA=oncmake --build build --config Release -j
# CPU onlycmake -B buildcmake --build build --config Release -jThe relevant binaries land under build/bin/. The two we care
about are llama-server (the HTTP server shrew talks to) and
llama-cli (handy for sanity checks).
./build/bin/llama-server --version./build/bin/llama-cli --help | head -20Step 2 — Download a GGUF
Section titled “Step 2 — Download a GGUF”Shrew’s default is qwen3.6-35b-a3b. The official Qwen MoE
checkpoint is published as a GGUF on the Hugging Face hub. Pick a
quantisation that fits your RAM budget:
| Quant | Size on disk | RAM (approx) | Quality |
|---|---|---|---|
Q4_K_M | 22 GB | 24-26 GB | recommended (default) |
Q5_K_M | 26 GB | 28-30 GB | slightly better |
Q6_K | 30 GB | 32-34 GB | near-fp16 quality |
Q3_K_S | 16 GB | 18 GB | tight laptops only |
Using huggingface-cli:
mkdir -p ~/modelshuggingface-cli download \ Qwen/Qwen3.6-35B-A3B-Instruct-GGUF \ --include "Q4_K_M/*.gguf" \ --local-dir ~/models/qwen3.6-35b-a3bIf the upstream publishes the GGUF as a single file, the
--include filter narrows the download. Multi-shard GGUFs are
loaded by passing the first shard to -m; llama.cpp finds the
others automatically.
Alternative (smaller dense model — easier to start with):
huggingface-cli download \ Qwen/Qwen3.5-9B-Instruct-GGUF \ --include "Q4_K_M/*.gguf" \ --local-dir ~/models/qwen3.5-9bStep 3 — Serve with the canonical incantation
Section titled “Step 3 — Serve with the canonical incantation”This is the load-bearing command. Memorise it:
~/src/llama.cpp/build/bin/llama-server \ -m ~/models/qwen3.6-35b-a3b/Q4_K_M/qwen3.6-35b-a3b-q4_k_m.gguf \ --host 127.0.0.1 --port 8888 \ --jinja \ -c 16384 \ -ngl 99 \ --n-cpu-moe 999 \ --flash-attn onFlag-by-flag, in order of importance:
--n-cpu-moe 999— keep all MoE expert tensors in system RAM, not VRAM. This is the trick. With999(a sentinel meaning “all layers”), the GPU stops carrying the experts and the only thing scaling with VRAM is the KV cache.-ngl 99— offload all non-expert layers to the GPU. The attention blocks live there. With--n-cpu-moe 999and-ngl 99set together, you get the canonical split: attention on GPU, experts on CPU.-c 16384— context window. 16k is a good default for an 8 GB GPU. Bigger GPUs can go higher; themoe_offloadextension picks a safe value automatically when you pass--vram-gbto shrew.--jinja— enable the model’s Jinja chat template. Required for thetools=[...]OpenAI-compatible tool-calling shape to parse. Do not omit this. Without--jinja, tool calls return as plain text and shrew’s loop won’t see them.--flash-attn on— flash-attention. Faster on supported GPUs (Metal 3, CUDA SM>=70).--host 127.0.0.1 --port 8888— bind to localhost on shrew’s default port.
Verify it’s up:
curl http://127.0.0.1:8888/health# {"status":"ok"}
curl http://127.0.0.1:8888/v1/models# {"object":"list","data":[{"id":"qwen3.6-35b-a3b",...}]}Now chimera shrew -p "say hi" will route through it
automatically.
Why this fits on an 8 GB laptop GPU
Section titled “Why this fits on an 8 GB laptop GPU”The MoE-offload arithmetic, in round numbers:
- Total weights: 22 GB (Q4_K_M).
- Experts (kept in RAM): ~19 GB.
- Attention + non-expert weights (on GPU): ~2.5 GB.
- CUDA workspace + activation reserve: ~0.8 GB.
- KV cache headroom on an 8 GB GPU: ~4.7 GB → ~16k tokens.
The shrew moe_offload extension
implements this arithmetic. Pass --vram-gb 8 and it returns
16384; pass --vram-gb 24 and it returns 32768 (clamped at the
model’s architectural maximum). The numbers are conservative — we
favour smaller windows over OOMs.
Step 4 — Run shrew against it
Section titled “Step 4 — Run shrew against it”export LLAMACPP_BASE_URL=http://127.0.0.1:8888/v1 # default; explicitchimera shrew -p "explain this repo"Or override per-invocation:
LLAMACPP_BASE_URL=http://gpu-box.lan:8888/v1 \ chimera shrew --model qwen3.6-35b-a3b -p "audit dependencies"If the server gates on a key (--api-key foo on llama-server),
pass it through:
export LLAMACPP_API_KEY=foochimera shrew -p "..."Sanity checks
Section titled “Sanity checks”- Did llama.cpp pick the right GPU? Look for
ggml_cuda_initorggml_metal_initin the server log; the device id and free VRAM should both be reported. - Are the experts actually on CPU? The startup log prints
offload to GPU: <n>per layer. With--n-cpu-moe 999you’ll see CPU lines for the FFN/expert layers and GPU lines for attention/embedding. - Is
--jinjaworking? A round-tripcurl http://127.0.0.1:8888/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"qwen3.6-35b-a3b","messages":[{"role":"user","content":"hi"}],"tools":[{"type":"function","function":{"name":"echo","parameters":{"type":"object","properties":{}}}}]}'should return atool_callsfield, not a plaincontentfield. - Is shrew finding the server?
chimera shrew --list-modelsshould printqwen3.6-35b-a3b llama.cpp @ http://127.0.0.1:8888/v1as the first row.
Tuning notes
Section titled “Tuning notes”- Context vs. quality trade-off. A bigger
-cmeans more KV cache, which means less VRAM left for activations and more attention compute per token. On an 8 GB laptop, 16k is the ceiling; on a 24 GB workstation, 32k is fine. Don’t push past the model’s architectural max (32k for Qwen3 family). --n-cpu-moeis a counter, not a flag. Setting it to999is just “all”; setting it to28would mean “first 28 expert layers stay on CPU, the rest can go on GPU”. The latter is rarely useful unless you’ve measured.- Batch size. llama.cpp’s
--batch-sizeis forgivable to leave at default. The bottleneck is the GPU↔CPU transfer of expert activations, not batch size. --mlock. On Linux with enough RAM headroom,--mlockkeeps expert tensors pinned and avoids page-faults during expert routing. Speeds things up by 5-15% in practice. Costsulimit -l unlimitedconfiguration.
Alternative — Ollama
Section titled “Alternative — Ollama”For a lower-friction setup at the cost of MoE-offload control, run the dense Qwen3.5-9B via Ollama:
ollama serve &ollama pull qwen3.5:cloudchimera shrew --model qwen3.5:cloud -p "audit this repo"Shrew probes Ollama at $OLLAMA_BASE_URL (default
http://localhost:11434) and falls back to it when llama.cpp isn’t
reachable. The Ollama path uses the OpenAI-compatible shim under
/v1 so the wire shape matches llama.cpp — useful when you want
shrew to behave consistently across both backends.
The trade-off is that the dense 9B model needs ~6 GB of VRAM all
to itself. There is no MoE trick available; it just runs on the
GPU. On an 8 GB laptop GPU you’ll want a small -c (8192 is
plenty for Qwen3.5-9B at this size).
Further reading
Section titled “Further reading”extensions.md— howmoe_offload,scaffold_fit, andtool_filterwork together at runtime.benchmarks.md— how to evaluate the local model against Aider Polyglot and GAIA.parity-matrix.md— what shrew exposes from the upstream small-model coding agent and what it doesn’t.security-and-trademarks.md— how shrew refers to upstream concepts in source / docs.