Gemma 4 (9.6GB) on a CPU-only VPS is entirely doable โ but OpenClaw's defaults are tuned for GPU inference or cloud APIs where responses arrive in seconds. On CPU, a single response can take 60โ300 seconds. Without specific config changes, OpenClaw will time out, kill the request, and leave you staring at an error.
This guide covers every setting you need to change, in the right order.
Minimum viable hardware: 8GB RAM for Gemma 4 4-bit quantized (Q4_K_M). 16GB recommended. A 4-core VPS will work but expect 2โ5 minute response times on complex prompts. This is a cost play, not a speed play.
Step 1 โ Install Ollama and Pull Gemma 4
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull Gemma 4 (4-bit quantized โ fits in 8GB RAM)
ollama pull gemma4:9b-instruct-q4_K_M
# Verify it loads (this will take a few minutes first run)
ollama run gemma4:9b-instruct-q4_K_M "say hello"
Wait for the model to fully load before touching OpenClaw config. If ollama run works, you know the model is functional on your hardware.
Step 2 โ The Critical Timeout Config
This is the #1 reason CPU setups fail. OpenClaw's default provider timeout is 60โ120 seconds โ fine for GPU, fatal for CPU inference on a 9.6GB model.
Edit ~/.openclaw/openclaw.json and add or update the model provider section:
{
"models": {
"default": "ollama/gemma4:9b-instruct-q4_K_M"
},
"providers": {
"ollama": {
"baseUrl": "http://127.0.0.1:11434",
"timeout": 600,
"firstTokenTimeout": 300,
"streamTimeout": 600
}
}
}
Key values:
timeout: 600โ 10 minutes total. CPU inference on complex prompts can legitimately take this long.firstTokenTimeout: 300โ 5 minutes to receive the first token. This is what kills most setups. Gemma 4 on CPU can take 60โ120 seconds just to start outputting.streamTimeout: 600โ Keep the stream alive for 10 minutes even if tokens arrive slowly.
If you're seeing "LLM timeout" errors: It's almost always firstTokenTimeout that's too low. The model is thinking โ OpenClaw just doesn't know that and gives up.
Step 3 โ Disable Heartbeat or Set a Long Interval
A heartbeat firing every 30 minutes while Gemma 4 is mid-inference will stack requests and exhaust your RAM. Either disable it or set a longer interval during initial testing:
{
"heartbeat": {
"enabled": true,
"intervalMs": 1800000,
"timeoutMs": 540000
}
}
That's a 30-minute interval with a 9-minute timeout โ giving Gemma 4 room to complete a response before the next heartbeat fires.
Step 4 โ Tune Ollama for Your RAM
Add these environment variables to Ollama's systemd unit or your shell before starting it:
# In /etc/systemd/system/ollama.service under [Service]:
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_KEEP_ALIVE=30m"
OLLAMA_NUM_PARALLEL=1โ Only one inference at a time. On CPU, parallel requests will OOM.OLLAMA_MAX_LOADED_MODELS=1โ Keep only Gemma 4 in memory. Loading/unloading models on CPU is slow and expensive.OLLAMA_KEEP_ALIVE=30mโ Keep the model loaded for 30 minutes after the last request instead of unloading it immediately.
sudo systemctl daemon-reload
sudo systemctl restart ollama
Step 5 โ Reduce Context Window
Gemma 4 supports a large context window, but on CPU the full context is brutal on memory and inference time. Unless you need deep memory, cap it:
{
"providers": {
"ollama": {
"options": {
"num_ctx": 8192,
"num_thread": 4
}
}
}
}
Set num_thread to your VPS vCPU count. A 4-core VPS gets 4. This is one of the biggest performance levers available to you.
Step 6 โ Slim Down Your SOUL.md
Every heartbeat cycle loads your full workspace context into the prompt. On CPU with a capped context window, a bloated SOUL.md eats into the tokens available for actual responses. Keep SOUL.md under 500 words and trim memory files aggressively.
What to Expect on Different Hardware
| VPS Spec | RAM | First Token | Full Response | Verdict |
|---|---|---|---|---|
| 2 vCPU / 8GB | Tight | 90โ180s | 3โ8 min | Marginal โ works with tuning |
| 4 vCPU / 16GB | Comfortable | 45โ90s | 2โ5 min | Good โ recommended minimum |
| 8 vCPU / 32GB | Headroom | 20โ45s | 1โ3 min | Solid for daily use |
| Any GPU VPS | Fine | 1โ3s | 5โ30s | Much better โ consider upgrading |
When CPU Doesn't Make Sense
CPU-only Gemma 4 is a cost optimization โ you're trading speed for ~$0 inference cost. It makes sense for:
- Privacy-sensitive workloads where you can't use cloud APIs
- High-volume use cases where API costs are prohibitive
- Experimental setups where latency doesn't matter
It doesn't make sense if your agent handles real-time communications (Telegram, Discord) where a 3-minute response time will drive users away. For that, use a cloud API or at minimum a GPU VPS.
If you want this configured correctly without the trial-and-error โ book a ClawReady setup call. We've run OpenClaw on CPU-only hardware and know exactly what combination of settings works.