OpenClaw + Gemma 4 on a CPU-Only VPS: Full Setup Guide

Gemma 4 (9.6GB) on a CPU-only VPS is entirely doable — but OpenClaw's defaults are tuned for GPU inference or cloud APIs where responses arrive in seconds. On CPU, a single response can take 60–300 seconds. Without specific config changes, OpenClaw will time out, kill the request, and leave you staring at an error.

This guide covers every setting you need to change, in the right order.

Minimum viable hardware: 8GB RAM for Gemma 4 4-bit quantized (Q4_K_M). 16GB recommended. A 4-core VPS will work but expect 2–5 minute response times on complex prompts. This is a cost play, not a speed play.

Step 1 — Install Ollama and Pull Gemma 4

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 4 (4-bit quantized — fits in 8GB RAM)
ollama pull gemma4:9b-instruct-q4_K_M

# Verify it loads (this will take a few minutes first run)
ollama run gemma4:9b-instruct-q4_K_M "say hello"

Wait for the model to fully load before touching OpenClaw config. If ollama run works, you know the model is functional on your hardware.

Step 2 — The Critical Timeout Config

This is the #1 reason CPU setups fail. OpenClaw's default provider timeout is 60–120 seconds — fine for GPU, fatal for CPU inference on a 9.6GB model.

Edit ~/.openclaw/openclaw.json and add or update the model provider section:

{
  "models": {
    "default": "ollama/gemma4:9b-instruct-q4_K_M"
  },
  "providers": {
    "ollama": {
      "baseUrl": "http://127.0.0.1:11434",
      "timeout": 600,
      "firstTokenTimeout": 300,
      "streamTimeout": 600
    }
  }
}

Key values:

timeout: 600 — 10 minutes total. CPU inference on complex prompts can legitimately take this long.
firstTokenTimeout: 300 — 5 minutes to receive the first token. This is what kills most setups. Gemma 4 on CPU can take 60–120 seconds just to start outputting.
streamTimeout: 600 — Keep the stream alive for 10 minutes even if tokens arrive slowly.

If you're seeing "LLM timeout" errors: It's almost always firstTokenTimeout that's too low. The model is thinking — OpenClaw just doesn't know that and gives up.

Step 3 — Disable Heartbeat or Set a Long Interval

A heartbeat firing every 30 minutes while Gemma 4 is mid-inference will stack requests and exhaust your RAM. Either disable it or set a longer interval during initial testing:

{
  "heartbeat": {
    "enabled": true,
    "intervalMs": 1800000,
    "timeoutMs": 540000
  }
}

That's a 30-minute interval with a 9-minute timeout — giving Gemma 4 room to complete a response before the next heartbeat fires.

Step 4 — Tune Ollama for Your RAM

Add these environment variables to Ollama's systemd unit or your shell before starting it:

# In /etc/systemd/system/ollama.service under [Service]:
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_KEEP_ALIVE=30m"

OLLAMA_NUM_PARALLEL=1 — Only one inference at a time. On CPU, parallel requests will OOM.
OLLAMA_MAX_LOADED_MODELS=1 — Keep only Gemma 4 in memory. Loading/unloading models on CPU is slow and expensive.
OLLAMA_KEEP_ALIVE=30m — Keep the model loaded for 30 minutes after the last request instead of unloading it immediately.

sudo systemctl daemon-reload
sudo systemctl restart ollama

Step 5 — Reduce Context Window

Gemma 4 supports a large context window, but on CPU the full context is brutal on memory and inference time. Unless you need deep memory, cap it:

{
  "providers": {
    "ollama": {
      "options": {
        "num_ctx": 8192,
        "num_thread": 4
      }
    }
  }
}

Set num_thread to your VPS vCPU count. A 4-core VPS gets 4. This is one of the biggest performance levers available to you.

Step 6 — Slim Down Your SOUL.md

Every heartbeat cycle loads your full workspace context into the prompt. On CPU with a capped context window, a bloated SOUL.md eats into the tokens available for actual responses. Keep SOUL.md under 500 words and trim memory files aggressively.

What to Expect on Different Hardware

VPS Spec	RAM	First Token	Full Response	Verdict
2 vCPU / 8GB	Tight	90–180s	3–8 min	Marginal — works with tuning
4 vCPU / 16GB	Comfortable	45–90s	2–5 min	Good — recommended minimum
8 vCPU / 32GB	Headroom	20–45s	1–3 min	Solid for daily use
Any GPU VPS	Fine	1–3s	5–30s	Much better — consider upgrading

When CPU Doesn't Make Sense

CPU-only Gemma 4 is a cost optimization — you're trading speed for ~$0 inference cost. It makes sense for:

Privacy-sensitive workloads where you can't use cloud APIs
High-volume use cases where API costs are prohibitive
Experimental setups where latency doesn't matter

It doesn't make sense if your agent handles real-time communications (Telegram, Discord) where a 3-minute response time will drive users away. For that, use a cloud API or at minimum a GPU VPS.

If you want this configured correctly without the trial-and-error — book a ClawReady setup call. We've run OpenClaw on CPU-only hardware and know exactly what combination of settings works.

OpenClaw + Gemma 4 on a CPU-Only VPS: Full Setup Guide

Step 1 — Install Ollama and Pull Gemma 4

Step 2 — The Critical Timeout Config

Step 3 — Disable Heartbeat or Set a Long Interval

Step 4 — Tune Ollama for Your RAM

Step 5 — Reduce Context Window

Step 6 — Slim Down Your SOUL.md

What to Expect on Different Hardware

When CPU Doesn't Make Sense

Want This Done Right the First Time?