MiniMax M2.7 Built Its Evaluation Benchmark on OpenClaw Tasks — And Trained Itself With an Agent Harness
MiniMax's newly released M2.7 model announcement contains two details that matter specifically for OpenClaw users.
First: MiniMax built their primary real-world evaluation set — "MM Claw" — directly from common OpenClaw usage patterns, covering personal learning planning, office document processing, scheduled tasks, and more. OpenClaw is their benchmark for what real agentic work looks like.
Second: MiniMax used OpenClaw agent harnesses internally to help train M2.7. They tasked an internal version of the model with building a research agent harness supporting data pipelines, training environments, cross-team collaboration, and persistent memory — then let the model improve its own learning process based on experiment results. Agent harness drives model evolution.
Why MM Claw Matters
When a frontier model lab builds its real-world capability benchmark on top of your platform's usage patterns, that's a strong validation signal. MiniMax isn't testing M2.7 against abstract NLP tasks — they're testing it against the actual workflows OpenClaw users run every day.
That means M2.7 is specifically tuned for:
- Personal productivity automation (the core OpenClaw use case)
- Office document processing and complex editing (Excel, PPT, Word)
- Scheduled and recurring task execution (heartbeat-style workflows)
- Complex skill management — 97% adherence rate across 40+ skills exceeding 2,000 tokens each
That last number is striking. Maintaining 97% behavioral adherence across 40 complex skills simultaneously is exactly the kind of capability that matters for multi-skill OpenClaw setups where agents are juggling multiple concurrent responsibilities.
M2.7 Performance Numbers
- SWE-Pro: 56.22% — "nearly approaching Opus's best level" for software engineering tasks
- VIBE-Pro: 55.6% (end-to-end full project delivery)
- Terminal Bench 2: 57.0% (deep understanding of complex engineering systems)
- GDPval-AA ELO: 1495 — highest among open-source models for professional office tasks
- Skill adherence: 97% across 40+ skills, each >2,000 tokens
The Self-Evolution Story
M2.7 is MiniMax's "first model deeply participating in its own evolution." The internal workflow: M2.7 built a research agent harness from scratch — with data pipelines, training environments, infrastructure connections, cross-team collaboration channels, and persistent memory — then used that harness to run RL experiments and improve its own learning process based on the results.
This is the same architecture pattern OpenClaw-RL (from Gen-Verse) explored, but implemented by MiniMax as a production training workflow rather than a research framework. The convergence is notable: both projects conclude that agent harnesses + self-directed feedback are the path to model improvement.
M2.7 for OpenClaw: Practical Considerations
M2.7 has been the community's top-recommended budget alternative after Claude's recent pricing changes (covered in our model rankings post). The "impossible to exhaust quota" reports from users combined with genuine near-Opus performance on structured tasks make it a compelling OpenClaw primary or fallback.
Setup in OpenClaw:
# openclaw.json
{
"agents": {
"defaults": {
"model": {
"primary": "minimax/minimax-m2.7",
"fallbacks": ["ollama/qwen3.6:27b"]
}
}
}
}
Best use cases based on MM Claw benchmark coverage:
- Heartbeat-driven scheduled tasks (strong recurring task performance)
- Document processing and summarization workflows
- Multi-skill agent setups where skill count is high (97% adherence)
- Coding and engineering tasks where Opus-level quality matters but budget is a constraint
Less optimal for: highly creative tasks, nuanced judgment calls, or scenarios requiring the absolute top-end reasoning that Claude Opus or Kimi K2.5 deliver.
The Bigger Signal
MiniMax building MM Claw on OpenClaw usage patterns is the same kind of ecosystem validation we saw with Tencent (QClaw), Alibaba (QwenPaw), and Huawei (JiuwenClaw) this week. OpenClaw has become the de facto reference platform for what real-world agentic AI work looks like — even for frontier model labs benchmarking their own models.
That's a strong moat for the platform. And it means models that optimize for OpenClaw compatibility (like M2.7 explicitly does) will keep getting better at the tasks OpenClaw users actually run.