April 23, 2026 Models MiniMax Benchmarks

MiniMax M2.7 Built Its Evaluation Benchmark on OpenClaw Tasks — And Trained Itself With an Agent Harness

MiniMax's newly released M2.7 model announcement contains two details that matter specifically for OpenClaw users.

First: MiniMax built their primary real-world evaluation set — "MM Claw" — directly from common OpenClaw usage patterns, covering personal learning planning, office document processing, scheduled tasks, and more. OpenClaw is their benchmark for what real agentic work looks like.

Second: MiniMax used OpenClaw agent harnesses internally to help train M2.7. They tasked an internal version of the model with building a research agent harness supporting data pipelines, training environments, cross-team collaboration, and persistent memory — then let the model improve its own learning process based on experiment results. Agent harness drives model evolution.

Why MM Claw Matters

When a frontier model lab builds its real-world capability benchmark on top of your platform's usage patterns, that's a strong validation signal. MiniMax isn't testing M2.7 against abstract NLP tasks — they're testing it against the actual workflows OpenClaw users run every day.

That means M2.7 is specifically tuned for:

Personal productivity automation (the core OpenClaw use case)
Office document processing and complex editing (Excel, PPT, Word)
Scheduled and recurring task execution (heartbeat-style workflows)
Complex skill management — 97% adherence rate across 40+ skills exceeding 2,000 tokens each

That last number is striking. Maintaining 97% behavioral adherence across 40 complex skills simultaneously is exactly the kind of capability that matters for multi-skill OpenClaw setups where agents are juggling multiple concurrent responsibilities.

M2.7 Performance Numbers

SWE-Pro: 56.22% — "nearly approaching Opus's best level" for software engineering tasks
VIBE-Pro: 55.6% (end-to-end full project delivery)
Terminal Bench 2: 57.0% (deep understanding of complex engineering systems)
GDPval-AA ELO: 1495 — highest among open-source models for professional office tasks
Skill adherence: 97% across 40+ skills, each >2,000 tokens

The Self-Evolution Story

M2.7 is MiniMax's "first model deeply participating in its own evolution." The internal workflow: M2.7 built a research agent harness from scratch — with data pipelines, training environments, infrastructure connections, cross-team collaboration channels, and persistent memory — then used that harness to run RL experiments and improve its own learning process based on the results.

This is the same architecture pattern OpenClaw-RL (from Gen-Verse) explored, but implemented by MiniMax as a production training workflow rather than a research framework. The convergence is notable: both projects conclude that agent harnesses + self-directed feedback are the path to model improvement.

M2.7 for OpenClaw: Practical Considerations

M2.7 has been the community's top-recommended budget alternative after Claude's recent pricing changes (covered in our model rankings post). The "impossible to exhaust quota" reports from users combined with genuine near-Opus performance on structured tasks make it a compelling OpenClaw primary or fallback.

Setup in OpenClaw:

# openclaw.json
{
  "agents": {
    "defaults": {
      "model": {
        "primary": "minimax/minimax-m2.7",
        "fallbacks": ["ollama/qwen3.6:27b"]
      }
    }
  }
}

Best use cases based on MM Claw benchmark coverage:

Heartbeat-driven scheduled tasks (strong recurring task performance)
Document processing and summarization workflows
Multi-skill agent setups where skill count is high (97% adherence)
Coding and engineering tasks where Opus-level quality matters but budget is a constraint

Less optimal for: highly creative tasks, nuanced judgment calls, or scenarios requiring the absolute top-end reasoning that Claude Opus or Kimi K2.5 deliver.

The Bigger Signal

MiniMax building MM Claw on OpenClaw usage patterns is the same kind of ecosystem validation we saw with Tencent (QClaw), Alibaba (QwenPaw), and Huawei (JiuwenClaw) this week. OpenClaw has become the de facto reference platform for what real-world agentic AI work looks like — even for frontier model labs benchmarking their own models.

That's a strong moat for the platform. And it means models that optimize for OpenClaw compatibility (like M2.7 explicitly does) will keep getting better at the tasks OpenClaw users actually run.

Get Your Model Stack Optimized for 2026 →