8 models, 52 prompts, 11 traps, LLM-as-judge scoring. Run on 2026-04-10.
| # | Model | Reasoning (32) | Code (12) | PM (8) | Overall | Traps |
|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 | 88% | 100% | 100% | 92% (48/52) | 8/11 |
| 2 | GLM 5.1 Cloud | 84% | 92% | 100% | 88% (46/52) | 8/11 |
| 3 | Qwen 3.5 Cloud | 81% | 100% | 88% | 87% (45/52) | 8/11 |
| 4 | GPT-5.4 | 81% | 83% | 100% | 85% (44/52) | 6/11 |
| 4 | Kimi K2.5 Cloud | 88% | 83%† | 100% | 85% (44/52) | 8/11 |
| 6 | MiniMax M2.7 highspeed | 75% | 83% | 88% | 79% (41/52) | 6/11 |
| 6 | GPT-5.3 Codex | 78% | 83% | 75% | 79% (41/52) | 5/11 |
| 8 | MiniMax M2.7 | 69% | 75% | 75% | 71% (37/52) | 6/11 |
† Kimi's raw code score was 67% (8/12). Two of the four misses were Ollama streaming-JSON parse errors at char 78 — infrastructure failures, not reasoning failures. Both retried cleanly at pass_rate 1.0. Corrected score is 10/12 = 83%.
Positions 2–5 are within 2 prompts (~4%) — treat as a statistical tie, not a ranking.
Prompts where every model failed, regardless of provider or thinking budget:
These failures can't be fixed by picking a different model. They have to be fixed by prompt-level rules that force clarification and refuse weaponization.
Full leaderboard with cost table, universal-failure analysis, adversarial matrix, per-tier recommendations.
Sonnet, GPT-5.4, GPT-5.3 Codex, MiniMax M2.7 base, MiniMax M2.7 highspeed. The canonical 5-config run.
Kimi K2.5, GLM 5.1, Qwen 3.5 — all three Chinese cloud models routed via Ollama.
The headline reasoning benchmark. Logical deduction, instruction adherence, self-correction, creative problem-solving, and adversarial robustness.
Same 32 prompts against Kimi, GLM, Qwen. All three tied Sonnet on trap handling.
Code reading comprehension: bugs, output prediction, diagnostic interpretation.
Note: Kimi's 67% here was 50% infrastructure artifact (streaming JSON parse errors). Real score 83% after retry.
Real PM scenarios: prioritization, stakeholder conflict, metric gaming, tech-debt CFO framing.
GLM 5.1 and Kimi K2.5 both hit 100% on PM scenarios.
GPT-5.4 at adaptive/medium/high/xhigh, Sonnet at medium/high, Codex at medium/xhigh, MiniMax at off. Thinking level barely moves the needle.
Each config under test is a dedicated eval-<slug> agent with an identical minimal AGENTS.md and a single-model models.json. All agents see exactly the same bootstrap, so any difference in response behavior is attributable to the model (or the thinking level). This compares models as OpenClaw actually uses them.
Responses are scored by Claude Opus 4.6 via Claude Code OAuth. When a config under test is itself Opus, the rubric is skipped (self-judge protection) and only pass/fail assertions apply. Pass criteria are strict: a response passes only if every named criterion passes (pass_rate = 1.0).
Full methodology in the docs folder on the main branch, especially interpreting-results.md.
The benchmark is open source and runnable end-to-end. See the main branch for code: github.com/arthursoares/openclaw-llm-bench.
Five commands:
git clone https://github.com/arthursoares/openclaw-llm-bench cd openclaw-llm-bench bash scripts/setup-eval-agents.sh python3 scripts/run_eval.py --configs configs/defaults.json --eval-set evals/reasoning-v1.json --parallel 4 python3 scripts/judge.py results/run-<timestamp> python3 scripts/report.py results/run-<timestamp>