About ClawBench

ClawBench is an agent orchestration benchmark that tests AI models through the full agent stack -- not raw API calls. It evaluates thinking-block stripping, retry logic, tool orchestration, and end-to-end task completion through the OpenClaw gateway.

Unlike traditional benchmarks that test model capabilities in isolation, ClawBench tests the complete agent system: the model, the orchestration middleware, and the tool execution layer working together.

Score Guide

90-100	Excellent -- agent handles complex orchestration reliably
70-89	Solid -- agent works well for most tasks
40-69	Functional -- agent works but has gaps
0-39	Needs work -- significant agent capability issues

Run a Benchmark

# Install and run
npx clawbench --gateway-token <your-token>

# Submit results to the leaderboard
npx clawbench --submit --gateway-token <your-token>

# Test a specific model
npx clawbench --submit --model anthropic/claude-sonnet-4-6 --gateway-token <token>

About ClawBench

Categories

Tool Accuracy

Code Generation

Reasoning

Error Recovery

Multi-Step

Research

Context Management

Score Guide

Run a Benchmark

Sponsors