About ClawBench

ClawBench is an agent orchestration benchmark that tests AI models through the full agent stack -- not raw API calls. It evaluates thinking-block stripping, retry logic, tool orchestration, and end-to-end task completion through the OpenClaw gateway.

Unlike traditional benchmarks that test model capabilities in isolation, ClawBench tests the complete agent system: the model, the orchestration middleware, and the tool execution layer working together.

Categories

Tool Accuracy

Tests whether the agent calls the right tools with the correct arguments and interprets results correctly.

Code Generation

Evaluates the agent's ability to write, modify, and debug code in a sandboxed environment.

Reasoning

Measures logical reasoning, multi-step deduction, and the ability to handle ambiguity.

Error Recovery

Tests how the agent recovers from errors, retries failed operations, and adapts its approach.

Multi-Step

Evaluates the agent's ability to plan and execute multi-step tasks that require sequencing.

Research

Tests information gathering, synthesis, and the ability to answer questions requiring multiple sources.

Context Management

Measures how well the agent maintains context across long conversations and complex tasks.

Score Guide

90-100Excellent -- agent handles complex orchestration reliably
70-89Solid -- agent works well for most tasks
40-69Functional -- agent works but has gaps
0-39Needs work -- significant agent capability issues

Run a Benchmark

# Install and run
npx clawbench --gateway-token <your-token>

# Submit results to the leaderboard
npx clawbench --submit --gateway-token <your-token>

# Test a specific model
npx clawbench --submit --model anthropic/claude-sonnet-4-6 --gateway-token <token>

Sponsors

Gold Sponsor
Your logo here
Maximum visibility
Silver Sponsor
Your logo here
Featured placement
Community Sponsor
Your logo here
Support the benchmark