Benchmarks

Performance comparison of CAPA's three tool exposure modes across 5 real-world scenarios, 10 trials each (150 runs total) on claude-opus-4-8.

Summary Stats

150/150
trials passed
$0.99
capa-cli total
$0.99
capa-ondemand total
$1.22
capa-expose-all total

Variants Under Test

capa-cli
toolExposure: none — agent reaches tools through capa sh. No MCP schemas in the prompt.
capa-ondemand
toolExposure: on-demand — two meta-tools (setup_tools, call_tool). Schemas loaded lazily.
capa-expose-all
toolExposure: expose-all — every tool's JSON schema published up-front. Vanilla MCP baseline.

Cost Per Run

Mean cost per scenario (10 trials each). Lower is better.

$0.00 $0.10 $0.20 $0.30 $0.40 $0.118$0.144$0.165$0.260$0.248$0.347$0.146$0.172$0.160$0.278$0.225$0.376$0.190$0.197$0.175 Slack Release NoteSentry Top IssuesGitHub Issue from BugIncident TriageRefactor: Request ID
capa-cli
capa-ondemand
capa-expose-all

Per-Scenario Cost Breakdown

Scenario Schema Bulk capa-cli capa-ondemand capa-expose-all Best vs Worst
Slack Release Note slack · 13 tools / 47 KB $0.1179 ±0.0121 $0.1440 ±0.0011 $0.1649 ±0.0325 29%
Sentry Top Issues sentry · 23 tools / 81 KB $0.2602 ±0.0201 $0.2479 ±0.0184 $0.3474 ±0.0569 29%
GitHub Issue from Bug github · 108 tools / 161 KB $0.1465 ±0.0335 $0.1719 ±0.0038 $0.1596 ±0.0177 15%
Incident Triage sentry+slack · 36 tools / ~128 KB $0.2783 ±0.0202 $0.2251 ±0.0053 $0.3755 ±0.0363 40%
Refactor: Request ID no SaaS — control case $0.1895 ±0.0132 $0.1970 ±0.0128 $0.1754 ±0.0173 11%
Total (suite) $0.99 $0.99 $1.22 19%

Token Detail

Mean per-run token usage across 10 trials.

Scenario Variant Input Cache Create Cache Read Output Total
Slack Release Note cli 2,106 8,860 67,290 735 78,991
ondemand 2,133 13,005 74,508 591 90,236
expose-all 2,478 16,558 68,543 589 88,168
Sentry Top Issues cli 2,226 20,650 127,307 2,255 152,437
ondemand 2,269 19,006 107,489 2,562 131,326
expose-all 2,815 37,795 101,080 1,862 143,553
GitHub Issue from Bug cli 2,106 12,088 69,326 1,030 84,550
ondemand 2,131 17,266 60,176 929 80,502
expose-all 4,081 15,152 43,706 906 63,846
Incident Triage cli 2,243 20,022 188,859 1,902 213,026
ondemand 2,266 18,151 126,584 1,480 148,481
expose-all 3,084 36,160 185,337 1,657 226,239
Refactor: Request ID cli 2,150 10,419 113,205 2,283 128,056
ondemand 2,151 10,103 123,733 2,450 138,438
expose-all 2,124 8,896 106,678 2,235 119,932

Quality (LLM Judge)

Quality scores per scenario per variant. Judge: claude-opus-4-8, threshold min_score: 0.6. All 150 trials pass deterministic assertions.

Scenario capa-cli capa-ondemand capa-expose-all
Slack Release Note 0.87 0.90 0.83
Sentry Top Issues 1.00 1.00 1.00
GitHub Issue from Bug 1.00 1.00 1.00
Incident Triage 1.00 1.00 1.00
Refactor: Request ID 0.99 0.99 0.99

Bottom Line

No quality regression — 150/150 deterministic assertions pass across all three modes.
~19% suite-wide savings — Both capa-cli and on-demand at $0.99 vs $1.22 expose-all baseline.
Savings concentrate where they should — biggest on multi-backend / large-schema tasks (up to 40% on incident triage).
Control task confirms the model — on refactor-request-id (no SaaS tools), capa adds ~10% overhead.
Recommendation

Default to on-demand for general MCP-heavy agents. Use capa-cli / none for shell-comfortable one-shot patterns. Keep expose-all only when there is no SaaS surface to filter.

Related Documentation