Benchmarks
Performance comparison of CAPA's three tool exposure modes across 5 real-world scenarios, 10 trials each (150 runs total) on claude-opus-4-8.
Summary Stats
150/150
trials passed
$0.99
capa-cli total
$0.99
capa-ondemand total
$1.22
capa-expose-all total
Variants Under Test
capa-cli
toolExposure: none — agent reaches tools through capa sh. No MCP schemas in the prompt.
capa-ondemand
toolExposure: on-demand — two meta-tools (setup_tools, call_tool). Schemas loaded lazily.
capa-expose-all
toolExposure: expose-all — every tool's JSON schema published up-front. Vanilla MCP baseline.
Cost Per Run
Mean cost per scenario (10 trials each). Lower is better.
capa-cli
capa-ondemand
capa-expose-all
Per-Scenario Cost Breakdown
| Scenario | Schema Bulk | capa-cli | capa-ondemand | capa-expose-all | Best vs Worst |
|---|---|---|---|---|---|
| Slack Release Note | slack · 13 tools / 47 KB | $0.1179 ±0.0121 | $0.1440 ±0.0011 | $0.1649 ±0.0325 | 29% |
| Sentry Top Issues | sentry · 23 tools / 81 KB | $0.2602 ±0.0201 | $0.2479 ±0.0184 | $0.3474 ±0.0569 | 29% |
| GitHub Issue from Bug | github · 108 tools / 161 KB | $0.1465 ±0.0335 | $0.1719 ±0.0038 | $0.1596 ±0.0177 | 15% |
| Incident Triage | sentry+slack · 36 tools / ~128 KB | $0.2783 ±0.0202 | $0.2251 ±0.0053 | $0.3755 ±0.0363 | 40% |
| Refactor: Request ID | no SaaS — control case | $0.1895 ±0.0132 | $0.1970 ±0.0128 | $0.1754 ±0.0173 | 11% |
| Total (suite) | $0.99 | $0.99 | $1.22 | 19% |
Token Detail
Mean per-run token usage across 10 trials.
| Scenario | Variant | Input | Cache Create | Cache Read | Output | Total |
|---|---|---|---|---|---|---|
| Slack Release Note | cli | 2,106 | 8,860 | 67,290 | 735 | 78,991 |
| ondemand | 2,133 | 13,005 | 74,508 | 591 | 90,236 | |
| expose-all | 2,478 | 16,558 | 68,543 | 589 | 88,168 | |
| Sentry Top Issues | cli | 2,226 | 20,650 | 127,307 | 2,255 | 152,437 |
| ondemand | 2,269 | 19,006 | 107,489 | 2,562 | 131,326 | |
| expose-all | 2,815 | 37,795 | 101,080 | 1,862 | 143,553 | |
| GitHub Issue from Bug | cli | 2,106 | 12,088 | 69,326 | 1,030 | 84,550 |
| ondemand | 2,131 | 17,266 | 60,176 | 929 | 80,502 | |
| expose-all | 4,081 | 15,152 | 43,706 | 906 | 63,846 | |
| Incident Triage | cli | 2,243 | 20,022 | 188,859 | 1,902 | 213,026 |
| ondemand | 2,266 | 18,151 | 126,584 | 1,480 | 148,481 | |
| expose-all | 3,084 | 36,160 | 185,337 | 1,657 | 226,239 | |
| Refactor: Request ID | cli | 2,150 | 10,419 | 113,205 | 2,283 | 128,056 |
| ondemand | 2,151 | 10,103 | 123,733 | 2,450 | 138,438 | |
| expose-all | 2,124 | 8,896 | 106,678 | 2,235 | 119,932 |
Quality (LLM Judge)
Quality scores per scenario per variant. Judge: claude-opus-4-8, threshold min_score: 0.6. All 150 trials pass deterministic assertions.
| Scenario | capa-cli | capa-ondemand | capa-expose-all |
|---|---|---|---|
| Slack Release Note | 0.87 | 0.90 | 0.83 |
| Sentry Top Issues | 1.00 | 1.00 | 1.00 |
| GitHub Issue from Bug | 1.00 | 1.00 | 1.00 |
| Incident Triage | 1.00 | 1.00 | 1.00 |
| Refactor: Request ID | 0.99 | 0.99 | 0.99 |
Bottom Line
No quality regression — 150/150 deterministic assertions pass across all three modes.
~19% suite-wide savings — Both capa-cli and on-demand at $0.99 vs $1.22 expose-all baseline.
Savings concentrate where they should — biggest on multi-backend / large-schema tasks (up to 40% on incident triage).
Control task confirms the model — on refactor-request-id (no SaaS tools), capa adds ~10% overhead.
Recommendation
Default to on-demand for general MCP-heavy agents. Use capa-cli / none for shell-comfortable one-shot patterns. Keep expose-all only when there is no SaaS surface to filter.
Related Documentation
- Tool Exposure — configuration and mode details
- Capabilities File — overall configuration reference
- capa sh — invoke tools from the command line