Benchmarks

Performance comparison of CAPA's three tool exposure modes across 5 real-world scenarios, 10 trials each (150 runs total) on claude-opus-4-8.

Summary Stats

150/150

trials passed

$0.99

capa-cli total

$0.99

capa-ondemand total

$1.22

capa-expose-all total

Variants Under Test

capa-cli

toolExposure: none — agent reaches tools through capa sh. No MCP schemas in the prompt.

capa-ondemand

toolExposure: on-demand — two meta-tools (setup_tools, call_tool). Schemas loaded lazily.

capa-expose-all

toolExposure: expose-all — every tool's JSON schema published up-front. Vanilla MCP baseline.

Cost Per Run

Mean cost per scenario (10 trials each). Lower is better.

capa-cli

capa-ondemand

capa-expose-all

Per-Scenario Cost Breakdown

Scenario	Schema Bulk	capa-cli	capa-ondemand	capa-expose-all	Best vs Worst
Slack Release Note	slack · 13 tools / 47 KB	$0.1179 ±0.0121	$0.1440 ±0.0011	$0.1649 ±0.0325	29%
Sentry Top Issues	sentry · 23 tools / 81 KB	$0.2602 ±0.0201	$0.2479 ±0.0184	$0.3474 ±0.0569	29%
GitHub Issue from Bug	github · 108 tools / 161 KB	$0.1465 ±0.0335	$0.1719 ±0.0038	$0.1596 ±0.0177	15%
Incident Triage	sentry+slack · 36 tools / ~128 KB	$0.2783 ±0.0202	$0.2251 ±0.0053	$0.3755 ±0.0363	40%
Refactor: Request ID	no SaaS — control case	$0.1895 ±0.0132	$0.1970 ±0.0128	$0.1754 ±0.0173	11%
Total (suite)		$0.99	$0.99	$1.22	19%

Token Detail

Mean per-run token usage across 10 trials.

Scenario	Variant	Input	Cache Create	Cache Read	Output	Total
Slack Release Note	cli	2,106	8,860	67,290	735	78,991
	ondemand	2,133	13,005	74,508	591	90,236
	expose-all	2,478	16,558	68,543	589	88,168
Sentry Top Issues	cli	2,226	20,650	127,307	2,255	152,437
	ondemand	2,269	19,006	107,489	2,562	131,326
	expose-all	2,815	37,795	101,080	1,862	143,553
GitHub Issue from Bug	cli	2,106	12,088	69,326	1,030	84,550
	ondemand	2,131	17,266	60,176	929	80,502
	expose-all	4,081	15,152	43,706	906	63,846
Incident Triage	cli	2,243	20,022	188,859	1,902	213,026
	ondemand	2,266	18,151	126,584	1,480	148,481
	expose-all	3,084	36,160	185,337	1,657	226,239
Refactor: Request ID	cli	2,150	10,419	113,205	2,283	128,056
	ondemand	2,151	10,103	123,733	2,450	138,438
	expose-all	2,124	8,896	106,678	2,235	119,932

Quality (LLM Judge)

Quality scores per scenario per variant. Judge: claude-opus-4-8, threshold min_score: 0.6. All 150 trials pass deterministic assertions.

Scenario	capa-cli	capa-ondemand	capa-expose-all
Slack Release Note	0.87	0.90	0.83
Sentry Top Issues	1.00	1.00	1.00
GitHub Issue from Bug	1.00	1.00	1.00
Incident Triage	1.00	1.00	1.00
Refactor: Request ID	0.99	0.99	0.99

Bottom Line

No quality regression — 150/150 deterministic assertions pass across all three modes.

~19% suite-wide savings — Both capa-cli and on-demand at $0.99 vs $1.22 expose-all baseline.

Savings concentrate where they should — biggest on multi-backend / large-schema tasks (up to 40% on incident triage).

Recommendation

Default to on-demand for general MCP-heavy agents. Use capa-cli / none for shell-comfortable one-shot patterns. Keep expose-all only when there is no SaaS surface to filter.