How a production 7-agent AI rationalization pipeline manages token spend across a 36-application enterprise portfolio — measured from real AWS Bedrock Converse API usage fields, not estimated.
Agents 1–3 are independent and could run in parallel. Agents 4–6 depend on upstream outputs. Portfolio Narrative runs once after all 36 apps complete.
| Token Category | Tokens | % of Input |
|---|---|---|
| System context (injected every call) | ~153,250 | 8.4% |
| Agent-specific prompts | ~420,000 | 23.1% |
| App metadata + infrastructure inventory | ~380,000 | 20.9% |
| Upstream agent outputs (passed downstream) | ~680,000 | 37.5% |
| Retry overhead (re-sent context) | ~183,000 | 10.1% |
| Total Input | 1,815,946 | 100% |
37.5% of all input tokens are upstream agent outputs being passed to downstream agents. The Confidence Advisor alone receives ~26,781 tokens per app — the full outputs of all 5 upstream agents.
The Synthesizer already applies field filtering (49% input reduction vs. passing full blobs). Applying the same pattern to the Confidence Advisor is the single highest-ROI optimization available.
Retry overhead at 10.1% is the second largest avoidable cost — driven by the 44% retry rate on Telemetry and Provisioning.
| Agent | Expected | Actual | Retries | Retry Rate | Est. Retry Cost |
|---|---|---|---|---|---|
| Telemetry | 36 | 52 | 16 | 44% | ~$0.91 |
| Provisioning | 36 | 52 | 16 | 44% | ~$0.72 |
| Dependency | 36 | 38 | 2 | 6% | ~$0.15 |
| Portfolio Narrative | 1 | 2 | 1 | 100% | ~$0.43 |
| Synthesizer | 36 | 36 | 0 | 0% | $0 |
| Confidence Advisor | 36 | 36 | 0 | 0% | $0 |
| Procurement | 36 | 36 | 0 | 0% | $0 |
| TOTAL | 252 | 250 | 35 | — | ~$2.21 |
JSON parse failures (C2 retry): The model returns malformed JSON when the user message is large and variable — primarily the infrastructure VM list in Telemetry and the workload classification payload in Provisioning. The pipeline retries with explicit <JSON>...</JSON> markers.
Validation retries: 0 in this run. The V1–V22 validator ran after every agent call. Zero rule failures means every output passed governance on first or second attempt — the retry mechanism worked as designed.
Retries account for ~$2.21 (9.2%) of total run cost. Pre-summarizing the VM list before sending to Telemetry would reduce this to near zero.
At 28,706 bytes, portfolio_narrative.txt contains a complete few-shot example — the full 10-app sample output (~5,000 tokens). This example is injected on every portfolio narrative call. Moving it to a separate reference file loaded only when needed would save ~4,500 tokens per run with no quality impact.
The Confidence Advisor receives full JSON blobs from all 5 upstream agents (~26,781 tokens/call). Apply the same field-filtering pattern already used in the Synthesizer — pass only the fields the advisor actually needs.
The 44% retry rate on Telemetry is caused by large, variable infrastructure VM lists in the user message. Pre-compute fleet statistics (avg CPU, avg RAM, OS distribution) in the enricher and send the summary instead of the raw VM array.
System context + agent prompts are identical across all 36 apps for a given agent. Bedrock prompt caching would cache these on first call and serve cached tokens at ~90% discount on subsequent calls — ~97% cache hit rate per agent.
AWS Bedrock Batch Inference processes requests asynchronously at 50% of on-demand pricing. For full portfolio runs where same-day results are not required, batch processing halves the cost with no changes to agent logic.
Bedrock structured output guarantees valid JSON on every call, eliminating all JSON parse retries and the C2 retry overhead. This would reduce the 44% retry rate on Telemetry and Provisioning to 0%.
Agents 1–3 (Telemetry, Dependency, Procurement) are independent per app. Running them in parallel with asyncio would cut wall-clock time from 4.9 hours to ~2.5 hours. Zero token savings — pure latency improvement.
The portfolio_narrative prompt contains a 5,000-token few-shot example. Moving it to a separate reference file loaded only when needed would reduce prompt size by ~70% with no quality impact.
Applying OPT-1 and OPT-3 today (code changes only, no infrastructure): ~$2.34/run reduction. Adding OPT-5 and OPT-6 when available: ~$16.34/run total reduction.
| Approach | Tokens/App | Cost/App | Cost/36-App Run | Notes |
|---|---|---|---|---|
| Naive monolithic prompt (all agents in one call) | ~150,000 | ~$1.50 | ~$54.00 | No retry isolation, no governance, unreliable math |
| 6R-ARF v4.4.2 (current) | ~84,749 | $0.67 | $23.97 | 7 specialized agents, provenance tagging, V1–V22 validation |
| Pipeline with OPT-1+3 applied | ~69,000 | ~$0.55 | ~$19.80 | Confidence advisor compression + telemetry pre-summary |
| Pipeline with batch processing | ~84,749 | ~$0.33 | ~$11.99 | Same tokens, 50% Bedrock batch pricing discount |
| Pipeline fully optimized (all OPTs) | ~55,000 | ~$0.18 | ~$6.48 | Batch + caching + compression + structured output |
The current pipeline is already 44% more token-efficient than a naive monolithic approach, primarily due to Synthesizer input filtering and the Python provisioning engine (no LLM math).
The Provisioning agent uses LLM only for workload classification. Python handles all financial math. This eliminates the need for the LLM to reason through multi-step calculations — which would require large output budgets and produce unreliable results.
The Synthesizer receives filtered summaries of upstream outputs, not full JSON blobs. This 49% input token reduction was achieved by identifying exactly which fields each downstream agent needs and stripping everything else before the API call.
The system context was reduced from ~300 lines to ~60 lines by removing examples, rationale, and historical context. The full governance document exists for human reference; the condensed version is what the model needs at inference time. This saved ~446,750 tokens per run vs. v3.0.
The C2 JSON parse retry is a safety net, not a design goal. Every retry doubles the token cost of that call. The 44% retry rate on Telemetry and Provisioning represents a structural inefficiency — the input payload is too large and variable, causing occasional JSON boundary failures.
Provenance tagging (wrapping every numeric value in a structured object) roughly triples output JSON size. This is a deliberate trade-off — governance and auditability justify the cost. But every additional output field compounds across 36 apps × 7 agents × potential retries.
The Portfolio Narrative agent processes all 36 apps in a single call (~79,712 tokens, $0.43) — less than 2% of total run cost. This is because it receives pre-summarized per-app data (~1,016 tokens for all 36 apps) rather than full agent outputs.