Live Output · Agentic AI Pipeline · Token Economics

AI Pipeline
Token Economics

How a production 7-agent AI rationalization pipeline manages token spend across a 36-application enterprise portfolio — measured from real AWS Bedrock Converse API usage fields, not estimated.

Pipeline: 6R-ARF v4.4.2
Engagement: Healthcare Enterprise · Imaging Division
Portfolio: 36 apps · 408 VMs
Model: Claude Sonnet 4.6 on AWS Bedrock
Run date: 2026-05-22
Run Summary
3.05M
Total Tokens
$23.97
Total Cost
$0.67
Cost per App
250
API Calls
4.9 hrs
Wall-Clock Time
36
Apps Analyzed
Agent Pipeline — Token Flow per Application
Agent 1
Telemetry
~8,500 in · ~9,500 out
$6.06 total 44% retry
Agent 2
Dependency
~5,460 in · ~3,785 out
$2.63 total 6% retry
Agent 3
Procurement
~5,922 in · ~1,870 out
$1.65 total 0% retry
Agent 4
Provisioning
~7,196 in · ~8,417 out
$5.32 total 44% retry
Agent 5
Synthesizer
~8,223 in · ~4,861 out
$3.51 total 0% retry
Agent 6
Confidence Advisor
~15,166 in · ~5,849 out
$4.80 total 0% retry
Portfolio
Narrative
~31,779 in · ~8,077 out
$0.43 total 1 run

Agents 1–3 are independent and could run in parallel. Agents 4–6 depend on upstream outputs. Portfolio Narrative runs once after all 36 apps complete.

Cost Distribution by Agent
Telemetry
25.3% · $6.06
648K tokens · 52 calls
Provisioning
22.2% · $5.32
562K tokens · 52 calls
Confidence Advisor
20.0% · $4.80
757K tokens · 36 calls
Synthesizer
14.6% · $3.51
471K tokens · 36 calls
Dependency
11.0% · $2.63
333K tokens · 38 calls
Procurement
6.9% · $1.65
281K tokens · 36 calls
Portfolio Narrative
1.8% · $0.43
80K tokens · 2 calls
Where Tokens Go — Anatomy of a Single App Run
Token CategoryTokens% of Input
System context (injected every call)~153,2508.4%
Agent-specific prompts~420,00023.1%
App metadata + infrastructure inventory~380,00020.9%
Upstream agent outputs (passed downstream)~680,00037.5%
Retry overhead (re-sent context)~183,00010.1%
Total Input1,815,946100%

Key Insight: Upstream Passthrough is the Largest Input Driver

37.5% of all input tokens are upstream agent outputs being passed to downstream agents. The Confidence Advisor alone receives ~26,781 tokens per app — the full outputs of all 5 upstream agents.

The Synthesizer already applies field filtering (49% input reduction vs. passing full blobs). Applying the same pattern to the Confidence Advisor is the single highest-ROI optimization available.

Retry overhead at 10.1% is the second largest avoidable cost — driven by the 44% retry rate on Telemetry and Provisioning.

Retry Economics — The Hidden Tax
AgentExpectedActualRetriesRetry RateEst. Retry Cost
Telemetry36521644%~$0.91
Provisioning36521644%~$0.72
Dependency363826%~$0.15
Portfolio Narrative121100%~$0.43
Synthesizer363600%$0
Confidence Advisor363600%$0
Procurement363600%$0
TOTAL25225035~$2.21

Why Retries Happen

JSON parse failures (C2 retry): The model returns malformed JSON when the user message is large and variable — primarily the infrastructure VM list in Telemetry and the workload classification payload in Provisioning. The pipeline retries with explicit <JSON>...</JSON> markers.

Validation retries: 0 in this run. The V1–V22 validator ran after every agent call. Zero rule failures means every output passed governance on first or second attempt — the retry mechanism worked as designed.

Retries account for ~$2.21 (9.2%) of total run cost. Pre-summarizing the VM list before sending to Telemetry would reduce this to near zero.

Prompt Engineering — Size vs. Necessity
Prompt file sizes (tokens estimated at 4 chars/token). System context is injected into every agent call.
portfolio_narrative
~7,176 tokens — 5,000 are a few-shot example
28,706 bytes
synthesizer
~3,589 tokens
14,359 bytes
procurement
~3,158 tokens
12,635 bytes
provisioning
~3,143 tokens
12,575 bytes
dependency
~2,471 tokens
9,887 bytes
telemetry
~2,032 tokens
8,131 bytes
confidence_advisor
~1,578 tokens
6,314 bytes
system_context ×250
~613 tokens × 250 calls = 153K total
2,454 bytes

The Portfolio Narrative Prompt is 3× Larger Than Any Other

At 28,706 bytes, portfolio_narrative.txt contains a complete few-shot example — the full 10-app sample output (~5,000 tokens). This example is injected on every portfolio narrative call. Moving it to a separate reference file loaded only when needed would save ~4,500 tokens per run with no quality impact.

Optimization Roadmap
P1 · High Impact

OPT-1: Confidence Advisor Input Compression

The Confidence Advisor receives full JSON blobs from all 5 upstream agents (~26,781 tokens/call). Apply the same field-filtering pattern already used in the Synthesizer — pass only the fields the advisor actually needs.

Estimated savings: ~540,000 tokens/run · ~$1.62/run · 6.8% cost reduction
Complexity: Low — code change in agents.py only
P1 · High Impact

OPT-3: Telemetry VM List Pre-Summarization

The 44% retry rate on Telemetry is caused by large, variable infrastructure VM lists in the user message. Pre-compute fleet statistics (avg CPU, avg RAM, OS distribution) in the enricher and send the summary instead of the raw VM array.

Estimated savings: ~48,000 tokens/run · ~$0.72/run · retry rate 44% → ~10%
Complexity: Medium — enricher + data_loader change
P2 · Near-Term

OPT-5: AWS Bedrock Prompt Caching

System context + agent prompts are identical across all 36 apps for a given agent. Bedrock prompt caching would cache these on first call and serve cached tokens at ~90% discount on subsequent calls — ~97% cache hit rate per agent.

Estimated savings: ~750K–1M tokens cacheable · ~$2.00–$2.70/run · 10% cost reduction
Complexity: Low — Bedrock feature flag when available
P2 · Near-Term

OPT-6: Bedrock Batch Processing API

AWS Bedrock Batch Inference processes requests asynchronously at 50% of on-demand pricing. For full portfolio runs where same-day results are not required, batch processing halves the cost with no changes to agent logic.

Estimated savings: ~$12.00/run · 50% cost reduction · trade-off: hours not minutes
Complexity: Low — API mode change, not suitable for interactive runs
P3 · Future

OPT-4: Structured Output Mode

Bedrock structured output guarantees valid JSON on every call, eliminating all JSON parse retries and the C2 retry overhead. This would reduce the 44% retry rate on Telemetry and Provisioning to 0%.

Estimated savings: ~50,000 tokens/run · ~$1.50–$2.00/run · eliminates retry tax
Complexity: Low — Bedrock feature flag when available for Claude Sonnet 4.6
P3 · Future

OPT-7: Agent Parallelization

Agents 1–3 (Telemetry, Dependency, Procurement) are independent per app. Running them in parallel with asyncio would cut wall-clock time from 4.9 hours to ~2.5 hours. Zero token savings — pure latency improvement.

Estimated savings: ~2.4 hours wall-clock · $0 token savings · subject to Bedrock rate limits
Complexity: Medium — async refactor of pipeline_runner.py
P4 · Low Impact

OPT-2: Portfolio Narrative Prompt Compression

The portfolio_narrative prompt contains a 5,000-token few-shot example. Moving it to a separate reference file loaded only when needed would reduce prompt size by ~70% with no quality impact.

Estimated savings: ~9,000 tokens/run · ~$0.07/run · improves maintainability
Complexity: Low — prompt file edit only
Combined P1+P2

Total Achievable Savings

Applying OPT-1 and OPT-3 today (code changes only, no infrastructure): ~$2.34/run reduction. Adding OPT-5 and OPT-6 when available: ~$16.34/run total reduction.

P1 only (today): ~$2.34/run · 10% reduction · $23.97 → $21.63
P1+P2 (near-term): ~$16.34/run · 68% reduction · $23.97 → $7.63
Benchmark — Pipeline vs. Alternative Approaches
ApproachTokens/AppCost/AppCost/36-App RunNotes
Naive monolithic prompt (all agents in one call)~150,000~$1.50~$54.00No retry isolation, no governance, unreliable math
6R-ARF v4.4.2 (current)~84,749$0.67$23.977 specialized agents, provenance tagging, V1–V22 validation
Pipeline with OPT-1+3 applied~69,000~$0.55~$19.80Confidence advisor compression + telemetry pre-summary
Pipeline with batch processing~84,749~$0.33~$11.99Same tokens, 50% Bedrock batch pricing discount
Pipeline fully optimized (all OPTs)~55,000~$0.18~$6.48Batch + caching + compression + structured output

The current pipeline is already 44% more token-efficient than a naive monolithic approach, primarily due to Synthesizer input filtering and the Python provisioning engine (no LLM math).

Tokenomics Principles — Applied in This Pipeline
Principle 1

Separate Classification from Computation

The Provisioning agent uses LLM only for workload classification. Python handles all financial math. This eliminates the need for the LLM to reason through multi-step calculations — which would require large output budgets and produce unreliable results.

Rule: Use LLM for judgment. Use deterministic code for computation.
Principle 2

Filter Before Sending

The Synthesizer receives filtered summaries of upstream outputs, not full JSON blobs. This 49% input token reduction was achieved by identifying exactly which fields each downstream agent needs and stripping everything else before the API call.

Rule: Never send a full upstream output downstream. Identify the minimum field set and filter explicitly.
Principle 3

Condense Governance, Don't Repeat It

The system context was reduced from ~300 lines to ~60 lines by removing examples, rationale, and historical context. The full governance document exists for human reference; the condensed version is what the model needs at inference time. This saved ~446,750 tokens per run vs. v3.0.

Rule: Governance documents are for humans. System prompts are for models. Keep them separate and minimal.
Principle 4

Retry is a Tax, Not a Feature

The C2 JSON parse retry is a safety net, not a design goal. Every retry doubles the token cost of that call. The 44% retry rate on Telemetry and Provisioning represents a structural inefficiency — the input payload is too large and variable, causing occasional JSON boundary failures.

Rule: Measure retry rates per agent. Treat them as a cost metric. Design inputs to minimize retry probability.
Principle 5

Output Verbosity Has a Price

Provenance tagging (wrapping every numeric value in a structured object) roughly triples output JSON size. This is a deliberate trade-off — governance and auditability justify the cost. But every additional output field compounds across 36 apps × 7 agents × potential retries.

Rule: Every output field is a token expenditure. Require justification for verbose schemas. Provenance tagging is justified; decorative narrative is not.
Principle 6

Portfolio-Level Agents Are Cheap

The Portfolio Narrative agent processes all 36 apps in a single call (~79,712 tokens, $0.43) — less than 2% of total run cost. This is because it receives pre-summarized per-app data (~1,016 tokens for all 36 apps) rather than full agent outputs.

Rule: Aggregate at the data layer before sending to the model. Pre-summarize per-item outputs into the minimum representation needed for portfolio-level reasoning.