AI Pipeline Tokenomics · Eric Sippel

Run Summary

3.05M

Total Tokens

$23.97

Total Cost

$0.67

Cost per App

250

API Calls

4.9 hrs

Wall-Clock Time

36

Apps Analyzed

Agent Pipeline — Token Flow per Application

Agent 1

Telemetry

~8,500 in · ~9,500 out

$6.06 total 44% retry

→

Agent 2

Dependency

~5,460 in · ~3,785 out

$2.63 total 6% retry

→

Agent 3

Procurement

~5,922 in · ~1,870 out

$1.65 total 0% retry

→

Agent 4

Provisioning

~7,196 in · ~8,417 out

$5.32 total 44% retry

→

Agent 5

Synthesizer

~8,223 in · ~4,861 out

$3.51 total 0% retry

→

Agent 6

Confidence Advisor

~15,166 in · ~5,849 out

$4.80 total 0% retry

→

Portfolio

Narrative

~31,779 in · ~8,077 out

$0.43 total 1 run

Agents 1–3 are independent and could run in parallel. Agents 4–6 depend on upstream outputs. Portfolio Narrative runs once after all 36 apps complete.

Cost Distribution by Agent

Telemetry

25.3% · $6.06

648K tokens · 52 calls

Provisioning

22.2% · $5.32

562K tokens · 52 calls

Confidence Advisor

20.0% · $4.80

757K tokens · 36 calls

Synthesizer

14.6% · $3.51

471K tokens · 36 calls

Dependency

11.0% · $2.63

333K tokens · 38 calls

Procurement

6.9% · $1.65

281K tokens · 36 calls

Portfolio Narrative

1.8% · $0.43

80K tokens · 2 calls

Where Tokens Go — Anatomy of a Single App Run

Token Category	Tokens	% of Input
System context (injected every call)	~153,250	8.4%
Agent-specific prompts	~420,000	23.1%
App metadata + infrastructure inventory	~380,000	20.9%
Upstream agent outputs (passed downstream)	~680,000	37.5%
Retry overhead (re-sent context)	~183,000	10.1%
Total Input	1,815,946	100%

Key Insight: Upstream Passthrough is the Largest Input Driver

37.5% of all input tokens are upstream agent outputs being passed to downstream agents. The Confidence Advisor alone receives ~26,781 tokens per app — the full outputs of all 5 upstream agents.

The Synthesizer already applies field filtering (49% input reduction vs. passing full blobs). Applying the same pattern to the Confidence Advisor is the single highest-ROI optimization available.

Retry overhead at 10.1% is the second largest avoidable cost — driven by the 44% retry rate on Telemetry and Provisioning.

Retry Economics — The Hidden Tax

Agent	Expected	Actual	Retries	Retry Rate	Est. Retry Cost
Telemetry	36	52	16	44%	~$0.91
Provisioning	36	52	16	44%	~$0.72
Dependency	36	38	2	6%	~$0.15
Portfolio Narrative	1	2	1	100%	~$0.43
Synthesizer	36	36	0	0%	$0
Confidence Advisor	36	36	0	0%	$0
Procurement	36	36	0	0%	$0
TOTAL	252	250	35	—	~$2.21

Why Retries Happen

JSON parse failures (C2 retry): The model returns malformed JSON when the user message is large and variable — primarily the infrastructure VM list in Telemetry and the workload classification payload in Provisioning. The pipeline retries with explicit <JSON>...</JSON> markers.

Validation retries: 0 in this run. The V1–V22 validator ran after every agent call. Zero rule failures means every output passed governance on first or second attempt — the retry mechanism worked as designed.

Retries account for ~$2.21 (9.2%) of total run cost. Pre-summarizing the VM list before sending to Telemetry would reduce this to near zero.

Prompt Engineering — Size vs. Necessity

Prompt file sizes (tokens estimated at 4 chars/token). System context is injected into every agent call.

portfolio_narrative

~7,176 tokens — 5,000 are a few-shot example

28,706 bytes

synthesizer

~3,589 tokens

14,359 bytes

procurement

~3,158 tokens

12,635 bytes

provisioning

~3,143 tokens

12,575 bytes

dependency

~2,471 tokens

9,887 bytes

telemetry

~2,032 tokens

8,131 bytes

confidence_advisor

~1,578 tokens

6,314 bytes

system_context ×250

~613 tokens × 250 calls = 153K total

2,454 bytes

The Portfolio Narrative Prompt is 3× Larger Than Any Other

At 28,706 bytes, portfolio_narrative.txt contains a complete few-shot example — the full 10-app sample output (~5,000 tokens). This example is injected on every portfolio narrative call. Moving it to a separate reference file loaded only when needed would save ~4,500 tokens per run with no quality impact.

Optimization Roadmap

P1 · High Impact

OPT-1: Confidence Advisor Input Compression

The Confidence Advisor receives full JSON blobs from all 5 upstream agents (~26,781 tokens/call). Apply the same field-filtering pattern already used in the Synthesizer — pass only the fields the advisor actually needs.

Estimated savings: ~540,000 tokens/run · ~$1.62/run · 6.8% cost reduction

Complexity: Low — code change in agents.py only

P1 · High Impact

OPT-3: Telemetry VM List Pre-Summarization

The 44% retry rate on Telemetry is caused by large, variable infrastructure VM lists in the user message. Pre-compute fleet statistics (avg CPU, avg RAM, OS distribution) in the enricher and send the summary instead of the raw VM array.

Estimated savings: ~48,000 tokens/run · ~$0.72/run · retry rate 44% → ~10%

Complexity: Medium — enricher + data_loader change

P2 · Near-Term

OPT-5: AWS Bedrock Prompt Caching

System context + agent prompts are identical across all 36 apps for a given agent. Bedrock prompt caching would cache these on first call and serve cached tokens at ~90% discount on subsequent calls — ~97% cache hit rate per agent.

Estimated savings: ~750K–1M tokens cacheable · ~$2.00–$2.70/run · 10% cost reduction

Complexity: Low — Bedrock feature flag when available

P2 · Near-Term

OPT-6: Bedrock Batch Processing API

AWS Bedrock Batch Inference processes requests asynchronously at 50% of on-demand pricing. For full portfolio runs where same-day results are not required, batch processing halves the cost with no changes to agent logic.

Estimated savings: ~$12.00/run · 50% cost reduction · trade-off: hours not minutes

Complexity: Low — API mode change, not suitable for interactive runs

P3 · Future

OPT-4: Structured Output Mode

Bedrock structured output guarantees valid JSON on every call, eliminating all JSON parse retries and the C2 retry overhead. This would reduce the 44% retry rate on Telemetry and Provisioning to 0%.

Estimated savings: ~50,000 tokens/run · ~$1.50–$2.00/run · eliminates retry tax

Complexity: Low — Bedrock feature flag when available for Claude Sonnet 4.6

P3 · Future

OPT-7: Agent Parallelization

Agents 1–3 (Telemetry, Dependency, Procurement) are independent per app. Running them in parallel with asyncio would cut wall-clock time from 4.9 hours to ~2.5 hours. Zero token savings — pure latency improvement.

Estimated savings: ~2.4 hours wall-clock · $0 token savings · subject to Bedrock rate limits

Complexity: Medium — async refactor of pipeline_runner.py

P4 · Low Impact

OPT-2: Portfolio Narrative Prompt Compression

The portfolio_narrative prompt contains a 5,000-token few-shot example. Moving it to a separate reference file loaded only when needed would reduce prompt size by ~70% with no quality impact.

Estimated savings: ~9,000 tokens/run · ~$0.07/run · improves maintainability

Complexity: Low — prompt file edit only

Combined P1+P2

Total Achievable Savings

Applying OPT-1 and OPT-3 today (code changes only, no infrastructure): ~$2.34/run reduction. Adding OPT-5 and OPT-6 when available: ~$16.34/run total reduction.

P1 only (today): ~$2.34/run · 10% reduction · $23.97 → $21.63

P1+P2 (near-term): ~$16.34/run · 68% reduction · $23.97 → $7.63

Benchmark — Pipeline vs. Alternative Approaches

Approach	Tokens/App	Cost/App	Cost/36-App Run	Notes
Naive monolithic prompt (all agents in one call)	~150,000	~$1.50	~$54.00	No retry isolation, no governance, unreliable math
6R-ARF v4.4.2 (current)	~84,749	$0.67	$23.97	7 specialized agents, provenance tagging, V1–V22 validation
Pipeline with OPT-1+3 applied	~69,000	~$0.55	~$19.80	Confidence advisor compression + telemetry pre-summary
Pipeline with batch processing	~84,749	~$0.33	~$11.99	Same tokens, 50% Bedrock batch pricing discount
Pipeline fully optimized (all OPTs)	~55,000	~$0.18	~$6.48	Batch + caching + compression + structured output

The current pipeline is already 44% more token-efficient than a naive monolithic approach, primarily due to Synthesizer input filtering and the Python provisioning engine (no LLM math).

Tokenomics Principles — Applied in This Pipeline

Principle 1

Separate Classification from Computation

The Provisioning agent uses LLM only for workload classification. Python handles all financial math. This eliminates the need for the LLM to reason through multi-step calculations — which would require large output budgets and produce unreliable results.

Rule: Use LLM for judgment. Use deterministic code for computation.

Principle 2

Filter Before Sending

The Synthesizer receives filtered summaries of upstream outputs, not full JSON blobs. This 49% input token reduction was achieved by identifying exactly which fields each downstream agent needs and stripping everything else before the API call.

Rule: Never send a full upstream output downstream. Identify the minimum field set and filter explicitly.

Principle 3

Condense Governance, Don't Repeat It

The system context was reduced from ~300 lines to ~60 lines by removing examples, rationale, and historical context. The full governance document exists for human reference; the condensed version is what the model needs at inference time. This saved ~446,750 tokens per run vs. v3.0.

Rule: Governance documents are for humans. System prompts are for models. Keep them separate and minimal.

Principle 4

Retry is a Tax, Not a Feature

The C2 JSON parse retry is a safety net, not a design goal. Every retry doubles the token cost of that call. The 44% retry rate on Telemetry and Provisioning represents a structural inefficiency — the input payload is too large and variable, causing occasional JSON boundary failures.

Rule: Measure retry rates per agent. Treat them as a cost metric. Design inputs to minimize retry probability.

Principle 5

Output Verbosity Has a Price

Provenance tagging (wrapping every numeric value in a structured object) roughly triples output JSON size. This is a deliberate trade-off — governance and auditability justify the cost. But every additional output field compounds across 36 apps × 7 agents × potential retries.

Rule: Every output field is a token expenditure. Require justification for verbose schemas. Provenance tagging is justified; decorative narrative is not.

Principle 6

Portfolio-Level Agents Are Cheap

The Portfolio Narrative agent processes all 36 apps in a single call (~79,712 tokens, $0.43) — less than 2% of total run cost. This is because it receives pre-summarized per-app data (~1,016 tokens for all 36 apps) rather than full agent outputs.

Rule: Aggregate at the data layer before sending to the model. Pre-summarize per-item outputs into the minimum representation needed for portfolio-level reasoning.

AI PipelineToken Economics