SLM Content Pipeline — Phi-3 Mini for Park Content Generation

System Overview

How the Pipeline Works End-to-End

Each pipeline type runs on a schedule (operations bulletin every morning, weather updates 3× daily, trend reports nightly, etc.). A Container App Job launches the pipeline runner, which assembles context from the knowledge RAG + real-time tools, then calls Claude via LangChain to produce formatted multi-platform content.

Trigger

Azure Container App Job (scheduled) Pipeline registry resolves content_type → tool set + prompt template Pipeline config loaded from Cosmos DB (prompt_source: cosmos or hardcoded) Config loaded fresh per run — no stale prompt risk

Pre-fetch Layer

PIPELINE_PREFETCH[content_type] config All fetches run in parallel before LLM call Knowledge queries: queryIntelligence → PostgreSQL RAG (semantic + structured) Real-time tool calls: current waits, LL status, weather, park schedule Results injected into prompt as structured context — not tool calls during generation KPI fact pack: kpi_daily_digest + kpi_crowd_index pre-loaded for gating

KPI Gate

should_gate_for_fact_pack() evaluates KPI data quality Gates require: crowd_score present, top_10_waits populated, operational_date matches today Gate reason logged: aws_auth_failure / generation_failed / insufficient_kpi_data Gate failure → pipeline skips generation, logs reason, no empty content published

Agentic Loop

LangChain + ChatBedrock (Claude claude-3-5-sonnet / claude-3-haiku) Tool calls execute in parallel via ThreadPoolExecutor Type-specific tool subset per pipeline (BULLETIN_TOOLS, WEATHER_TOOLS, etc.) System prompt: temperature=0, structured output format enforced Actual token counts from usage_metadata (not estimate) Tool failures return "No data available" — never hallucinate missing data

Multi-Pass Generation

Pass 1: data synthesis (all tool results → structured summary) Pass 2: platform-specific formatting (Twitter/X, Instagram, Facebook, blog) Pass 3 (Reels): video script + captions for Instagram Reels / TikTok Section markers parsed from response → split into per-platform content artifacts Generation plan built from pipeline config (enabled channels per content_type)

Publish

blog_publisher.py writes artifacts to content store Structured content artifact (ContentArtifact) with per-platform sections Video job triggered: video_job_trigger → park-video-worker Container App Job Audit trail logged: generation time, model, token counts, gate status

Pipeline Types

10 Scheduled Content Pipelines

operations_bulletin

Daily Operations Bulletin

Full park ops snapshot: current waits, LL status, closures, dining, shows. Targets social + blog. Heaviest pre-fetch — 6 knowledge queries + 7 real-time tools in parallel.

BULLETIN_TOOLS14 tools

weather_morning / weather_midday / weather_evening

Weather Intelligence Updates

3× daily weather-focused content: NWS forecast, storm risk, ride sensitivity, rides at risk of closure. All three use identical WEATHER_TOOLS but different prompt tone (morning plan vs. midday adjustment vs. evening recap).

WEATHER_TOOLS9 tools

rope_drop_strategy

Rope Drop Strategy

Pre-open park strategy: first attractions to target, LL prioritization, weather impact on opening conditions. Uses forecast + current LL + purchase options + entity proximity context.

ROPE_DROP_TOOLS10 tools

daily_trend_report

Daily Trend Report

End-of-day analysis: wait time trends, sellout/release patterns, weekly comparisons. Uses TREND_TOOLS including query_intelligence for historical context from the knowledge RAG.

TREND_TOOLSknowledge RAG

yesterday_recap

Yesterday Recap

Narrative recap of prior day's operations: headline waits, LL drama, downtime incidents, weather impact. Structured by pre-fetched daily actuals from precomputed_docs.

TREND_TOOLSprecomputed_docs

morning_briefing

Morning Briefing

Pre-visit preparation content: what to expect today based on historical day-of-week patterns, LL opening strategy, weather outlook. MORNING_TOOLS blend real-time current state with pattern knowledge.

MORNING_TOOLS10 tools

ai_deep_insights

AI Deep Insights

Long-form analytical content: week-over-week comparisons, crowd pattern explanations, predictive guidance. INSIGHTS_TOOLS combine query_intelligence (RAG) with structured daily aggregates for multi-layer analysis.

INSIGHTS_TOOLSknowledge + structured

evening_wrap

Evening Wrap

Day-end narrative: today's operational highlights, LL sellout counts, top waits, crowd vs. historical baseline. Pre-fetches today's KPI digest and crowd index for data-driven hooks.

BULLETIN_TOOLSKPI facts

Pre-fetch Layer

All Context Assembled Before the LLM Sees a Token

Every pipeline has a PIPELINE_PREFETCH config that defines exactly which knowledge queries and real-time tool calls to fire in parallel before the LLM call. This pattern eliminates agentic round-trips for predictable data needs — the LLM receives a complete context block and can focus on generation, not data retrieval.

The operations_bulletin pipeline pre-fetches 6 semantic knowledge queries and 7 live tool calls in parallel. Entity names are embedded into query strings (not just type filters) so BM25 ranking in the knowledge store surfaces the right headliner rides — not low-priority attractions written later in the upsert cycle. Template variables like {last_same_dow} and {date} are resolved at runtime from an ET-anchored date context before queries are dispatched.

Knowledge · popularity_ranking

"TRON Rise Resistance Seven Dwarfs Mine Train Guardians Cosmic Rewind Slinky Dog Dash Flight Passage Smugglers Run Tower Terror popularity ranking top wait times"

mode: knowledge · limit: 5 · entity names embedded for BM25 accuracy

Knowledge · ll_sellout_timeline + lightning_lane_volatility

"Rise Resistance TRON Seven Dwarfs Mine Train Guardians Cosmic Rewind Slinky Dog Dash Lightning Lane sellout fastest demand volatility"

mode: knowledge · limit: 5

Structured · dining (same-DOW historical baseline)

"top 10 restaurants by wait time date={last_same_dow}"

mode: structured · limit: 10 · resolves to prior-week same-day for accurate DOW patterns

Real-time tools (parallel)

get_current_wait_times · get_current_down_attractions · get_lightning_lane_status · get_upcoming_shows · get_entity_context · get_top_restaurants · get_restaurant_wait_times

All 7 execute concurrently · failures return "No data available" (never block)

SLM Integration

Phi-3 Mini as the Query Classifier

A fine-tuned Phi-3 Mini model deployed to Azure Container Apps handles the query classification step that decides which retrieval strategy to use for each tool call. The same classifier serves both user-facing RAG queries and the automated content pipeline tool calls. The SLM runs at sub-200ms with minimal inference cost; Phi-4-mini on Azure OpenAI serves as the automatic fallback.

Routing Logic

SHA-256 hash of query → deterministic routing (same question = same path) SLM_ROLLOUT_PERCENTAGE env var (0–100) controls traffic split SLM_ROLLOUT_PERCENTAGE=0 → 100% Phi-4-mini (safe default) SLM_ROLLOUT_PERCENTAGE=100 → all traffic to Phi-3 Mini Gradual rollout: 10% → 25% → 50% → 100% as confidence is validated

Confidence Gate

SLM returns confidence: 0.0–1.0 SLM_CONFIDENCE_THRESHOLD = 0.85 (default) confidence ≥ 0.85 → use SLM result, stamp classification_source="slm" confidence < 0.85 → fall back to Phi-4-mini, stamp "slm_fallback_low_confidence" + slm_confidence value Fallback preserves full classification for logging — both paths are auditable

Error Fallback

Timeout (2000ms default), HTTP error, invalid strategy → SLM error 2 retries with exponential backoff (100ms, 200ms) before raising SLMClassificationError caught → Phi-4-mini fallback, stamp "slm_fallback_error" SLM container can be stopped for maintenance — zero pipeline disruption

Serving Infrastructure

Azure Container Apps — scales to zero when idle HTTP endpoint: POST /classify Reusable httpx.Client with connection pooling (max 20 connections) Phi-3 Mini: 3.8B parameters, ~3× faster than Phi-4-mini for classification task Same classification schema as Phi-4-mini — drop-in replacement

Tool Library

30+ LangChain Tools — Two Backend Systems

Every tool wraps one of two HTTP endpoints. LangChain tool docstrings are the only documentation the model sees for tool selection — they are written as precise data-query specifications, not conversational descriptions. Tool failures return "No data available: [reason]" and are never propagated as exceptions into the generation context.

Tool	Backend	Used By	Key Data
`query_intelligence`RAG	park-knowledge-updater → PostgreSQL RAG	TREND_TOOLS, INSIGHTS_TOOLS	Historical patterns, knowledge articles. Supports mode=semantic/structured/knowledge, knowledge_types filter to avoid full-library scan.
`get_current_wait_times`Real-time	router-bot /api/ai/v1/currentWaits	BULLETIN, MORNING, ROPE_DROP	Live wait times all attractions, operational status
`get_lightning_lane_status`Real-time	router-bot /api/ai/v1/lightningLane	BULLETIN, MORNING, ROPE_DROP	LL availability, pricing, sellout status per ride
`get_weather_intelligence`Real-time	router-bot /api/ai/weather/intelligence	All tool sets	AI-analyzed weather with crowd/wait impact predictions, storm risk
`get_weather_forecast`Real-time	router-bot /api/ai/weather/forecast	WEATHER_TOOLS, ROPE_DROP	Hourly NWS forecast, storm risk %, wind risk %, comfort level
`get_rides_at_risk`Real-time	router-bot	WEATHER_TOOLS, ROPE_DROP	Outdoor rides with closure risk given current/forecast weather
`get_ride_sensitivity`Real-time	router-bot	WEATHER_TOOLS, ROPE_DROP	Historical weather sensitivity profile per attraction (rain/wind/lightning)
`get_down_summary`Real-time	router-bot	BULLETIN, TREND, INSIGHTS	Today's downtime incidents, total hours, impacted attractions
`get_top_wait_aggregates`Real-time	router-bot	TREND_TOOLS, INSIGHTS_TOOLS	Today's ranked top-10 waits from precomputed_docs
`get_daily_sellout_summary`Real-time	router-bot	TREND_TOOLS, INSIGHTS_TOOLS	Today's LL sellout counts per attraction
`get_daily_release_summary`Real-time	router-bot	TREND_TOOLS, INSIGHTS_TOOLS	Today's LL release/restock events — high count = extreme demand cycling
`get_entity_context`Real-time	router-bot → park_knowledge parkEntities	BULLETIN, ROPE_DROP, evening_wrap	Attraction metadata: land, park, type, proximity. Used for spatial narrative.
`get_weekly_aggregation`Real-time	router-bot	TREND_TOOLS, INSIGHTS_TOOLS	Week-over-week comparative wait/downtime metrics

Agentic Loop

LangChain + Claude on AWS Bedrock

The previous implementation used boto3 directly against the Bedrock Agent runtime (RETURN_CONTROL flow), which serialized tool calls and had no actual token counting. The current implementation uses LangChain's ChatBedrock (bedrock-runtime:InvokeModel directly), parallel tool execution, and real token counts from usage_metadata.

boto3 + Bedrock Agent runtime

RETURN_CONTROL tool calls were serial. Token counting was chars÷4 estimate. Output format required retries. AWS service overhead per iteration.

Current

LangChain + ChatBedrock direct

bedrock-runtime:InvokeModel directly — no Agent service overhead. Tool calls fire in parallel via ThreadPoolExecutor. Real token counts from usage_metadata. Output format enforced at temperature=0.

Tool Parallelism

ThreadPoolExecutor per turn

When the model emits multiple tool_calls in a single turn, all execute concurrently. Results collected via as_completed(). Wall time = slowest tool, not sum.

Failure Mode

Explicit "No data available"

Every tool wraps its HTTP call in a try/except. Timeout, HTTP error, or exception → return "No data available: [reason]". Model instructed to omit that section rather than invent data.

Token Tracking

usage_metadata per call

Input tokens, output tokens, and total tracked per generation call. Used for cost audit trail and pipeline-level token budget enforcement.

Model Selection

Claude 3.5 Sonnet / 3 Haiku

Content-type-specific model selection: long-form insight generation uses Sonnet; real-time bulletin updates use Haiku for lower latency and cost.

Multi-Pass Generation

3-Call Architecture: Data → Platforms → Video

Content generation uses multiple sequential LLM calls to separate concerns. Pass 1 synthesizes raw tool data into a structured summary. Pass 2 reformats for each enabled platform. Pass 3 (Reels/TikTok) builds the video script. Each pass uses different temperature and prompt directives.

Pass 1 — Data Synthesis

All pre-fetched context + agentic tool results → structured summary temperature=0 for factual accuracy Crowd index, top wait times, LL drama, closures, weather impact Section markers delimit data categories for Pass 2 re-use

Pass 2 — Platform Format

Platform generation plan: enabled channels from pipeline config Channels: Twitter/X (280 char), Instagram (caption + hashtags), Facebook (long form), blog (full article) build_enabled_call2_format_block() builds per-channel format instructions Section markers in response → split into per-platform artifacts _normalize_generation_channels() resolves channel config per content type

Pass 3 — Reels/Video Script

Triggered only when video channel enabled for content type Template: get_facebook_reel_template() → structured hook/body/CTA script _resolve_reels_call3_override() selects weather/recap/morning variants Output: video_job_trigger → park-video-worker Container App Job renders the reel Reels call3 override allows content-type-specific video templates

Platform Config

SOCIAL_CHANNEL_ORDER defines canonical output order platform_prompt_config.py centralizes all channel format specs Content type drives which channels are enabled (not all types post to all platforms) Context-aware: has_dashboard flag adds portal-link CTAs to applicable channels

Design Decisions

Engineering Choices

Why pre-fetch context instead of letting the agent call tools?+

For scheduled pipeline types (not interactive chat), the data requirements are fully predictable. Pre-fetching in parallel has two major advantages: total wall time equals the slowest single call (not sum of all calls), and the LLM receives a complete context block rather than making tool decisions under generation pressure. The agentic loop still fires for unexpected data needs, but the pre-fetch layer covers 90%+ of what each pipeline type needs. The PIPELINE_PREFETCH config makes it explicit and testable — a new engineer can see exactly what data flows into each pipeline type before reading any LLM code.

How does the KPI gate prevent publishing on bad data days?+

should_gate_for_fact_pack() inspects the pre-fetched KPI fact pack before any LLM call. Required conditions: crowd_score is present and numeric, top_10_waits has at least one entry, and operational_date matches today (Eastern Time). If any condition fails, the pipeline logs a gate reason (insufficient_kpi_data, aws_auth_failure, generation_failed) and skips generation entirely. This prevents publishing content built from stale or partial data — which would be worse than publishing nothing. The gate reason is auditable in the pipeline run log.

Why embed entity names into knowledge query strings rather than just type filters?+

PostgreSQL BM25-style ranking (and the previous Cosmos query layer) scores documents by text relevance. If you query knowledge_types=["popularity_ranking"] with a generic question, the ranker returns low-priority attractions written last in the upsert cycle. Headliner rides (TRON, Rise of the Resistance, Seven Dwarfs Mine Train) that were upserted earlier score lower on recency-based metrics. By embedding headliner names directly into the query string — "TRON Rise Resistance Seven Dwarfs Mine Train Guardians Cosmic Rewind popularity ranking" — the text matching layer surfaces the right rides first regardless of upsert order. This was discovered through operational observation: bulletin content was featuring Remy's Ratatouille Adventure over Space Mountain because Remy was written later and scored higher on recency.

Why switch from boto3 Bedrock Agent runtime to LangChain ChatBedrock?+

The Bedrock Agent runtime (RETURN_CONTROL loop) serialized tool calls: the agent emitted one tool call, waited for the response, then decided whether to call another. This meant a 3-tool bulletin build took 3 sequential 45-second HTTP calls = 135 seconds. LangChain's ChatBedrock calls bedrock-runtime:InvokeModel directly — we own the agentic loop. When Claude emits 5 tool calls in a single turn, we fire all 5 in parallel via ThreadPoolExecutor and collect results. The same bulletin build now takes ~45 seconds (cost of the slowest tool). The secondary benefit is output format control: with the Bedrock Agent, getting structured section-marked output reliably required retries. At temperature=0 with explicit format instructions in the system prompt, Claude's first response is always correctly formatted.

How does same-day-of-week dining data work?+

Restaurant wait patterns are strongly day-of-week dependent (Magic Kingdom Saturdays vs. Tuesdays have completely different dining pressure). The structured knowledge query "top 10 restaurants by wait time date={last_same_dow}" resolves {last_same_dow} to the prior week's same day-of-week (e.g., for a Thursday bulletin, it fetches last Thursday's restaurant data). This gives the LLM a meaningful historical dining baseline that reflects the same crowd pattern as today — much more accurate than a trailing 7-day average, which blends weekday and weekend patterns.

SLM ContentGeneration Pipeline