Production · Azure OpenAI · PostgreSQL + pgvector

Park Whisperer
Knowledge RAG

A production ETL + semantic search pipeline that transforms 1.6 million raw operational records from theme park systems — wait times, downtime events, sellouts, Lightning Lane pricing — into a PostgreSQL + pgvector knowledge base. Azure OpenAI generates 1536-dimension embeddings on extracted insights; an HNSW index serves cosine similarity search; a Phi-4-mini / SLM classifier routes queries across three strategies; and deterministic fast paths bypass inference entirely for numeric questions. All 100% of knowledge docs write to PostgreSQL — Cosmos was retired as the knowledge store in v3.

1.6M+
Source Records
19
Source Doctypes
34+
Knowledge Types
1536
Embedding Dims
3
Query Strategies
v3
Architecture

Raw Ops Data → Embedded Knowledge

build_knowledge_from_ops.py reads pre-aggregated operational datasets from Cosmos DB park_ops, applies extraction logic to surface durable insights (timing strategies, reliability patterns, popularity rankings, hidden gems), then generates OpenAI embeddings and upserts the results into the knowledge store. Full and incremental modes supported — daily incremental runs refresh only the last 7–14 days, keeping embedding costs manageable.

Source
Cosmos DB park_ops 1.6M+ records 19 doctypes database: park-data-az Never query 1M raw snapshots — use pre-aggregated
Extractors
extract_wait_time_patterns() extract_reliability_patterns() extract_popularity_rankings() extract_avg_wait_insights() Pagination via continuation tokens — fetches ALL results, not just first page Operating hours filter: 7 AM–10 PM, min 5 samples/hour
Insight Filtering
wait_time_strategy: worst_wait > best_wait × 1.5 (meaningful spread) reliability_info: uptime % + incident rate thresholds popularity_ranking: top-15 by days in Top 10, avg peak ≥ 30 min low_wait_gem: avg_wait < 15 min, ≥ 5 days of data
Embedding
Azure OpenAI text-embedding-3-small 1536 dimensions Input: title + content Managed Identity (prod) or API key (local dev) Batch variant for bulk runs
Output
PostgreSQL park_knowledge table (100%) HNSW index on embedding column (pgvector) id: {type}_{entity_id}_{YYYYMMDD} JSONB data column — full doc stored alongside row-level query fields entity_id, operational_date, context_for_llm denormalized for fast queries Upsert (safe to re-run) precomputed_docs table: raw event/snapshot types (selloutEventv2, dailyTop, etc.) partitioned by YYYY-MM-DD
Run Modes
Full: all data from 2025-07-26 — complete rebuild Incremental (--incremental): last 7–14 days — daily refresh Purge + regenerate: purge_park_knowledge.py + build script 8-second hard statement timeout per PG query — prevents runaway scans Auto-reconnect: ping on reuse, reconnect on idle timeout

19 Doctypes, 1.6M+ Records in park_ops

All data is partitioned by date. The extractors deliberately avoid the raw 1M+ snapshot tables and query pre-aggregated doctypes first — the raw waitSnapshot table is only queried for hourly pattern extraction, and even then uses partition key filtering to limit scan range.

waitSnapshot
1,122,018 records · ⚠️ Use with partition filter

Raw per-attraction wait time snapshots. Fields: entity_id, entity_name, wait_time, status, date, hour. Used only for hourly pattern extraction with strict date range filter.

lightningLane
814,778 records

Lightning Lane pricing and availability snapshots. Foundation for future lightning_lane_value and ll_sellout_timeline knowledge types.

purchaseOptions
355,337 records

Ticket pricing options. Used for sellout risk modeling and LL value scoring knowledge types.

selloutEvent
23,797 records

Sold-out events and dates. Input for sellout_risk_window knowledge type and structured fast-path sellout queries.

dailySchedule
19,095 records

Park hours, special schedules, extended evening events. Input for schedule_opportunity knowledge type (proposed).

dailyAvgWait
4,142 records · Pre-aggregated

Pre-aggregated daily average wait times per attraction. Fields: entity_id, name, avg_wait_time, sample_count, period. Used for low_wait_gem extraction.

dailyTop
1,561 records · Pre-aggregated

Top 10 attractions by max wait time per day. ID format: dailyTop_<uuid>_YYYY-MM-DD. Date extracted from ID suffix. Used for popularity_ranking.

downSummaryv2
Daily aggregate · downtime/reliability

Daily per-attraction downtime totals. Fields: downtime_hours, downtime_minutes, incident_count. Partition key filtered. Sole source for reliability_info.

dailyTopSellout
713 records · Pre-aggregated

Top sellout items by day. Feeds sellout ranking knowledge types and LL release pattern analysis.

currentDown
105 records · Live status

Currently closed/down attractions with live downtime tracking. Fields: entity_id, down_since, downtime_hours. Used for crowd_shift_alert (proposed).

dailyAvgWaitRestaurant
2,354 records

Restaurant-specific daily wait aggregates. Feeds dining knowledge types: dining_pressure_window, restaurant_strategy.

dailyTopRestaurant
687 records

Top restaurants by wait per day. Parallel to dailyTop but scoped to dining. Query pattern identical.

3-Strategy Router with Phi-4-mini + SLM Classifier

Every question hits a multi-layer routing system before any data is retrieved. Deterministic fast paths handle numeric questions without LLM inference cost. A fine-tuned Phi-3 Mini SLM (deployed to Container Apps) handles classification with Phi-4-mini as fallback when the SLM returns low confidence or is unavailable. Pattern keywords override both classifiers when conceptual questions would be misrouted to structured queries. This same query intelligence layer also serves the content generation pipelines as a tool call.

Query Routing Pipeline

Fast paths fire first — if a question matches a deterministic pattern (wait count, average wait by date, downtime total by week), it resolves directly against PostgreSQL with zero LLM cost. Otherwise, Phi-4-mini-instruct classifies to semantic/structured/knowledge, temporal references are parsed and date drift is corrected, then the strategy override layer catches conceptual questions the classifier misrouted to structured mode.

Fast Path

Regex entity extract
Numeric question match
Direct PG query — 0 LLM calls

SLM → Phi-4-mini

Phi-3 Mini via Container Apps (rollout %)
Confidence gate ≥ 0.85 → trust SLM
Fallback: Phi-4-mini-instruct (Azure OpenAI)

Temporal Correction

parse_temporal_references()
Fixes model year drift
"Saturday" → next occurrence or 90-day history

Strategy Override

Pattern keywords → force semantic
KPI keywords → inject KPI doctypes
Strip invalid knowledge_types

Retrieval

Semantic: embed + BM25
Structured: dynamic Cosmos SQL
Knowledge: partition read + BM25

SLM Client design: should_use_slm(query) uses a deterministic SHA-256 hash of the query — the same question always takes the same path, so A/B comparison logging is consistent. Traffic is split by SLM_ROLLOUT_PERCENTAGE (env var, 0–100). If the SLM returns confidence < SLM_CONFIDENCE_THRESHOLD (default 0.85), it falls back to Phi-4-mini and stamps classification_source = "slm_fallback_low_confidence" on the result for observability.

SHA-256 deterministic routing SLM_ROLLOUT_PERCENTAGE (0–100) SLM_CONFIDENCE_THRESHOLD (default 0.85) SLM_TIMEOUT_MS (default 2000) classification_source field on every response 2-retry with exponential backoff
Strategy 1

Semantic Search

Embeds the question with text-embedding-3-small, queries PostgreSQL HNSW index (embedding <=> query_vec cosine distance), returns top-k by similarity. For knowledge lookup mode, fetches by type partition and re-ranks with BM25 in Python to avoid the ordering-bias bug described below.

"What are the best hidden gems at Magic Kingdom?"
"Strategy tips for rope drop?"
Strategy 2

Structured Query

Routes to precomputed_docs PostgreSQL table (partitioned by YYYY-MM-DD) for raw event types, or Cosmos DB for timeseries aggregates. KPI types use PostgreSQL point reads (pg_point_read_kpi) — ~2 RU equivalent vs 10–30 RU for Cosmos partition scans.

"What was Space Mountain's max wait on 2026-06-01?"
"Top 10 by wait time yesterday"
Fast Paths

Deterministic Routes

Three regex-driven fast paths bypass the classifier entirely. Entity extracted from question via pattern matching. Resolves directly against PostgreSQL — zero LLM inference, single indexed query.

"How many times did Haunted Mansion appear in the top 10 this year?"
"Average wait for Space Mountain on Tuesdays"
"Big Thunder Mountain total downtime last week"

Temporal Reference Parsing — Fixing LLM Date Drift

LLMs trained on data with a cutoff date consistently return wrong years in extracted date ranges. parse_temporal_references() runs after every classification and corrects or replaces the model's time_range output based on deterministic rules — no model inference involved.

Pattern vs. Specific
Pattern query detected via keywords: "typical", "usually", "average", "trends", "compare" Pattern + day-of-week → 90-day historical window Specific + day-of-week → next occurrence from today
Relative Keywords
"today" → today's date (Eastern Time) "yesterday" → today - 1 "last week" → Mon–Sun of prior week "last 7 days" → rolling 7-day window "this week" → Monday of current week → today "recent patterns" → 30-day window
Drift Detection
Model returns year ≠ current year → log warning + override with 30-day window Pattern: if start/end don't begin with current year → apply default All corrections logged as ⚠️ Date drift detected: model returned {date}
Strategy Override
Pattern keywords → force strategy = "semantic" regardless of classifier output Prevents classifier from routing "best time to visit X" to structured Cosmos SQL (which requires explicit doctypes + valid date ranges to avoid BadRequest errors)

34+ Knowledge Types Across 4 Categories

The knowledge registry defines what the system "knows" — each type maps to a partition in the knowledge store. Operational types are auto-updated by the ETL pipeline. KPI types are generated nightly by a separate kpi_generator. Evergreen types are curated manually and do not expire.

Auto-updated (operational) KPI nightly Evergreen / manual
wait_time_strategy hourly_wait_pattern reliability_info popularity_ranking low_wait_gem next_best_action sellout_risk_window dining_pressure_window lightning_lane_volatility rope_drop_priority day_of_week_pattern ll_sellout_timeline ll_release_pattern seasonal_strategy lightning_lane_value hourly_strategy weekly_trend dining_strategy rope_drop_strategy multiday_strategy restaurant_strategy month_trend kpi_daily_digest kpi_crowd_index kpi_rope_drop_priority kpi_reliability_score kpi_hourly_patterns kpi_weekly_trends hidden_gem weather_ride_strategy weather_wait_correlation seasonal_weather_pattern attraction_backstory crowd_flow_pattern cross_park_comparison budget_optimization park_whisperer_magic

v1 → v2 (A/B) → v3 Production

VersionKnowledge StoreClassifierContainersKey Change
v1 Cosmos DB park_knowledge (partition: /type) None — heuristic routing only park_ops (single) Initial ETL pipeline. All knowledge in single Cosmos container. No embedding-based retrieval.
v2 A/B Cosmos DB park_knowledge_v2 + schema_version gating Phi-4-mini-instruct introduced park_ops → park_timeseries + park_aggregates A/B test of query_intelligence_b/ variant. 22/25 PASS. 3 KPI-compute misses traced to data pipeline gap (not query logic). Promoted to production.
v3 Prod PostgreSQL park_knowledge table — 100% of docs write here. Cosmos park_knowledge_v2 retired. Phi-3 Mini SLM (Container Apps) → Phi-4-mini fallback park_timeseries + park_aggregates (Cosmos, raw ops). precomputed_docs (PostgreSQL, ranked event types). park_knowledge_v2 retired. PostgreSQL HNSW index for vector search. ~80% RU cost reduction on knowledge queries. 2 RU KPI point reads vs 30 RU Cosmos scans. 8s hard statement timeout. SLM client serves both RAG queries and content generation pipelines.

Engineering Choices

Why move knowledge from Cosmos to PostgreSQL in v3?+

Cosmos DB knowledge queries were expensive: a BM25 scan over a full type partition (e.g., all wait_time_strategy docs) required fetching 50–100 documents and cost 10–30 RU per query. For KPI types where the document ID is deterministic (e.g., kpi_daily_digest_{date}), this was wasteful — a point read is available but only if you know the exact partition key and id. PostgreSQL cuts this to ~2 RU equivalent via indexed lookups and supports the BM25 ranking natively via pg_trgm or ts_rank. The 10-minute in-memory cache on warm Azure Function hosts further reduces round-trips for repeated questions about the same park day.

Why does the BM25 ranking NOT use ORDER BY in Cosmos queries?+

A subtle data pipeline ordering bug: the knowledge rebuild script processes attractions in entity_id order — headliner rides (Space Mountain, Haunted Mansion) are written early in the upsert loop, and low-priority attractions are written last. When Cosmos returns results via SELECT TOP N ORDER BY updated_at DESC, only the last-upserted (low-priority) batch reaches the BM25 ranker — headliners are silently excluded from every query result. The fix: fetch all docs for the relevant partition without ordering, then apply BM25 scoring in Python. Each attraction has exactly one doc (upserted by stable id), so there's no version-staleness risk from removing the ORDER BY.

How do the deterministic fast paths work?+

Three regex-driven functions intercept questions before the Phi-4-mini classifier is called:

  • _wait_metric_fast_path(): Detects "how many times did X appear in the top 10" and "average wait for X on [date/weekday]" — extracts entity via regex, routes directly to a PostgreSQL function.
  • _downtime_metric_fast_path(): Detects "total downtime for X last week" — extracts entity and date range, routes to pg_downtime_total().
  • Entity extraction: Multiple regex patterns handle different phrasings. "Average wait for Space Mountain on Tuesdays" and "Space Mountain average wait" both resolve to entity="Space Mountain". Normalizes "the", "a", "ride" prefixes/suffixes.

If a fast path matches but returns None (entity not found or no matching rows), the question falls through to the full classifier pipeline. Zero inference cost when they fire.

Why use Phi-4-mini for classification instead of GPT-4o?+

The classification step runs on every user question — it needs to be fast and cheap, not accurate at the frontier level. Phi-4-mini-instruct produces structured JSON classification outputs (strategy, doctypes, entities, time range, aggregation type) in <500 tokens with response_format: json_object, and does so in ~200–400ms. GPT-4o would add 800ms–2s of latency and 10–20× the inference cost per query. The multi-layer correction pipeline (temporal drift fix, pattern keyword override, KPI type injection, invalid type stripping) handles Phi-4-mini's known failure modes — wrong years, missed KPI routing, invented knowledge types — making the overall accuracy comparable to a more expensive model without paying for it on every question.

What does the pattern keyword override actually fix?+

Phi-4-mini frequently classifies analytical questions as strategy: "structured" — attempting to resolve them as Cosmos DB SQL queries. Questions like "what are the best hidden gems at Magic Kingdom?" or "tips for Lightning Lane strategy?" don't have a doctype or date range to query against. Routing them to the structured path causes a Cosmos BadRequest (missing required params) or silently returns empty results. The override layer scans the question for 30+ pattern keywords ("best", "typical", "strategy", "recommend", "tips", "reliability", "lightning lane", "sell out") and forces strategy = "semantic" when matched — routing these questions to the knowledge base vector search where the answers actually live.

Proposed next knowledge types and what they need+
  • lightning_lane_value: LL pricing vs wait savings. Inputs: lightningLane + dailyAvgWait. Output: LL value score and recommendation bands per attraction per date range.
  • sellout_risk_forecast: "This date has high sellout risk." Inputs: selloutEvent + purchaseOptions + dailySchedule (park hours as context). Model: frequency of past sellouts on matching date patterns.
  • rope_drop_targets: First 1–3 attractions that minimize total morning wait. Inputs: hourly waitSnapshot patterns (early window: 7–9 AM). Output: ranked list by park / day-of-week.
  • crowd_shift_alert: Detect unusual wait spikes from ride closures. Inputs: currentDown + waitSnapshot → statistical deviation from baseline. Would feed real-time pivot recommendations.
Digital Twin Platform ↗ SLM Content Pipelines ↗ Azure Functions Platform ↗ Weather Intelligence ↗ All Projects ↗