Park Whisperer Knowledge RAG — Vector Search on Operational Data

ETL Pipeline

Raw Ops Data → Embedded Knowledge

build_knowledge_from_ops.py reads pre-aggregated operational datasets from Cosmos DB park_ops, applies extraction logic to surface durable insights (timing strategies, reliability patterns, popularity rankings, hidden gems), then generates OpenAI embeddings and upserts the results into the knowledge store. Full and incremental modes supported — daily incremental runs refresh only the last 7–14 days, keeping embedding costs manageable.

Source

Cosmos DB park_ops 1.6M+ records 19 doctypes database: park-data-az Never query 1M raw snapshots — use pre-aggregated

Extractors

extract_wait_time_patterns() extract_reliability_patterns() extract_popularity_rankings() extract_avg_wait_insights() Pagination via continuation tokens — fetches ALL results, not just first page Operating hours filter: 7 AM–10 PM, min 5 samples/hour

Insight Filtering

wait_time_strategy: worst_wait > best_wait × 1.5 (meaningful spread) reliability_info: uptime % + incident rate thresholds popularity_ranking: top-15 by days in Top 10, avg peak ≥ 30 min low_wait_gem: avg_wait < 15 min, ≥ 5 days of data

Embedding

Azure OpenAI text-embedding-3-small 1536 dimensions Input: title + content Managed Identity (prod) or API key (local dev) Batch variant for bulk runs

Output

PostgreSQL park_knowledge table (100%) HNSW index on embedding column (pgvector) id: {type}_{entity_id}_{YYYYMMDD} JSONB data column — full doc stored alongside row-level query fields entity_id, operational_date, context_for_llm denormalized for fast queries Upsert (safe to re-run) precomputed_docs table: raw event/snapshot types (selloutEventv2, dailyTop, etc.) partitioned by YYYY-MM-DD

Run Modes

Full: all data from 2025-07-26 — complete rebuild Incremental (--incremental): last 7–14 days — daily refresh Purge + regenerate: purge_park_knowledge.py + build script 8-second hard statement timeout per PG query — prevents runaway scans Auto-reconnect: ping on reuse, reconnect on idle timeout

Source Data

19 Doctypes, 1.6M+ Records in park_ops

All data is partitioned by date. The extractors deliberately avoid the raw 1M+ snapshot tables and query pre-aggregated doctypes first — the raw waitSnapshot table is only queried for hourly pattern extraction, and even then uses partition key filtering to limit scan range.

waitSnapshot

1,122,018 records · ⚠️ Use with partition filter

Raw per-attraction wait time snapshots. Fields: entity_id, entity_name, wait_time, status, date, hour. Used only for hourly pattern extraction with strict date range filter.

lightningLane

814,778 records

Lightning Lane pricing and availability snapshots. Foundation for future lightning_lane_value and ll_sellout_timeline knowledge types.

purchaseOptions

355,337 records

Ticket pricing options. Used for sellout risk modeling and LL value scoring knowledge types.

selloutEvent

23,797 records

Sold-out events and dates. Input for sellout_risk_window knowledge type and structured fast-path sellout queries.

dailySchedule

19,095 records

Park hours, special schedules, extended evening events. Input for schedule_opportunity knowledge type (proposed).

dailyAvgWait

4,142 records · Pre-aggregated

Pre-aggregated daily average wait times per attraction. Fields: entity_id, name, avg_wait_time, sample_count, period. Used for low_wait_gem extraction.

dailyTop

1,561 records · Pre-aggregated

Top 10 attractions by max wait time per day. ID format: dailyTop_<uuid>_YYYY-MM-DD. Date extracted from ID suffix. Used for popularity_ranking.

downSummaryv2

Daily aggregate · downtime/reliability

Daily per-attraction downtime totals. Fields: downtime_hours, downtime_minutes, incident_count. Partition key filtered. Sole source for reliability_info.

dailyTopSellout

713 records · Pre-aggregated

Top sellout items by day. Feeds sellout ranking knowledge types and LL release pattern analysis.

currentDown

105 records · Live status

Currently closed/down attractions with live downtime tracking. Fields: entity_id, down_since, downtime_hours. Used for crowd_shift_alert (proposed).

dailyAvgWaitRestaurant

2,354 records

Restaurant-specific daily wait aggregates. Feeds dining knowledge types: dining_pressure_window, restaurant_strategy.

dailyTopRestaurant

687 records

Top restaurants by wait per day. Parallel to dailyTop but scoped to dining. Query pattern identical.

Query Intelligence

3-Strategy Router with Phi-4-mini + SLM Classifier

Every question hits a multi-layer routing system before any data is retrieved. Deterministic fast paths handle numeric questions without LLM inference cost. A fine-tuned Phi-3 Mini SLM (deployed to Container Apps) handles classification with Phi-4-mini as fallback when the SLM returns low confidence or is unavailable. Pattern keywords override both classifiers when conceptual questions would be misrouted to structured queries. This same query intelligence layer also serves the content generation pipelines as a tool call.

Query Routing Pipeline

Fast paths fire first — if a question matches a deterministic pattern (wait count, average wait by date, downtime total by week), it resolves directly against PostgreSQL with zero LLM cost. Otherwise, Phi-4-mini-instruct classifies to semantic/structured/knowledge, temporal references are parsed and date drift is corrected, then the strategy override layer catches conceptual questions the classifier misrouted to structured mode.

Fast Path

Regex entity extract
Numeric question match
Direct PG query — 0 LLM calls

→

SLM → Phi-4-mini

Phi-3 Mini via Container Apps (rollout %)
Confidence gate ≥ 0.85 → trust SLM
Fallback: Phi-4-mini-instruct (Azure OpenAI)

→

Temporal Correction

parse_temporal_references()
Fixes model year drift
"Saturday" → next occurrence or 90-day history

→

Strategy Override

Pattern keywords → force semantic
KPI keywords → inject KPI doctypes
Strip invalid knowledge_types

→

Retrieval

Semantic: embed + BM25
Structured: dynamic Cosmos SQL
Knowledge: partition read + BM25

SLM Client design: should_use_slm(query) uses a deterministic SHA-256 hash of the query — the same question always takes the same path, so A/B comparison logging is consistent. Traffic is split by SLM_ROLLOUT_PERCENTAGE (env var, 0–100). If the SLM returns confidence < SLM_CONFIDENCE_THRESHOLD (default 0.85), it falls back to Phi-4-mini and stamps classification_source = "slm_fallback_low_confidence" on the result for observability.

SHA-256 deterministic routing SLM_ROLLOUT_PERCENTAGE (0–100) SLM_CONFIDENCE_THRESHOLD (default 0.85) SLM_TIMEOUT_MS (default 2000) classification_source field on every response 2-retry with exponential backoff

Strategy 1

Semantic Search

Embeds the question with text-embedding-3-small, queries PostgreSQL HNSW index (embedding <=> query_vec cosine distance), returns top-k by similarity. For knowledge lookup mode, fetches by type partition and re-ranks with BM25 in Python to avoid the ordering-bias bug described below.

"What are the best hidden gems at Magic Kingdom?"
"Strategy tips for rope drop?"

Strategy 2

Structured Query

Routes to precomputed_docs PostgreSQL table (partitioned by YYYY-MM-DD) for raw event types, or Cosmos DB for timeseries aggregates. KPI types use PostgreSQL point reads (pg_point_read_kpi) — ~2 RU equivalent vs 10–30 RU for Cosmos partition scans.

"What was Space Mountain's max wait on 2026-06-01?"
"Top 10 by wait time yesterday"

Fast Paths

Deterministic Routes

Three regex-driven fast paths bypass the classifier entirely. Entity extracted from question via pattern matching. Resolves directly against PostgreSQL — zero LLM inference, single indexed query.

"How many times did Haunted Mansion appear in the top 10 this year?"
"Average wait for Space Mountain on Tuesdays"
"Big Thunder Mountain total downtime last week"

Date Intelligence

Temporal Reference Parsing — Fixing LLM Date Drift

LLMs trained on data with a cutoff date consistently return wrong years in extracted date ranges. parse_temporal_references() runs after every classification and corrects or replaces the model's time_range output based on deterministic rules — no model inference involved.

Pattern vs. Specific

Pattern query detected via keywords: "typical", "usually", "average", "trends", "compare" Pattern + day-of-week → 90-day historical window Specific + day-of-week → next occurrence from today

Relative Keywords

"today" → today's date (Eastern Time) "yesterday" → today - 1 "last week" → Mon–Sun of prior week "last 7 days" → rolling 7-day window "this week" → Monday of current week → today "recent patterns" → 30-day window

Drift Detection

Model returns year ≠ current year → log warning + override with 30-day window Pattern: if start/end don't begin with current year → apply default All corrections logged as ⚠️ Date drift detected: model returned {date}

Strategy Override

Pattern keywords → force strategy = "semantic" regardless of classifier output Prevents classifier from routing "best time to visit X" to structured Cosmos SQL (which requires explicit doctypes + valid date ranges to avoid BadRequest errors)

Architecture Evolution

v1 → v2 (A/B) → v3 Production

Version	Knowledge Store	Classifier	Containers	Key Change
v1	Cosmos DB `park_knowledge` (partition: /type)	None — heuristic routing only	park_ops (single)	Initial ETL pipeline. All knowledge in single Cosmos container. No embedding-based retrieval.
v2 A/B	Cosmos DB `park_knowledge_v2` + schema_version gating	Phi-4-mini-instruct introduced	park_ops → park_timeseries + park_aggregates	A/B test of `query_intelligence_b/` variant. 22/25 PASS. 3 KPI-compute misses traced to data pipeline gap (not query logic). Promoted to production.
v3 Prod	PostgreSQL `park_knowledge` table — 100% of docs write here. Cosmos `park_knowledge_v2` retired.	Phi-3 Mini SLM (Container Apps) → Phi-4-mini fallback	park_timeseries + park_aggregates (Cosmos, raw ops). precomputed_docs (PostgreSQL, ranked event types). park_knowledge_v2 retired.	PostgreSQL HNSW index for vector search. ~80% RU cost reduction on knowledge queries. 2 RU KPI point reads vs 30 RU Cosmos scans. 8s hard statement timeout. SLM client serves both RAG queries and content generation pipelines.

Design Decisions

Engineering Choices

Why move knowledge from Cosmos to PostgreSQL in v3?+

Cosmos DB knowledge queries were expensive: a BM25 scan over a full type partition (e.g., all wait_time_strategy docs) required fetching 50–100 documents and cost 10–30 RU per query. For KPI types where the document ID is deterministic (e.g., kpi_daily_digest_{date}), this was wasteful — a point read is available but only if you know the exact partition key and id. PostgreSQL cuts this to ~2 RU equivalent via indexed lookups and supports the BM25 ranking natively via pg_trgm or ts_rank. The 10-minute in-memory cache on warm Azure Function hosts further reduces round-trips for repeated questions about the same park day.

Why does the BM25 ranking NOT use ORDER BY in Cosmos queries?+

A subtle data pipeline ordering bug: the knowledge rebuild script processes attractions in entity_id order — headliner rides (Space Mountain, Haunted Mansion) are written early in the upsert loop, and low-priority attractions are written last. When Cosmos returns results via SELECT TOP N ORDER BY updated_at DESC, only the last-upserted (low-priority) batch reaches the BM25 ranker — headliners are silently excluded from every query result. The fix: fetch all docs for the relevant partition without ordering, then apply BM25 scoring in Python. Each attraction has exactly one doc (upserted by stable id), so there's no version-staleness risk from removing the ORDER BY.

How do the deterministic fast paths work?+

Three regex-driven functions intercept questions before the Phi-4-mini classifier is called:

_wait_metric_fast_path(): Detects "how many times did X appear in the top 10" and "average wait for X on [date/weekday]" — extracts entity via regex, routes directly to a PostgreSQL function.
_downtime_metric_fast_path(): Detects "total downtime for X last week" — extracts entity and date range, routes to pg_downtime_total().
Entity extraction: Multiple regex patterns handle different phrasings. "Average wait for Space Mountain on Tuesdays" and "Space Mountain average wait" both resolve to entity="Space Mountain". Normalizes "the", "a", "ride" prefixes/suffixes.

If a fast path matches but returns None (entity not found or no matching rows), the question falls through to the full classifier pipeline. Zero inference cost when they fire.

Why use Phi-4-mini for classification instead of GPT-4o?+

The classification step runs on every user question — it needs to be fast and cheap, not accurate at the frontier level. Phi-4-mini-instruct produces structured JSON classification outputs (strategy, doctypes, entities, time range, aggregation type) in <500 tokens with response_format: json_object, and does so in ~200–400ms. GPT-4o would add 800ms–2s of latency and 10–20× the inference cost per query. The multi-layer correction pipeline (temporal drift fix, pattern keyword override, KPI type injection, invalid type stripping) handles Phi-4-mini's known failure modes — wrong years, missed KPI routing, invented knowledge types — making the overall accuracy comparable to a more expensive model without paying for it on every question.

What does the pattern keyword override actually fix?+

Phi-4-mini frequently classifies analytical questions as strategy: "structured" — attempting to resolve them as Cosmos DB SQL queries. Questions like "what are the best hidden gems at Magic Kingdom?" or "tips for Lightning Lane strategy?" don't have a doctype or date range to query against. Routing them to the structured path causes a Cosmos BadRequest (missing required params) or silently returns empty results. The override layer scans the question for 30+ pattern keywords ("best", "typical", "strategy", "recommend", "tips", "reliability", "lightning lane", "sell out") and forces strategy = "semantic" when matched — routing these questions to the knowledge base vector search where the answers actually live.

Proposed next knowledge types and what they need+

lightning_lane_value: LL pricing vs wait savings. Inputs: lightningLane + dailyAvgWait. Output: LL value score and recommendation bands per attraction per date range.
sellout_risk_forecast: "This date has high sellout risk." Inputs: selloutEvent + purchaseOptions + dailySchedule (park hours as context). Model: frequency of past sellouts on matching date patterns.
rope_drop_targets: First 1–3 attractions that minimize total morning wait. Inputs: hourly waitSnapshot patterns (early window: 7–9 AM). Output: ranked list by park / day-of-week.
crowd_shift_alert: Detect unusual wait spikes from ride closures. Inputs: currentDown + waitSnapshot → statistical deviation from baseline. Would feed real-time pivot recommendations.

Park WhispererKnowledge RAG

Raw Ops Data → Embedded Knowledge

19 Doctypes, 1.6M+ Records in park_ops

3-Strategy Router with Phi-4-mini + SLM Classifier

Query Routing Pipeline

Fast Path

SLM → Phi-4-mini

Temporal Correction

Strategy Override

Retrieval

Semantic Search

Structured Query

Deterministic Routes

Temporal Reference Parsing — Fixing LLM Date Drift

34+ Knowledge Types Across 4 Categories

v1 → v2 (A/B) → v3 Production

Engineering Choices

Park Whisperer
Knowledge RAG