A production ETL + semantic search pipeline that transforms 1.6 million raw operational records from theme park systems — wait times, downtime events, sellouts, Lightning Lane pricing — into a PostgreSQL + pgvector knowledge base. Azure OpenAI generates 1536-dimension embeddings on extracted insights; an HNSW index serves cosine similarity search; a Phi-4-mini / SLM classifier routes queries across three strategies; and deterministic fast paths bypass inference entirely for numeric questions. All 100% of knowledge docs write to PostgreSQL — Cosmos was retired as the knowledge store in v3.
build_knowledge_from_ops.py reads pre-aggregated operational datasets from Cosmos DB
park_ops, applies extraction logic to surface durable insights (timing strategies, reliability
patterns, popularity rankings, hidden gems), then generates OpenAI embeddings and upserts the results
into the knowledge store. Full and incremental modes supported — daily incremental runs refresh
only the last 7–14 days, keeping embedding costs manageable.
All data is partitioned by date. The extractors deliberately avoid the raw 1M+ snapshot tables
and query pre-aggregated doctypes first — the raw waitSnapshot table is only queried
for hourly pattern extraction, and even then uses partition key filtering to limit scan range.
Raw per-attraction wait time snapshots. Fields: entity_id, entity_name, wait_time, status, date, hour. Used only for hourly pattern extraction with strict date range filter.
Lightning Lane pricing and availability snapshots. Foundation for future lightning_lane_value and ll_sellout_timeline knowledge types.
Ticket pricing options. Used for sellout risk modeling and LL value scoring knowledge types.
Sold-out events and dates. Input for sellout_risk_window knowledge type and structured fast-path sellout queries.
Park hours, special schedules, extended evening events. Input for schedule_opportunity knowledge type (proposed).
Pre-aggregated daily average wait times per attraction. Fields: entity_id, name, avg_wait_time, sample_count, period. Used for low_wait_gem extraction.
Top 10 attractions by max wait time per day. ID format: dailyTop_<uuid>_YYYY-MM-DD. Date extracted from ID suffix. Used for popularity_ranking.
Daily per-attraction downtime totals. Fields: downtime_hours, downtime_minutes, incident_count. Partition key filtered. Sole source for reliability_info.
Top sellout items by day. Feeds sellout ranking knowledge types and LL release pattern analysis.
Currently closed/down attractions with live downtime tracking. Fields: entity_id, down_since, downtime_hours. Used for crowd_shift_alert (proposed).
Restaurant-specific daily wait aggregates. Feeds dining knowledge types: dining_pressure_window, restaurant_strategy.
Top restaurants by wait per day. Parallel to dailyTop but scoped to dining. Query pattern identical.
Every question hits a multi-layer routing system before any data is retrieved. Deterministic fast paths handle numeric questions without LLM inference cost. A fine-tuned Phi-3 Mini SLM (deployed to Container Apps) handles classification with Phi-4-mini as fallback when the SLM returns low confidence or is unavailable. Pattern keywords override both classifiers when conceptual questions would be misrouted to structured queries. This same query intelligence layer also serves the content generation pipelines as a tool call.
Fast paths fire first — if a question matches a deterministic pattern (wait count, average wait by date, downtime total by week), it resolves directly against PostgreSQL with zero LLM cost. Otherwise, Phi-4-mini-instruct classifies to semantic/structured/knowledge, temporal references are parsed and date drift is corrected, then the strategy override layer catches conceptual questions the classifier misrouted to structured mode.
Regex entity extract
Numeric question match
Direct PG query — 0 LLM calls
Phi-3 Mini via Container Apps (rollout %)
Confidence gate ≥ 0.85 → trust SLM
Fallback: Phi-4-mini-instruct (Azure OpenAI)
parse_temporal_references()
Fixes model year drift
"Saturday" → next occurrence or 90-day history
Pattern keywords → force semantic
KPI keywords → inject KPI doctypes
Strip invalid knowledge_types
Semantic: embed + BM25
Structured: dynamic Cosmos SQL
Knowledge: partition read + BM25
SLM Client design: should_use_slm(query) uses a deterministic SHA-256 hash of the query — the same question always takes the same path, so A/B comparison logging is consistent. Traffic is split by SLM_ROLLOUT_PERCENTAGE (env var, 0–100). If the SLM returns confidence < SLM_CONFIDENCE_THRESHOLD (default 0.85), it falls back to Phi-4-mini and stamps classification_source = "slm_fallback_low_confidence" on the result for observability.
Embeds the question with text-embedding-3-small, queries PostgreSQL HNSW index (embedding <=> query_vec cosine distance), returns top-k by similarity. For knowledge lookup mode, fetches by type partition and re-ranks with BM25 in Python to avoid the ordering-bias bug described below.
Routes to precomputed_docs PostgreSQL table (partitioned by YYYY-MM-DD) for raw event types, or Cosmos DB for timeseries aggregates. KPI types use PostgreSQL point reads (pg_point_read_kpi) — ~2 RU equivalent vs 10–30 RU for Cosmos partition scans.
Three regex-driven fast paths bypass the classifier entirely. Entity extracted from question via pattern matching. Resolves directly against PostgreSQL — zero LLM inference, single indexed query.
LLMs trained on data with a cutoff date consistently return wrong years in extracted date ranges.
parse_temporal_references() runs after every classification and corrects or replaces
the model's time_range output based on deterministic rules — no model inference involved.
The knowledge registry defines what the system "knows" — each type maps to a partition in
the knowledge store. Operational types are auto-updated by the ETL pipeline.
KPI types are generated nightly by a separate kpi_generator.
Evergreen types are curated manually and do not expire.
| Version | Knowledge Store | Classifier | Containers | Key Change |
|---|---|---|---|---|
| v1 | Cosmos DB park_knowledge (partition: /type) |
None — heuristic routing only | park_ops (single) | Initial ETL pipeline. All knowledge in single Cosmos container. No embedding-based retrieval. |
| v2 A/B | Cosmos DB park_knowledge_v2 + schema_version gating |
Phi-4-mini-instruct introduced | park_ops → park_timeseries + park_aggregates | A/B test of query_intelligence_b/ variant. 22/25 PASS. 3 KPI-compute misses traced to data pipeline gap (not query logic). Promoted to production. |
| v3 Prod | PostgreSQL park_knowledge table — 100% of docs write here. Cosmos park_knowledge_v2 retired. |
Phi-3 Mini SLM (Container Apps) → Phi-4-mini fallback | park_timeseries + park_aggregates (Cosmos, raw ops). precomputed_docs (PostgreSQL, ranked event types). park_knowledge_v2 retired. | PostgreSQL HNSW index for vector search. ~80% RU cost reduction on knowledge queries. 2 RU KPI point reads vs 30 RU Cosmos scans. 8s hard statement timeout. SLM client serves both RAG queries and content generation pipelines. |
Cosmos DB knowledge queries were expensive: a BM25 scan over a full type partition (e.g., all wait_time_strategy docs) required fetching 50–100 documents and cost 10–30 RU per query. For KPI types where the document ID is deterministic (e.g., kpi_daily_digest_{date}), this was wasteful — a point read is available but only if you know the exact partition key and id. PostgreSQL cuts this to ~2 RU equivalent via indexed lookups and supports the BM25 ranking natively via pg_trgm or ts_rank. The 10-minute in-memory cache on warm Azure Function hosts further reduces round-trips for repeated questions about the same park day.
A subtle data pipeline ordering bug: the knowledge rebuild script processes attractions in entity_id order — headliner rides (Space Mountain, Haunted Mansion) are written early in the upsert loop, and low-priority attractions are written last. When Cosmos returns results via SELECT TOP N ORDER BY updated_at DESC, only the last-upserted (low-priority) batch reaches the BM25 ranker — headliners are silently excluded from every query result. The fix: fetch all docs for the relevant partition without ordering, then apply BM25 scoring in Python. Each attraction has exactly one doc (upserted by stable id), so there's no version-staleness risk from removing the ORDER BY.
Three regex-driven functions intercept questions before the Phi-4-mini classifier is called:
pg_downtime_total().entity="Space Mountain". Normalizes "the", "a", "ride" prefixes/suffixes.If a fast path matches but returns None (entity not found or no matching rows), the question falls through to the full classifier pipeline. Zero inference cost when they fire.
The classification step runs on every user question — it needs to be fast and cheap, not accurate at the frontier level. Phi-4-mini-instruct produces structured JSON classification outputs (strategy, doctypes, entities, time range, aggregation type) in <500 tokens with response_format: json_object, and does so in ~200–400ms. GPT-4o would add 800ms–2s of latency and 10–20× the inference cost per query. The multi-layer correction pipeline (temporal drift fix, pattern keyword override, KPI type injection, invalid type stripping) handles Phi-4-mini's known failure modes — wrong years, missed KPI routing, invented knowledge types — making the overall accuracy comparable to a more expensive model without paying for it on every question.
Phi-4-mini frequently classifies analytical questions as strategy: "structured" — attempting to resolve them as Cosmos DB SQL queries. Questions like "what are the best hidden gems at Magic Kingdom?" or "tips for Lightning Lane strategy?" don't have a doctype or date range to query against. Routing them to the structured path causes a Cosmos BadRequest (missing required params) or silently returns empty results. The override layer scans the question for 30+ pattern keywords ("best", "typical", "strategy", "recommend", "tips", "reliability", "lightning lane", "sell out") and forces strategy = "semantic" when matched — routing these questions to the knowledge base vector search where the answers actually live.
lightningLane + dailyAvgWait. Output: LL value score and recommendation bands per attraction per date range.selloutEvent + purchaseOptions + dailySchedule (park hours as context). Model: frequency of past sellouts on matching date patterns.waitSnapshot patterns (early window: 7–9 AM). Output: ranked list by park / day-of-week.currentDown + waitSnapshot → statistical deviation from baseline. Would feed real-time pivot recommendations.