A containerized ETL pipeline running every 10 minutes on Azure Container Apps, consuming the ThemeParks.wiki public API and formatting live attraction status, Lightning Lane availability, sellout events, and park schedules into a PostgreSQL data store that feeds all downstream AI consumers — the ride wait forecasting models, the SRE agent, and the Park Agent chat interface.
Each run executes three sequential processing phases: entities (metadata upserts), schedules (operating hours + Lightning Lane purchase options), and live data (real-time attraction status with event detection). All output lands in PostgreSQL.
/entity/{id}/children — entity roster
/entity/{id}/schedule — operating hours
/entity/{id}/live — real-time status
Walt Disney World entity ID: e957da41-3552-4cf6-b636-5babc5cbc4e5
process_entities()
Fetch flat entity roster (attractions, restaurants, shows)
UPSERT into park_entities_v2 — 286 entities
Fields: entity_id, name, entity_type, parent_id, slug, lat/lon, timezone
RETIRED_ENTITY_IDS blocklist — suppress writes for closed attractions
process_schedules()
Park operating hours per date
UPSERT into park_schedules_v2
Lightning Lane purchase options — UPSERT into ll_purchase_options_v2
Fields: entity_id, date, open_time, close_time, is_park_hours
process_live_data()
Poll real-time status: wait_time, queue_type, status, is_down
UPSERT into park_live_status_logs_v2 — high-volume table
Downtime event detection — compare to PREVIOUS_STATES_CACHE
Sellout event detection — Lightning Lane FINISHED/SOLD_OUT transitions
Release event detection — availability restored after sellout
publish_sellout_events() — called inline from live data phase
Writes selloutEvent (legacy) + selloutEventv2 (enhanced) documents
v2 includes: release_duration_minutes, release_duration_display, operating_minutes
Duration calculated from last releaseEvent → sellout time, excluding CLOSED periods
clip_to_park_hours() strips closed intervals from operating_minutes calculation
az-park-data-v2.postgres.database.azure.com
All writes via psycopg2 with ON CONFLICT DO UPDATE upsert semantics
Blob Storage — raw API response archival (Azure Storage)
35-day backup retention, geo-redundant
All four tables use ON CONFLICT DO UPDATE upsert semantics — every run is idempotent.
The high-volume table is park_live_status_logs_v2, which accumulates a snapshot
every 10 minutes for each of the 286 park entities.
Sellout and release events are written as structured documents in addition to the tabular snapshots. The v2 sellout event enriches the legacy format with duration analytics derived from operating hours.
| Field | Type | Source | Description |
|---|---|---|---|
| doctype | TEXT | hardcoded | selloutEventv2 |
| entity_id / entity_name | UUID / TEXT | API | Attraction identity |
| availability_type | TEXT | API | LIGHTNING_LANE_SINGLE or LIGHTNING_LANE_MULTI |
| sold_out_at | TIMESTAMPTZ | state transition | UTC timestamp when status flipped to FINISHED/SOLD_OUT |
| sold_out_time | TEXT | derived | Human-readable Eastern time: "2:47 PM" |
| release_duration_minutes | FLOAT | computed | Minutes from last releaseEvent → this sellout. Null if no prior release found. |
| release_duration_display | TEXT | derived | Formatted duration: "1h 23m" or "47m" |
| operating_minutes | FLOAT | computed | Duration minus CLOSED park intervals via clip_to_park_hours() |
| price | NUMERIC | ll_purchase_options_v2 | Purchase price at time of sellout |
| operational_date | TEXT | derived | YYYY-MM-DD in Eastern time (rolls at 3 AM) |
The data store migration was the most significant architectural decision in this pipeline's history. The original v6 design used a two-job split — ingest to PostgreSQL, then a second job (park-cosmos-publisher) to read from PostgreSQL and re-publish to Cosmos DB. This round-trip was eliminated entirely.
park_ops container with a single /doctype partition key. Hot partition problem emerged as waitSnapshot grew to 62.5% of all documents.park_timeseries, park_aggregates, park_config) with hierarchical partition keys. Solved hot partition issue but introduced operational complexity and ~$394/month Cosmos floor cost.Every write uses ON CONFLICT (entity_id) DO UPDATE or composite key variants. This makes every run idempotent — a job that runs twice at the same timestamp produces identical state rather than duplicate rows. For a 10-minute cron that has no deduplication budget, idempotency is more valuable than the slight write amplification from checking for existing rows first.
The exception is park_live_status_logs_v2, which is designed as an append log — each 10-minute snapshot is a distinct row even for the same entity. The upsert key here is (entity_id, status_timestamp), so re-runs at the same timestamp still converge to one row per entity per timestamp.
Sellout and downtime event detection compares the current API response against the previous known state for each entity. Rather than querying PostgreSQL on every iteration to find the last known status, the cache is loaded once at the start of each run and updated in-place as the live data phase processes each entity. This reduces per-entity queries from O(n) database calls to one bulk load.
The cache is module-level (PREVIOUS_STATES_CACHE = {}), so it resets on each container invocation. Container App Jobs start fresh for each scheduled execution — there's no warm instance carrying stale cache between runs. This is intentional: it avoids a ghost state problem where a cache entry from a prior run suppresses a real state transition that happens to look like the previous value.
A Lightning Lane sellout that lasts 90 minutes means something very different if the park was closed for 45 of those minutes. clip_to_park_hours() subtracts CLOSED intervals from the release-to-sellout window to compute the number of minutes the attraction was actually open and accessible during that period. This makes operating_minutes the correct feature for demand analysis — a 45-minute operating window that sold out is more significant than a 90-minute raw window that included park closure.
This field directly feeds the downstream content pipeline's analysis of which attractions have the most aggressive sellout patterns relative to their actual operational window.
The original decision to use Cosmos DB was reasonable at the time: the hot partition problem wasn't visible until waitSnapshot accumulated millions of documents. Once it was visible, the planned solution was a hierarchical partition key migration — creating park_ops_v2 with [/doctype, /date] keys and backfilling 5.9M documents. That plan was designed, documented, and ready to execute.
The decision to abandon Cosmos entirely instead came from a simpler observation: every aggregate document in Cosmos was computed from data that already lived in PostgreSQL. The publisher read from PostgreSQL, computed GROUP BY rollups, and wrote them to Cosmos — which was then re-read by downstream consumers. This is a pure round-trip with no benefit. PostgreSQL can compute the same aggregates at query time or via materialized views at a fraction of the cost, with one fewer hop in the data path and no per-RU billing.
The v6 "split architecture" was always meant to be temporary — a transition state between the old monolith and a cleaner design. The split was introduced to provide fault isolation while the Cosmos migration was planned. Once the decision to drop Cosmos was made, the split job was no longer necessary and was eliminated.
When an attraction closes permanently (Rock 'n' Roller Coaster, DINOSAUR), the ThemeParks.wiki API may still return the entity in the roster with stale data for some time. Writing zero-wait or null status records for a closed attraction adds noise to the historical dataset that downstream models have to filter. The blocklist prevents writes for specific entity IDs entirely, while preserving all historical rows already in the database.
A soft-delete approach (marking the entity as inactive in park_entities_v2) would still require every downstream consumer to filter on the inactive flag. The blocklist is enforced at the ingest layer, keeping downstream consumers clean without extra filter logic.