An earlier iteration of the wait-time prediction engine that took a fundamentally different approach from v8 Prophet: a single global LightGBM model trained on all attractions, a live PostgreSQL backend for real-time wait data, and a recursive multi-step forecasting loop where each predicted value is fed back as a lag feature for the next interval. Seven prediction horizons from 15 minutes to 3 hours.
The fundamental challenge with any lag-based wait time model is: at t+60 min, the lag features for t+15 and t+30 don't exist yet. v6 solved this by running predictions sequentially — each predicted value is appended to the live data buffer and used as input for the next step.
prediction_buffer.tail(20) — only the 20 most recent points are kept to prevent memory growth.
At prediction start, the buffer is seeded with 6 hours of real PostgreSQL wait-time observations from park_live_status_logs.
Each appended predicted value gradually replaces real observations as the horizon extends.
At +3hr, the buffer contains almost exclusively model-generated values — error compounds across steps.
trained_wait_time_model.pkl
Feature list: model_features.pkl — exact column order the model was trained on
Categorical features: model_categorical_features.pkl
Category levels: model_categorical_feature_categories.pkl — maps each cat column to its training-time levels to prevent unseen-category errors
Entity_ID categories loaded from pkl → filter park_entities to only trained IDs
Entity type filter: ATTRACTION or RIDE — shows excluded
Each entity row: entity_id, name, parent_id (Park_ID), entity_type
Park_ID derived from parent_id → maps to MK/HS/EP/AK prefix for schedule columns
park_live_status_logs PostgreSQL table — query last 6 hours per entity_id
Column: queue_details → 'STANDBY' → 'waitTime' (JSONB path extraction)
Output: list of {datetime, wait_time} sorted by recorded_at ASC
Timestamps normalized to UTC (tzinfo-aware)
6-hour lookback provides lag_7day (zeroed), lag_24hr (zeroed), lag_1hr, lag_30min, lag_15min
park_schedules table → today, yesterday, tomorrow hours (Operating / Early Entry / Extended Evening)
Derived: {PREFIX}HOURS (today), {PREFIX}HOURSYEST, {PREFIX}HOURSTOM — operating hour duration in float hours
METAR weather from weather_data table (station KISM) — closest observation by UTC timestamp
Weather fields: TEMP (°F), HUMIDITY, PRECIP_TOTAL, WIND_SPEED, WIND_GUST, PRESSURE, CONDITION, VISIBILITY, CLOUD_COVER
Park_Name from parent entity lookup in park_entities
School session flags defaulted to 0 if not in DB (Total_Schools_inSession, Schools_inSession_wdw, etc.)
model.predict(features_df) per interval [15, 30, 60, 90, 120, 150, 180 min]
Output clamped: max(0, predicted_wait_time) — no negative wait times
Each prediction appended to buffer before next interval's feature assembly
Buffer pruned to tail(20) to limit memory per ride
Result: dict of {prediction_datetime → wait_time_seconds} per entity_id
Unlike v8 which operated entirely on BigQuery batch data, v6 read from a live PostgreSQL
database at prediction time. This enabled real-time lag features but required a running
wait-time ingestion service feeding park_live_status_logs.
| Table | Key Columns | Used For |
|---|---|---|
| park_live_status_logs | entity_id recorded_at queue_details JSONB | Real-time wait-time buffer — 6-hour lookback window per ride. JSONB path: queue_details → 'STANDBY' → 'waitTime' |
| park_entities | entity_id entity_type parent_id latitude longitude | Entity metadata — name, type (ATTRACTION/RIDE), park parent_id, geographic coordinates |
| park_schedules | entity_id date type start_time end_time | Today / yesterday / tomorrow park hours. Schedule types: Operating, Early Entry, Extended Evening Hours |
| weather_data | station_id observation_time temp_c precip_in wx_string | METAR/ASOS observations from KISM (Disney World weather station). Closest observation by UTC delta selected. |
LightGBM was chosen for three concrete reasons over the alternatives available at this stage. First, it handles categorical features natively — Entity_ID can be passed directly as a pandas category dtype without one-hot encoding, which matters when you have 130+ attractions as a single categorical. Second, training speed: LightGBM's leaf-wise growth is significantly faster than XGBoost's level-wise growth on this size dataset, which mattered when iterating on feature sets. Third, it tolerates missing values without explicit imputation unlike many other algorithms.
The explicit model_categorical_feature_categories.pkl file was added specifically to prevent a common LightGBM deployment bug: if a category appears at inference time that wasn't in training (e.g., a new attraction added after training), the model raises an error. Serializing the training-time category levels and re-applying them at inference forces any new values to be treated as known categories.
v6 used a single LightGBM model trained on all rides simultaneously, with Entity_ID as a categorical feature. This is the opposite of v8's one-Prophet-per-ride approach. The global model benefits from cross-ride patterns: if rain causes Space Mountain's waits to drop and Haunted Mansion's waits to stay flat, the model can learn that pattern from the combined data rather than each ride having to independently learn weather effects from its own sparse rain observations.
The tradeoff: a global model can't capture highly ride-specific seasonality patterns. If Big Thunder Railroad has a unique crowd pattern during Halloween parties that no other ride shares, the global model's Entity_ID feature has to carry that signal alone — it can't model the full interaction between entity, time, and context the way a dedicated per-ride model can. This is why v8 swung hard in the other direction, but then hit the training time problem.
The recursive approach is theoretically sound for short horizons. At +15 minutes, all lag features come from observed PostgreSQL data — the prediction quality is limited only by model accuracy. At +30 minutes, lag_15min comes from the +15 min prediction — one compounded error, but small. At +60 minutes, rolling_mean_1hr is an average of observed data and one or two predicted values — still mostly real. The degradation is gradual.
At +3 hours, lag_1hr is the prediction from +2hr, which itself was built from predicted lags. Rolling_mean_4hr is almost entirely model-generated. The model is predicting based on its own prior predictions rather than anything observed. Empirically this produces "smooth" but increasingly average-regressed forecasts — the model tends toward the historical mean of that entity_id for that hour of day, which is arguably better than the v8 zero-imputation approach but still not reliable.
The buffer was pruned to tail(20) — keeping only the 20 most recent points. This helped memory but also meant at +3hr prediction, you might have only 3 actual observed points and 17 model-generated ones in the window, with the observed ones being 4+ hours stale.
The v6 architecture required a continuously running wait-time ingestion process writing to park_live_status_logs. Without this, the 6-hour lookback buffer would be empty at inference time, all lag and rolling features would fall back to NaN → 0, and the model would be operating in near-identical conditions to v8's imputation approach. The live data was the entire justification for v6's architecture over v8's batch approach.
The queue_details column used a JSONB structure — the actual wait time was at path queue_details → 'STANDBY' → 'waitTime'. This JSONB path is the same structure as the live queue data from the park entity API, so the ingestion pipeline wrote API responses directly as JSON blobs. This meant the schema was flexible but the extraction path had to be hardcoded in the prediction query.
In hindsight, this was the right call for the data model — the current production PostgreSQL database in the successor system uses the same JSONB-first approach for queue data, and the table structure of park_live_status_logs directly influenced the production schema.
Three things v6 got right that v8 abandoned in the pivot to Prophet:
The successor architecture (Knowledge RAG + precomputed aggregates) ultimately rendered the forecasting approach obsolete for the product use case — real-time wait prediction was replaced by historical pattern queries ("typical Saturday morning wait at Space Mountain") which require no live model at all.