Over 18 months of building the Park Whisperer data platform, the primary analytics store made a full circuit: starting in a relational database, migrating to a globally-distributed NoSQL document store to solve scale problems, and ultimately returning to PostgreSQL after discovering the NoSQL round-trip was adding cost and latency with no benefit. The numbers at each inflection point tell the story.
Each era was the right decision at the time, based on what was known. The progression wasn't a failure of planning — it was the result of real data revealing constraints that couldn't be predicted without production scale.
/doctype
The original park_ops container used /doctype as its partition key.
This was simple and intuitive — queries always knew their doctype. But it created two
critically overloaded physical partitions that couldn't be scaled out.
waitSnapshot was written for every attraction (200+) every 10 minutes. With /doctype as the partition key, every single waitSnapshot row landed in the same logical partition. At 57,600 documents per day, waitSnapshot alone was projected to hit Cosmos DB's 20 GB per-logical-partition hard limit in approximately 11 months. There is no way to increase this limit — it is a fundamental constraint of the Cosmos DB storage model./* indexing policy, which indexes every field in every document. Each waitSnapshot document has 15–20 fields. Indexing them all added approximately 50% unnecessary write RUs. Combined with 400–600 individual (non-batched) HTTP calls per publisher run, the job was spending 6–9 minutes of a 10-minute window just writing to Cosmos.upsert_item() calls with execute_item_batch() backed by a ThreadPoolExecutor. Added ThreadPoolExecutor(max_workers=4) to parallelize the 6 independent publish functions. Applied 6 PostgreSQL indexes on park_live_status_logs_v2 to speed the source queries that fed the publisher./doctype to [/doctype, /partitionKey] (hierarchical), where partitionKey is the operational date. This would distribute waitSnapshot across ~300 sub-partitions (one per day) instead of one. The problem: Cosmos DB does not allow in-place partition key changes. The only path is creating a new container (park_ops_v2), backfilling 5.9M documents via migration script (~4–6 hours), updating all 34 dashboard API call sites, updating Azure Functions, and running 48 hours of parallel validation before deleting the old container. This plan was designed, documented in detail, and fully ready to execute.While planning the hierarchical partition key migration, a more fundamental question surfaced: what exactly was Cosmos providing that PostgreSQL couldn't? The answer, for most of the data, was nothing.
park_aggregates Cosmos container held daily, weekly, and monthly rollups: dailyAvgWait, weeklyTop, monthlyTop, downSummaryv2, and ~20 other aggregate doctypes. Every one of these was computed by the publisher reading from park_live_status_logs_v2 in PostgreSQL, running GROUP BY queries, and writing the results to Cosmos. Downstream consumers then queried Cosmos to get those aggregates back. The data originated in PostgreSQL, was computed in PostgreSQL, written to Cosmos, and read back from Cosmos. Cosmos was a $116/month cache for data that could be queried directly from its origin.park_timeseries (~$58/mo), park_aggregates (~$58/mo), park_config (~$23/mo), park_knowledge_v2 (~$232/mo for 4000 RU/s autoscale), and pipeline_configs (~$23/mo). The floor is ~$394/month — this is the minimum with zero traffic. Burst writes during active publisher runs push it higher. PostgreSQL Flexible Server, by comparison, costs a fixed amount regardless of query volume.park_knowledge_v2 held the knowledge base for the Park Agent's RAG pipeline — KPI digests, crowd index scores, reliability scores, hourly patterns. The semantic search + BM25 hybrid ranking used for retrieval was implemented externally to Cosmos (Cosmos has no native vector search on this account). The vector computation and similarity search was already running in application code. PostgreSQL with the pgvector extension is a direct drop-in: tsvector/tsquery for BM25-style keyword matching, and a vector column with <-> cosine distance for semantic search — at no additional per-query cost.park_timeseries held waitSnapshot, lightningLane, currentDownv2, hourlyWait, and purchaseOptions: high-frequency time-series data written for 286 entities every 10 minutes. This is the write pattern Cosmos is designed for. Each entity becomes a natural partition unit. With hierarchical keys [doctype, entity_id] or [doctype, date], the data distributes evenly across hundreds of physical partitions with no hot-spot risk. The original plan preserved park_timeseries in Cosmos for exactly this reason. It was only after the full Cosmos decommission decision that even this was migrated away — because at that point, maintaining any Cosmos dependency meant maintaining all the client plumbing, key rotation procedures, and billing overhead.Choosing /doctype as the Cosmos partition key was a natural decision early on — every query filters by doctype, so it seems like the right primary partitioning axis. The problem doesn't appear until write volume reveals that two doctypes will absorb 85% of all data. At a few hundred thousand documents, this is invisible. At 5.9M documents growing at 57,600 per day, it's a time bomb with a known detonation date.
The lesson is that partition key selection for time-series or event data should be evaluated against the write distribution of the data at projected scale — not the query patterns of today. The best query partition key and the best write-distribution partition key are often different. Hierarchical partition keys exist precisely to reconcile this tension, but using them from day one on a greenfield system would have avoided the migration entirely.
The park_aggregates Cosmos container was, in effect, a materialized view — daily/weekly/monthly GROUP BY rollups of data already in PostgreSQL. The rationale for caching these in Cosmos was that NoSQL document access is faster than running aggregation queries. This is only true if the aggregations are expensive at query time and the NoSQL store is cheaper to operate than re-querying.
Neither was true here. The aggregation queries are standard GROUP BY over indexed columns — fast in PostgreSQL. And Cosmos at $116/month for park_aggregates alone is more expensive than adding a materialized view refresh to the PostgreSQL job. The moment the pipeline is redesigned without the Cosmos write step, the $116/month goes to zero and query latency barely changes because PostgreSQL was already the origin of the data.
The general principle: any caching layer whose cache is populated entirely from the same source system as its consumers should be evaluated on whether the operational overhead of maintaining it is less than the cost of querying the source directly. In this case it wasn't.
Moving from individual upsert_item() calls to execute_item_batch() had the largest single impact on publisher runtime of any change made. The problem is intuitive in retrospect: each individual Cosmos write is a separate HTTPS request with ~50–150ms of network round-trip latency. At 400–600 writes per publisher run, this is 40–90 seconds of pure network wait time with no compute happening — the application is just waiting for acknowledgments.
Batching 100 items per batch call reduces the network wait by a factor of 100×. The CPU overhead of assembling the batch is negligible. This is a straightforward application of the fundamental principle that network I/O is orders of magnitude more expensive than local computation, and any architecture that issues one network call per record is wasting most of its execution time.
The same principle applies to the PostgreSQL side: the 6 covering indexes added to park_live_status_logs_v2 turned full sequential scans into index-only scans. The combination of fast source queries and fast batch destination writes brought the publisher from 6–9 minutes to 2–4 minutes per run.
The hierarchical partition key migration plan was fully designed and never executed. The reason is not that the plan was wrong — it was sound. The reason is that full execution required updating 34 separate call sites in the dashboard API plus all Azure Functions consumers, running a 48-hour parallel validation window, and only then deleting the origin container. The actual migration (backfilling 5.9M documents) was estimated at 4–6 hours. The consumer update work was estimated at multiple days.
This is a recurring pattern in data platform work: the migration of the store itself is rarely the hard part. The hard part is finding every consumer, understanding what query pattern each one uses, updating it to the new schema or partition key structure, testing the update doesn't change behavior, and running long enough in parallel to be confident. This is the hidden cost that makes "just change the partition key" a multi-week project.
The decision to decommission Cosmos entirely rather than migrate it was driven partly by this: if migrating even one container requires touching 34 call sites, the ongoing maintenance cost of "Cosmos as a first-class store" is high. A single PostgreSQL endpoint with a stable schema has no per-partition query syntax, no partition key to remember, and no migration path for future schema changes — just standard SQL.
The story of returning to PostgreSQL is not a story about NoSQL being wrong. park_timeseries was the one container that was a genuine fit for Cosmos: 286 entities × 10-minute write cadence × multiple doctypes = hundreds of documents per minute, each naturally partitioned by entity or date. Cosmos's multi-hash hierarchical partitioning distributes this load across hundreds of physical partitions with no hot-spot risk and no lock contention. A single PostgreSQL table receiving this write volume would require careful partitioning, connection pooling, and ongoing DBA attention to stay healthy.
The original design — ingest to PostgreSQL, publish time-series to Cosmos — was architecturally coherent. The error was extending Cosmos to aggregates and knowledge documents that had no need for its partition-based scalability. The lesson is precision: use the right store for each data shape, not a single store for everything, but also not the same store for data that doesn't share the same access patterns just because you've already paid for the account.