Data Engineering · Architecture Evolution · PostgreSQL · Cosmos DB

PostgreSQL → Cosmos DB → PostgreSQL
A Data Pipeline Architecture Story

Over 18 months of building the Park Whisperer data platform, the primary analytics store made a full circuit: starting in a relational database, migrating to a globally-distributed NoSQL document store to solve scale problems, and ultimately returning to PostgreSQL after discovering the NoSQL round-trip was adding cost and latency with no benefit. The numbers at each inflection point tell the story.

5.9M
Peak Cosmos docs
$394
Monthly Cosmos floor
62.5%
Hot partition share
6–9 min
Pre-optimization runtime
57,600
Docs/day at peak
$0
Cosmos cost today

One Pipeline, Three Storage Architectures

Each era was the right decision at the time, based on what was known. The progression wasn't a failure of planning — it was the result of real data revealing constraints that couldn't be predicted without production scale.

Era 1 · v1–v5 · 2024–early 2026
Cosmos DB — Single Container
park_ops · /doctype partition key
StoreCosmos DB park_ops
Partition key/doctype
Documents at peak~5.9M
Growth rate57,600 docs/day
Hot partitionswaitSnapshot 62.5% · lightningLane 22.5%
Indexing policy/* (everything)
Publisher writes/run400–600 individual HTTP calls
Job runtime6–9 minutes / 10-min cadence
⚠ Hot partition projected to hit 20 GB Cosmos hard limit in ~11 months. Two partitions held 85% of all data. Over-indexing added ~50% unnecessary write RUs.
Era 2 · v6 Split · early–mid 2026
Two-Job Split — Ingest + Publish
PostgreSQL raw → Cosmos analytics (3 containers)
Job 1API → PostgreSQL (park_live_status_logs_v2)
Job 2PostgreSQL → Cosmos (3 containers)
Cosmos containerspark_timeseries · park_aggregates · park_config
Partition keys[doctype, partitionKey] hierarchical
Cosmos floor cost~$394/month (5 containers total)
Round-trip hopsAPI → Postgres → Cosmos → consumer
OptimizationsBatch writes · ThreadPoolExecutor · 6 PG indexes
⚠ Hierarchical partition keys solved hot partitions. But park_aggregates was computing GROUP BY rollups from PostgreSQL and writing them to Cosmos — which consumers then re-read. A pure unnecessary round-trip at $116/month.
Era 3 · v6 Current · June 2026
PostgreSQL Only — Single Job
All consumers read directly from PostgreSQL
StorePostgreSQL Flexible Server
Jobs1 (park-data-ingest only)
Cosmos DBDecommissioned
Cosmos monthly cost$0
ConsumersAzure Functions · Ride forecasting · RAG pipeline
Data hopsAPI → PostgreSQL → consumer
Operational complexity1 container · 1 store · 1 cron
✓ One fewer hop, one fewer service, zero NoSQL cost. park_aggregates rollups are computed directly in PostgreSQL at query time or via materialized views. The data never needed to leave the relational store.

When Partition Key Choice Becomes a Time Bomb

The original park_ops container used /doctype as its partition key. This was simple and intuitive — queries always knew their doctype. But it created two critically overloaded physical partitions that couldn't be scaled out.

Cosmos partition distribution by document count
waitSnapshot 62.5%
lightningLane 22.5%
purchaseOptions 7.2%
other 8%
Two partitions held 85% of 5.9M documents. Both written every 10 minutes across ~200–300 entities.
Root Cause
High-frequency writes to a low-cardinality partition key
waitSnapshot was written for every attraction (200+) every 10 minutes. With /doctype as the partition key, every single waitSnapshot row landed in the same logical partition. At 57,600 documents per day, waitSnapshot alone was projected to hit Cosmos DB's 20 GB per-logical-partition hard limit in approximately 11 months. There is no way to increase this limit — it is a fundamental constraint of the Cosmos DB storage model.
20 GB hard limit ~3.7M docs in one partition Cannot be fixed in-place
What Made It Worse
Over-indexing multiplied every write's cost
The container used the default /* indexing policy, which indexes every field in every document. Each waitSnapshot document has 15–20 fields. Indexing them all added approximately 50% unnecessary write RUs. Combined with 400–600 individual (non-batched) HTTP calls per publisher run, the job was spending 6–9 minutes of a 10-minute window just writing to Cosmos.
400–600 individual writes/run 50% extra RU from over-indexing 6–9 min runtime / 10-min cadence
Phase 1 Fix
Selective indexing + batched writes reduced runtime by ~60%
Applied a selective indexing policy with 19 specific field paths and 4 composite indexes — only what was actually queried. Replaced individual upsert_item() calls with execute_item_batch() backed by a ThreadPoolExecutor. Added ThreadPoolExecutor(max_workers=4) to parallelize the 6 independent publish functions. Applied 6 PostgreSQL indexes on park_live_status_logs_v2 to speed the source queries that fed the publisher.
Runtime: 6–9 min → ~2–4 min Batch writes via execute_item_batch() 6 PG covering indexes
Phase 2 Plan
Hierarchical partition keys — but requires full data migration
The permanent fix for the hot partition was to change the partition key from /doctype to [/doctype, /partitionKey] (hierarchical), where partitionKey is the operational date. This would distribute waitSnapshot across ~300 sub-partitions (one per day) instead of one. The problem: Cosmos DB does not allow in-place partition key changes. The only path is creating a new container (park_ops_v2), backfilling 5.9M documents via migration script (~4–6 hours), updating all 34 dashboard API call sites, updating Azure Functions, and running 48 hours of parallel validation before deleting the old container. This plan was designed, documented in detail, and fully ready to execute.
5.9M documents to migrate 34 consumer call sites to update 4–6 hour migration window Plan was designed but never executed

When the Data Never Needed to Leave PostgreSQL

While planning the hierarchical partition key migration, a more fundamental question surfaced: what exactly was Cosmos providing that PostgreSQL couldn't? The answer, for most of the data, was nothing.

The Round-Trip
park_aggregates was PostgreSQL data laundered through Cosmos
The park_aggregates Cosmos container held daily, weekly, and monthly rollups: dailyAvgWait, weeklyTop, monthlyTop, downSummaryv2, and ~20 other aggregate doctypes. Every one of these was computed by the publisher reading from park_live_status_logs_v2 in PostgreSQL, running GROUP BY queries, and writing the results to Cosmos. Downstream consumers then queried Cosmos to get those aggregates back. The data originated in PostgreSQL, was computed in PostgreSQL, written to Cosmos, and read back from Cosmos. Cosmos was a $116/month cache for data that could be queried directly from its origin.
PostgreSQL → Cosmos → PostgreSQL → consumer ~$116/month for park_aggregates alone ~262K aggregate documents
The Cost Breakdown
$394/month Cosmos floor with no burst headroom accounted for
At the time of the April 2026 architecture review, five Cosmos containers were active: park_timeseries (~$58/mo), park_aggregates (~$58/mo), park_config (~$23/mo), park_knowledge_v2 (~$232/mo for 4000 RU/s autoscale), and pipeline_configs (~$23/mo). The floor is ~$394/month — this is the minimum with zero traffic. Burst writes during active publisher runs push it higher. PostgreSQL Flexible Server, by comparison, costs a fixed amount regardless of query volume.
$394/month floor — before burst park_knowledge_v2: $232/mo (4000 RU/s) PostgreSQL: fixed cost, no per-request billing
The pgvector Observation
park_knowledge_v2 vector search was already being done outside Cosmos
park_knowledge_v2 held the knowledge base for the Park Agent's RAG pipeline — KPI digests, crowd index scores, reliability scores, hourly patterns. The semantic search + BM25 hybrid ranking used for retrieval was implemented externally to Cosmos (Cosmos has no native vector search on this account). The vector computation and similarity search was already running in application code. PostgreSQL with the pgvector extension is a direct drop-in: tsvector/tsquery for BM25-style keyword matching, and a vector column with <-> cosine distance for semantic search — at no additional per-query cost.
$232/mo for what was application-layer search anyway pgvector: same search, zero per-query cost Already operating pg on az-park-data-v2
What Cosmos Was Actually Good For
park_timeseries — the one genuine fit
park_timeseries held waitSnapshot, lightningLane, currentDownv2, hourlyWait, and purchaseOptions: high-frequency time-series data written for 286 entities every 10 minutes. This is the write pattern Cosmos is designed for. Each entity becomes a natural partition unit. With hierarchical keys [doctype, entity_id] or [doctype, date], the data distributes evenly across hundreds of physical partitions with no hot-spot risk. The original plan preserved park_timeseries in Cosmos for exactly this reason. It was only after the full Cosmos decommission decision that even this was migrated away — because at that point, maintaining any Cosmos dependency meant maintaining all the client plumbing, key rotation procedures, and billing overhead.
High-freq time-series: genuine Cosmos fit But: maintaining one container still requires all the ops overhead Final decision: full decommission simplifies everything

Quantified Impact Across Each Transition

Publisher Job Runtime
v5 Cosmos (unoptimized)6–9 min
v6 Split (post-optimize)2–4 min
v6 Current (PG only)2–3 min
Monthly Cosmos Cost (floor)
v5 Single container~$100+
v6 Split (5 containers)~$394
v6 Current (decommissioned)$0
Cosmos Writes Per Run
Individual upsert_item() calls400–600
After batch + parallelization~10 batch calls
v6 Current0
Hot Partition Share (waitSnapshot)
park_ops /doctype key62.5%
After 3-container migrationDistributed
v6 CurrentN/A (PG)
Data Hop Count (API → consumer)
v5 Cosmos direct2 hops
v6 Split (PG → Cosmos → consumer)3 hops
v6 Current (PG direct)2 hops
Jobs / Containers in Pipeline
v5 Monolith1 job · 1 Cosmos container
v6 Split2 jobs · 5 Cosmos containers
v6 Current1 job · 0 Cosmos containers

What Each Transition Taught About Data at Scale

Partition key selection is a forward-looking commitment, not a convenience choice+

Choosing /doctype as the Cosmos partition key was a natural decision early on — every query filters by doctype, so it seems like the right primary partitioning axis. The problem doesn't appear until write volume reveals that two doctypes will absorb 85% of all data. At a few hundred thousand documents, this is invisible. At 5.9M documents growing at 57,600 per day, it's a time bomb with a known detonation date.

The lesson is that partition key selection for time-series or event data should be evaluated against the write distribution of the data at projected scale — not the query patterns of today. The best query partition key and the best write-distribution partition key are often different. Hierarchical partition keys exist precisely to reconcile this tension, but using them from day one on a greenfield system would have avoided the migration entirely.

A "cache" that computes from the same source on every write is a liability, not an asset+

The park_aggregates Cosmos container was, in effect, a materialized view — daily/weekly/monthly GROUP BY rollups of data already in PostgreSQL. The rationale for caching these in Cosmos was that NoSQL document access is faster than running aggregation queries. This is only true if the aggregations are expensive at query time and the NoSQL store is cheaper to operate than re-querying.

Neither was true here. The aggregation queries are standard GROUP BY over indexed columns — fast in PostgreSQL. And Cosmos at $116/month for park_aggregates alone is more expensive than adding a materialized view refresh to the PostgreSQL job. The moment the pipeline is redesigned without the Cosmos write step, the $116/month goes to zero and query latency barely changes because PostgreSQL was already the origin of the data.

The general principle: any caching layer whose cache is populated entirely from the same source system as its consumers should be evaluated on whether the operational overhead of maintaining it is less than the cost of querying the source directly. In this case it wasn't.

Batch writes matter more than any other single optimization+

Moving from individual upsert_item() calls to execute_item_batch() had the largest single impact on publisher runtime of any change made. The problem is intuitive in retrospect: each individual Cosmos write is a separate HTTPS request with ~50–150ms of network round-trip latency. At 400–600 writes per publisher run, this is 40–90 seconds of pure network wait time with no compute happening — the application is just waiting for acknowledgments.

Batching 100 items per batch call reduces the network wait by a factor of 100×. The CPU overhead of assembling the batch is negligible. This is a straightforward application of the fundamental principle that network I/O is orders of magnitude more expensive than local computation, and any architecture that issues one network call per record is wasting most of its execution time.

The same principle applies to the PostgreSQL side: the 6 covering indexes added to park_live_status_logs_v2 turned full sequential scans into index-only scans. The combination of fast source queries and fast batch destination writes brought the publisher from 6–9 minutes to 2–4 minutes per run.

The cost of switching stores is not just the migration — it's updating every consumer+

The hierarchical partition key migration plan was fully designed and never executed. The reason is not that the plan was wrong — it was sound. The reason is that full execution required updating 34 separate call sites in the dashboard API plus all Azure Functions consumers, running a 48-hour parallel validation window, and only then deleting the origin container. The actual migration (backfilling 5.9M documents) was estimated at 4–6 hours. The consumer update work was estimated at multiple days.

This is a recurring pattern in data platform work: the migration of the store itself is rarely the hard part. The hard part is finding every consumer, understanding what query pattern each one uses, updating it to the new schema or partition key structure, testing the update doesn't change behavior, and running long enough in parallel to be confident. This is the hidden cost that makes "just change the partition key" a multi-week project.

The decision to decommission Cosmos entirely rather than migrate it was driven partly by this: if migrating even one container requires touching 34 call sites, the ongoing maintenance cost of "Cosmos as a first-class store" is high. A single PostgreSQL endpoint with a stable schema has no per-partition query syntax, no partition key to remember, and no migration path for future schema changes — just standard SQL.

NoSQL excels at its use case — and it was genuinely the right call for time-series writes+

The story of returning to PostgreSQL is not a story about NoSQL being wrong. park_timeseries was the one container that was a genuine fit for Cosmos: 286 entities × 10-minute write cadence × multiple doctypes = hundreds of documents per minute, each naturally partitioned by entity or date. Cosmos's multi-hash hierarchical partitioning distributes this load across hundreds of physical partitions with no hot-spot risk and no lock contention. A single PostgreSQL table receiving this write volume would require careful partitioning, connection pooling, and ongoing DBA attention to stay healthy.

The original design — ingest to PostgreSQL, publish time-series to Cosmos — was architecturally coherent. The error was extending Cosmos to aggregates and knowledge documents that had no need for its partition-based scalability. The lesson is precision: use the right store for each data shape, not a single store for everything, but also not the same store for data that doesn't share the same access patterns just because you've already paid for the account.

01
Partition key = write distribution commitment
Evaluate partition keys against projected write volume at scale, not convenience for today's query patterns.
02
Network calls per record is always a bug
400 individual HTTP calls at 100ms each = 40 seconds of wall time doing nothing. Batch or pipeline wherever writes touch a remote store.
03
Round-trip caches require justification
If a "cache" is computed from the same source as its consumers, calculate whether its operational overhead is less than re-querying the source.
04
Migration cost = store work + consumer work
The data migration is rarely the bottleneck. Finding and updating all consumers is where the real time goes.
05
NoSQL per-request billing compounds at scale
$394/month floor before any burst traffic. Fixed-cost relational stores eliminate billing anxiety at high write cadence.
06
Use the right store for each shape — but no more stores than necessary
Cosmos was genuinely correct for high-frequency partitioned time-series. It was not correct for pre-computed aggregates that originated in PostgreSQL.
ETL Pipeline Detail → Downstream: Ride Forecasting ↗ Park Agent Chat ↗ All Projects ↗