At QCon San Francisco 2025, Jimmy Morzaria, Staff Software Engineer at Stripe, unveiled the company's innovative Zero-Downtime Data Movement Platform. This system is engineered to facilitate petabyte-scale database migrations with traffic cutovers that typically complete in milliseconds, a critical capability for a company handling $1.4 trillion in annual transactions with 99.9995% reliability. [1]
The platform underpins Stripe's infrastructure, managing an impressive 5 million database queries per second across more than 2,000 MongoDB-based shards. [1] Its migration process is structured around a six-phase blueprint, guided by three core principles:
- Data Consistency: Ensuring data consistency with downtime periods shorter than typical node failover events. [1]
- Minimal Performance Impact: Minimizing any performance degradation on live queries during the migration. [1]
- Scalability: Accommodating shards that vary significantly in size, from small datasets to tens of terabytes. [1]
The Six Phases of Zero-Downtime Migration
The migration journey through Stripe's platform involves several meticulously orchestrated steps:
- Migration Registration: The process begins by updating the routing metadata service to register new target shards and their corresponding key ranges, establishing the intended data destination. [1]
- Bulk Data Import: The primary dataset is then transferred using an optimized service. Stripe achieved a tenfold performance improvement by reordering inserts to align with MongoDB's B-tree storage engine, sorting items by the most-used indexes in each shard. [1]
- Async Replication: A dedicated replication service ensures continuous, bidirectional synchronization between the source and target shards. This crucial phase captures ongoing changes and replicates modifications back to source shards, providing a robust mechanism for complete migration rollbacks if issues arise. [1]
- Validation: Before proceeding to traffic switching, a comprehensive validation service performs correctness checks, comparing data between source and target shards to guarantee data integrity. [1]
- Traffic Switch (Cutover): This is the most technically sophisticated phase, utilizing "versioned gating." The mechanism coordinates version updates across the database proxy service, coordinator, routing service, and replication service. The client application initially queries through a proxy at version one, routing to the source database. Once the coordinator sets version two and confirms replication synchronization, the proxy fetches new routes and directs traffic to the target database. The entire coordination completes within milliseconds to a maximum of 2 seconds, ensuring imperceptible disruption to customers. [1]
- Migration Deregistration: The process concludes with cleaning up metadata and decommissioning the migration infrastructure. [1]
Beyond Basic Migration
Stripe leverages this platform for more than just horizontal scaling. It's also instrumental in shard merging, performing MongoDB version upgrades across multiple major releases, and facilitating tenancy model transitions. This demonstrates how foundational investments can yield tools capable of serving a wide array of scenarios beyond their initial design. [1]
Stripe's decision to build its DocDB platform internally, rather than relying on managed services, was driven by specific requirements for security policy enforcement, predictable performance, and multi-tenancy support with enforced quotas. Given that 40% of customers abandon transactions after payment denials, zero-downtime migrations are not merely an advantage but an essential operational requirement for Stripe. [1] This strategic build-versus-buy decision underscores the critical importance of tailored solutions for differentiated requirements and stringent security needs in the financial technology sector.