Zero-Downtime Kubernetes Migration via Strangler-Fig

There are two ways to modernize a production-critical legacy system. The first is the rewrite — fork off a parallel team, ship a new version in 18 months, do a flag-day cutover, and pray. The second is the strangler-fig migration — extract pieces of the legacy system one at a time, let the old and new versions run side by side, gradually shift traffic, and decommission the legacy when nothing depends on it anymore.

Almost every engineering team's first instinct is the rewrite. Almost every successful production-system modernization we have shipped used the strangler-fig. The mismatch is interesting and worth examining, because the rewrite is so consistently chosen by smart teams who then suffer through it.

This post is built on the most recent of those engagements: a 14-month migration of a European payment processor from a PHP 5.6 monolith running on dedicated VMs to a Node.js microservices platform running on EKS, with PCI-DSS controls baked into infrastructure-as-code and zero customer-visible downtime through the cutover. p95 latency dropped 60% (1.8s → 320ms) and the PCI-DSS audit closed without findings.

Why rewrites usually fail in production payment systems

The rewrite assumes you understand every undocumented edge case in the legacy system. For a 12-year-old payment platform, this is essentially never true. The codebase has accumulated a dozen years of production discoveries — the time a particular acquirer started returning 200 OK with a body that said FAILED, the workaround for a routing bug that nobody can quite remember why is there, the magic delay that prevents a race condition between two services that should not be racing.

The rewrite team will discover all of these the same way the original team did — by hitting them in production. Except now they are hitting them on flag-day, with the old system already turned off and customers losing money. The strangler-fig structurally avoids this failure mode by keeping the legacy system authoritative until you have proven the new path is safe.

Choosing the extraction order

The order in which you extract services from the monolith is the most consequential design decision in a strangler-fig migration. Get it right and you build team confidence early, hit the riskiest parts when the new platform is mature, and the migration accelerates over time. Get it wrong and you spend six months on the hardest service first, lose team morale, and either give up or ship something with insufficient runway.

We rank candidate services on two axes — extraction risk and extraction value — and start in the high-value / low-risk quadrant. Concretely, in the payment-gateway engagement:

First three services extracted: merchant onboarding, the static merchant portal, the reporting API. High volume in aggregate, but no PCI scope and no risk to live transaction flow.
Middle batch: webhook delivery, settlement reporting, dispute handling. Higher value, modest risk — these touch live data but are not on the critical authorization path.
Last batch: the authorization flow itself. The riskiest 30% of the codebase, migrated when the platform was operationally mature and the team had nine months of strangler-fig experience.

Shadow mode is non-negotiable

Every service we extracted ran in shadow mode for at least two weeks before any customer traffic was routed to the new path. Shadow mode means the new service receives a copy of every production request, processes it, and produces a real response — but the legacy service remains authoritative. The new service's response is logged for comparison, not returned to the customer.

Concretely: every request flows into a routing layer that fans out to both implementations in parallel. The legacy response is what the customer sees. The new response is captured, diffed against the legacy response, and surfaced in a dashboard.

typescript

// Simplified shadow-mode dispatch
async function dispatch(req: Request): Promise<Response> {
  const legacyResponse = await legacyService.handle(req);

  // Fire-and-forget: shadow call must never affect the legacy path
  fireAndForget(async () => {
    const startNs = process.hrtime.bigint();
    try {
      const newResponse = await newService.handle(req);
      const latencyMs = Number(process.hrtime.bigint() - startNs) / 1e6;
      shadowDiff.record({
        endpoint: req.endpoint,
        latencyMs,
        diff: deepDiff(legacyResponse, newResponse),
      });
    } catch (err) {
      shadowError.record({ endpoint: req.endpoint, err });
    }
  });

  return legacyResponse;
}

The two design points that matter here: shadow calls must be truly fire-and-forget (a shadow failure must never affect the customer's response), and the diff must be field-aware enough to ignore inconsequential differences (timestamps, generated IDs) while catching meaningful ones (amounts, statuses).

We required a clean week of shadow traffic — no diffs above a defined threshold — before authorizing a cutover for any given service. For most services this took less than two weeks. Two services required three rounds of fixes and roughly six weeks before they were clean. In both cases, what we found in shadow mode would have been customer-visible incidents in a flag-day cutover.

Blue/green per service, with sub-second rollback

Cutover for each service was a blue/green deployment scoped to that service alone — never a coordinated cutover across multiple services. The routing layer added a per-service flag that could shift traffic from the legacy implementation to the new one (or back) in seconds, with no deployment required.

The rollback path was wired into the same dashboard the engineering team used to monitor the cutover. One button, one confirmation, and traffic returned to the legacy path within roughly 800ms of the click. We measured this. We tested it as part of every cutover rehearsal.

Rollback was used three times across the 14-month migration — twice for legitimate issues (a memory leak that only manifested at production scale, a downstream dependency that returned subtly different data than the development environment) and once because someone hit the wrong button. In all three cases the customer-visible impact was zero, because rollback was faster than any customer would have noticed the new behavior.

Observability is a prerequisite, not a deliverable

We did not start the migration until the observability stack was live in production. This is the single most-skipped step in real-world strangler-fig migrations and it always costs more than building it would have.

Observability for a strangler-fig means at minimum: distributed tracing across both the legacy and new implementations (so a request can be followed end-to-end through whichever path it took), structured logs with consistent correlation IDs, per-service latency and error metrics, and a real-time view of the legacy-vs-new traffic split. Without this you are flying blind during the riskiest period in the system's life.

OpenTelemetry instrumentation in both legacy and new code. Same span semantics, same attribute conventions.
Centralized log aggregation with correlation IDs that survive the legacy/new boundary.
Per-service dashboards that always show 'legacy share / new share / error delta' as the headline view.
Alerts on the diff dashboard, not just on raw error rates. A clean diff dashboard while the new path is broken is worse than a noisy dashboard you trust.

What surprised us

Three things were not in our pre-engagement risk register and ended up mattering more than we expected.

1. Legacy data quality issues that shadow mode surfaced

The new implementation's stricter validation surfaced data-quality issues that the legacy system had been silently tolerating for years. We had to decide, repeatedly, whether to make the new implementation accept the legacy garbage or to reject it cleanly. We mostly chose the latter, with a one-time data cleanup script and a hard policy on rejecting going forward — but each instance was a real conversation with the client's data team.

2. Compliance posture got better, not worse

Auditors initially viewed the migration with suspicion (more change == more risk in their mental model). By the end of the engagement, the rebuilt platform had a smaller PCI-DSS footprint than the legacy one, with controls implemented in IaC instead of policy PDFs. The audit closed without findings — the first clean audit the company had in five years. We did not pre-sell this in the engagement scope; it emerged organically from the architecture.

3. Team culture changed

The same in-house team that had been firefighting the legacy system started running a documented release schedule on the new platform. Their on-call burn rate dropped roughly 70% in the six months after handover. We did not budget for this and we cannot perfectly explain why it happened — but the pattern has shown up in every successful strangler-fig engagement we have run.

Closing

Strangler-fig is structurally slower than a rewrite — you are running both implementations in parallel for months — but it is consistently faster overall, because you are not paying for the rework that flag-day cutovers force. It is also the only reasonable way to modernize a system whose downtime has a financial cost measured in basis points per minute.

If you are sizing one of these migrations and want a sanity check, we are happy to chat. The cases where rewrite is genuinely the right answer exist, but they are rarer than most teams assume.

Why rewrites usually fail in production payment systems

Choosing the extraction order

Shadow mode is non-negotiable

Blue/green per service, with sub-second rollback

Observability is a prerequisite, not a deliverable

What surprised us

1. Legacy data quality issues that shadow mode surfaced

2. Compliance posture got better, not worse

3. Team culture changed

Closing

Where these patterns showed up.

Modernizing a legacy payment gateway under PCI-DSS pressure

Multi-region Kubernetes migration for a clinical SaaS platform

Cloud Infrastructure & DevOps

Full-Stack Web Development

Custom Software Development

More from the engineering blog.

RAG in production: lessons from a document-AI pipeline

Generative Engine Optimization: the technical playbook

Want to compare notes?