Worker Deployment Patterns for High Availability (first ½ ready for review) by lukeknep · Pull Request #4703 · temporalio/documentation

lukeknep · 2026-06-11T17:46:17Z

What does this PR do?

When using multi-region High Availability, Temporal Cloud customers often ask us how to decide where to deploy their Workers and other systems.

This page gives recommendations on common patterns for an overall High Availability strategy that a Temporal Cloud user can adopt in their architecture.

Notes to reviewers

(June 22nd) The first half of this doc is "ready for Docs review" - up until the Active/Active pattern is detailed (still working on the Active/Active variants)
Internal context (long thread, first messages are most important): https://temporaltechnologies.slack.com/archives/C04V0LSU5S6/p1781117451071889?thread_ts=1781008921.964629&cid=C04V0LSU5S6

┆Attachments: EDU-6522 [draft] High Availability Deployment Models page

vercel · 2026-06-11T17:46:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
temporal-documentation	Ready	Preview, Comment	Jul 1, 2026 6:57pm

github-actions · 2026-06-11T20:35:49Z

📖 Docs PR preview links

Cloud
- Connectivity
  - GCP Private Connect
- High Availability
- Outages and Recovery Objectives (RTO / RPO)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

vercel · 2026-06-15T20:57:10Z

Deployment failed with the following error:

The `vercel.json` schema validation failed with the following message: should NOT have additional property `public`

Learn More: https://vercel.com/docs/concepts/projects/project-configuration

thestephenstanton · 2026-06-17T13:51:05Z

+- **Active / Passive** — Workflows process in one region at a time, the "active" region. The other region is "passive" and ready for failover. This pattern has two variants:
+  - **[Active / Passive (Cold)](#active-cold)** — a.k.a. Active / Cold — Workers run in only one region at a time. After a failover, Workers start in the secondary region. The region where Workers run == the region where Workflows process. To fail over, Workers need a "cold start" in the other region.
+  - **[Active / Passive (Hot)](#active-hot)** — a.k.a. Active / Hot — Workers run in **both regions** simultaneously, but Workflows still process in only one region at any given time. The other region's Workers are on "hot" standby.
+- **[Active / Active](#active-active)** — Workflows process in both regions at the same time. Necessarily, Workers run in both regions at all times.


nit: necessarily is an odd word to use here. Id just remove

thestephenstanton · 2026-06-17T13:56:21Z

+Active / Cold Pattern: **On failover**
+
+- **The Namespace fails over automatically.** Temporal Cloud promotes the secondary region's replica to active. No action is needed to fail over the Namespace itself.
+- **You bring the Workers up in the secondary region.** Because no Workers were running there, they start from nothing — a "cold" start. Starting and scaling that fleet is your responsibility, ideally through tested automation. Until the Workers are running, no Workflows make progress.


I feel like the question everyone reading this is going to ask is, how do we detect a failover.

I know we have plans to answer this in H2, but is there something we want to tell them now? Like them have some sort or system that is constantly querying what the active is to detect a failover? Or do we just want to wait for the question and address it then?

It could just be one of those things where we fix the problem before we expect to be asked about it.

Another thing I thought about is them knowing when to scale down those workers and do their own failback

This should definitely be called out, I'll add it.

thestephenstanton · 2026-06-17T13:59:32Z

+Active / Cold Pattern: **Tradeoffs**
+
+- Highest overall recovery time of the three patterns, due to cold starting the Worker fleet after failover.
+- Depends on tested automation to bring up the secondary-region fleet quickly.


"tested automation", I see this 3 times and as a user I'd have no idea what this means personally.

thestephenstanton · 2026-06-17T14:03:31Z

+
+- **Use the Namespace Endpoint.**
+   - Connect Workers through the [Namespace Endpoint](/cloud/namespaces#access-namespaces), which always connects to the Namespace in its active region and automatically fails over to the new region.
+   - **Rationale:** If a Temporal Cloud incident requires the Namespace to fail over while the rest of the primary region is healthy, the Workers in the primary region can still connect through the Namespace Endpoint and process Workflows. If the Workers use the Regional Endpoint for the primary region, they will not reliably connect to the Namespace during a Temporal Cloud incident in the primary region.


If the Workers use the Regional Endpoint for the primary region, they will not reliably connect to the Namespace during a Temporal Cloud incident in the primary region.

won't they be forwarded?

ah I see lower about turning off forwarding. This seems like this would be a really good feature to have in the worker and pass up the flag. Cause if you know you are connecting to a regional endpoint, and you don't want to have forwarding, seeing it all in one spot in the code is much more clear than having to set the regional endpoint in the worker and make a cli call externally.

just a thought

Interesting. Initial user feedback was that they wanted it at a Namespace level. But I'll listen to see if some folks want it at a per-Worker level too.

I'm more concerned about overloading the "endpoint" they use to also convey which region the worker is in. I wish we had a separate "region" config that the Worker could specify.

thestephenstanton · 2026-06-17T14:14:32Z

+- **Codec Servers and proxies** — run in both regions continuously.
+- **Databases and queues** — accessed from both regions; cross-region consistency must be designed for.
+
+### Dual Active (Multi-Active) {/* #dual-active */}


I'm a little confused about this one. Is this not just taking the active passive pattern and now just doing it for 2 namespaces now? I guess I'm confused about this being here when we already have active passive.

Like is this pattern here just really saying "you can have different namespaces in different regions"?

Yea, basically.

lennessyy · 2026-06-23T21:39:22Z

+
+| Pattern | Best for | Major benefits | Major tradeoffs |
+| --- | --- | --- | --- |
+| **[Active/Passive (Cold)](#active-cold)** | Easy initial deployment | Acts like a single region; no special setup required | Failing over Workers is the user's responsibility |


Isn't failing over workers always the user's responsibility? The biggest tradeoff here is the cold start right?

Active / Cold: Starting the Workers after a failover event is the user's responsibility. Workflows are blocked until the user does so.

Active / Hot: the Workers are already started in advance; it's Temporal's responsibility to send them tasks after a failover. So Workflows can keep running even if the user does nothing!

Welcome to suggestions on how to improve this phrasing!

lennessyy · 2026-06-23T21:46:22Z

+| --- | --- | --- | --- |
+| **[Active/Passive (Cold)](#active-cold)** | Easy initial deployment | Acts like a single region; no special setup required | Failing over Workers is the user's responsibility |
+
+```mermaid


The other diagrams look great. I don't these 6 are super helpful though. You already have basically the same diagrams in the later sections to illustrate each in detail. By including them here you are also losing the comparative benefits of the table because the now the users can't easily look at each row side by side. I'd remove these mermaids here and keep a simple table

lennessyy · 2026-06-23T21:47:11Z

+
+### Active/Passive (Cold) {/* #active-cold */}
+
+_Also known as "Active/Cold Standby", "Active/Cold", or simply "Active/Passive"._


Instead of listing all these alternate names, I think it's less confusing to just use one consistent name throughout.

lennessyy · 2026-06-23T21:56:40Z

+
+### Which pattern has the lowest recovery time (RTO)? {/* #faq-lowest-rto */}
+
+**Active/Passive (Hot)** achieves the lowest recovery time, because a standby Worker fleet already runs in the secondary region and begins processing the moment it becomes active — no cold start. See [Active/Passive (Hot)](#active-hot) and [RPO and RTO](/cloud/rpo-rto).


Wouldn't Active/Active have the lowest recovery time?

We say this earlier on the page:

Active/Active Pattern: Benefits

Low overall recovery time. The surviving region keeps processing while capacity scales up.

lennessyy · 2026-06-23T21:58:08Z

+
+To understand the recovery objectives each pattern is measured against, see [RPO and RTO](/cloud/rpo-rto).
+
+## Frequently asked questions {/* #faq */}


How attached are you to this FAQ section? It looks like it is repeating things we already addressed before in the page and in greater detail.

…tation into ha-worker-deployments

lukeknep added 2 commits June 10, 2026 11:21

Disable forwarding setting for HA

ee33475

First draft of deployment models page

508d477

lukeknep requested a review from a team as a code owner June 11, 2026 17:46

edits to deployment models

d1a61cc

vercel Bot deployed to Preview June 11, 2026 18:03 View deployment

more edits to deployment models

3f98c90

vercel Bot deployed to Preview June 11, 2026 18:47 View deployment

more updates

810fa7d

vercel Bot deployed to Preview June 11, 2026 20:09 View deployment

Merge branch 'main' into ha-worker-deployments

2f486b3

vercel Bot deployed to Preview June 11, 2026 21:23 View deployment

Add High Availability deployment patterns docs page

6722da6

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

vercel Bot deployed to Preview June 12, 2026 04:18 View deployment

Updates to worker deployment patterns

24c9151

vercel Bot had a problem deploying to Preview June 15, 2026 20:57 Failure

updates

420713a

vercel Bot had a problem deploying to Preview June 15, 2026 22:32 Failure

sync-by-unito Bot assigned brianmacdonald-temporal Jun 16, 2026

updates

d6b5d36

vercel Bot had a problem deploying to Preview June 16, 2026 19:30 Failure

Merge branch 'main' into ha-worker-deployments

d1016d4

vercel Bot deployed to Preview June 16, 2026 23:47 View deployment

thestephenstanton reviewed Jun 17, 2026

View reviewed changes

worker deployment updates

378f251

vercel Bot deployed to Preview June 22, 2026 16:48 View deployment

lukeknep changed the title ~~[draft] High Availability Deployment Models page~~ High Availability Deployment Models page (½ ready for review) Jun 22, 2026

Merge branch 'main' into ha-worker-deployments

d017741

vercel Bot deployed to Preview June 23, 2026 18:10 View deployment

lennessyy self-assigned this Jun 23, 2026

lennessyy reviewed Jun 23, 2026

View reviewed changes

lukeknep changed the title ~~High Availability Deployment Models page (½ ready for review)~~ Worker Deployment Models for High Availability (first ½ ready for review) Jun 24, 2026

lukeknep added 2 commits June 24, 2026 16:02

Progress on Active/Active

4e1ef45

Merge branch 'ha-worker-deployments' of github.com:temporalio/documen…

9a68681

…tation into ha-worker-deployments

vercel Bot deployed to Preview June 24, 2026 23:04 View deployment

lukeknep changed the title ~~Worker Deployment Models for High Availability (first ½ ready for review)~~ Worker Deployment Patterns for High Availability (first ½ ready for review) Jun 26, 2026

new anchor for AI

fd94a26

vercel Bot deployed to Preview June 29, 2026 23:25 View deployment

Updates

b0dfcd6

vercel Bot deployed to Preview July 1, 2026 18:57 View deployment


		### Active/Passive (Cold) {/* #active-cold */}

		_Also known as "Active/Cold Standby", "Active/Cold", or simply "Active/Passive"._


		### Which pattern has the lowest recovery time (RTO)? {/* #faq-lowest-rto */}

		Active/Passive (Hot) achieves the lowest recovery time, because a standby Worker fleet already runs in the secondary region and begins processing the moment it becomes active — no cold start. See [Active/Passive (Hot)](#active-hot) and [RPO and RTO](/cloud/rpo-rto).


		To understand the recovery objectives each pattern is measured against, see [RPO and RTO](/cloud/rpo-rto).

		## Frequently asked questions {/* #faq */}

Uh oh!

Conversation

lukeknep commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Notes to reviewers

Uh oh!

vercel Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📖 Docs PR preview links

Uh oh!

vercel Bot commented Jun 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lennessyy Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lukeknep commented Jun 11, 2026 •

edited

Loading

vercel Bot commented Jun 11, 2026 •

edited

Loading

github-actions Bot commented Jun 11, 2026 •

edited

Loading

lennessyy Jun 23, 2026 •

edited

Loading