Worker Deployment Patterns for High Availability (first ½ ready for review)#4703
Worker Deployment Patterns for High Availability (first ½ ready for review)#4703lukeknep wants to merge 17 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📖 Docs PR preview links |
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Deployment failed with the following error: Learn More: https://vercel.com/docs/concepts/projects/project-configuration |
| - **Active / Passive** — Workflows process in one region at a time, the "active" region. The other region is "passive" and ready for failover. This pattern has two variants: | ||
| - **[Active / Passive (Cold)](#active-cold)** — a.k.a. Active / Cold — Workers run in only one region at a time. After a failover, Workers start in the secondary region. The region where Workers run == the region where Workflows process. To fail over, Workers need a "cold start" in the other region. | ||
| - **[Active / Passive (Hot)](#active-hot)** — a.k.a. Active / Hot — Workers run in **both regions** simultaneously, but Workflows still process in only one region at any given time. The other region's Workers are on "hot" standby. | ||
| - **[Active / Active](#active-active)** — Workflows process in both regions at the same time. Necessarily, Workers run in both regions at all times. |
There was a problem hiding this comment.
nit: necessarily is an odd word to use here. Id just remove
| Active / Cold Pattern: **On failover** | ||
|
|
||
| - **The Namespace fails over automatically.** Temporal Cloud promotes the secondary region's replica to active. No action is needed to fail over the Namespace itself. | ||
| - **You bring the Workers up in the secondary region.** Because no Workers were running there, they start from nothing — a "cold" start. Starting and scaling that fleet is your responsibility, ideally through tested automation. Until the Workers are running, no Workflows make progress. |
There was a problem hiding this comment.
I feel like the question everyone reading this is going to ask is, how do we detect a failover.
I know we have plans to answer this in H2, but is there something we want to tell them now? Like them have some sort or system that is constantly querying what the active is to detect a failover? Or do we just want to wait for the question and address it then?
There was a problem hiding this comment.
It could just be one of those things where we fix the problem before we expect to be asked about it.
There was a problem hiding this comment.
Another thing I thought about is them knowing when to scale down those workers and do their own failback
There was a problem hiding this comment.
This should definitely be called out, I'll add it.
| Active / Cold Pattern: **Tradeoffs** | ||
|
|
||
| - Highest overall recovery time of the three patterns, due to cold starting the Worker fleet after failover. | ||
| - Depends on tested automation to bring up the secondary-region fleet quickly. |
There was a problem hiding this comment.
"tested automation", I see this 3 times and as a user I'd have no idea what this means personally.
|
|
||
| - **Use the Namespace Endpoint.** | ||
| - Connect Workers through the [Namespace Endpoint](/cloud/namespaces#access-namespaces), which always connects to the Namespace in its active region and automatically fails over to the new region. | ||
| - **Rationale:** If a Temporal Cloud incident requires the Namespace to fail over while the rest of the primary region is healthy, the Workers in the primary region can still connect through the Namespace Endpoint and process Workflows. If the Workers use the Regional Endpoint for the primary region, they will not reliably connect to the Namespace during a Temporal Cloud incident in the primary region. |
There was a problem hiding this comment.
If the Workers use the Regional Endpoint for the primary region, they will not reliably connect to the Namespace during a Temporal Cloud incident in the primary region.
won't they be forwarded?
There was a problem hiding this comment.
ah I see lower about turning off forwarding. This seems like this would be a really good feature to have in the worker and pass up the flag. Cause if you know you are connecting to a regional endpoint, and you don't want to have forwarding, seeing it all in one spot in the code is much more clear than having to set the regional endpoint in the worker and make a cli call externally.
just a thought
There was a problem hiding this comment.
Interesting. Initial user feedback was that they wanted it at a Namespace level. But I'll listen to see if some folks want it at a per-Worker level too.
I'm more concerned about overloading the "endpoint" they use to also convey which region the worker is in. I wish we had a separate "region" config that the Worker could specify.
| - **Codec Servers and proxies** — run in both regions continuously. | ||
| - **Databases and queues** — accessed from both regions; cross-region consistency must be designed for. | ||
|
|
||
| ### Dual Active (Multi-Active) {/* #dual-active */} |
There was a problem hiding this comment.
I'm a little confused about this one. Is this not just taking the active passive pattern and now just doing it for 2 namespaces now? I guess I'm confused about this being here when we already have active passive.
Like is this pattern here just really saying "you can have different namespaces in different regions"?
|
|
||
| | Pattern | Best for | Major benefits | Major tradeoffs | | ||
| | --- | --- | --- | --- | | ||
| | **[Active/Passive (Cold)](#active-cold)** | Easy initial deployment | Acts like a single region; no special setup required | Failing over Workers is the user's responsibility | |
There was a problem hiding this comment.
Isn't failing over workers always the user's responsibility? The biggest tradeoff here is the cold start right?
There was a problem hiding this comment.
Active / Cold: Starting the Workers after a failover event is the user's responsibility. Workflows are blocked until the user does so.
Active / Hot: the Workers are already started in advance; it's Temporal's responsibility to send them tasks after a failover. So Workflows can keep running even if the user does nothing!
Welcome to suggestions on how to improve this phrasing!
| | --- | --- | --- | --- | | ||
| | **[Active/Passive (Cold)](#active-cold)** | Easy initial deployment | Acts like a single region; no special setup required | Failing over Workers is the user's responsibility | | ||
|
|
||
| ```mermaid |
There was a problem hiding this comment.
The other diagrams look great. I don't these 6 are super helpful though. You already have basically the same diagrams in the later sections to illustrate each in detail. By including them here you are also losing the comparative benefits of the table because the now the users can't easily look at each row side by side. I'd remove these mermaids here and keep a simple table
|
|
||
| ### Active/Passive (Cold) {/* #active-cold */} | ||
|
|
||
| _Also known as "Active/Cold Standby", "Active/Cold", or simply "Active/Passive"._ |
There was a problem hiding this comment.
Instead of listing all these alternate names, I think it's less confusing to just use one consistent name throughout.
|
|
||
| ### Which pattern has the lowest recovery time (RTO)? {/* #faq-lowest-rto */} | ||
|
|
||
| **Active/Passive (Hot)** achieves the lowest recovery time, because a standby Worker fleet already runs in the secondary region and begins processing the moment it becomes active — no cold start. See [Active/Passive (Hot)](#active-hot) and [RPO and RTO](/cloud/rpo-rto). |
There was a problem hiding this comment.
Wouldn't Active/Active have the lowest recovery time?
We say this earlier on the page:
Active/Active Pattern: Benefits
- Low overall recovery time. The surviving region keeps processing while capacity scales up.
|
|
||
| To understand the recovery objectives each pattern is measured against, see [RPO and RTO](/cloud/rpo-rto). | ||
|
|
||
| ## Frequently asked questions {/* #faq */} |
There was a problem hiding this comment.
How attached are you to this FAQ section? It looks like it is repeating things we already addressed before in the page and in greater detail.
…tation into ha-worker-deployments
What does this PR do?
When using multi-region High Availability, Temporal Cloud customers often ask us how to decide where to deploy their Workers and other systems.
This page gives recommendations on common patterns for an overall High Availability strategy that a Temporal Cloud user can adopt in their architecture.
Notes to reviewers
┆Attachments: EDU-6522 [draft] High Availability Deployment Models page