Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/cloud/connectivity/gcp-connectivity.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ This one-way connection means Temporal cannot establish a connection back to you
This is useful if normally you block traffic egress as part of your security protocols.
If you use a private environment that does not allow external connectivity, you will remain isolated.

<a id="high-availability-and-private-service-connect"></a>

:::warning Namespaces with High Availability features and GCP Private Service Connect

Automatic failover via Temporal Cloud DNS is not currently supported with GCP Private Service Connect.
Expand Down
864 changes: 864 additions & 0 deletions docs/cloud/high-availability/architecture-patterns.mdx

Large diffs are not rendered by default.

8 changes: 6 additions & 2 deletions docs/cloud/high-availability/enable.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ You can enable [High Availability](/cloud/high-availability) features for a new
replica. When you add a replica, Temporal Cloud begins asynchronously replicating ongoing and existing Workflow
Executions.

Adding a replica fails the Namespace over automatically; to plan how your Workers fail over with it, see [Worker deployment patterns for high availability and disaster recovery](/cloud/high-availability/architecture-patterns).

The replica region must be on the same continent as the primary region. Because of that, not all replication options are available in all Temporal Cloud regions. See the [Service regions](/cloud/regions) page for the supported replica regions for each active region.

Using private network connectivity with a HA namespace requires extra setup. See
Expand Down Expand Up @@ -137,6 +139,8 @@ Client APIs (Start, Signal, Cancel, Terminate, Query, and the equivalent Activit

Same-region replicas are not affected by this setting.

To deploy Worker fleets in both regions that stay on standby in the passive region until failover, see [Active / Passive (Hot)](/cloud/high-availability/architecture-patterns#active-hot).

:::info

To see which endpoints route to which replica, see [How requests reach the replica](/cloud/high-availability/ha-connectivity#how-requests-reach-the-replica).
Expand All @@ -152,10 +156,10 @@ Use the [`temporal cloud namespace ha update`](/cli/command-reference/cloud/name
```bash
temporal cloud namespace ha update \
--namespace <namespace>.<account> \
--disable-passive-poller-forwarding true
--passive-poller-forwarding disabled
```

Set the flag to `false` to re-enable forwarding.
Set the flag to `enabled` to re-enable forwarding.

### Set the forwarding behavior with the Cloud Ops API {/* #set-forwarding-curl */}

Expand Down
2 changes: 2 additions & 0 deletions docs/cloud/high-availability/failovers/manage.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,8 @@ the replica, the DNS redirection orchestrated by Temporal ensures that your exis
Namespace without interruption. Temporal Cloud forwards their requests from the passive replica to the active region and
the responses back, so Workers keep running through a failover.

To choose where your Worker fleets run across regions, see [Deployment patterns for High Availability](/cloud/high-availability/architecture-patterns).

To route Workers to the passive region's replica, see [How requests reach the replica](/cloud/high-availability/ha-connectivity#how-requests-reach-the-replica).

To stop forwarding Worker polls to the active region, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior).
Expand Down
4 changes: 4 additions & 0 deletions docs/cloud/high-availability/ha-connectivity.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,10 @@ To learn what forwarding does, see [Request forwarding](/cloud/high-availability

To stop forwarding Worker polls on a Namespace, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior).

To run Worker fleets in both regions that rely on this forwarding, see [Active / Active](/cloud/high-availability/architecture-patterns#active-active).

To keep passive-region Workers on standby until failover by disabling this forwarding, see [Active / Passive (Hot)](/cloud/high-availability/architecture-patterns#active-hot).

## How to use PrivateLink with High Availability features

:::tip
Expand Down
5 changes: 5 additions & 0 deletions docs/cloud/high-availability/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,10 @@ To route Workers to the passive region's replica, see [How requests reach the re

To disable passive region replica forwarding, see [Change the forwarding behavior](/cloud/high-availability/enable#change-forwarding-behavior).

To run Worker fleets in both regions that rely on this forwarding, see [Active / Active](/cloud/high-availability/architecture-patterns#active-active).

To keep passive-region Workers on standby until failover by disabling this forwarding, see [Active / Passive (Hot)](/cloud/high-availability/architecture-patterns#active-hot).

## Service levels and recovery objectives

Namespaces using High Availability have a 99.99% [uptime SLA](/cloud/sla) with sub-1-minute [RPO](/cloud/rpo-rto) and 20-minute [RTO](/cloud/rpo-rto). For detailed information:
Expand All @@ -115,6 +119,7 @@ Namespaces using High Availability have a 99.99% [uptime SLA](/cloud/sla) with s
## Failover

High Availability Namespaces can automatically or manually [fail over](/cloud/high-availability/failovers) to the replica if the primary is unavailable or unhealthy.
The Namespace fails over automatically, but your Workers and the rest of your architecture need their own plan — see [Worker deployment patterns for Active-Passive and Active-Active HA/DR](/cloud/high-availability/architecture-patterns).

## Target workloads

Expand Down
18 changes: 18 additions & 0 deletions docs/cloud/high-availability/monitoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,24 @@ import { ToolTipTerm } from '@site/src/components';
Temporal Cloud offers several ways for you to track the health and performance of your
[High Availability](/cloud/high-availability) namespaces.

## Detect a failover or an outage {/* #detect-failover-or-outage */}

With some [Worker deployment patterns](/cloud/high-availability/architecture-patterns) — most notably [Active/Passive (Cold)](/cloud/high-availability/architecture-patterns#active-cold) — detecting an outage is your responsibility, and your Workflows make no progress until you detect it and bring up Workers in the new active region. Fast, reliable detection therefore directly determines your recovery time, so it is worth monitoring for both of the following.

### Detect a failover {/* #detect-a-failover */}

The clearest way to detect that a failover has happened is to watch whether your Namespace's active region changed. When Temporal Cloud promotes the replica in the secondary region to active, the active region reported for the Namespace changes — a reliable, unambiguous signal that a failover occurred. To track failovers as they happen, look for the `FailoverNamespace` operation described in [Failover audit log](#failover-audit-log).

### Detect an outage {/* #detect-an-outage */}

A failover is not the only signal worth watching. You may want to detect a regional outage directly, before or independently of a Namespace failover, so you can begin your own response. Watch for:

- **A spike in replication lag** between the primary and the replica. See [Monitoring replication](#monitoring-replication).
- **A drop in Workflow throughput**, such as a sudden decline in the rate of Workflows started, completed, or Tasks processed.
- **A spike in errors across your overall stack**, not just Temporal — for example, application errors, failed Activities, or connection failures.
- **A drop in throughput across your overall stack**, such as fewer requests reaching your services or fewer Activities executing.
- **Errors or failovers in other cloud systems you depend on**, such as databases, queues, or other regional services, which often signal a broader regional outage.

## Replication status

You can monitor your replica status with the Temporal Cloud UI. If the replica is unhealthy, Temporal Cloud disables the
Expand Down
1 change: 1 addition & 0 deletions docs/cloud/rto-rpo.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). For detail

The RTO and RPO for a Namespace depend on the type of outage and which [High Availability](/cloud/high-availability)
features the Namespace has enabled.
Your real-world recovery time also depends on your [Worker deployment pattern](/cloud/high-availability/architecture-patterns) — how quickly Workers resume processing in the new region after the Namespace fails over.

## RTO and RPO summary

Expand Down
1 change: 1 addition & 0 deletions sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -1168,6 +1168,7 @@ module.exports = {
},
items: [
'cloud/high-availability/enable',
'cloud/high-availability/architecture-patterns',
'cloud/high-availability/monitoring',
{
type: 'category',
Expand Down
15 changes: 15 additions & 0 deletions vercel.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,21 @@
}
],
"redirects": [
{
"source": "/cloud/disaster-recovery",
"destination": "/cloud/high-availability",
"permanent": true
},
{
"source": "/cloud/multi-region-failover",
"destination": "/cloud/high-availability/architecture-patterns",
"permanent": true
},
{
"source": "/cloud/high-availability/deployment-patterns",
"destination": "/cloud/high-availability/architecture-patterns",
"permanent": true
},
{
"source": "/encyclopedia/detecting-application-failures",
"destination": "/encyclopedia/failures-and-error-handling",
Expand Down