Active/Passive Cluster Runbook

This document is the operator runbook for running IdentityMesh in an active/passive HA topology — one live host, one pre-staged standby, with a manual or MSCS-driven cutover when the active host needs to go away.

IdentityMesh ships single-instance only (see deployment-architecture.md for why active/active isn’t supported today). Active/passive sidesteps the distributed-state gaps entirely while still giving an HA story most enterprises actually want.

When to use this

The data-loss budget (RPO) needs to be near zero. Active/passive with a shared HA-clustered SQL backend gives RPO ≈ 0 because all durable state lives in SQL and the standby reads the same database.
The recovery-time budget (RTO) needs to be measured in minutes, not the hours that a from-backup rebuild takes. Pre-staging the passive node turns failover into “start two services and flip a DNS record”.
A scale-out (active/active) deployment is not on the table — IdentityMesh’s Data Protection keys, SignalR hub state, and engine scheduler are not coordinated across instances.
Compared with the rebuild-from-backup procedure in backup-and-restore.md, active/passive trades a second always-running host for an order-of-magnitude faster recovery.

Topology

                ┌──────────────────────────────────┐
                │  VIP / DNS CNAME                 │
                │  (im.contoso.local)              │
                └──────────────────┬───────────────┘
                                   │
                  ┌────────────────┴───────────────┐
                  ▼                                ▼
       ┌────────────────────┐           ┌────────────────────┐
       │  ACTIVE host       │           │  PASSIVE host      │
       │  Admin API   (run) │           │  Admin API   (stop)│
       │  Sync Engine (run) │           │  Sync Engine (stop)│
       └─────────┬──────────┘           └─────────┬──────────┘
                 │ EF Core, TLS                   │ EF Core, TLS
                 └────────────────┬───────────────┘
                                  ▼
                  ┌────────────────────────────────┐
                  │  Shared SQL Server             │
                  │  (Failover Cluster / AG)       │
                  │  IdentityMesh database         │
                  └────────────────────────────────┘

                  ┌────────────────────────────────┐
                  │  Relay agents                  │
                  │  (SignalR clients,             │
                  │   auto-reconnect to VIP)       │
                  └────────────────────────────────┘

Both hosts have IdentityMesh installed at the same version, point at the same SQL database, and serve traffic on the same DNS name via a VIP or CNAME. Only the active host has its services running. Relay agents target the VIP, so failover is transparent to them once the record swings.

Pre-staging the passive node

The passive node is a full IdentityMesh install whose services are configured not to start on boot. Build it once; refresh it on every upgrade.

Install the same MSI version that the active host runs. See installer.md. The installer creates the same service definitions on both hosts.
Mirror appsettings.json between hosts. The ConnectionStrings:IdentityMesh value must point at the shared SQL endpoint (the AG listener or cluster VNN, not a node name). Auth configuration (Authentication:JwtBearer:* for OIDC, or the Negotiate config for Windows auth) must be identical so a user’s session on the active host doesn’t depend on host-specific issuer / audience values.
Set both Windows services to Manual start on the passive host so they don’t compete with the active host after a reboot:
```
sc config IdentityMeshAdmin  start= demand
sc config IdentityMeshEngine start= demand
sc stop   IdentityMeshAdmin
sc stop   IdentityMeshEngine
```
The active host keeps the default Automatic start.
Pre-provision the license file at %CommonApplicationData%\IdentityMesh\license.key. The license is a per-host file (see backup-and-restore.md); copying the active host’s license.key to the passive host is sufficient. The same license is valid on either host because the licence is bound to the customer, not to a machine ID.
Provision the TLS / code-signing certificates on the passive host’s LocalMachine certificate store (or wherever the active host loads them from). Certificates do not travel via the SQL database — they are host-local material.

Provision secrets. This step depends on the configured secret provider:

Provider	Passive-node setup	Cutover impact
`DPAPI` (default)	Re-run `secretscli set <ref> <value>` on the passive host for every entry in `IM_Secrets`. The DPAPI key is bound to the host’s `LocalMachine` scope and the active host’s blobs cannot be decrypted on the passive host. See `secrets-and-dpapi.md`.	Each rotation must be re-applied to the passive host.
`AzureKeyVault`	Grant the passive host’s managed identity (or the configured `AZURE_CLIENT_ID`) the Key Vault Secrets User role on the same vault. Set `Secrets:Provider = AzureKeyVault` and the same `KeyVault:VaultUri` as the active host. See `secrets-keyvault.md`.	None — secrets are read from the vault on demand.

Recommended: use the Azure Key Vault provider for active/passive deployments. The DPAPI provider works, but every secret rotation becomes a two-host operation, and every upgrade-time MSI repair that resets the host’s machine key forces a full re-provision.

Verify the passive host can reach SQL before declaring the node ready. Start the services briefly, confirm GET /health/ready returns 200 against localhost, then stop them again.

Cutover procedure (planned)

Use this procedure for OS patching, MSI upgrades, or any maintenance that requires the active host to step down briefly.

Drain the active host. Pause schedules in the Admin UI or wait for in-flight runs to complete. The Sync Engine wraps each import batch in a single SQL transaction (see deployment-architecture.md), so stopping mid-batch is safe — the batch rolls back — but it forces the next sync to re-run from the last watermark, which is wasted work if avoidable.
Stop the services on the active host:
```
sc stop IdentityMeshEngine
sc stop IdentityMeshAdmin
```
Watch the engine log file (logs/identitymesh-YYYYMMDD.log) for the final “shutdown complete” line before proceeding.
Update the VIP / DNS CNAME to point at the passive host.
- DNS: change the CNAME and wait out the TTL (or set it low — 30 to 60 seconds — for active/passive deployments).
- MSCS / Windows Failover Cluster: move the IdentityMesh role to the standby node; the cluster handles the IP swap.
Start the services on the passive host:
```
sc start IdentityMeshAdmin
sc start IdentityMeshEngine
```
The Admin API initialises first; the Sync Engine registers itself in the engine-instance registry once it starts.
Verify using the verification checklist below.
Rollback is the same procedure in reverse: stop on the new active, swing DNS / VIP back, start on the original active.

The total operator-visible outage is the time between step 2 and step 5 — typically 1 to 3 minutes if DNS TTL is low.

Cutover procedure (unplanned)

The active host is unreachable: hardware failure, OS crash, network isolation. This procedure assumes the SQL backend is healthy; if SQL is also down, that is a separate incident and IdentityMesh cannot serve traffic from either node.

Confirm the active host is actually down. A flapping host that comes back mid-failover with the services still running creates the dual-active scenario described in What NOT to do. If in doubt, force the active host’s services off via remote management before proceeding.
Update the VIP / DNS CNAME to point at the passive host.

Start the services on the passive host:

sc start IdentityMeshAdmin
sc start IdentityMeshEngine

Verify using the verification checklist.

Expected behaviour after an unplanned failover

An import batch that was in flight when the active host died was rolled back at the SQL layer. The connector’s watermark sits at the last successful checkpoint, so the next scheduled run re-imports the same window.
If the engine had already committed some batches in a multi-batch run, those committed objects are durable. The next run continues from the current watermark — no data loss, but a small amount of duplicate work is possible if the engine had partially advanced watermarks for some object types and not others.
Relay agents detect the SignalR connection drop and reconnect automatically. They will reconnect to the VIP, which now resolves to the passive host. No manual intervention on the relay side.
Any operator session in the Admin UI is invalidated — Data Protection keys are per-host (see deployment-architecture.md) and the cookie issued by the active host cannot be decrypted by the passive host. Operators re-sign-in.

What NOT to do

Do not run both Sync Engine instances simultaneously against the same SQL database. The engine-instance registry table exists but the scheduler does not yet consult it for leader election. Two engines pointed at the same database will both schedule the same connectors at the same times — wasted work, noisier audit logs, and composer-rule mutations interleaving in surprising ways. Always confirm the previously-active engine is stopped before starting the new one.
Do not put both Admin API instances behind a round-robin load balancer. ASP.NET Core Data Protection keys are not persisted to a shared store, so cookies and antiforgery tokens issued by one instance will not decrypt on the other. Sticky-session LBs are acceptable as long as a host failover triggers a session reset (operators re-sign-in); round-robin is not.
Do not skip the secret-provisioning step on the passive host with the DPAPI provider. The blobs in IM_Secrets are bound to the active host’s LocalMachine DPAPI key. Without re-provisioning on the passive host, the first sync after failover throws InvalidOperationException on every connector that needs an authenticated bind.
Do not configure the passive host’s services to start automatically on boot. A reboot of the passive host while the active host is healthy will produce the dual-active scenario above. Manual start is the safety interlock.
Do not let the two hosts drift in version. Apply MSI upgrades to both hosts during the same maintenance window — see upgrades.md. A version mismatch is supported briefly during cutover (active running version N, passive staged at N+1) but should not persist.

Verification checklist

Run this checklist immediately after every cutover, planned or unplanned. The new active host is “good” only when every item passes.

GET /health/live against the VIP returns 200.
GET /health/ready against the VIP returns 200. This confirms the new active host can reach SQL and the licence file is readable.
GET /api/instances (requires the dashboard.read permission) returns exactly one engine row, and that row identifies the new active host. More than one row means the previously-active engine is still heartbeating somewhere — stop it before continuing.
GET /api/relays shows the expected relay agents in the Connected state. Relays reconnect within their SignalR back-off window (typically under 30 seconds); allow up to a minute before treating an absence as a failure.
The Admin UI loads at the VIP and an operator can sign in against the configured identity provider.
A test sync of a small connector (a “canary” connector pointed at a low-volume directory is the usual choice) completes without authentication errors. This confirms the secret store is wired correctly on the new active host.

If any item fails, the recovery is not complete. Capture the diagnostic for that step before deciding whether to continue investigating on the new active host or roll back to the original.

deployment-architecture.md — what state lives where, why active/active is not supported.
backup-and-restore.md — the rebuild-from-backup procedure if both hosts are lost.
secrets-keyvault.md — the recommended secret provider for active/passive deployments.
secrets-and-dpapi.md — the constraints the default secret provider imposes on cluster topologies.
secret-rotation.md — rotation procedure; with the DPAPI provider every rotation is a two-host operation.
upgrades.md — the MSI upgrade procedure has the same shape as a planned cutover and can be performed as a rolling upgrade across active/passive.