Runbooks / Cluster Active Passive

Active/Passive Cluster Runbook

This document is the operator runbook for running IdentityMesh in an active/passive HA topology — one live host, one pre-staged standby, with a manual or MSCS-driven cutover when the active host needs to go away.

IdentityMesh ships single-instance only (see deployment-architecture.md for why active/active isn’t supported today). Active/passive sidesteps the distributed-state gaps entirely while still giving an HA story most enterprises actually want.

When to use this

Topology

                ┌──────────────────────────────────┐
                │  VIP / DNS CNAME                 │
                │  (im.contoso.local)              │
                └──────────────────┬───────────────┘

                  ┌────────────────┴───────────────┐
                  ▼                                ▼
       ┌────────────────────┐           ┌────────────────────┐
       │  ACTIVE host       │           │  PASSIVE host      │
       │  Admin API   (run) │           │  Admin API   (stop)│
       │  Sync Engine (run) │           │  Sync Engine (stop)│
       └─────────┬──────────┘           └─────────┬──────────┘
                 │ EF Core, TLS                   │ EF Core, TLS
                 └────────────────┬───────────────┘

                  ┌────────────────────────────────┐
                  │  Shared SQL Server             │
                  │  (Failover Cluster / AG)       │
                  │  IdentityMesh database         │
                  └────────────────────────────────┘

                  ┌────────────────────────────────┐
                  │  Relay agents                  │
                  │  (SignalR clients,             │
                  │   auto-reconnect to VIP)       │
                  └────────────────────────────────┘

Both hosts have IdentityMesh installed at the same version, point at the same SQL database, and serve traffic on the same DNS name via a VIP or CNAME. Only the active host has its services running. Relay agents target the VIP, so failover is transparent to them once the record swings.

Pre-staging the passive node

The passive node is a full IdentityMesh install whose services are configured not to start on boot. Build it once; refresh it on every upgrade.

  1. Install the same MSI version that the active host runs. See installer.md. The installer creates the same service definitions on both hosts.

  2. Mirror appsettings.json between hosts. The ConnectionStrings:IdentityMesh value must point at the shared SQL endpoint (the AG listener or cluster VNN, not a node name). Auth configuration (Authentication:JwtBearer:* for OIDC, or the Negotiate config for Windows auth) must be identical so a user’s session on the active host doesn’t depend on host-specific issuer / audience values.

  3. Set both Windows services to Manual start on the passive host so they don’t compete with the active host after a reboot:

    sc config IdentityMeshAdmin  start= demand
    sc config IdentityMeshEngine start= demand
    sc stop   IdentityMeshAdmin
    sc stop   IdentityMeshEngine

    The active host keeps the default Automatic start.

  4. Pre-provision the license file at %CommonApplicationData%\IdentityMesh\license.key. The license is a per-host file (see backup-and-restore.md); copying the active host’s license.key to the passive host is sufficient. The same license is valid on either host because the licence is bound to the customer, not to a machine ID.

  5. Provision the TLS / code-signing certificates on the passive host’s LocalMachine certificate store (or wherever the active host loads them from). Certificates do not travel via the SQL database — they are host-local material.

  6. Provision secrets. This step depends on the configured secret provider:

    ProviderPassive-node setupCutover impact
    DPAPI (default)Re-run secretscli set <ref> <value> on the passive host for every entry in IM_Secrets. The DPAPI key is bound to the host’s LocalMachine scope and the active host’s blobs cannot be decrypted on the passive host. See secrets-and-dpapi.md.Each rotation must be re-applied to the passive host.
    AzureKeyVaultGrant the passive host’s managed identity (or the configured AZURE_CLIENT_ID) the Key Vault Secrets User role on the same vault. Set Secrets:Provider = AzureKeyVault and the same KeyVault:VaultUri as the active host. See secrets-keyvault.md.None — secrets are read from the vault on demand.

    Recommended: use the Azure Key Vault provider for active/passive deployments. The DPAPI provider works, but every secret rotation becomes a two-host operation, and every upgrade-time MSI repair that resets the host’s machine key forces a full re-provision.

  7. Verify the passive host can reach SQL before declaring the node ready. Start the services briefly, confirm GET /health/ready returns 200 against localhost, then stop them again.

Cutover procedure (planned)

Use this procedure for OS patching, MSI upgrades, or any maintenance that requires the active host to step down briefly.

  1. Drain the active host. Pause schedules in the Admin UI or wait for in-flight runs to complete. The Sync Engine wraps each import batch in a single SQL transaction (see deployment-architecture.md), so stopping mid-batch is safe — the batch rolls back — but it forces the next sync to re-run from the last watermark, which is wasted work if avoidable.
  2. Stop the services on the active host:
    sc stop IdentityMeshEngine
    sc stop IdentityMeshAdmin
    Watch the engine log file (logs/identitymesh-YYYYMMDD.log) for the final “shutdown complete” line before proceeding.
  3. Update the VIP / DNS CNAME to point at the passive host.
    • DNS: change the CNAME and wait out the TTL (or set it low — 30 to 60 seconds — for active/passive deployments).
    • MSCS / Windows Failover Cluster: move the IdentityMesh role to the standby node; the cluster handles the IP swap.
  4. Start the services on the passive host:
    sc start IdentityMeshAdmin
    sc start IdentityMeshEngine
    The Admin API initialises first; the Sync Engine registers itself in the engine-instance registry once it starts.
  5. Verify using the verification checklist below.
  6. Rollback is the same procedure in reverse: stop on the new active, swing DNS / VIP back, start on the original active.

The total operator-visible outage is the time between step 2 and step 5 — typically 1 to 3 minutes if DNS TTL is low.

Cutover procedure (unplanned)

The active host is unreachable: hardware failure, OS crash, network isolation. This procedure assumes the SQL backend is healthy; if SQL is also down, that is a separate incident and IdentityMesh cannot serve traffic from either node.

  1. Confirm the active host is actually down. A flapping host that comes back mid-failover with the services still running creates the dual-active scenario described in What NOT to do. If in doubt, force the active host’s services off via remote management before proceeding.
  2. Update the VIP / DNS CNAME to point at the passive host.
  3. Start the services on the passive host:
    sc start IdentityMeshAdmin
    sc start IdentityMeshEngine
  4. Verify using the verification checklist.

Expected behaviour after an unplanned failover

What NOT to do

Verification checklist

Run this checklist immediately after every cutover, planned or unplanned. The new active host is “good” only when every item passes.

If any item fails, the recovery is not complete. Capture the diagnostic for that step before deciding whether to continue investigating on the new active host or roll back to the original.