Deployment Architecture

This document states explicitly what IdentityMesh’s supported deployment topology is — and, as importantly, what it is not — so operators don’t accidentally deploy in a configuration the code doesn’t honour.

The supported topology

   ┌────────────────────────────────┐
   │  IdentityMesh Admin API        │ ← one instance
   │  (Windows service + Kestrel)   │
   └──────────────┬─────────────────┘
                  │ EF Core, TLS
                  ▼
   ┌────────────────────────────────┐
   │  SQL Server                    │ ← can be clustered /
   │  (IdentityMesh database)       │   AG'd / mirrored, but
   └──────────────┬─────────────────┘   from IdentityMesh's
                  ▲                     perspective it's "the DB"
                  │ EF Core, TLS
   ┌──────────────┴─────────────────┐
   │  IdentityMesh Sync Engine      │ ← one instance
   │  (Windows service)             │
   └──────────────┬─────────────────┘
                  │ SignalR (hub client)
                  ▼
   ┌────────────────────────────────┐
   │  Relay agents (≥ 0)            │ ← one per remote host,
   │  (Windows services)            │   independent
   └────────────────────────────────┘

One Admin API, one Sync Engine, one SQL database, zero-or-more Relay agents. The Admin API and the Sync Engine can share a host or sit on separate hosts — doesn’t matter to the product. Relay agents are designed to be remote.

What’s per-instance vs shared

State	Location	Scaled out?
Identity data (`IM_MeshObjects`, attributes, audit, composer)	SQL	Yes — SQL Server handles the HA story
License (`license.key`)	`CommonApplicationData\IdentityMesh\`	No — per-host file
Secrets (`IM_Secrets`)	SQL (DPAPI-encrypted under the host’s LocalMachine key)	Partially — blobs travel, decryption key does not. See `secrets-and-dpapi.md`.
ASP.NET Core Data Protection keys	Admin API host memory (default provider)	No — `AddDataProtection().PersistKeysTo…` is not wired
SignalR hub state (connection lookups, group membership)	Admin API host memory	No — no Redis / Service Bus backplane
In-memory caches (join rules, composer rules, attribute-flow rules, projection rules, export queue working set)	Sync Engine scope memory	No — each scope is a run
Engine instance registration (the engine-instance registry)	SQL	Table supports N rows; the scheduler doesn’t read it for coordination yet.

The pattern: SQL holds everything durable; everything ephemeral lives in the instance’s process memory.

Why scale-out isn’t supported today

Three specific gaps, each independently sufficient to break a scale-out deployment:

Data Protection keys aren’t persisted to a shared store. Cookies, antiforgery tokens, and any other data-protected payload minted by one Admin API instance wouldn’t decrypt on a second instance. The moment a sticky-session load balancer fell back to a different instance, the session would break.
SignalR has no backplane. A relay agent connected to instance A would be invisible to instance B. Admin UI operations that broadcast commands to relays (sync triggers, config reloads) from instance B wouldn’t reach the relay.
Engine run coordination isn’t wired. Two sync engine instances pointing at the same SQL database would both schedule the same connectors at the same times — not a correctness bug (each object touches its own watermark + transaction) but wasteful, and composer-rule mutations from two engines could interleave in surprising ways. The the engine-instance registry table exists for a future leader- election implementation; it isn’t consulted today.

Additionally:

The per-scope rule caches in the Sync Engine are explicitly scoped to “one engine process, one connector run”. If a second engine ran against the same DB, each would hold its own cache, causing neither correctness issues nor the performance win of sharing.
Rate limiting is per-instance — a 30 req/min cap on /api/license/upload on instance A doesn’t coordinate with instance B. A sticky-session LB hides this in practice; a round-robin LB would let a caller burn 2× the cap.

Supported failure modes

Admin API process crashes / host reboots: the Windows service restart policy brings it back. The Sync Engine continues unaffected (it doesn’t talk to the Admin API — both talk to the same SQL). Interactive operators see a brief outage. Relay agents reconnect automatically when the SignalR hub comes back.
Sync Engine process crashes / host reboots: the Windows service restart policy brings it back. Any in-flight batch rolls back (the engine wraps each batch in one transaction); watermark stays at the last checkpoint, so the connector re-syncs from there on the next schedule tick. Admin API continues unaffected.
SQL Server unavailable: everything pauses. EF’s EnableRetryOnFailure (configured 5 attempts, 30s max delay) rides out short blips. Longer outages surface as /health/ready failures until the DB returns.
Relay agent crashes: its queued work replays from SQL when it reconnects; the Sync Engine doesn’t block on an offline relay (each relay is per-connector-group independent).

Not-supported failure modes (today)

Admin API host permanently lost: operators must stand up a new host, install the MSI pointing at the same SQL database, and re-provision secrets (DPAPI keys don’t travel). Interactive downtime = time-to-rebuild. backup-and-restore.md covers the runbook.
Sync Engine host permanently lost: same story. Outage lasts until a replacement engine host is up.
Need for zero-downtime rolling upgrades: not possible with a single instance. Plan maintenance windows.

What scale-out would take

Not a commitment; an inventory for a future ER that would implement active/active:

Persist Data Protection keys to a shared store (AddDataProtection().PersistKeysToFileSystem() on a shared network path, PersistKeysToAzureBlobStorage, or the DB via the community AspNetCore.DataProtection.EntityFrameworkCore package).
Add a SignalR backplane. Redis is the most mature option (Microsoft.AspNetCore.SignalR.StackExchangeRedis). Azure SignalR Service is the alternative on Azure.
Replace the per-scope rule/schedule caches with a distributed cache or shared-invalidation signal so rule edits on one instance are seen by the other without restart.
Wire engine leader election against the engine-instance registry (the table already has the rows, just nothing consults them) so only one engine runs each connector at a time.
Move the rate-limiter partition key off in-memory sliding windows to a distributed backend.

For most enterprises the simpler active/passive topology covers the common HA ask — one instance live, a second instance ready to take over via a standard MSCS cluster role or a manual cutover procedure. That sidesteps the distributed-state complexity entirely.

Checking the topology at runtime

GET /health/ready from the Admin API returns a 200 only when this instance can talk to SQL and has a valid license. Use it as the LB probe.
GET /api/instances (requires dashboard.read) reads the engine-instance registry and returns every engine that’s recently heartbeated. A well-behaved single-instance deployment shows exactly one row; seeing more than one is a signal that two engines are pointed at the same DB (usually a config mistake during migration or a test rig left running).
GET /api/relays lists connected relay agents as seen by this Admin API instance — again, “exactly what you provisioned” is the healthy signal.

backup-and-restore.md — what to do when the Admin API or Sync Engine host is permanently lost.
secrets-and-dpapi.md — why DPAPI-encrypted secrets don’t survive a host change, what to re-provision.