Deployment Architecture
This document states explicitly what IdentityMesh’s supported deployment topology is — and, as importantly, what it is not — so operators don’t accidentally deploy in a configuration the code doesn’t honour.
The supported topology
┌────────────────────────────────┐
│ IdentityMesh Admin API │ ← one instance
│ (Windows service + Kestrel) │
└──────────────┬─────────────────┘
│ EF Core, TLS
▼
┌────────────────────────────────┐
│ SQL Server │ ← can be clustered /
│ (IdentityMesh database) │ AG'd / mirrored, but
└──────────────┬─────────────────┘ from IdentityMesh's
▲ perspective it's "the DB"
│ EF Core, TLS
┌──────────────┴─────────────────┐
│ IdentityMesh Sync Engine │ ← one instance
│ (Windows service) │
└──────────────┬─────────────────┘
│ SignalR (hub client)
▼
┌────────────────────────────────┐
│ Relay agents (≥ 0) │ ← one per remote host,
│ (Windows services) │ independent
└────────────────────────────────┘
One Admin API, one Sync Engine, one SQL database, zero-or-more Relay agents. The Admin API and the Sync Engine can share a host or sit on separate hosts — doesn’t matter to the product. Relay agents are designed to be remote.
What’s per-instance vs shared
| State | Location | Scaled out? |
|---|---|---|
Identity data (IM_MeshObjects, attributes, audit, composer) | SQL | Yes — SQL Server handles the HA story |
License (license.key) | CommonApplicationData\IdentityMesh\ | No — per-host file |
Secrets (IM_Secrets) | SQL (DPAPI-encrypted under the host’s LocalMachine key) | Partially — blobs travel, decryption key does not. See secrets-and-dpapi.md. |
| ASP.NET Core Data Protection keys | Admin API host memory (default provider) | No — AddDataProtection().PersistKeysTo… is not wired |
| SignalR hub state (connection lookups, group membership) | Admin API host memory | No — no Redis / Service Bus backplane |
| In-memory caches (join rules, composer rules, attribute-flow rules, projection rules, export queue working set) | Sync Engine scope memory | No — each scope is a run |
| Engine instance registration (the engine-instance registry) | SQL | Table supports N rows; the scheduler doesn’t read it for coordination yet. |
The pattern: SQL holds everything durable; everything ephemeral lives in the instance’s process memory.
Why scale-out isn’t supported today
Three specific gaps, each independently sufficient to break a scale-out deployment:
- Data Protection keys aren’t persisted to a shared store. Cookies, antiforgery tokens, and any other data-protected payload minted by one Admin API instance wouldn’t decrypt on a second instance. The moment a sticky-session load balancer fell back to a different instance, the session would break.
- SignalR has no backplane. A relay agent connected to instance A would be invisible to instance B. Admin UI operations that broadcast commands to relays (sync triggers, config reloads) from instance B wouldn’t reach the relay.
- Engine run coordination isn’t wired. Two sync engine instances pointing at the same SQL database would both schedule the same connectors at the same times — not a correctness bug (each object touches its own watermark + transaction) but wasteful, and composer-rule mutations from two engines could interleave in surprising ways. The the engine-instance registry table exists for a future leader- election implementation; it isn’t consulted today.
Additionally:
- The per-scope rule caches in the Sync Engine are explicitly scoped to “one engine process, one connector run”. If a second engine ran against the same DB, each would hold its own cache, causing neither correctness issues nor the performance win of sharing.
- Rate limiting is per-instance — a 30 req/min cap on
/api/license/uploadon instance A doesn’t coordinate with instance B. A sticky-session LB hides this in practice; a round-robin LB would let a caller burn 2× the cap.
Supported failure modes
- Admin API process crashes / host reboots: the Windows service restart policy brings it back. The Sync Engine continues unaffected (it doesn’t talk to the Admin API — both talk to the same SQL). Interactive operators see a brief outage. Relay agents reconnect automatically when the SignalR hub comes back.
- Sync Engine process crashes / host reboots: the Windows service restart policy brings it back. Any in-flight batch rolls back (the engine wraps each batch in one transaction); watermark stays at the last checkpoint, so the connector re-syncs from there on the next schedule tick. Admin API continues unaffected.
- SQL Server unavailable: everything pauses. EF’s
EnableRetryOnFailure(configured 5 attempts, 30s max delay) rides out short blips. Longer outages surface as/health/readyfailures until the DB returns. - Relay agent crashes: its queued work replays from SQL when it reconnects; the Sync Engine doesn’t block on an offline relay (each relay is per-connector-group independent).
Not-supported failure modes (today)
- Admin API host permanently lost: operators must stand up a
new host, install the MSI pointing at the same SQL database, and
re-provision secrets (DPAPI keys don’t travel). Interactive
downtime = time-to-rebuild.
backup-and-restore.mdcovers the runbook. - Sync Engine host permanently lost: same story. Outage lasts until a replacement engine host is up.
- Need for zero-downtime rolling upgrades: not possible with a single instance. Plan maintenance windows.
What scale-out would take
Not a commitment; an inventory for a future ER that would implement active/active:
- Persist Data Protection keys to a shared store
(
AddDataProtection().PersistKeysToFileSystem()on a shared network path,PersistKeysToAzureBlobStorage, or the DB via the communityAspNetCore.DataProtection.EntityFrameworkCorepackage). - Add a SignalR backplane. Redis is the most mature option
(
Microsoft.AspNetCore.SignalR.StackExchangeRedis). Azure SignalR Service is the alternative on Azure. - Replace the per-scope rule/schedule caches with a distributed cache or shared-invalidation signal so rule edits on one instance are seen by the other without restart.
- Wire engine leader election against the engine-instance registry (the table already has the rows, just nothing consults them) so only one engine runs each connector at a time.
- Move the rate-limiter partition key off in-memory sliding windows to a distributed backend.
For most enterprises the simpler active/passive topology covers the common HA ask — one instance live, a second instance ready to take over via a standard MSCS cluster role or a manual cutover procedure. That sidesteps the distributed-state complexity entirely.
Checking the topology at runtime
GET /health/readyfrom the Admin API returns a 200 only when this instance can talk to SQL and has a valid license. Use it as the LB probe.GET /api/instances(requiresdashboard.read) reads the engine-instance registry and returns every engine that’s recently heartbeated. A well-behaved single-instance deployment shows exactly one row; seeing more than one is a signal that two engines are pointed at the same DB (usually a config mistake during migration or a test rig left running).GET /api/relayslists connected relay agents as seen by this Admin API instance — again, “exactly what you provisioned” is the healthy signal.
Related
backup-and-restore.md— what to do when the Admin API or Sync Engine host is permanently lost.secrets-and-dpapi.md— why DPAPI-encrypted secrets don’t survive a host change, what to re-provision.