-
Couldn't load subscription status.
- Fork 378
Document caching strategy for Managed Identity v2 #5526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Added detailed caching strategy and resilience plan for Managed Identity v2, including problem identification, proposed solutions, call sequence, cache renewal matrix, invalidation rules, and security considerations.
| --- | ||
|
|
||
| ## Solution (What’s Changing) | ||
| 1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is "link-local" ?
|
|
||
| ## Solution (What’s Changing) | ||
| 1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**. | ||
| 2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean "treat as primary anchor". Pls use more precise wording.
|
|
||
| ## Solution (What’s Changing) | ||
| 1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**. | ||
| 2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not hardcode any expirations. We rely on services returning expirations.
| ## Solution (What’s Changing) | ||
| 1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**. | ||
| 2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs. | ||
| 3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls be precise. Specify:
- jitter (e.g. 5 min)
- if renewal should happen on front-end or back-end thread. I think front-end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is jitter calculated? Is it randomized per host/process or globally coordinated? Could jitter introduce any unintended renewal delays?
| 1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**. | ||
| 2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs. | ||
| 3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry. | ||
| 4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want cross-process coordination, please specify the IPC that is going to be used. This needs to exist on Windows and Linux and it needs to be available in sanctioned libraries across all supported MSAL languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the ‘single-writer’ selected? Is this a file lock, named mutex, or other mechanism? What happens if the single-writer crashes mid-renewal?
| 2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs. | ||
| 3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry. | ||
| 4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert. | ||
| 5. **MAA token** is used **only** for issuance/renewal; short-lived cache to prevent attestations calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not enough precision. What does it mean "short-lived" cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What’s the cache invalidation logic if a policy or key rotation occurs on the MAA side? Is there a way to force re-attestation?
| ``` | ||
| Call 0 (local): Probe IMDS v2 → cache MSI source (V2/V1) | ||
| 1 (local): Create KeyGuard key (per reboot) | ||
| 2 (external): Get MAA token // only for (re)issuing cert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is local and what is external ?
| | Item | Scope | Where | TTL | Notes | | ||
| |---|---|---|---|---| | ||
| | **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here | | ||
| | **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you going to deal with atomicity, multiple file writers, and a process that gets killed in the middle of a write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not enough details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For each cache and renewal step, document what happens if the cache is missing, invalid, or corrupted.
- Outline (even briefly) the implementation details of the single-writer system.
| # Managed Identity v2 (Attested TB) — Resilience & Caching Plan | ||
|
|
||
| ## TL;DR | ||
| We reduce cold-start latency and dependency risk for MSI v2 by caching safe, long-lived artifacts, coordinating renewal across processes, and keeping the hot path in memory. **MAA is used only to (re)issue the binding certificate**; bound AT acquisition relies on that cert. Result: fewer failures, less churn, smoother CX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What’s the fallback if the binding cert is lost or corrupted? Is there any emergency recovery path?
| ## Solution (What’s Changing) | ||
| 1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**. | ||
| 2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs. | ||
| 3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is jitter calculated? Is it randomized per host/process or globally coordinated? Could jitter introduce any unintended renewal delays?
| 1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**. | ||
| 2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs. | ||
| 3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry. | ||
| 4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the ‘single-writer’ selected? Is this a file lock, named mutex, or other mechanism? What happens if the single-writer crashes mid-renewal?
| 2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs. | ||
| 3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry. | ||
| 4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert. | ||
| 5. **MAA token** is used **only** for issuance/renewal; short-lived cache to prevent attestations calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What’s the cache invalidation logic if a policy or key rotation occurs on the MAA side? Is there a way to force re-attestation?
| ``` | ||
| Call 0 (local): Probe IMDS v2 → cache MSI source (V2/V1) | ||
| 1 (local): Create KeyGuard key (per reboot) | ||
| 2 (external): Get MAA token // only for (re)issuing cert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there retries or backoff strategies if the MAA call fails? Is exponential backoff used or is it a fixed retry policy?
| | Item | Scope | Where | TTL | Notes | | ||
| |---|---|---|---|---| | ||
| | **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here | | ||
| | **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are policy changes detected? Is it polled, pushed, or inferred from failures?
| |---|---|---|---|---| | ||
| | **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here | | ||
| | **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter | | ||
| | **Binding cert + `/issuecredential` metadata** | Per **Managed Identity per user context** | Persisted (Win: `CurrentUser\My`; Linux: protected file/PEM) | ~7 days | Renew at **half-life + jitter**; Serialize issuance | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What protects against file corruption or unauthorized access on Linux? Is there a fallback if the file is deleted outside of MSAL?
| | **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here | | ||
| | **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter | | ||
| | **Binding cert + `/issuecredential` metadata** | Per **Managed Identity per user context** | Persisted (Win: `CurrentUser\My`; Linux: protected file/PEM) | ~7 days | Renew at **half-life + jitter**; Serialize issuance | | ||
| | **Access tokens (`bearer` or `mtls_pop`)** | Per audience | In memory | Service-configured | Reacquire after reboot (new key) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there scenarios where token invalidation lags behind key rotation? How does the system ensure that stale tokens aren’t accidentally reused?
| ## Invalidation Rules | ||
| - **Reboot** → Use **persisted binding cert** to fetch new ATs; re-attest on first demand on service failure. | ||
| - **Cert expiry** → re-issue. | ||
| - **MAA token expired** → re-attest and re-issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there built-in safeguards to prevent a thundering herd if all processes notice expiry at the same time?
|
|
||
| ## Why This Improves CX | ||
| - **MAA is out of the hot path**—steady-state calls rely on a **multi-day binding cert**. | ||
| - Different identities on the same VM, uses **cached MAA token** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the cache not keyed per identity?
Added detailed caching strategy and resilience plan for Managed Identity v2, including problem identification, proposed solutions, call sequence, cache renewal matrix, invalidation rules, and security considerations.