Skip to content

Conversation

@gladjohn
Copy link
Contributor

@gladjohn gladjohn commented Oct 8, 2025

Added detailed caching strategy and resilience plan for Managed Identity v2, including problem identification, proposed solutions, call sequence, cache renewal matrix, invalidation rules, and security considerations.

Added detailed caching strategy and resilience plan for Managed Identity v2, including problem identification, proposed solutions, call sequence, cache renewal matrix, invalidation rules, and security considerations.
@gladjohn gladjohn requested a review from a team as a code owner October 8, 2025 15:43
---

## Solution (What’s Changing)
1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "link-local" ?


## Solution (What’s Changing)
1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean "treat as primary anchor". Pls use more precise wording.


## Solution (What’s Changing)
1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not hardcode any expirations. We rely on services returning expirations.

## Solution (What’s Changing)
1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls be precise. Specify:

  • jitter (e.g. 5 min)
  • if renewal should happen on front-end or back-end thread. I think front-end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is jitter calculated? Is it randomized per host/process or globally coordinated? Could jitter introduce any unintended renewal delays?

1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry.
4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want cross-process coordination, please specify the IPC that is going to be used. This needs to exist on Windows and Linux and it needs to be available in sanctioned libraries across all supported MSAL languages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the ‘single-writer’ selected? Is this a file lock, named mutex, or other mechanism? What happens if the single-writer crashes mid-renewal?

2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry.
4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert.
5. **MAA token** is used **only** for issuance/renewal; short-lived cache to prevent attestations calls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not enough precision. What does it mean "short-lived" cache?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the cache invalidation logic if a policy or key rotation occurs on the MAA side? Is there a way to force re-attestation?

```
Call 0 (local): Probe IMDS v2 → cache MSI source (V2/V1)
1 (local): Create KeyGuard key (per reboot)
2 (external): Get MAA token // only for (re)issuing cert
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is local and what is external ?

| Item | Scope | Where | TTL | Notes |
|---|---|---|---|---|
| **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here |
| **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you going to deal with atomicity, multiple file writers, and a process that gets killed in the middle of a write?

Copy link
Member

@bgavrilMS bgavrilMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not enough details.

Copy link
Contributor

@Robbie-Microsoft Robbie-Microsoft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • For each cache and renewal step, document what happens if the cache is missing, invalid, or corrupted.
  • Outline (even briefly) the implementation details of the single-writer system.

# Managed Identity v2 (Attested TB) — Resilience & Caching Plan

## TL;DR
We reduce cold-start latency and dependency risk for MSI v2 by caching safe, long-lived artifacts, coordinating renewal across processes, and keeping the hot path in memory. **MAA is used only to (re)issue the binding certificate**; bound AT acquisition relies on that cert. Result: fewer failures, less churn, smoother CX.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the fallback if the binding cert is lost or corrupted? Is there any emergency recovery path?

## Solution (What’s Changing)
1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is jitter calculated? Is it randomized per host/process or globally coordinated? Could jitter introduce any unintended renewal delays?

1. **Probe once** (link-local) to detect **MSI v2** → cache result **in-proc**.
2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry.
4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the ‘single-writer’ selected? Is this a file lock, named mutex, or other mechanism? What happens if the single-writer crashes mid-renewal?

2. Treat the **binding certificate** (from IMDS `/issuecredential`) as the **primary anchor** (~7-day validity); use it to get ATs.
3. **Proactive renewal at half-life (+ small jitter)** to rotate well before expiry.
4. **Single-writer coordination** so only one process issues/renews; others reuse the same cert.
5. **MAA token** is used **only** for issuance/renewal; short-lived cache to prevent attestations calls.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the cache invalidation logic if a policy or key rotation occurs on the MAA side? Is there a way to force re-attestation?

```
Call 0 (local): Probe IMDS v2 → cache MSI source (V2/V1)
1 (local): Create KeyGuard key (per reboot)
2 (external): Get MAA token // only for (re)issuing cert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there retries or backoff strategies if the MAA call fails? Is exponential backoff used or is it a fixed retry policy?

| Item | Scope | Where | TTL | Notes |
|---|---|---|---|---|
| **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here |
| **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are policy changes detected? Is it polled, pushed, or inferred from failures?

|---|---|---|---|---|
| **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here |
| **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter |
| **Binding cert + `/issuecredential` metadata** | Per **Managed Identity per user context** | Persisted (Win: `CurrentUser\My`; Linux: protected file/PEM) | ~7 days | Renew at **half-life + jitter**; Serialize issuance |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What protects against file corruption or unauthorized access on Linux? Is there a fallback if the file is deleted outside of MSAL?

| **MSI v2 probe result** | Per process | In-proc static | Process lifetime | NO changes needed here |
| **MAA token** | Per **keyHandle** | small file cache | ≤ JWT `exp` (~8h) | Only for cert issuance; evict on reboot/policy change/attest fail; refresh half-life + jitter |
| **Binding cert + `/issuecredential` metadata** | Per **Managed Identity per user context** | Persisted (Win: `CurrentUser\My`; Linux: protected file/PEM) | ~7 days | Renew at **half-life + jitter**; Serialize issuance |
| **Access tokens (`bearer` or `mtls_pop`)** | Per audience | In memory | Service-configured | Reacquire after reboot (new key) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there scenarios where token invalidation lags behind key rotation? How does the system ensure that stale tokens aren’t accidentally reused?

## Invalidation Rules
- **Reboot** → Use **persisted binding cert** to fetch new ATs; re-attest on first demand on service failure.
- **Cert expiry** → re-issue.
- **MAA token expired** → re-attest and re-issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there built-in safeguards to prevent a thundering herd if all processes notice expiry at the same time?


## Why This Improves CX
- **MAA is out of the hot path**—steady-state calls rely on a **multi-day binding cert**.
- Different identities on the same VM, uses **cached MAA token**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the cache not keyed per identity?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants