Skip to content

Add file locking during model download to prevent race conditions #18

@titusz

Description

@titusz

Problem

When multiple processes or threads attempt to download the model simultaneously (e.g., during parallel test execution with pytest -n auto), race conditions can occur that corrupt the downloaded model file. This results in INVALID_PROTOBUF errors when trying to load the model.

Current Behavior

  • iscc_sct.utils.get_model() downloads the model without file locking
  • Multiple concurrent calls can write to the same file simultaneously
  • Downloaded file can become corrupted
  • No atomic download mechanism (download to temp, then move)

Expected Behavior

  • Only one process should download the model at a time
  • Other processes should wait for the download to complete
  • Use file locking (e.g., fcntl.flock on Unix, msvcrt.locking on Windows, or filelock library)
  • Download to temporary file, verify integrity, then atomically rename

Suggested Implementation

import tempfile
from pathlib import Path
from filelock import FileLock

def get_model():
    lock_path = MODEL_PATH.parent / f"{MODEL_PATH.name}.lock"
    
    with FileLock(str(lock_path), timeout=300):  # 5 minute timeout
        # Check again after acquiring lock (another process may have downloaded)
        if MODEL_PATH.exists() and check_integrity(MODEL_PATH, MODEL_CHECKSUM):
            return MODEL_PATH
        
        # Download to temporary file
        with tempfile.NamedTemporaryFile(delete=False, dir=MODEL_PATH.parent) as tmp:
            download_file(MODEL_URL, tmp.name)
            
            # Verify integrity before moving
            if not check_integrity(tmp.name, MODEL_CHECKSUM):
                Path(tmp.name).unlink()
                raise ValueError("Downloaded model failed integrity check")
            
            # Atomic rename
            Path(tmp.name).replace(MODEL_PATH)
        
    return MODEL_PATH

Impact

This would make iscc-sct safe to use in:

  • Parallel test environments
  • Multi-worker application servers
  • Container deployments with shared volumes
  • Any concurrent execution scenario

Related

  • Similar issue in other ML libraries that download models (transformers, sentence-transformers, etc.) which all use file locking

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions