Skip to content

Conversation

@hunsche
Copy link
Collaborator

@hunsche hunsche commented Oct 24, 2025

1. Problem Summary

Recent ClusterFuzz deployments began failing, causing two main symptoms:

  • Compute Engine MIGs: Instances running the clusterfuzz.service would fail their health checks and enter a reboot loop.
  • Google Cloud Batch: Jobs would fail with a generic exit 1 error code.

The root cause was identified as a missing Python dependency (PyYAML) in the final deployment package (clusterfuzz_package.zip). This led to a ModuleNotFoundError: No module named 'yaml' at bot startup, causing the main process to crash silently.

2. Investigation Details

The investigation followed these key steps:

  1. Isolating the Failure: We confirmed that an older Docker image (...:091c6c2) worked correctly on the new infrastructure, while newer images (...:1d0f907 and others) failed. This pointed to a regression in either the Docker image or the packaged code.

  2. Capturing the Error: By disabling the health check on a VM, we were able to manually execute the bot's startup script. This allowed us to capture the fatal error, which was ModuleNotFoundError: No module named 'yaml'.

  3. Analyzing the Dependency Source: Our initial hypothesis was that the PyYAML library was missing from the Docker image itself. However, direct testing proved this wrong: neither the old (working) nor the new (failing) Docker image had PyYAML pre-installed.

  4. Identifying the Real Source: This led to the discovery that ClusterFuzz does not rely on the Docker image for its third-party Python libraries. Instead, the startup script (run.sh) downloads a clusterfuzz_package.zip from GCS. This zip file is supposed to contain a src/third_party/ directory with all necessary dependencies.

  5. The "Aha!" Moment: By executing the setup script (setup_clusterfuzz.sh) inside both the old and new containers, we found the critical difference:

    • The old, working deployment package correctly contained the yaml library in src/third_party/.
    • The new, failing deployment package was missing the yaml library from src/third_party/.

3. Root Cause Analysis

The problem was traced back to a specific change in the build and packaging logic.

  • The "What": A code refactoring in src/local/butler/common.py replaced the deprecated distutils.dir_util.copy_tree function with the modern shutil.copytree.

  • The "Why": This change was part of commit 8e07445597abc24f3db26d622d8054adf9157873, whose primary goal was to add the clusterfuzz-config commit hash to the build revision for better traceability. The refactoring was a well-intentioned code cleanup based on a TODO comment in the source.

  • The "How": The two functions have a subtle but critical difference in behavior. distutils.dir_util.copy_tree(..., update=True) is designed to merge/update a directory tree, copying files into a destination even if it already exists. In contrast, shutil.copytree by default will fail if the destination directory already exists. In our build process, the destination staging directory was created before the copy, causing the new shutil logic to skip copying the third_party dependencies without raising a visible error.

4. The Solution

The fix was to align the behavior of the new shutil.copytree call with the old distutils logic. This was achieved by adding the dirs_exist_ok=True argument to the shutil.copytree call in the update_dir function.

This solution is robust because:

  1. It corrects the fundamental packaging logic, ensuring all third-party dependencies are included, not just PyYAML.
  2. It respects the original intent of the refactoring by continuing to use the modern shutil library.

This change restores the correct behavior, ensuring that the deployment package is complete and the bots can start successfully.

A refactoring from the deprecated distutils.dir_util.copy_tree to the
modern shutil.copytree inadvertently broke the packaging of third-party
dependencies.

shutil.copytree fails by default if the destination directory exists,
which prevented dependencies from being copied into the final deployment zip.

This commit fixes the issue by adding the 'dirs_exist_ok=True' argument,
available since Python 3.8, to the shutil.copytree call. This
restores the original update/merge behavior and ensures all necessary
dependencies are correctly included in the clusterfuzz_package.zip,
resolving the ModuleNotFoundError at runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants