Fix(deploy): Restore packaging of third-party dependencies #5005
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
1. Problem Summary
Recent ClusterFuzz deployments began failing, causing two main symptoms:
clusterfuzz.servicewould fail their health checks and enter a reboot loop.exit 1error code.The root cause was identified as a missing Python dependency (
PyYAML) in the final deployment package (clusterfuzz_package.zip). This led to aModuleNotFoundError: No module named 'yaml'at bot startup, causing the main process to crash silently.2. Investigation Details
The investigation followed these key steps:
Isolating the Failure: We confirmed that an older Docker image (
...:091c6c2) worked correctly on the new infrastructure, while newer images (...:1d0f907and others) failed. This pointed to a regression in either the Docker image or the packaged code.Capturing the Error: By disabling the health check on a VM, we were able to manually execute the bot's startup script. This allowed us to capture the fatal error, which was
ModuleNotFoundError: No module named 'yaml'.Analyzing the Dependency Source: Our initial hypothesis was that the
PyYAMLlibrary was missing from the Docker image itself. However, direct testing proved this wrong: neither the old (working) nor the new (failing) Docker image hadPyYAMLpre-installed.Identifying the Real Source: This led to the discovery that ClusterFuzz does not rely on the Docker image for its third-party Python libraries. Instead, the startup script (
run.sh) downloads aclusterfuzz_package.zipfrom GCS. This zip file is supposed to contain asrc/third_party/directory with all necessary dependencies.The "Aha!" Moment: By executing the setup script (
setup_clusterfuzz.sh) inside both the old and new containers, we found the critical difference:yamllibrary insrc/third_party/.yamllibrary fromsrc/third_party/.3. Root Cause Analysis
The problem was traced back to a specific change in the build and packaging logic.
The "What": A code refactoring in
src/local/butler/common.pyreplaced the deprecateddistutils.dir_util.copy_treefunction with the modernshutil.copytree.The "Why": This change was part of commit 8e07445597abc24f3db26d622d8054adf9157873, whose primary goal was to add the
clusterfuzz-configcommit hash to the build revision for better traceability. The refactoring was a well-intentioned code cleanup based on aTODOcomment in the source.The "How": The two functions have a subtle but critical difference in behavior.
distutils.dir_util.copy_tree(..., update=True)is designed to merge/update a directory tree, copying files into a destination even if it already exists. In contrast,shutil.copytreeby default will fail if the destination directory already exists. In our build process, the destination staging directory was created before the copy, causing the newshutillogic to skip copying thethird_partydependencies without raising a visible error.4. The Solution
The fix was to align the behavior of the new
shutil.copytreecall with the olddistutilslogic. This was achieved by adding thedirs_exist_ok=Trueargument to theshutil.copytreecall in theupdate_dirfunction.This solution is robust because:
PyYAML.shutillibrary.This change restores the correct behavior, ensuring that the deployment package is complete and the bots can start successfully.