Skip to content

Conversation

hssyoo
Copy link
Contributor

@hssyoo hssyoo commented Aug 18, 2025

Draft WIP to validate full object checksums for multipart downloads

@hssyoo hssyoo requested a review from jonathan343 August 18, 2025 15:00
@hssyoo
Copy link
Contributor Author

hssyoo commented Aug 18, 2025

@jonathan343 This is a WIP, still need to clean up code, add support for other CRC algorithms, and add tests, but wanted some early feedback on high-level approach. Also will likely change how stored checksum will be provided to transfer meta, but this is close enough for an initial review.

Specifically:

  • Using StreamingChecksumBody + ChecksumValidator to record, combine and validate checksums.
  • Using CountCallbackInvoker to signal execution of ValidateChecksumTask.
  • I'm a little iffy on using IO executor to execute ValidateChecksumTask, but wasn't sure if there's a better option. FWIW, validation will occur after all parts have been downloaded, so it won't block any IO tasks except for the final rename task.
  • Also lmk if you notice anything else at the design level.

Copy link
Contributor

@jonathan343 jonathan343 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good and I'm happy with the direction. Once question I have: How does this interact with the existing checksum validation that exists in botocore? In the case where an object is downloaded using the same chunk size that was used for uploading, we currently validate the checksum send back by S3. After these changes are we doing this validation twice?

@hssyoo
Copy link
Contributor Author

hssyoo commented Aug 21, 2025

I think this looks good and I'm happy with the direction. Once question I have: How does this interact with the existing checksum validation that exists in botocore? In the case where an object is downloaded using the same chunk size that was used for uploading, we currently validate the checksum send back by S3. After these changes are we doing this validation twice?

Made some changes to reuse existing checksum values and also added validation for MPUs to see what it'd look like. At a high-level, if a file-like object points to another file-like object that exposes the checksum property, then it'll reuse its checksum. Major caveat with this is that the order in which we wrap file-like objects is important. For example, PartStreamingChecksumBody must be the first wrapper that wraps the file-like object returned from Botocore for downloads. And ReadFileChunk must be the last wrapper before passing to Botocore for uploads.

@hssyoo hssyoo changed the title [v2] Validate full object checksums for multipart downloads [v2] Validate full object checksums Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants