Fix indefinite hangs on non-EC2 network paths#438
Conversation
Override coldsnap with a fork that adds SDK timeouts, single-layer retries, configurable workers, client sharding, and upload diagnostics. This fixes indefinite hangs when uploading from non-EC2 paths (e.g. GitHub Actions runners). Upstream PR: awslabs/coldsnap#438 Usage: coldsnap upload --workers 64 --client-shards 8 image.raw Can be removed once the upstream PR is merged and the fix reaches nixpkgs.
jmt-lab
left a comment
There was a problem hiding this comment.
Thank you for this contribution. This looks good to me.
| config = config.endpoint_url(endpoint); | ||
| } | ||
|
|
||
| // The AWS SDK does not set response or per-attempt timeouts by default. |
|
Also we do suggest you setup ssh key signing for your git commits to help verify your commits. This is not a hard requirement for contributions externally but generally a good thing to do: https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification#ssh-commit-signature-verification |
coldsnap uploads can hang indefinitely when running from outside AWS. A small fraction of PutSnapshotBlock requests complete the TCP handshake but never receive an HTTP response. The AWS Rust SDK does not set read_timeout, operation_attempt_timeout, or operation_timeout by default, so these requests block the worker forever. This has been reported in awslabs#362 (uploads stalling from GitHub Actions), awslabs#374 (downloads hanging from non-EC2), awslabs#216 (excessive retries with no visibility), and awslabs#95 (missing timeouts on remote calls). Changes: - Set SDK timeouts in build_client_config(): read_timeout 12s, operation_attempt_timeout 20s, operation_timeout 120s. These apply to both the upload and download CLI paths. - Set SDK max_attempts to 1. coldsnap has its own per-block retry loop with backoff; layering SDK retries on top produced up to 36 attempts per block with no coordinated timeout. - Reduce block retry count from 12 to 5. With bounded per-attempt timeouts, fewer retries are needed. - Add --workers flag (default 64) to configure concurrent upload workers. Reject --workers 0 since for_each_concurrent treats 0 as unlimited. - Add --client-shards flag (default 1) to create N independent EbsClient instances for uploads. Blocks are distributed by index (block_index % N). This is opt-in; the default preserves the existing single-client behavior. In our testing, --client-shards 8 with 64 workers improved the per-block latency profile on high-latency paths. - Log a latency histogram at INFO level after the upload completes (<250ms, 250-500ms, 500ms-1s, 1-2s, 2-5s, >5s, plus error count). - Log per-block warnings on failure with block index, attempt number, elapsed time, and error. Previously failures were logged at DEBUG with no context. Tested with 18 consecutive uploads of a 4.1 GiB image from GitHub Actions runners across Virginia, Wyoming, and other Azure regions. All 18 succeeded (20-51s depending on config and runner location). Stock coldsnap on the same paths hung indefinitely or took 5-20+ minutes.
Done! Sorry about that |
|
Thank you for your contribution! |
coldsnap uploads can hang indefinitely when running from outside AWS. A small fraction of PutSnapshotBlock requests complete the TCP handshake but never receive an HTTP response. The AWS Rust SDK does not set read_timeout, operation_attempt_timeout, or operation_timeout by default, so these requests block the worker forever.
Addresses: #437
Changes:
Set SDK timeouts in build_client_config(): read_timeout 12s, operation_attempt_timeout 20s, operation_timeout 120s. These apply to both the upload and download CLI paths.
Set SDK max_attempts to 1. coldsnap has its own per-block retry loop with backoff; layering SDK retries on top produced up to 36 attempts per block with no coordinated timeout.
Reduce block retry count from 12 to 5. With bounded per-attempt timeouts, fewer retries are needed.
Add --workers flag (default 64) to configure concurrent upload workers. Reject --workers 0 since for_each_concurrent treats 0 as unlimited.
Add --client-shards flag (default 1) to create N independent EbsClient instances for uploads. Blocks are distributed by index (block_index % N). This is opt-in; the default preserves the existing single-client behavior. In our testing, --client-shards 8 with 64 workers improved the per-block latency profile on high-latency paths.
Log a latency histogram at INFO level after the upload completes (<250ms, 250-500ms, 500ms-1s, 1-2s, 2-5s, >5s, plus error count).
Log per-block warnings on failure with block index, attempt number, elapsed time, and error. Previously failures were logged at DEBUG with no context.
Tested with 18 consecutive uploads of a 4.1 GiB image from GitHub Actions runners across Virginia, Wyoming, and other Azure regions. All 18 succeeded (20-51s depending on config and runner location). Stock coldsnap on the same paths hung indefinitely or took 5-20+ minutes.