Fix indefinite hangs on non-EC2 network paths by nkk0 · Pull Request #438 · awslabs/coldsnap

nkk0 · 2026-04-14T15:48:01Z

coldsnap uploads can hang indefinitely when running from outside AWS. A small fraction of PutSnapshotBlock requests complete the TCP handshake but never receive an HTTP response. The AWS Rust SDK does not set read_timeout, operation_attempt_timeout, or operation_timeout by default, so these requests block the worker forever.

Addresses: #437

Changes:

Set SDK timeouts in build_client_config(): read_timeout 12s, operation_attempt_timeout 20s, operation_timeout 120s. These apply to both the upload and download CLI paths.
Set SDK max_attempts to 1. coldsnap has its own per-block retry loop with backoff; layering SDK retries on top produced up to 36 attempts per block with no coordinated timeout.
Reduce block retry count from 12 to 5. With bounded per-attempt timeouts, fewer retries are needed.
Add --workers flag (default 64) to configure concurrent upload workers. Reject --workers 0 since for_each_concurrent treats 0 as unlimited.
Add --client-shards flag (default 1) to create N independent EbsClient instances for uploads. Blocks are distributed by index (block_index % N). This is opt-in; the default preserves the existing single-client behavior. In our testing, --client-shards 8 with 64 workers improved the per-block latency profile on high-latency paths.
Log a latency histogram at INFO level after the upload completes (<250ms, 250-500ms, 500ms-1s, 1-2s, 2-5s, >5s, plus error count).
Log per-block warnings on failure with block index, attempt number, elapsed time, and error. Previously failures were logged at DEBUG with no context.

Tested with 18 consecutive uploads of a 4.1 GiB image from GitHub Actions runners across Virginia, Wyoming, and other Azure regions. All 18 succeeded (20-51s depending on config and runner location). Stock coldsnap on the same paths hung indefinitely or took 5-20+ minutes.

Override coldsnap with a fork that adds SDK timeouts, single-layer retries, configurable workers, client sharding, and upload diagnostics. This fixes indefinite hangs when uploading from non-EC2 paths (e.g. GitHub Actions runners). Upstream PR: awslabs/coldsnap#438 Usage: coldsnap upload --workers 64 --client-shards 8 image.raw Can be removed once the upstream PR is merged and the fix reaches nixpkgs.

jmt-lab

Thank you for this contribution. This looks good to me.

jmt-lab · 2026-04-15T17:45:42Z

        config = config.endpoint_url(endpoint);
    }

+    // The AWS SDK does not set response or per-attempt timeouts by default.


This is a good catch here

jmt-lab · 2026-04-15T17:52:55Z

Also we do suggest you setup ssh key signing for your git commits to help verify your commits. This is not a hard requirement for contributions externally but generally a good thing to do: https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification#ssh-commit-signature-verification

coldsnap uploads can hang indefinitely when running from outside AWS. A small fraction of PutSnapshotBlock requests complete the TCP handshake but never receive an HTTP response. The AWS Rust SDK does not set read_timeout, operation_attempt_timeout, or operation_timeout by default, so these requests block the worker forever. This has been reported in awslabs#362 (uploads stalling from GitHub Actions), awslabs#374 (downloads hanging from non-EC2), awslabs#216 (excessive retries with no visibility), and awslabs#95 (missing timeouts on remote calls). Changes: - Set SDK timeouts in build_client_config(): read_timeout 12s, operation_attempt_timeout 20s, operation_timeout 120s. These apply to both the upload and download CLI paths. - Set SDK max_attempts to 1. coldsnap has its own per-block retry loop with backoff; layering SDK retries on top produced up to 36 attempts per block with no coordinated timeout. - Reduce block retry count from 12 to 5. With bounded per-attempt timeouts, fewer retries are needed. - Add --workers flag (default 64) to configure concurrent upload workers. Reject --workers 0 since for_each_concurrent treats 0 as unlimited. - Add --client-shards flag (default 1) to create N independent EbsClient instances for uploads. Blocks are distributed by index (block_index % N). This is opt-in; the default preserves the existing single-client behavior. In our testing, --client-shards 8 with 64 workers improved the per-block latency profile on high-latency paths. - Log a latency histogram at INFO level after the upload completes (<250ms, 250-500ms, 500ms-1s, 1-2s, 2-5s, >5s, plus error count). - Log per-block warnings on failure with block index, attempt number, elapsed time, and error. Previously failures were logged at DEBUG with no context. Tested with 18 consecutive uploads of a 4.1 GiB image from GitHub Actions runners across Virginia, Wyoming, and other Azure regions. All 18 succeeded (20-51s depending on config and runner location). Stock coldsnap on the same paths hung indefinitely or took 5-20+ minutes.

nkk0 · 2026-04-17T13:21:05Z

Also we do suggest you setup ssh key signing for your git commits to help verify your commits. This is not a hard requirement for contributions externally but generally a good thing to do: https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification#ssh-commit-signature-verification

Done! Sorry about that

jmt-lab · 2026-04-17T22:49:37Z

Thank you for your contribution!

jmt-lab approved these changes Apr 15, 2026

View reviewed changes

nkk0 force-pushed the fix-wan-timeouts branch from d325ed8 to 9b14efa Compare April 17, 2026 10:32

ginglis13 approved these changes Apr 17, 2026

View reviewed changes

jmt-lab merged commit 98c1cdd into awslabs:develop Apr 17, 2026
2 checks passed

nkk0 mentioned this pull request Apr 20, 2026

Missing SDK timeouts and excessive retries cause indefinite hangs on non-EC2 network paths #437

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix indefinite hangs on non-EC2 network paths#438

Fix indefinite hangs on non-EC2 network paths#438
jmt-lab merged 1 commit intoawslabs:developfrom
nkk0:fix-wan-timeouts

nkk0 commented Apr 14, 2026

Uh oh!

jmt-lab left a comment

Uh oh!

jmt-lab Apr 15, 2026

Uh oh!

jmt-lab commented Apr 15, 2026

Uh oh!

nkk0 commented Apr 17, 2026

Uh oh!

Uh oh!

jmt-lab commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nkk0 commented Apr 14, 2026

Uh oh!

jmt-lab left a comment

Choose a reason for hiding this comment

Uh oh!

jmt-lab Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

jmt-lab commented Apr 15, 2026

Uh oh!

nkk0 commented Apr 17, 2026

Uh oh!

Uh oh!

jmt-lab commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants