Skip to content

Fix indefinite hangs on non-EC2 network paths#438

Merged
jmt-lab merged 1 commit intoawslabs:developfrom
nkk0:fix-wan-timeouts
Apr 17, 2026
Merged

Fix indefinite hangs on non-EC2 network paths#438
jmt-lab merged 1 commit intoawslabs:developfrom
nkk0:fix-wan-timeouts

Conversation

@nkk0
Copy link
Copy Markdown
Contributor

@nkk0 nkk0 commented Apr 14, 2026

coldsnap uploads can hang indefinitely when running from outside AWS. A small fraction of PutSnapshotBlock requests complete the TCP handshake but never receive an HTTP response. The AWS Rust SDK does not set read_timeout, operation_attempt_timeout, or operation_timeout by default, so these requests block the worker forever.

Addresses: #437

Changes:

  • Set SDK timeouts in build_client_config(): read_timeout 12s, operation_attempt_timeout 20s, operation_timeout 120s. These apply to both the upload and download CLI paths.

  • Set SDK max_attempts to 1. coldsnap has its own per-block retry loop with backoff; layering SDK retries on top produced up to 36 attempts per block with no coordinated timeout.

  • Reduce block retry count from 12 to 5. With bounded per-attempt timeouts, fewer retries are needed.

  • Add --workers flag (default 64) to configure concurrent upload workers. Reject --workers 0 since for_each_concurrent treats 0 as unlimited.

  • Add --client-shards flag (default 1) to create N independent EbsClient instances for uploads. Blocks are distributed by index (block_index % N). This is opt-in; the default preserves the existing single-client behavior. In our testing, --client-shards 8 with 64 workers improved the per-block latency profile on high-latency paths.

  • Log a latency histogram at INFO level after the upload completes (<250ms, 250-500ms, 500ms-1s, 1-2s, 2-5s, >5s, plus error count).

  • Log per-block warnings on failure with block index, attempt number, elapsed time, and error. Previously failures were logged at DEBUG with no context.

Tested with 18 consecutive uploads of a 4.1 GiB image from GitHub Actions runners across Virginia, Wyoming, and other Azure regions. All 18 succeeded (20-51s depending on config and runner location). Stock coldsnap on the same paths hung indefinitely or took 5-20+ minutes.

nkk0 added a commit to nkk0/amis that referenced this pull request Apr 14, 2026
Override coldsnap with a fork that adds SDK timeouts, single-layer
retries, configurable workers, client sharding, and upload diagnostics.
This fixes indefinite hangs when uploading from non-EC2 paths (e.g.
GitHub Actions runners).

Upstream PR: awslabs/coldsnap#438

Usage: coldsnap upload --workers 64 --client-shards 8 image.raw

Can be removed once the upstream PR is merged and the fix reaches
nixpkgs.
Copy link
Copy Markdown
Contributor

@jmt-lab jmt-lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution. This looks good to me.

Comment thread src/bin/coldsnap/main.rs
config = config.endpoint_url(endpoint);
}

// The AWS SDK does not set response or per-attempt timeouts by default.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch here

@jmt-lab
Copy link
Copy Markdown
Contributor

jmt-lab commented Apr 15, 2026

Also we do suggest you setup ssh key signing for your git commits to help verify your commits. This is not a hard requirement for contributions externally but generally a good thing to do: https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification#ssh-commit-signature-verification

coldsnap uploads can hang indefinitely when running from outside AWS.
A small fraction of PutSnapshotBlock requests complete the TCP handshake
but never receive an HTTP response. The AWS Rust SDK does not set
read_timeout, operation_attempt_timeout, or operation_timeout by
default, so these requests block the worker forever.

This has been reported in awslabs#362 (uploads stalling from GitHub Actions),
awslabs#374 (downloads hanging from non-EC2), awslabs#216 (excessive retries with no
visibility), and awslabs#95 (missing timeouts on remote calls).

Changes:

- Set SDK timeouts in build_client_config(): read_timeout 12s,
  operation_attempt_timeout 20s, operation_timeout 120s. These apply to
  both the upload and download CLI paths.

- Set SDK max_attempts to 1. coldsnap has its own per-block retry loop
  with backoff; layering SDK retries on top produced up to 36 attempts
  per block with no coordinated timeout.

- Reduce block retry count from 12 to 5. With bounded per-attempt
  timeouts, fewer retries are needed.

- Add --workers flag (default 64) to configure concurrent upload
  workers. Reject --workers 0 since for_each_concurrent treats 0 as
  unlimited.

- Add --client-shards flag (default 1) to create N independent
  EbsClient instances for uploads. Blocks are distributed by index
  (block_index % N). This is opt-in; the default preserves the
  existing single-client behavior. In our testing, --client-shards 8
  with 64 workers improved the per-block latency profile on
  high-latency paths.

- Log a latency histogram at INFO level after the upload completes
  (<250ms, 250-500ms, 500ms-1s, 1-2s, 2-5s, >5s, plus error count).

- Log per-block warnings on failure with block index, attempt number,
  elapsed time, and error. Previously failures were logged at DEBUG
  with no context.

Tested with 18 consecutive uploads of a 4.1 GiB image from GitHub
Actions runners across Virginia, Wyoming, and other Azure regions.
All 18 succeeded (20-51s depending on config and runner location).
Stock coldsnap on the same paths hung indefinitely or took 5-20+
minutes.
@nkk0 nkk0 force-pushed the fix-wan-timeouts branch from d325ed8 to 9b14efa Compare April 17, 2026 10:32
@nkk0
Copy link
Copy Markdown
Contributor Author

nkk0 commented Apr 17, 2026

Also we do suggest you setup ssh key signing for your git commits to help verify your commits. This is not a hard requirement for contributions externally but generally a good thing to do: https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification#ssh-commit-signature-verification

Done! Sorry about that

@jmt-lab jmt-lab merged commit 98c1cdd into awslabs:develop Apr 17, 2026
2 checks passed
@jmt-lab
Copy link
Copy Markdown
Contributor

jmt-lab commented Apr 17, 2026

Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants