Skip to content

Conversation

@ElijahQuinones
Copy link
Member

@ElijahQuinones ElijahQuinones commented Dec 16, 2025

What type of PR is this?

/kind feature

What is this PR about? / Why do we need it?

This PR improves our batching process by adding a cache to put known bad IDs into for Volumes,snapshots and instances. When a request comes in with a known bad id we trim it from the batch request and try it separately from the batch. This ensures one bad ID does not constantly poison the whole batch.

How was this change tested?

Volume Example

Statically provisioned a fake volume
Modified the volume

Without Change

Error on first call in batcher

E0106 20:46:06.268871       1 handlers.go:86] "Error from AWS API" err="api error InvalidVolume.NotFound: The volume 'vol-05ae9e63bbfb1f23a' does not exist."                                                      │
│ E0106 20:46:06.269018       1 batcher.go:161] "execute: error executing batch" err="operation error EC2: DescribeVolumes, https response error StatusCode: 400, RequestID: 97bded0d-b90f-486b-8e9f-97d6fd70781d, a │
│ E0106 20:46:06.269108       1 driver.go:133] "GRPC error" err="rpc error: code = Internal desc = Could not modify volume \"vol-05ae9e63bbfb1f23a\": operation error EC2: DescribeVolumes, https response error Sta │
│ I0106 20:46:07.288117       1 controller.go:730] "ControllerModifyVolume: called" args="volume_id:\"vol-05ae9e63bbfb1f23a\" mutable_parameters:{key:\"csi.storage.k8s.io/pv/name\" value:\"test-pv\"} mutable_para │
│ I0106 20:46:09.288933       1 cloud.go:970] "Received Modify Disk request" volumeID="vol-05ae9e63bbfb1f23a" options={"VolumeType":"gp3","IOPS":4000,"Throughput":250,"IOPSPerGB":0,"AllowIopsIncreaseOnResize":fal │
│ E0106 20:46:09.877350       1 handlers.go:86] "Error from AWS API" err="api error InvalidVolume.NotFound: The volume 'vol-05ae9e63bbfb1f23a' does not exist."      

subsequent retries are all in the batcher posioning good requests in same batch
 
 │
│ E0106 20:46:09.877419       1 batcher.go:161] "execute: error executing batch" err="operation error EC2: DescribeVolumes, https response error StatusCode: 400, RequestID: f3b919ba-990d-490d-a570-a22ed91f075d, a │
│ E0106 20:46:09.877491       1 driver.go:133] "GRPC error" err="rpc error: code = Internal desc = Could not modify volume \"vol-05ae9e63bbfb1f23a\": operation error EC2: DescribeVolumes, https response error Sta │
│ I0106 20:46:11.895847       1 controller.go:730] "ControllerModifyVolume: called" args="volume_id:\"vol-05ae9e63bbfb1f23a\" mutable_parameters:{key:\"csi.storage.k8s.io/pv/name\" value:\"test-pv\"} mutable_para │
│ I0106 20:46:13.896998       1 cloud.go:970] "Received Modify Disk request" volumeID="vol-05ae9e63bbfb1f23a" options={"VolumeType":"gp3","IOPS":4000,"Throughput":250,"IOPSPerGB":0,"AllowIopsIncreaseOnResize":fal

With change

 │ I0108 19:57:34.169876       1 controller.go:686] "ControllerExpandVolume: called" args="volume_id:\"vol-05ae9e63bbfb1f23a\"  capacity_range:{required_bytes:21474836480}  volume_capability:{mount:{}  access_mode │
│ I0108 19:57:35.183239       1 controller.go:728] "ControllerModifyVolume: called" args="volume_id:\"vol-05ae9e63bbfb1f23a\"  mutable_parameters:{key:\"csi.storage.k8s.io/pv/name\"  value:\"ebs-pv\"}  mutable_pa │
│ I0108 19:57:36.171154       1 cloud.go:1020] "Received Resize and/or Modify Disk request" volumeID="vol-05ae9e63bbfb1f23a" newSizeBytes=21474836480 options={"VolumeType":"gp3","IOPS":4000,"Throughput":200,"IOPS │
│ E0108 19:57:36.798077       1 handlers.go:86] "Error from AWS API" err="api error InvalidVolume.NotFound: The volume 'vol-05ae9e63bbfb1f23a' does not exist."                                                      │
│ E0108 19:57:36.798191       1 batcher.go:161] "execute: error executing batch" err="operation error EC2: DescribeVolumes, https response error StatusCode: 400, RequestID: fd0d2b23-eb5f-47b4-89ce-6db798d5ae7e, a │
│ E0108 19:57:36.798315       1 driver.go:133] "GRPC error" err="rpc error: code = Internal desc = Could not modify volume \"vol-05ae9e63bbfb1f23a\": operation error EC2: DescribeVolumes, https response error Sta │
│ E0108 19:57:36.798376       1 driver.go:133] "GRPC error" err="rpc error: code = Internal desc = Could not resize volume \"vol-05ae9e63bbfb1f23a\": rpc error: code = Internal desc = Could not modify volume \"vo │
│ I0108 19:57:36.829929       1 controller.go:616] "ControllerGetCapabilities: called" args=""                                                                                                                       │
│ I0108 19:57:36.830555       1 controller.go:686] "ControllerExpandVolume: called" args="volume_id:\"vol-05ae9e63bbfb1f23a\"  capacity_range:{required_bytes:21474836480}  volume_capability:{mount:{}  access_mode │
│ I0108 19:57:38.826895       1 controller.go:728] "ControllerModifyVolume: called" args="volume_id:\"vol-05ae9e63bbfb1f23a\"  mutable_parameters:{key:\"csi.storage.k8s.io/pv/name\"  value:\"ebs-pv\"}  mutable_pa │
│ I0108 19:57:38.831021       1 cloud.go:1020] "Received Resize and/or Modify Disk request" volumeID="vol-05ae9e63bbfb1f23a" newSizeBytes=21474836480 options={"VolumeType":"gp3","IOPS":4000,"Throughput":200,"IOPS │
│ E0108 19:57:39.428054       1 handlers.go:86] "Error from AWS API" err="api error InvalidVolume.NotFound: The volume 'vol-05ae9e63bbfb1f23a' does not exist."                                                      │
│ E0108 19:57:39.428159       1 driver.go:133] "GRPC error" err="rpc error: code = NotFound desc = Could not modify volume (not found) \"vol-05ae9e63bbfb1f23a\": resource was not found"                            │
│ E0108 19:57:39.428270       1 driver.go:133] "GRPC error" err="rpc error: code = Internal desc = Could not resize volume \"vol-05ae9e63bbfb1f23a\": rpc error: code = NotFound desc = Could not modify volume (not

Snapshots example

Create a Snapshot with a fake snapshot id

I0108 20:02:34.498164       1 controller.go:996] "ListSnapshots: called" args="snapshot_id:\"snap-0a1b033b1430b3b05\""                                                                                             │
│ E0108 20:02:35.061317       1 handlers.go:86] "Error from AWS API" err="api error InvalidSnapshot.NotFound: The snapshot 'snap-0a1b033b1430b3b05' does not exist."                                                 │
│ E0108 20:02:35.061504       1 batcher.go:161] "execute: error executing batch" err="operation error EC2: DescribeSnapshots, https response error StatusCode: 400, RequestID: 7f593571-5d9b-4d33-a0d6-911b3a0d47d6, │
│ E0108 20:02:35.061565       1 driver.go:133] "GRPC error" err="rpc error: code = Internal desc = Could not get snapshot ID \"snap-0a1b033b1430b3b05\": operation error EC2: DescribeSnapshots, https response erro │
│ I0108 20:02:35.072913       1 controller.go:616] "ControllerGetCapabilities: called" args=""                                                                                                                       │
│ I0108 20:02:35.073430       1 controller.go:996] "ListSnapshots: called" args="snapshot_id:\"snap-0a1b033b1430b3b05\""                                                                                             │
│ E0108 20:02:35.678695       1 handlers.go:86] "Error from AWS API" err="api error InvalidSnapshot.NotFound: The snapshot 'snap-0a1b033b1430b3b05' does not exist."                                                 │
│ I0108 20:02:35.678822       1 controller.go:1004] "ListSnapshots: snapshot not found, returning with success"                                                                                                      │
│ I0108 20:02:36.071131       1 controller.go:616] "ControllerGetCapabilities: called" args=""                                                                                                                       │
│ I0108 20:02:36.071655       1 controller.go:996] "ListSnapshots: called" args="snapshot_id:\"snap-0a1b033b1430b3b05\""                                                                                             │
│ E0108 20:02:36.646632       1 handlers.go:86] "Error from AWS API" err="api error InvalidSnapshot.NotFound: The snapshot 'snap-0a1b033b1430b3b05' does not exist."                                                 │
│ I0108 20:02:36.646756       1 controller.go:1004] "ListSnapshots: snapshot not found, returning with success"                                                                                                      │
│ I0108 20:02:40.647625       1 controller.go:616] "ControllerGetCapabilities: called" args=""

Instance Example

Crate a fake CSI-node object a fake Node object and a fake volume-Attachemnt

example yaml

fake-node.yaml:
yaml
apiVersion: v1
kind: Node
metadata:
  name: fake-node
  labels:
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: m5.large
    topology.kubernetes.io/zone: us-west-2a
    node.kubernetes.io/instance-id: i-0b00dbff4a16b95ad
spec:
  taints:
  - effect: NoSchedule
    key: fake-node
status:
  conditions:
  - type: Ready
    status: "True"
  addresses:
  - type: InternalIP
    address: 10.0.1.100
  - type: Hostname
    address: fake-node


csi-node.yaml:
yaml
apiVersion: storage.k8s.io/v1
kind: CSINode
metadata:
  name: fake-node
spec:
  drivers:
  - name: ebs.csi.aws.com
    nodeID: i-0b00dbff4a16b95ad
    topologyKeys:
    - topology.ebs.csi.aws.com/zone


volume-attachment.yaml:
yaml
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  name: fake-attachment
spec:
  attacher: ebs.csi.aws.com
  nodeName: fake-node
  source:
    persistentVolumeName: test-pv
│ I0108 20:05:59.252170       1 controller.go:472] "ControllerPublishVolume: called" args="volume_id:\"vol-05ae9e63bbfb1f23a\"  node_id:\"i-0b00dbff4a16b95ad\"  volume_capability:{mount:{}  access_mode:{mode:SING │
│ I0108 20:05:59.252252       1 controller.go:493] "ControllerPublishVolume: attaching" volumeID="vol-05ae9e63bbfb1f23a" nodeID="i-0b00dbff4a16b95ad"                                                                │
│ E0108 20:05:59.825454       1 handlers.go:86] "Error from AWS API" err="api error InvalidInstanceID.NotFound: The instance ID 'i-0b00dbff4a16b95ad' does not exist"                                                │
│ E0108 20:05:59.825655       1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, https response error StatusCode: 400, RequestID: 7ac9c220 │
│ I0108 20:05:59.825744       1 inflight.go:74] "Node Service: volume operation finished" key="vol-05ae9e63bbfb1f23ai-0b00dbff4a16b95ad"                                                                             │
│ E0108 20:05:59.825758       1 driver.go:133] "GRPC error" err="rpc error: code = Internal desc = Could not attach volume \"vol-05ae9e63bbfb1f23a\" to node \"i-0b00dbff4a16b95ad\": error listing AWS instances: o │
│ I0108 20:05:59.832940       1 controller.go:472] "ControllerPublishVolume: called" args="volume_id:\"vol-05ae9e63bbfb1f23a\"  node_id:\"i-0b00dbff4a16b95ad\"  volume_capability:{mount:{}  access_mode:{mode:SING │
│ I0108 20:05:59.833040       1 controller.go:493] "ControllerPublishVolume: attaching" volumeID="vol-05ae9e63bbfb1f23a" nodeID="i-0b00dbff4a16b95ad"                                                                │
│ E0108 20:06:00.410111       1 handlers.go:86] "Error from AWS API" err="api error InvalidInstanceID.NotFound: The instance ID 'i-0b00dbff4a16b95ad' does not exist"                                                │
│ I0108 20:06:00.410260       1 inflight.go:74] "Node Service: volume operation finished" key="vol-05ae9e63bbfb1f23ai-0b00dbff4a16b95ad"                                                                             │
│ E0108 20:06:00.410308       1 driver.go:133] "GRPC error" err="rpc error: code = NotFound desc = Volume \"vol-05ae9e63bbfb1f23a\" not found"                                                                       │
│ I0108 20:07:03.833967       1 controller.go:472] "ControllerPublishVolume: called" args="volume_id:\"vol-05ae9e63bbfb1f23a\"  node_id:\"i-0b00dbff4a16b95ad\"  volume_capability:{mount:{}  access_mode:{mode:SING │
│ I0108 20:07:03.834063       1 controller.go:493] "ControllerPublishVolume: attaching" volumeID="vol-05ae9e63bbfb1f23a" nodeID="i-0b00dbff4a16b95ad"                                         │

Request is tried outside of batcher

│ E0106 21:37:21.656333       1 handlers.go:86] "Error from AWS API" err="api error InvalidInstanceID.NotFound: The instance ID 'i-0b00dbff4a16b95ad' does not exist"                                                │
│ I0106 21:43:20.880366       1 controller.go:472] "ControllerPublishVolume: called" args="volume_id:\"vol-05ae9e63bbfb1f23a\" node_id:\"i-0b00dbff4a16b95ad\" volume_capability:{mount:{fs_type:\"ext4\"} access_mo │

As we will timeout on future requests because the batch call does not error out we hit deadline and try again. 

│ E0106 21:43:20.880460       1 driver.go:133] "GRPC error" err="rpc error: code = Aborted desc = An operation with the given Volume vol-05ae9e63bbfb1f23a already exists"

Does this PR introduce a user-facing change?

Cache known bad ids and remove from batch calls

working on unit tests and finishing up manual testing WIP

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. labels Dec 16, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 16, 2025
@github-actions
Copy link

github-actions bot commented Dec 16, 2025

Code Coverage Diff

File Old Coverage New Coverage Delta
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/cloud/cloud.go 85.0% 85.5% 0.4

@ElijahQuinones
Copy link
Member Author

/retest

1 similar comment
@ConnorJC3
Copy link
Contributor

/retest

Copy link
Contributor

@ConnorJC3 ConnorJC3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 6, 2026
@mdzraf
Copy link
Member

mdzraf commented Jan 6, 2026

/lgtm

Waiting on manual testing instructions for approval

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 6, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from connorjc3. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ElijahQuinones ElijahQuinones changed the title WIP: Cache known bad ids to remove from batch calls Cache known bad ids to remove from batch calls Jan 6, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 6, 2026
@ElijahQuinones ElijahQuinones removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 6, 2026
@ElijahQuinones
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 8, 2026
@ElijahQuinones
Copy link
Member Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 8, 2026
Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This revision looks great. The logic is sound, edge cases are handled correctly, and code coverage is comprehensive. If we could run this through a scale test before release that would be sweet.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 8, 2026
@ElijahQuinones
Copy link
Member Author

This revision looks great. The logic is sound, edge cases are handled correctly, and code coverage is comprehensive. If we could run this through a scale test before release that would be sweet.

/lgtm

Running one as we speak :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants