Skip to content

KEP-3751: add error handling #5482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

huww98
Copy link

@huww98 huww98 commented Aug 16, 2025

  • One-line PR description: add error handling, and some cleanups
  • Other comments:

I'm aware of another PR #5462 also trying to address this.
As per our discussion on the meeting, the behavior described in this PR should be the default. So I'm proposing this for early review.
We may also add a strict-mode like #5462 to ensure quota restriction in any situation, which I will add in the following days based on this PR.

Also removes some incorrect copies of content. And fixed an link.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 16, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: huww98
Once this PR has been reviewed and has the lgtm label, please assign msau42 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from saad-ali August 16, 2025 14:23
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Aug 16, 2025
@k8s-ci-robot k8s-ci-robot requested a review from xing-yang August 16, 2025 14:23
@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 16, 2025
@huww98 huww98 force-pushed the vac-err-handle branch 2 times, most recently from 7ab2b96 to 7ab48a9 Compare August 17, 2025 11:29
@huww98
Copy link
Author

huww98 commented Aug 17, 2025

##### Implementation

Flow diagram describes how external-resizer modifies the volume:
![ModifyVolume Flow Diagram](./VolumeAttributesClass-Modify-Flow.jpg)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At init, it should be enough to only mark InProgress volumes as uncertain. Before we mark a volume Infeasible, we will unset uncertain. Before we retry an Infeasible modify, we will mark it as InProgress again.

@gnufied
Copy link
Member

gnufied commented Aug 21, 2025

Can you explain why is target=spec after infeasible error?

| 25 | A | A | Infeas | A | false | 23 | | | | |
| 26 | B | A | InProg | A | false | 14 | | | | |
| 27 | B | A | InProg | A | true | A | 13 | 26 | 28 | 27 |
| 28 | B | A | Infeas | A | false | 14 | | | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear tbh, what those action mean. What does Action-6 mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example and some more descriptions to this table. Hopes this will help.

This policy safeguards against potential quota abuse that can occur if users time their requests strategically.
- Final errors (such as `Internal`), indicating storage provider failed to modify the volume and likely no longer processing the request.
In this case, We allow changing the parameters to recover from the error.
- Infeasible Errors (e.g., `InvalidArgument`): This is a subset of final errors indicating the request itself is invalid and is not likely to succeed when retried.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a indentation error or did you really meant infeasible to be indented?

Apart from flow-chart, IMO the recovery options from infeasible should be clearly specified here. Like what happens if:

  1. nil --> A(infeasible) --> nil
  2. A --> B (infeasible) --> nil

The flow chart is helpful, but having a summary here helps a casual reader of the spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I want it to be indented. Because all infeasible errors are final errors, and inherits the policy of final errors.

On the two lines after this comment:

If the Spec.VolumeAttributesClassName is set to nil, we will not retry the request.

Is this clear enough for you?
BTW, I think case 2 is not allowed by APIServer, but if this really happens, we can handle it. I will add a sentence to further clarify the quota.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think case 2 is not allowed by APIServer, but if this really happens, we can handle it. I will add a sentence to further clarify the quota.

Oops, I meant:

A --> B(Infeasible) --> A

Number 1 in the last five columns means we should transit to state 1 in this case.
If action equals to self, we have reached the final state (1, 3, 4, 9) and should stop reconcile until user modifies the spec again.

One can verify this contains all states by arbitrarily changing the spec and verify it will still hit a listed state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am reading your flowchart right, it looks like - for nil --> A (infeasible) --> nil, the target will be set back to nil and hence no quota will be counted towards A. I thought you were arguing that, quota for A should still be counted?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, quota for A should still be counted? I will remove the target=spec after infeasible error. And for A (infeasible) --> nil, it should go to the "stop reconcile" at the bottom of the chart.

In the implementation, I also have a branch

if spec == nil && (status == nil || status == Infeasible) {
    return
}

I think I'd better also add this to the flowchart.

@huww98
Copy link
Author

huww98 commented Aug 22, 2025

Can you explain why is target=spec after infeasible error?

Thanks for pointing this out. I think I overlooked this case when I write kubernetes-csi/external-resizer#510 . I will fix the flowchart and the code. We should just keep target at its original value here, because in the uncertain case, target is not the same as spec.

@@ -402,10 +401,42 @@ Note:

The parameters in VolumeAttributesClass are opaque and immutable. This gives different cloud providers flexibility but at the same time it is up to the specific driver to verify the values set in the parameters. The parameters from VolumeAttributesClass associated with a volume are mutable because they are coming from different VolumeAttributesClasses.

**Deal with Volume Reverted to nil VAC**

After reverting VAC name to nil, it will not be fully reverted to the previous state, because Kubernetes does not actually know the previous state of the volume. The volume will keep occupying the quota of previously specified VAC. Now user can:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to be clear, even if driver reports infeasible INVALID_ARGUMENTS, you cannot go back to nil because driver may have secretly modified volume?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it means the PVC's VAC class field can still be nil, but there is no guarantee that the underlying volume has been reverted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, even we get INVALID_ARGUMENTS at the last attempt, their is no guarantee about what happened in the previous attempts, e.g.:

  • nil -> A (INTERNAL) -> A (INVALID_ARGUMENTS) -> nil
  • nil -> B (INTERNAL) -> A (INVALID_ARGUMENTS) -> nil

So, to keep our promise to admin (mentioned later in the next section): every PVC with Status.ModifyVolumeStatus == nil is properly covered by quota, we do not revert PVC.Status.ModifyVolumeStatus from Infeasible to nil automatically.

This also reminds user that the volume may have some changes applied, and may not be the same as their other volumes.

Copy link
Author

@huww98 huww98 Aug 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that we cannot clear the pending status when reverting to nil, because Pending status may overwrite the previous InProgress status. I also refined the language of this paragraph. Hopes this will make it more clear.

@huww98 huww98 force-pushed the vac-err-handle branch 2 times, most recently from fc1e392 to 770d408 Compare August 23, 2025 04:35
@huww98
Copy link
Author

huww98 commented Aug 23, 2025

Can you explain why is target=spec after infeasible error?

Thanks for pointing this out. I think I overlooked this case when I write kubernetes-csi/external-resizer#510 . I will fix the flowchart and the code. We should just keep target at its original value here, because in the uncertain case, target is not the same as spec.

Pushed the fix to kubernetes-csi/external-resizer#512 . @sunnylovestiramisu @gnufied PTAL

@gnufied
Copy link
Member

gnufied commented Aug 25, 2025

Thanks. I see two distinct flows for going to Done state in the flow chart one via pending state and another via infeasible state and what happens to status field is different for those. I would think that for infeasible "done", we won't be resetting status but whereas for Pending - we will. Is that accurate?

For example:

1.nil --> A(infeasible) --> nil, then .status.modifyStatus == whatever old value was
2. nil --> A (pending) ---> nil, then .status.modifyStatus == nil

Can you update the flowchart to reflect this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants