-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-3751: add error handling #5482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Also removes some incorrect copies of content. And fixed an link.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: huww98 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
7ab2b96
to
7ab48a9
Compare
##### Implementation | ||
|
||
Flow diagram describes how external-resizer modifies the volume: | ||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At init, it should be enough to only mark InProgress volumes as uncertain. Before we mark a volume Infeasible, we will unset uncertain. Before we retry an Infeasible modify, we will mark it as InProgress again.
Can you explain why is |
| 25 | A | A | Infeas | A | false | 23 | | | | | | ||
| 26 | B | A | InProg | A | false | 14 | | | | | | ||
| 27 | B | A | InProg | A | true | A | 13 | 26 | 28 | 27 | | ||
| 28 | B | A | Infeas | A | false | 14 | | | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear tbh, what those action mean. What does Action-6 mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an example and some more descriptions to this table. Hopes this will help.
This policy safeguards against potential quota abuse that can occur if users time their requests strategically. | ||
- Final errors (such as `Internal`), indicating storage provider failed to modify the volume and likely no longer processing the request. | ||
In this case, We allow changing the parameters to recover from the error. | ||
- Infeasible Errors (e.g., `InvalidArgument`): This is a subset of final errors indicating the request itself is invalid and is not likely to succeed when retried. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a indentation error or did you really meant infeasible to be indented?
Apart from flow-chart, IMO the recovery options from infeasible should be clearly specified here. Like what happens if:
- nil --> A(infeasible) --> nil
- A --> B (infeasible) --> nil
The flow chart is helpful, but having a summary here helps a casual reader of the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I want it to be indented. Because all infeasible errors are final errors, and inherits the policy of final errors.
On the two lines after this comment:
If the
Spec.VolumeAttributesClassName
is set to nil, we will not retry the request.
Is this clear enough for you?
BTW, I think case 2 is not allowed by APIServer, but if this really happens, we can handle it. I will add a sentence to further clarify the quota.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think case 2 is not allowed by APIServer, but if this really happens, we can handle it. I will add a sentence to further clarify the quota.
Oops, I meant:
A --> B(Infeasible) --> A
Number 1 in the last five columns means we should transit to state 1 in this case. | ||
If action equals to self, we have reached the final state (1, 3, 4, 9) and should stop reconcile until user modifies the spec again. | ||
|
||
One can verify this contains all states by arbitrarily changing the spec and verify it will still hit a listed state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am reading your flowchart right, it looks like - for nil --> A (infeasible) --> nil
, the target
will be set back to nil
and hence no quota will be counted towards A. I thought you were arguing that, quota for A
should still be counted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, quota for A should still be counted? I will remove the target=spec
after infeasible error. And for A (infeasible) --> nil
, it should go to the "stop reconcile" at the bottom of the chart.
In the implementation, I also have a branch
if spec == nil && (status == nil || status == Infeasible) {
return
}
I think I'd better also add this to the flowchart.
Thanks for pointing this out. I think I overlooked this case when I write kubernetes-csi/external-resizer#510 . I will fix the flowchart and the code. We should just keep target at its original value here, because in the uncertain case, target is not the same as spec. |
@@ -402,10 +401,42 @@ Note: | |||
|
|||
The parameters in VolumeAttributesClass are opaque and immutable. This gives different cloud providers flexibility but at the same time it is up to the specific driver to verify the values set in the parameters. The parameters from VolumeAttributesClass associated with a volume are mutable because they are coming from different VolumeAttributesClasses. | |||
|
|||
**Deal with Volume Reverted to nil VAC** | |||
|
|||
After reverting VAC name to nil, it will not be fully reverted to the previous state, because Kubernetes does not actually know the previous state of the volume. The volume will keep occupying the quota of previously specified VAC. Now user can: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to be clear, even if driver reports infeasible INVALID_ARGUMENTS, you cannot go back to nil because driver may have secretly modified volume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it means the PVC's VAC class field can still be nil, but there is no guarantee that the underlying volume has been reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, even we get INVALID_ARGUMENTS at the last attempt, their is no guarantee about what happened in the previous attempts, e.g.:
- nil -> A (INTERNAL) -> A (INVALID_ARGUMENTS) -> nil
- nil -> B (INTERNAL) -> A (INVALID_ARGUMENTS) -> nil
So, to keep our promise to admin (mentioned later in the next section): every PVC with Status.ModifyVolumeStatus == nil
is properly covered by quota, we do not revert PVC.Status.ModifyVolumeStatus
from Infeasible to nil automatically.
This also reminds user that the volume may have some changes applied, and may not be the same as their other volumes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that we cannot clear the pending status when reverting to nil, because Pending status may overwrite the previous InProgress status. I also refined the language of this paragraph. Hopes this will make it more clear.
fc1e392
to
770d408
Compare
Pushed the fix to kubernetes-csi/external-resizer#512 . @sunnylovestiramisu @gnufied PTAL |
Thanks. I see two distinct flows for going to For example: 1. Can you update the flowchart to reflect this? |
I'm aware of another PR #5462 also trying to address this.
As per our discussion on the meeting, the behavior described in this PR should be the default. So I'm proposing this for early review.
We may also add a strict-mode like #5462 to ensure quota restriction in any situation, which I will add in the following days based on this PR.