-
Notifications
You must be signed in to change notification settings - Fork 40
Feature/dns skip wait and partial state #1052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 21 commits
8712b73
7ef4213
702122b
56eb265
b405ce7
188f0b7
13fdd53
113bbb9
e073ec2
8118f17
3e1a403
039719f
6ffe516
e74f9f8
4e99f0d
55183c5
ee3a0c8
76fc503
de09817
e7649c2
037cece
265836f
ba8ecc8
1196efb
50f1f37
873f875
6e89bf9
b769ba1
f65f2ff
2f8850c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,18 +1,19 @@ | ||
| #!/usr/bin/env bash | ||
| # This script lints the SDK modules and the internal examples | ||
| # Pre-requisites: golangci-lint | ||
| # Pre-requisites: golangci-lint (provided by Makefile or system) | ||
| set -eo pipefail | ||
|
|
||
| ROOT_DIR=$(git rev-parse --show-toplevel) | ||
| GOLANG_CI_YAML_PATH="${ROOT_DIR}/golang-ci.yaml" | ||
| GOLANG_CI_ARGS="--allow-parallel-runners --timeout=5m --config=${GOLANG_CI_YAML_PATH}" | ||
|
|
||
| if type -p golangci-lint >/dev/null; then | ||
| : | ||
| else | ||
| echo "golangci-lint not installed, unable to proceed." | ||
| # Use provided golangci-lint binary or fallback to system installation | ||
| GOLANGCI_LINT_BIN="${1:-golangci-lint}" | ||
|
|
||
| if [ ! -x "${GOLANGCI_LINT_BIN}" ] && ! type -p "${GOLANGCI_LINT_BIN}" >/dev/null; then | ||
| echo "golangci-lint not found at ${GOLANGCI_LINT_BIN} and not installed in PATH, unable to proceed." | ||
| exit 1 | ||
| fi | ||
|
|
||
| cd ${ROOT_DIR} | ||
| golangci-lint run ${GOLANG_CI_ARGS} | ||
| ${GOLANGCI_LINT_BIN} run ${GOLANG_CI_ARGS} |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -3,6 +3,7 @@ package dns | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import ( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "context" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "fmt" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "net/http" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "strings" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "github.com/hashicorp/terraform-plugin-framework-validators/int64validator" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -16,6 +17,7 @@ import ( | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "github.com/hashicorp/terraform-plugin-framework/schema/validator" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "github.com/hashicorp/terraform-plugin-framework/types" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "github.com/hashicorp/terraform-plugin-log/tflog" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "github.com/stackitcloud/stackit-sdk-go/core/oapierror" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "github.com/stackitcloud/stackit-sdk-go/services/dns" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "github.com/stackitcloud/stackit-sdk-go/services/dns/wait" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "github.com/stackitcloud/terraform-provider-stackit/stackit/internal/conversion" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -219,15 +221,27 @@ func (r *recordSetResource) Create(ctx context.Context, req resource.CreateReque | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| // Write id attributes to state before polling via the wait handler - just in case anything goes wrong during the wait handler | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| utils.SetAndLogStateFields(ctx, &resp.Diagnostics, &resp.State, map[string]any{ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "project_id": projectId, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "zone_id": zoneId, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "record_set_id": *recordSetResp.Rrset.Id, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| }) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| recordSetId := *recordSetResp.Rrset.Id | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| model.RecordSetId = types.StringValue(recordSetId) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| model.Id = utils.BuildInternalTerraformId(projectId, zoneId, recordSetId) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| // Set all unknown/null fields to null before saving state | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if err := utils.SetModelFieldsToNull(ctx, &model); err != nil { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| recordSetResp, err := r.client.CreateRecordSet(ctx, projectId, zoneId).CreateRecordSetPayload(*payload).Execute() | |
| if err != nil || recordSetResp.Rrset == nil || recordSetResp.Rrset.Id == nil { | |
| core.LogAndAddError(ctx, &resp.Diagnostics, "Error creating record set", fmt.Sprintf("Calling API: %v", err)) | |
| return | |
| } | |
| // Write id attributes to state before polling via the wait handler - just in case anything goes wrong during the wait handler | |
| utils.SetAndLogStateFields(ctx, &resp.Diagnostics, &resp.State, map[string]any{ | |
| "project_id": projectId, | |
| "zone_id": zoneId, | |
| "record_set_id": *recordSetResp.Rrset.Id, | |
| }) | |
| if resp.Diagnostics.HasError() { | |
| return | |
| } | |
| waitResp, err := wait.CreateRecordSetWaitHandler(ctx, r.client, projectId, zoneId, *recordSetResp.Rrset.Id).WaitWithContext(ctx) | |
| if err != nil { | |
| core.LogAndAddError(ctx, &resp.Diagnostics, "Error creating record set", fmt.Sprintf("Instance creation waiting: %v", err)) | |
| return | |
| } |
After the wait handler we use the mapFields function to map the API response to the Terraform state model.
terraform-provider-stackit/stackit/internal/services/dns/recordset/resource.go
Lines 237 to 248 in b5f82e7
| // Map response body to schema | |
| err = mapFields(ctx, waitResp, &model) | |
| if err != nil { | |
| core.LogAndAddError(ctx, &resp.Diagnostics, "Error creating record set", fmt.Sprintf("Processing API payload: %v", err)) | |
| return | |
| } | |
| // Set state to fully populated data | |
| diags = resp.State.Set(ctx, model) | |
| resp.Diagnostics.Append(diags...) | |
| if resp.Diagnostics.HasError() { | |
| return | |
| } |
Now comes the important part: Here is the section in the mapFields function, which makes sure all fields of the resource get set to a value or null. [1]
terraform-provider-stackit/stackit/internal/services/dns/recordset/resource.go
Lines 432 to 445 in b5f82e7
| model.Id = utils.BuildInternalTerraformId( | |
| model.ProjectId.ValueString(), model.ZoneId.ValueString(), recordSetId, | |
| ) | |
| model.RecordSetId = types.StringPointerValue(recordSet.Id) | |
| model.Active = types.BoolPointerValue(recordSet.Active) | |
| model.Comment = types.StringPointerValue(recordSet.Comment) | |
| model.Error = types.StringPointerValue(recordSet.Error) | |
| if model.Name.IsNull() || model.Name.IsUnknown() { | |
| model.Name = types.StringPointerValue(recordSet.Name) | |
| } | |
| model.FQDN = types.StringPointerValue(recordSet.Name) | |
| model.State = types.StringValue(string(recordSet.GetState())) | |
| model.TTL = types.Int64PointerValue(recordSet.Ttl) | |
| model.Type = types.StringValue(string(recordSet.GetType())) |
Well, and after that the model struct must be persisted in the Terraform state (this doesn't happen automatically):
terraform-provider-stackit/stackit/internal/services/dns/recordset/resource.go
Lines 243 to 248 in b5f82e7
| // Set state to fully populated data | |
| diags = resp.State.Set(ctx, model) | |
| resp.Diagnostics.Append(diags...) | |
| if resp.Diagnostics.HasError() { | |
| return | |
| } |
To sum it up, here's what happens in the main branch implementation of this resource:
- Create request for the API resource
- (Write id fields to the state in case anything goes wrong during the wait handler)
- Wait handler to wait for creation of the API resource to complete
- Map API response to Terraform resource model struct (
mapFields) - Persist the Terraform model struct of the resource in the Terraform state
Now to your changes
Now to your changes and why it's not working (without setting all fields to null using your new reflection-powered util func):
In your func (r *recordSetResource) Create(...) ... implementation...
- You also do the Create request for the API resource (see no. 1 above)
- You write the id fields to the state (see no 2. above)
- And then you jump out of the
Createimplementation of the Terraform resource prematurely with the code below.
if !utils.ShouldWait() {
tflog.Info(ctx, "Skipping wait; async mode for Crossplane/Upjet")
return
}The problem is: This doesn't only skip the wait handler (no. 3 above), but also the mapFields func call (no. 4 above) which (as said) sets explicitly all values to a value or null.
Again, you just skip this. This is a core part of the resource implementation. You don't call it. That's why Terraform complains about unknown values. Terraform says this is a bug in the provider implementation, and it's correct.
But it's sadly not a bug in our implementation on the main branch, but in your implementation.
You circumvent this problem by setting all fields of the Terraform resource state model explicitly to null by using your new util func. This circumvents the problem (Terraform doesn't complain anymore about unknown values), but it doesn't really fix the problem (at least not in a clean way).
In fact setting all fields of the Terraform resource model struct to null circumvents existing checks of Terraform which we want to take advantage of during our resource implementations (at least for pure Terraform usage, without thinking of crossplane here).
[1] Btw, if you forget to set one field of the Terraform resource model struct to a value of null here during the implementation of the Terraform resource you will also get exactly the error After the apply operation, the provider still indicated an unknown value... from above. This is what I consider a terraform feature. As said, unknown values are a concept of Terraform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation. It covers well my observations. I think we are actually on two sides of the same coin.
Let´s take a step and start with the requirements for the create again, then I share my observations during testing and then check different alternatives.
Requirements
- Have idempotency. If I apply a resource and somehow fail right after the api call (for example due to timeouts, context cancels, random api errors in the wait handler) I want the resource to be in the state and use a Read to fill the model in the next apply. There should be no state drifts or replacements of the resource created in the first apply.
- Have a way to return right after the creation of the resource without waiting. This comes from upjet/crossplane (therefore the skip method with the log that it is intended to only use by this tool). Upjet needs the ids of the resources quite fast to persist them in kubernetes custom resources (the database so to say). Because the resource is only known once stored in the custom resource. It does hold the terraform state in a file only temporary. During applies the state is constructed with the custom resource. The problem with the waiting here is that the controller executing terraform can restart in any point in time. Since the wait handler can take quite a bit of time we risk creating the same resource twice. Therefore the early return. And it is completely fine for the tool since it executes a Read directly after the return. Every 10 min it queries the state of the cloud resource as well. So eventually it will reach the point where the cloud resource and the custom resource have the same data. That´s the standard kubernetes reconciling mechanism.
- (optional) have a common way to achieve idempotency in every single resource. We should have a rock solid way without much custom implementation as it is error prone to do it for each resource.
Code Walkthrough, Testing and Observations
We already recognized that we need to set partial states in the terraform state. That´s why the following code already exists in the main branch:
utils.SetAndLogStateFields(ctx, &resp.Diagnostics, &resp.State, map[string]any{
"project_id": projectId,
"zone_id": zoneId,
"record_set_id": *recordSetResp.Rrset.Id,
})
if resp.Diagnostics.HasError() {
return
}
In my tests I have setup a terraform resource (in this case mariadb with the same function as mariadb takes way longer to create and dns is super fast. So please don´t be confused about the resource we are still talking about the same code)
resource "stackit_mariadb_instance" "example_maria_db" {
name = "example-mariadb"
plan_name = "stackit-mariadb-1.4.10-single"
project_id = "xxx"
version = "10.6"
}
Then I applied and once the wait handler started and I saw mariadb in creating state in the portal I canceled the apply to simulate random failures as mentioned above. Then I reapplied and got the error: stackit_mariadb_instance.example_maria_db is tainted, so must be replaced.
That´s when I recognized that setting ids is not enough and we need to include the fields in the resource as well (name, plane_name,version). So I changed the code to:
utils.SetAndLogStateFields(ctx, &resp.Diagnostics, &resp.State, map[string]interface{}{
"project_id": projectId,
"instance_id": model.InstanceId.ValueString(),
"id": model.Id.ValueString(),
"name": model.Name.ValueString(),
"plan_name": model.PlanName.ValueString(),
"version": model.Version.ValueString(),
"plan_id": model.PlanId.ValueString(),
})
if resp.Diagnostics.HasError() {
return
}
and that almost worked. We also should not log and error in the wait handler as it messes up terraform and result in non idempotent behaviour:
waitResp, err := wait.CreateInstanceWaitHandler(ctx, r.client, projectId, instanceId).WaitWithContext(ctx)
if err != nil {
tflog.Warn(ctx, fmt.Sprintf("Instance creation waiting failed: %v. The instance was created but waiting for ready state was interrupted. State will be refreshed on next apply.", err))
return
}
And that works perfectly fine in the case of create/cancel/reapply. Now there are no state drift and the resource stays as it is.
Now I went a step further and wrote unit tests for the behaviour. So we can really verify that it works how we think it works.
// Verify that Read successfully populated all fields from the API
var stateAfterRead Model
diags = readResp.State.Get(tc.Ctx, &stateAfterRead)
require.False(t, diags.HasError(), "Expected no errors reading state after Read")
// Verify all fields are now complete after successful Read (prevents state drift)
require.Equal(t, instanceId, stateAfterRead.InstanceId.ValueString())
require.Equal(t, fmt.Sprintf("%s,%s", projectId, instanceId), stateAfterRead.Id.ValueString())
require.Equal(t, projectId, stateAfterRead.ProjectId.ValueString())
require.Equal(t, instanceName, stateAfterRead.Name.ValueString())
require.Equal(t, planId, stateAfterRead.PlanId.ValueString())
require.Equal(t, planName, stateAfterRead.PlanName.ValueString())
require.Equal(t, version, stateAfterRead.Version.ValueString())
// CRITICAL: Verify fields that were NULL after Create are now populated
// This prevents Terraform state drift on the next apply
require.False(t, stateAfterRead.DashboardUrl.IsNull(), "DashboardUrl must be populated by Read to prevent state drift")
require.Equal(t, dashboardUrl, stateAfterRead.DashboardUrl.ValueString())
require.False(t, stateAfterRead.CfGuid.IsNull(), "CfGuid must be populated by Read to prevent state drift")
require.False(t, stateAfterRead.ImageUrl.IsNull(), "ImageUrl must be populated by Read to prevent state drift")
The unit test covers the manual test create/cancel/read. Note that setting the partial state actually leads to null fields while reading the state again. Then I inserted the utils.SetModelFieldsToNull instead of utils.SetAndLogStateFields and the test(s) were equally successful. This lead me to the assumption we are actually on two different sides of the same coin (different code but same outcome). Not setting fields in the state leads to null values while setting them to null explicitly also result in reading out null values. So we probably found out multiple ways to solve the idempotency problem. More in the alternatives.
Second the early exit is this code:
if !utils.ShouldWait() {
tflog.Info(ctx, "Skipping wait; async mode for Crossplane/Upjet")
return
}
Note this function is only executed if an environment variable is set to "true". If the variable is not set or to any other value than "true" we would continue with the wait handler. Not pretty but we somehow need to cover the requirement since the tool works as it works.
Alternatives/Conclusion
I think there is no real discussion about the early return but if there is feel free to suggest something.
The more interesting point is the idempotency part.
- As we already saw in the tests we need to set the ids and the fields specified in the resource to avoid state drift and resource recreation. One approach could be like the current one in the main branch but make it a bit more abstract. We can construct the map based on the map. Similiar to the proposed implementation
utils.SetModelFieldsToNullwe can iterate over the models attributes with reflection magic check for non null/unknown fields and use the tags (tfsdk) of the model as keys for the map and the value of the attribute of the model. This should result in the map we want to store as partial state in the terraform state. - Similiar to the first approach we can go the reverse approach and have the model already set and then set all fields to null that are unknown. That´s also the proposed approach. You highlighted correctly that it might not be the best idea to use the same model that is used after the wait handler as we also want to verify the behaviour of the map function after the wait handler. Means we should do a deepcopy and set the model fields on this deepcopy to null and also save the deepcopy in the struct.
- One last approach that I could came up with is the construction of the map with a lot of if-conditions in the resource without any reflection magic. That is the least preferred option as it requires implementing it in every resource and is error prone since we may miss fields. (That´s what I mean in the third requirement)
So, what do you think? Do you have other testing experiences? Which direction should we go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the code. Catched wait errors. Using now another model to save the minimal state which preserve the upstream logic and still fixes the idempotency bugs


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running bash scripts blindly from a master branch of another repository is a no-go for me, sorry
Overall, what's the point of this? This whole thing feels wrong to me. For managing development dependencies there are things like dev containers, devenvs, nix flakes, ...
I'm aware we're not providing any of these currently, but this download process inside the Makefile seems pretty hacky to me 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that was a bit too ambitious. You know once you copy it from somewhere you always copy it :P
I replaced it with downloading from the releases which should be secure.
It is actually quite typically to download binaries that are needed to interact with the application (like linting, kubectl, kind, helm, etc) via scripts/make. In many stackit projects that is already the case. And there are also many opensource projects that do similar things like:
I guess many ways solve the same problem. Currently my biggest problem is that I cannot lint locally since there are version diffs between my installed golangci lint and the one in the pipeline. Therefore I want to have a make command that runs the same version in the pipeline as in our local env. Some might say that is the shift left approach.