feat(backup): add EtcdBackup CRD for declarative etcd snapshots#307
feat(backup): add EtcdBackup CRD for declarative etcd snapshots#307IvanHunters wants to merge 35 commits intoaenix-io:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds EtcdBackup and EtcdBackupSchedule CRDs, validating webhooks, controllers to manage Jobs/CronJobs, backup-agent and restore-agent binaries, StatefulSet bootstrap restore init containers, Helm/CRD/RBAC/chart updates, generated deepcopy code, and extensive unit tests plus build/tooling changes. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant K8s as Kubernetes API
participant Controller as EtcdBackup Controller
participant Factory as Job Factory
participant Job as Kubernetes Job
participant Agent as backup-agent
participant Etcd as Etcd Cluster
participant Storage as S3/PVC
User->>K8s: Create EtcdBackup CR
K8s->>Controller: Reconcile EtcdBackup
Controller->>K8s: Get EtcdCluster
Controller->>Factory: CreateBackupJob(backup, cluster, operatorImage)
Factory->>K8s: Create Job
K8s->>Job: Schedule Job
Job->>Agent: Run backup-agent container
Agent->>Etcd: Connect and snapshot
Etcd-->>Agent: Snapshot stream
alt S3
Agent->>Storage: Upload snapshot to S3
else PVC
Agent->>Storage: Write snapshot to PVC
end
Storage-->>Agent: Confirm
Agent-->>Job: Exit success
Job->>K8s: Update Job status
Controller->>K8s: Observe Job completion, Update EtcdBackup status
sequenceDiagram
participant User
participant K8s as Kubernetes API
participant ScheduleController as EtcdBackupSchedule Controller
participant Factory as CronJob Factory
participant Cron as CronJob
participant Scheduler as K8s Scheduler
participant Job as Created Job
participant Agent as backup-agent
User->>K8s: Create EtcdBackupSchedule CR
K8s->>ScheduleController: Reconcile Schedule
ScheduleController->>K8s: Get EtcdCluster
ScheduleController->>Factory: CreateBackupCronJob(schedule, cluster, operatorImage)
Factory->>K8s: Create CronJob
note over Scheduler,Job: Periodically per cron schedule
Scheduler->>K8s: Instantiate Job from CronJob
K8s->>Job: Run backup-agent
Job->>Agent: Execute backup-agent
Agent->>Agent: Perform backup (S3/PVC)
Job-->>K8s: Complete
ScheduleController->>K8s: Update EtcdBackupSchedule status (LastScheduleTime/Ready)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (13)
Makefile (1)
121-128: Deduplicate the backup-agent build recipe.
buildandbuild-backup-agentnow carry the samego build -o bin/backup-agent ...command. Makingbuilddepend onbuild-backup-agentavoids the two targets drifting the next time build flags change.Possible cleanup
.PHONY: build -build: manifests generate fmt vet ## Build manager binary. +build: manifests generate fmt vet build-backup-agent ## Build manager binary. go build -o bin/manager cmd/manager/main.go - go build -o bin/backup-agent cmd/backup-agent/main.go .PHONY: build-backup-agent build-backup-agent: ## Build backup-agent binary. go build -o bin/backup-agent cmd/backup-agent/main.go🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 121 - 128, The Makefile duplicates the backup-agent build command in both the build and build-backup-agent targets; change the build target to depend on build-backup-agent (add build-backup-agent to its prerequisites) and remove the duplicate "go build -o bin/backup-agent cmd/backup-agent/main.go" line from the build recipe so the backup-agent is built only via the build-backup-agent target (leave the build-backup-agent target containing the go build command intact).internal/controller/factory/backup_job.go (1)
205-218: Consider handling zero replicas edge case.When
cluster.Spec.Replicasis explicitly set to 0 (allowed per kubebuilder validation minimum),buildEndpointsreturns an empty string. The backup-agent will then fail with an unhelpful error about missing endpoints rather than a clear message about no replicas.This is a minor edge case since backing up a zero-replica cluster is inherently invalid, but could be confusing during debugging.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/factory/backup_job.go` around lines 205 - 218, buildEndpoints currently returns an empty string when cluster.Spec.Replicas is explicitly 0; change buildEndpoints to detect replicas == 0 and return a clear sentinel string (for example "no-replicas:<clusterName>.<namespace>") instead of "", so callers get a deterministic, debuggable value; update the logic inside buildEndpoints (which uses GetServerProtocol, GetHeadlessServiceName and reads cluster.Spec.Replicas) to check for zero replicas and return the descriptive message immediately before building endpoints.config/crd/bases/etcd.aenix.io_etcdbackups.yaml (1)
66-127: Push the destination invariants into the CRD schema.
destinationstill accepts{}and{s3,pvc}at the OpenAPI level, so these invalid specs are only rejected once the validating webhook is involved. AddingoneOfplus branch-levelrequiredfields here would make the CRD self-validating and fail bad manifests earlier.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/crd/bases/etcd.aenix.io_etcdbackups.yaml` around lines 66 - 127, The CRD currently allows destination to be empty or contain both s3 and pvc because its schema only marks properties as optional; update the destination schema (the "destination" object that contains "s3" and "pvc") to use a oneOf with two branches: one branch requiring the "pvc" property and within it the existing required "claimName", and the other branch requiring the "s3" property and within it the existing required fields ("bucket", "credentialsSecretRef", "endpoint", "key", etc.); ensure each branch keeps the existing property definitions and required lists so the CRD self-validates and rejects {} or mixed specs without relying on the webhook.api/v1alpha1/etcdbackupschedule_webhook_test.go (1)
28-199: Add negative coverage for schedule destination required fields too.This suite checks xor and
subPath, but it never exercises missingclaimNameor missing required S3 fields onEtcdBackupSchedule. Since this CR has its own webhook, those regressions would still slip past the existingEtcdBackuptests.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/v1alpha1/etcdbackupschedule_webhook_test.go` around lines 28 - 199, Add negative unit tests that call EtcdBackupSchedule.ValidateCreate to cover missing required destination fields: one test creating an EtcdBackupSchedule whose Spec.Destination.PVC has an empty ClaimName and asserting ValidateCreate returns a StatusError containing "claimName" (or similar), and another test where Spec.Destination.S3 omits required fields (e.g., empty Bucket or empty CredentialsSecretRef.Name) and asserting ValidateCreate returns a StatusError containing the missing S3 field names; locate the test cases near the existing tests referencing EtcdBackupSchedule, BackupDestination, PVCBackupDestination, S3BackupDestination and reuse the ValidateCreate call/assertion pattern used elsewhere in this file.charts/etcd-operator/templates/cert-manager/validatingwebhookconfiguration.yml (1)
10-49: Set an explicit timeout on these new validating webhooks.These rules fail closed and currently rely on the API server default timeout. Pinning
timeoutSecondsto a small value keeps CREATE/UPDATE failures fast when the webhook service is slow or unavailable.Suggested change
- admissionReviewVersions: - v1 clientConfig: service: name: {{ include "etcd-operator.fullname" . }}-webhook-service namespace: {{ .Release.Namespace }} path: /validate-etcd-aenix-io-v1alpha1-etcdbackup failurePolicy: Fail name: vetcdbackup.kb.io + timeoutSeconds: 5 rules: - apiGroups: - etcd.aenix.io @@ - admissionReviewVersions: - v1 clientConfig: service: name: {{ include "etcd-operator.fullname" . }}-webhook-service namespace: {{ .Release.Namespace }} path: /validate-etcd-aenix-io-v1alpha1-etcdbackupschedule failurePolicy: Fail name: vetcdbackupschedule.kb.io + timeoutSeconds: 5 rules: - apiGroups: - etcd.aenix.io🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/etcd-operator/templates/cert-manager/validatingwebhookconfiguration.yml` around lines 10 - 49, The validating webhook configurations for vetcdbackup.kb.io and vetcdbackupschedule.kb.io lack an explicit timeoutSeconds and thus use the API server default; add a timeoutSeconds: 5 (or another small integer) field to each webhook object (alongside failurePolicy) in the validatingwebhookconfiguration entries for paths /validate-etcd-aenix-io-v1alpha1-etcdbackup and /validate-etcd-aenix-io-v1alpha1-etcdbackupschedule so CREATE/UPDATE calls fail fast when the webhook service is slow or unavailable.config/webhook/manifests.yaml (1)
33-72: Mirror an explicit webhook timeout in the generated manifest too.Like the chart template, these fail-closed webhooks currently inherit the API server default timeout. Setting
timeoutSecondsexplicitly keeps local/generated deployments from stalling writes for longer than necessary during webhook issues.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/webhook/manifests.yaml` around lines 33 - 72, The generated Mutating/ValidatingWebhookConfiguration entries for the webhooks vetcdbackup.kb.io and vetcdbackupschedule.kb.io (look for clientConfig.service.path values /validate-etcd-aenix-io-v1alpha1-etcdbackup and /validate-etcd-aenix-io-v1alpha1-etcdbackupschedule) lack an explicit timeoutSeconds and therefore inherit the API server default; add a timeoutSeconds field (e.g., timeoutSeconds: 10) to each webhook block in the manifest so the generated YAML mirrors the chart template and avoids long stalls when the webhook is unreachable.internal/controller/factory/backup_job_test.go (1)
142-203: Assert the new S3 endpoint/path-style env wiring here.
CreateBackupJobnow exportsS3_ENDPOINTandS3_FORCE_PATH_STYLE, but this test never checks them. That leaves the S3 compatibility change in this PR effectively unguarded.Suggested change
if envMap["S3_BUCKET"].Value != "backups" { t.Errorf("expected S3_BUCKET=backups, got %q", envMap["S3_BUCKET"].Value) } + if envMap["S3_ENDPOINT"].Value != "https://s3.example.com" { + t.Errorf("expected S3_ENDPOINT=https://s3.example.com, got %q", envMap["S3_ENDPOINT"].Value) + } if envMap["S3_KEY"].Value != "etcd/snap.db" { t.Errorf("expected S3_KEY=etcd/snap.db, got %q", envMap["S3_KEY"].Value) } if envMap["S3_REGION"].Value != "eu-west-1" { t.Errorf("expected S3_REGION=eu-west-1, got %q", envMap["S3_REGION"].Value) } + if envMap["S3_FORCE_PATH_STYLE"].Value != "false" { + t.Errorf("expected S3_FORCE_PATH_STYLE=false, got %q", envMap["S3_FORCE_PATH_STYLE"].Value) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/factory/backup_job_test.go` around lines 142 - 203, The test fails to assert the new S3 env wiring added by CreateBackupJob: add checks in backup_job_test.go (after building envMap from job.Spec.Template.Spec.Containers[0].Env) to assert envMap["S3_ENDPOINT"].Value == "https://s3.example.com" and that envMap["S3_FORCE_PATH_STYLE"].Value exists and equals the expected default (e.g. "false") so the S3 endpoint and path-style behavior are validated for the backup created from the backup object with Endpoint set.internal/controller/factory/backup_cronjob_test.go (1)
30-46: Test setup looks good, consider explicit error checking on scheme registration.The test properly sets up the runtime scheme with both client-go and etcd-operator types. While ignoring the return value of
AddToSchemeis common in tests (these rarely fail), explicit error checking would be more defensive.🔧 Optional: Add explicit error checking
func TestCreateBackupCronJob_PVC(t *testing.T) { scheme := runtime.NewScheme() - _ = clientgoscheme.AddToScheme(scheme) - _ = etcdaenixiov1alpha1.AddToScheme(scheme) + if err := clientgoscheme.AddToScheme(scheme); err != nil { + t.Fatalf("failed to add client-go scheme: %v", err) + } + if err := etcdaenixiov1alpha1.AddToScheme(scheme); err != nil { + t.Fatalf("failed to add etcd-operator scheme: %v", err) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/factory/backup_cronjob_test.go` around lines 30 - 46, TestCreateBackupCronJob_PVC currently ignores errors returned by clientgoscheme.AddToScheme and etcdaenixiov1alpha1.AddToScheme; update the test to explicitly check and fail on those errors by capturing their return values and calling t.Fatalf or t.Fatal when non-nil. Locate the two AddToScheme calls in TestCreateBackupCronJob_PVC and change them to assign the error (e.g., err := clientgoscheme.AddToScheme(scheme)) and handle err similarly for etcdaenixiov1alpha1.AddToScheme so the test fails fast if scheme registration fails.api/v1alpha1/etcdbackupschedule_webhook.go (1)
34-35: Consider using a cron parsing library instead of regex for schedule validation.The current regex allows some invalid cron expressions (e.g.,
99 99 99 99 99would match). Consider using a library likegithub.com/robfig/cron/v3to parse and validate cron expressions more accurately, which would also catch semantic errors (e.g., invalid day-of-month values).♻️ Suggested approach using cron library
import "github.com/robfig/cron/v3" // In ValidateCreate, replace regex check with: if r.Spec.Schedule == "" { allErrors = append(allErrors, field.Required(...)) } else { parser := cron.NewParser(cron.Minute | cron.Hour | cron.Dom | cron.Month | cron.Dow | cron.Descriptor) if _, err := parser.Parse(r.Spec.Schedule); err != nil { allErrors = append(allErrors, field.Invalid( field.NewPath("spec", "schedule"), r.Spec.Schedule, fmt.Sprintf("invalid cron expression: %v", err), )) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/v1alpha1/etcdbackupschedule_webhook.go` around lines 34 - 35, Replace the fragile regex-based cron validation (cronRegexp) in ValidateCreate with parsing via github.com/robfig/cron/v3: remove/stop using cronRegexp, add the cron import, construct a parser with cron.Minute|cron.Hour|cron.Dom|cron.Month|cron.Dow|cron.Descriptor and call parser.Parse(r.Spec.Schedule), treat empty r.Spec.Schedule as required, and on parse error append a field.Invalid for field.NewPath("spec","schedule") including the parser error text so semantic cron errors are caught.api/v1alpha1/etcdbackup_webhook.go (1)
144-152: Consider usingpath.Cleaninstead offilepath.Cleanfor consistent path validation.
filepath.Cleanuses OS-specific path separators, which may behave differently when the webhook runs on Windows vs. when the backup-agent runs in a Linux container. Usingpath.Cleanfrom thepathpackage ensures consistent forward-slash handling regardless of the webhook's host OS.♻️ Suggested fix for consistent path handling
import ( "fmt" - "path/filepath" + "path" "reflect" "strings" ... if pvc.SubPath != "" { - cleaned := filepath.Clean(pvc.SubPath) - if strings.HasPrefix(cleaned, "..") || filepath.IsAbs(cleaned) { + cleaned := path.Clean(pvc.SubPath) + if strings.HasPrefix(cleaned, "..") || path.IsAbs(cleaned) { allErrors = append(allErrors, field.Invalid(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/v1alpha1/etcdbackup_webhook.go` around lines 144 - 152, Replace OS-dependent filepath usage with the POSIX path package for consistent validation: in the block that checks pvc.SubPath (using filepath.Clean and filepath.IsAbs), switch to path.Clean and path.IsAbs and update imports to include "path" instead of "path/filepath"; keep the same logic with strings.HasPrefix(cleaned, "..") and field.Invalid on pvcPath.Child("subPath") so subPath is validated as a relative, non-.. containing path.internal/controller/etcdbackupschedule_controller_test.go (1)
274-278: The compile-time verification block is unnecessary.Lines 275-278 use a pattern to verify imports compile, but this is already ensured by the actual test code that uses
errors.IsNotFoundandptr.To. This block can be removed to reduce noise.🧹 Remove unnecessary verification block
-// Verify that all helpers referenced in this file compile. -var _ = func() { - _ = errors.IsNotFound - _ = ptr.To[int32] -}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/etcdbackupschedule_controller_test.go` around lines 274 - 278, Remove the unnecessary compile-time verification block (the anonymous var _ = func() { ... } closure) that assigns _ = errors.IsNotFound and _ = ptr.To[int32]; simply delete that entire block from etcdbackupschedule_controller_test.go since the tests already reference errors.IsNotFound and ptr.To elsewhere; ensure no other references remain to that anonymous function or its variables.config/crd/bases/etcd.aenix.io_etcdbackupschedules.yaml (1)
70-131: Consider adding CEL validation for destination mutual exclusivity.Currently, the destination schema allows both
pvcands3to be specified simultaneously (or neither), relying entirely on webhook validation. For defense-in-depth and better user experience when webhooks are unavailable, consider adding a CEL validation rule (requires Kubernetes 1.25+).📋 Optional: Add CEL validation rule
Add to the destination property in the schema:
destination: description: Destination defines where the backup will be stored. x-kubernetes-validations: - rule: "has(self.pvc) != has(self.s3)" message: "exactly one of pvc or s3 must be specified" properties: # ... existing propertiesAlternatively, use the corresponding kubebuilder marker in the Go types file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/crd/bases/etcd.aenix.io_etcdbackupschedules.yaml` around lines 70 - 131, Add a CEL validation to the destination schema to enforce mutual exclusivity between pvc and s3: under the destination property (referencing destination, pvc, and s3) add an x-kubernetes-validations entry with rule "has(self.pvc) != has(self.s3)" and an appropriate message like "exactly one of pvc or s3 must be specified" so the CRD itself rejects both/none selections when webhooks are unavailable.internal/controller/etcdbackupschedule_controller.go (1)
127-167: Consider:reflect.DeepEqualmay cause unnecessary updates.Using
reflect.DeepEqualonCronJob.Spec(Line 140) is functional but may miss semantic equivalence (e.g., nil vs empty slices). This could occasionally trigger no-op updates, but it's not incorrect behavior—just slightly suboptimal.For a v1alpha1 controller, this is acceptable. Consider using a more targeted comparison if update frequency becomes a concern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/etcdbackupschedule_controller.go` around lines 127 - 167, The code uses reflect.DeepEqual to compare existingCronJob.Spec and desiredCronJob.Spec in reconcileExistingCronJob which can treat semantically equal objects (e.g., nil vs empty slices) as different and cause unnecessary updates; replace the comparison with a Kubernetes-aware semantic equality check (e.g., import "k8s.io/apimachinery/pkg/api/equality" and use equality.Semantic.DeepEqual(existingCronJob.Spec, desiredCronJob.Spec)) or implement targeted field comparisons for the specific spec fields you care about (normalize nil/empty slices/maps first) before assigning existingCronJob.Spec and calling r.Update.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/backup-agent/main.go`:
- Around line 89-121: In buildTLSConfig, set an explicit minimum TLS version on
the tls.Config (tlsConfig.MinVersion) to enforce TLS1.2+ (e.g.,
tls.VersionTLS12) before returning; update the tlsConfig variable initialization
(in function buildTLSConfig) to assign MinVersion to the chosen constant so the
config enforces the minimum protocol version for all connections.
In `@cmd/manager/main.go`:
- Around line 118-133: Read OPERATOR_IMAGE once at startup into a local variable
(e.g. operatorImage := os.Getenv("OPERATOR_IMAGE")), validate it is non-empty
and fail fast (log error and os.Exit(1)) if unset, then pass that checked
operatorImage into both controller.EtcdBackupReconciler and
controller.EtcdBackupScheduleReconciler via their OperatorImage field instead of
calling os.Getenv inline; update imports if you need errors (or logging) per
your chosen failure path.
In `@config/crd/patches/webhook_in_etcdbackups.yaml`:
- Around line 7-16: The CRD patch currently sets conversion.strategy: Webhook
(webhook.clientConfig.path: /convert) but there are no conversion handlers
(ConvertTo/ConvertFrom) or multiple API versions; remove the conversion webhook
block from the EtcdBackup CRD patch (delete the conversion: ... webhook: ...
conversionReviewVersions: ... entries) so the CRD remains single-version without
a conversion webhook, or alternatively implement proper hub/spoke conversion
methods (ConvertTo/ConvertFrom) and add additional API versions if you intend to
keep the webhook.
In `@internal/controller/factory/backup_job_test.go`:
- Around line 299-302: The test currently only checks
envMap["ETCD_ENDPOINTS"].Value is non-empty; change it to actually verify the
scheme is HTTPS by parsing the endpoint(s) and asserting the URL scheme equals
"https". For envMap["ETCD_ENDPOINTS"].Value (which may contain a single endpoint
or comma-separated endpoints), split on commas, trim whitespace, parse each with
a URL parser and assert parsedURL.Scheme == "https" (or fail the test with a
clear message referencing ETCD_ENDPOINTS and the offending endpoint). This
ensures the test enforces TLS rather than merely non-empty values.
---
Nitpick comments:
In `@api/v1alpha1/etcdbackup_webhook.go`:
- Around line 144-152: Replace OS-dependent filepath usage with the POSIX path
package for consistent validation: in the block that checks pvc.SubPath (using
filepath.Clean and filepath.IsAbs), switch to path.Clean and path.IsAbs and
update imports to include "path" instead of "path/filepath"; keep the same logic
with strings.HasPrefix(cleaned, "..") and field.Invalid on
pvcPath.Child("subPath") so subPath is validated as a relative, non-..
containing path.
In `@api/v1alpha1/etcdbackupschedule_webhook_test.go`:
- Around line 28-199: Add negative unit tests that call
EtcdBackupSchedule.ValidateCreate to cover missing required destination fields:
one test creating an EtcdBackupSchedule whose Spec.Destination.PVC has an empty
ClaimName and asserting ValidateCreate returns a StatusError containing
"claimName" (or similar), and another test where Spec.Destination.S3 omits
required fields (e.g., empty Bucket or empty CredentialsSecretRef.Name) and
asserting ValidateCreate returns a StatusError containing the missing S3 field
names; locate the test cases near the existing tests referencing
EtcdBackupSchedule, BackupDestination, PVCBackupDestination, S3BackupDestination
and reuse the ValidateCreate call/assertion pattern used elsewhere in this file.
In `@api/v1alpha1/etcdbackupschedule_webhook.go`:
- Around line 34-35: Replace the fragile regex-based cron validation
(cronRegexp) in ValidateCreate with parsing via github.com/robfig/cron/v3:
remove/stop using cronRegexp, add the cron import, construct a parser with
cron.Minute|cron.Hour|cron.Dom|cron.Month|cron.Dow|cron.Descriptor and call
parser.Parse(r.Spec.Schedule), treat empty r.Spec.Schedule as required, and on
parse error append a field.Invalid for field.NewPath("spec","schedule")
including the parser error text so semantic cron errors are caught.
In
`@charts/etcd-operator/templates/cert-manager/validatingwebhookconfiguration.yml`:
- Around line 10-49: The validating webhook configurations for vetcdbackup.kb.io
and vetcdbackupschedule.kb.io lack an explicit timeoutSeconds and thus use the
API server default; add a timeoutSeconds: 5 (or another small integer) field to
each webhook object (alongside failurePolicy) in the
validatingwebhookconfiguration entries for paths
/validate-etcd-aenix-io-v1alpha1-etcdbackup and
/validate-etcd-aenix-io-v1alpha1-etcdbackupschedule so CREATE/UPDATE calls fail
fast when the webhook service is slow or unavailable.
In `@config/crd/bases/etcd.aenix.io_etcdbackups.yaml`:
- Around line 66-127: The CRD currently allows destination to be empty or
contain both s3 and pvc because its schema only marks properties as optional;
update the destination schema (the "destination" object that contains "s3" and
"pvc") to use a oneOf with two branches: one branch requiring the "pvc" property
and within it the existing required "claimName", and the other branch requiring
the "s3" property and within it the existing required fields ("bucket",
"credentialsSecretRef", "endpoint", "key", etc.); ensure each branch keeps the
existing property definitions and required lists so the CRD self-validates and
rejects {} or mixed specs without relying on the webhook.
In `@config/crd/bases/etcd.aenix.io_etcdbackupschedules.yaml`:
- Around line 70-131: Add a CEL validation to the destination schema to enforce
mutual exclusivity between pvc and s3: under the destination property
(referencing destination, pvc, and s3) add an x-kubernetes-validations entry
with rule "has(self.pvc) != has(self.s3)" and an appropriate message like
"exactly one of pvc or s3 must be specified" so the CRD itself rejects both/none
selections when webhooks are unavailable.
In `@config/webhook/manifests.yaml`:
- Around line 33-72: The generated Mutating/ValidatingWebhookConfiguration
entries for the webhooks vetcdbackup.kb.io and vetcdbackupschedule.kb.io (look
for clientConfig.service.path values /validate-etcd-aenix-io-v1alpha1-etcdbackup
and /validate-etcd-aenix-io-v1alpha1-etcdbackupschedule) lack an explicit
timeoutSeconds and therefore inherit the API server default; add a
timeoutSeconds field (e.g., timeoutSeconds: 10) to each webhook block in the
manifest so the generated YAML mirrors the chart template and avoids long stalls
when the webhook is unreachable.
In `@internal/controller/etcdbackupschedule_controller_test.go`:
- Around line 274-278: Remove the unnecessary compile-time verification block
(the anonymous var _ = func() { ... } closure) that assigns _ =
errors.IsNotFound and _ = ptr.To[int32]; simply delete that entire block from
etcdbackupschedule_controller_test.go since the tests already reference
errors.IsNotFound and ptr.To elsewhere; ensure no other references remain to
that anonymous function or its variables.
In `@internal/controller/etcdbackupschedule_controller.go`:
- Around line 127-167: The code uses reflect.DeepEqual to compare
existingCronJob.Spec and desiredCronJob.Spec in reconcileExistingCronJob which
can treat semantically equal objects (e.g., nil vs empty slices) as different
and cause unnecessary updates; replace the comparison with a Kubernetes-aware
semantic equality check (e.g., import "k8s.io/apimachinery/pkg/api/equality" and
use equality.Semantic.DeepEqual(existingCronJob.Spec, desiredCronJob.Spec)) or
implement targeted field comparisons for the specific spec fields you care about
(normalize nil/empty slices/maps first) before assigning existingCronJob.Spec
and calling r.Update.
In `@internal/controller/factory/backup_cronjob_test.go`:
- Around line 30-46: TestCreateBackupCronJob_PVC currently ignores errors
returned by clientgoscheme.AddToScheme and etcdaenixiov1alpha1.AddToScheme;
update the test to explicitly check and fail on those errors by capturing their
return values and calling t.Fatalf or t.Fatal when non-nil. Locate the two
AddToScheme calls in TestCreateBackupCronJob_PVC and change them to assign the
error (e.g., err := clientgoscheme.AddToScheme(scheme)) and handle err similarly
for etcdaenixiov1alpha1.AddToScheme so the test fails fast if scheme
registration fails.
In `@internal/controller/factory/backup_job_test.go`:
- Around line 142-203: The test fails to assert the new S3 env wiring added by
CreateBackupJob: add checks in backup_job_test.go (after building envMap from
job.Spec.Template.Spec.Containers[0].Env) to assert envMap["S3_ENDPOINT"].Value
== "https://s3.example.com" and that envMap["S3_FORCE_PATH_STYLE"].Value exists
and equals the expected default (e.g. "false") so the S3 endpoint and path-style
behavior are validated for the backup created from the backup object with
Endpoint set.
In `@internal/controller/factory/backup_job.go`:
- Around line 205-218: buildEndpoints currently returns an empty string when
cluster.Spec.Replicas is explicitly 0; change buildEndpoints to detect replicas
== 0 and return a clear sentinel string (for example
"no-replicas:<clusterName>.<namespace>") instead of "", so callers get a
deterministic, debuggable value; update the logic inside buildEndpoints (which
uses GetServerProtocol, GetHeadlessServiceName and reads cluster.Spec.Replicas)
to check for zero replicas and return the descriptive message immediately before
building endpoints.
In `@Makefile`:
- Around line 121-128: The Makefile duplicates the backup-agent build command in
both the build and build-backup-agent targets; change the build target to depend
on build-backup-agent (add build-backup-agent to its prerequisites) and remove
the duplicate "go build -o bin/backup-agent cmd/backup-agent/main.go" line from
the build recipe so the backup-agent is built only via the build-backup-agent
target (leave the build-backup-agent target containing the go build command
intact).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 11eaf79a-1ab5-4d2c-9c09-359f4139cf3e
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (40)
DockerfileMakefilePROJECTapi/v1alpha1/etcdbackup_types.goapi/v1alpha1/etcdbackup_webhook.goapi/v1alpha1/etcdbackup_webhook_test.goapi/v1alpha1/etcdbackupschedule_types.goapi/v1alpha1/etcdbackupschedule_webhook.goapi/v1alpha1/etcdbackupschedule_webhook_test.goapi/v1alpha1/webhook_suite_test.goapi/v1alpha1/zz_generated.deepcopy.gocharts/etcd-operator/crds/etcd-backup.yamlcharts/etcd-operator/templates/cert-manager/validatingwebhookconfiguration.ymlcharts/etcd-operator/templates/rbac/clusterrole-manager-role.ymlcharts/etcd-operator/templates/workload/deployment.ymlcmd/backup-agent/main.gocmd/manager/main.goconfig/crd/bases/etcd.aenix.io_etcdbackups.yamlconfig/crd/bases/etcd.aenix.io_etcdbackupschedules.yamlconfig/crd/bases/etcd.aenix.io_etcdclusters.yamlconfig/crd/kustomization.yamlconfig/crd/patches/cainjection_in_etcdbackups.yamlconfig/crd/patches/cainjection_in_etcdbackupschedules.yamlconfig/crd/patches/webhook_in_etcdbackups.yamlconfig/crd/patches/webhook_in_etcdbackupschedules.yamlconfig/rbac/etcdbackup_editor_role.yamlconfig/rbac/etcdbackup_viewer_role.yamlconfig/rbac/role.yamlconfig/samples/etcd.aenix.io_v1alpha1_etcdbackup.yamlconfig/samples/etcd.aenix.io_v1alpha1_etcdbackupschedule.yamlconfig/webhook/manifests.yamlgo.modinternal/controller/etcdbackup_controller.gointernal/controller/etcdbackup_controller_test.gointernal/controller/etcdbackupschedule_controller.gointernal/controller/etcdbackupschedule_controller_test.gointernal/controller/factory/backup_cronjob.gointernal/controller/factory/backup_cronjob_test.gointernal/controller/factory/backup_job.gointernal/controller/factory/backup_job_test.go
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (6)
cmd/backup-agent/main_test.go (2)
106-107: Prefer usingtls.VersionTLS12constant over magic number.Using the named constant improves readability and ensures consistency with the implementation in
buildTLSConfig().♻️ Suggested improvement
Add
crypto/tlsto imports:import ( "crypto/tls" "path/filepath" "strings" "testing" )Then update the assertion:
- if cfg.MinVersion != 0x0303 { // tls.VersionTLS12 - t.Errorf("expected MinVersion TLS 1.2 (0x0303), got 0x%04x", cfg.MinVersion) + if cfg.MinVersion != tls.VersionTLS12 { + t.Errorf("expected MinVersion TLS 1.2, got 0x%04x", cfg.MinVersion) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/backup-agent/main_test.go` around lines 106 - 107, Replace the magic hex literal with the tls package constant: import "crypto/tls" in cmd/backup-agent/main_test.go and change the assertion comparing cfg.MinVersion to use tls.VersionTLS12 instead of 0x0303 (the cfg variable and the test expectation in the test that validates buildTLSConfig()) so the test reads cfg.MinVersion != tls.VersionTLS12 and the error message reflects that constant.
81-91: Test name is slightly misleading.
t.Setenv("ETCD_TLS_ENABLED", "")sets the variable to an empty string rather than unsetting it. While the behavior is the same (sincebuildTLSConfigchecks!= "true"), consider renaming toTestBuildTLSConfig_EmptyValuefor accuracy, or useos.Unsetenvto truly test the unset case.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/backup-agent/main_test.go` around lines 81 - 91, The test name TestBuildTLSConfig_Unset is misleading because it sets ETCD_TLS_ENABLED to an empty string rather than truly unsetting it; update the test that calls buildTLSConfig() to either (a) rename the test to TestBuildTLSConfig_EmptyValue to reflect the current behavior, or (b) actually unset the environment variable before calling buildTLSConfig by calling os.Unsetenv("ETCD_TLS_ENABLED") (or t.Unsetenv if available) and keep the name TestBuildTLSConfig_Unset so the test semantics match its name; ensure the assertion logic remains the same.internal/controller/factory/statefulset.go (1)
461-474: UnusedpodEnvvariable ingenerateRestoreInitContainers.The
podEnvslice containingPOD_NAMEandPOD_NAMESPACEis defined but only used byrestoreDatadirContainer(line 587). This appears intentional since restore-agent doesn't need pod metadata, but consider removing the variable and defining the env vars inline at line 587 to improve clarity.♻️ Optional: Define env vars inline
- podEnv := []corev1.EnvVar{ - { - Name: "POD_NAME", - ValueFrom: &corev1.EnvVarSource{ - FieldRef: &corev1.ObjectFieldSelector{FieldPath: "metadata.name"}, - }, - }, - { - Name: "POD_NAMESPACE", - ValueFrom: &corev1.EnvVarSource{ - FieldRef: &corev1.ObjectFieldSelector{FieldPath: "metadata.namespace"}, - }, - }, - } - // Env vars for the restore-agent init container restoreAgentEnv := []corev1.EnvVar{ {Name: "ETCD_DATA_DIR", Value: "/var/run/etcd/default.etcd"}, }Then at line 587:
- Env: podEnv, + Env: []corev1.EnvVar{ + { + Name: "POD_NAME", + ValueFrom: &corev1.EnvVarSource{ + FieldRef: &corev1.ObjectFieldSelector{FieldPath: "metadata.name"}, + }, + }, + { + Name: "POD_NAMESPACE", + ValueFrom: &corev1.EnvVarSource{ + FieldRef: &corev1.ObjectFieldSelector{FieldPath: "metadata.namespace"}, + }, + }, + },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/factory/statefulset.go` around lines 461 - 474, The podEnv variable is defined in generateRestoreInitContainers but never used except inside restoreDatadirContainer; remove the unused podEnv slice and instead add the two EnvVar entries (POD_NAME with FieldRef metadata.name and POD_NAMESPACE with FieldRef metadata.namespace) inline in the Env/EnvFrom list where restoreDatadirContainer is constructed (reference restoreDatadirContainer and generateRestoreInitContainers) so the env vars remain present but the unused podEnv symbol is eliminated.internal/controller/factory/statefulset_test.go (1)
98-98: Add a bootstrap-restore case instead of only threading the new arg through.All updated call sites still pass
"", so the newoperatorImage-gated branch inCreateOrUpdateStatefulSetnever runs here. Please add one case withspec.bootstrap.restoreset and a real image, plus one asserting the empty-image error, so the new parameter is actually covered.Also applies to: 162-162, 324-324, 384-393
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/factory/statefulset_test.go` at line 98, The test currently only calls CreateOrUpdateStatefulSet(ctx, &etcdcluster, k8sClient, "") which never exercises the operatorImage-gated branch; add two test cases in statefulset_test.go that exercise CreateOrUpdateStatefulSet: one where etcdcluster.Spec.Bootstrap.Restore (or spec.bootstrap.restore) is set and you pass a non-empty operatorImage string (e.g., "operator:latest") and assert success, and another where Restore is set but you pass "" and assert the function returns the expected empty-image error; ensure these cases use the existing etcdcluster and k8sClient setup (and mirror analogous additions at the other call sites referenced) so the new parameter behavior is covered.charts/etcd-operator/crds/etcd-cluster.yaml (1)
60-116: Consider adding schema-level XOR constraint forsourcefield.The
bootstrap.restore.sourcefield allows boths3andpvcto be set (or neither) at the schema level. While the webhook (validateBackupDestination) enforces exactly-one semantics, this validation can be bypassed via--dry-run, when webhooks are unavailable, or through direct etcd writes.Adding OpenAPI constraints provides defense-in-depth:
♻️ Suggested schema enhancement
properties: pvc: # ... pvc definition ... s3: # ... s3 definition ... type: object + minProperties: 1 + maxProperties: 1Alternatively, you can use
oneOfwith separate schemas for each destination type if more complex validation is needed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/etcd-operator/crds/etcd-cluster.yaml` around lines 60 - 116, The schema for bootstrap.restore.source currently allows both or neither of s3 and pvc; add an OpenAPI XOR constraint (use oneOf with two object schemas—one requiring s3 and forbidding pvc, the other requiring pvc and forbidding s3—or use anyOf + a not to forbid both) to enforce exactly-one semantics at the CRD level for bootstrap.restore.source; reference the existing properties named "s3" and "pvc" and ensure the new oneOf/anyOf block mirrors the existing nested schemas and required fields so it complements the validateBackupDestination webhook.internal/controller/etcdbackupschedule_controller.go (1)
229-235: Minor: Condition message could be more precise.The message "is active" at line 233 could be interpreted as the CronJob currently running Jobs (the technical meaning of
Status.Activein CronJob). Consider using "exists" or "is configured" for clarity.♻️ Suggested change
meta.SetStatusCondition(&schedule.Status.Conditions, metav1.Condition{ Type: etcdaenixiov1alpha1.EtcdBackupScheduleConditionReady, Status: metav1.ConditionTrue, Reason: "CronJobReady", - Message: fmt.Sprintf("CronJob %q is active", existingCronJob.Name), + Message: fmt.Sprintf("CronJob %q is configured", existingCronJob.Name), ObservedGeneration: schedule.Generation, })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/etcdbackupschedule_controller.go` around lines 229 - 235, Update the readiness condition message to avoid ambiguity around CronJob.Status.Active: in the call to meta.SetStatusCondition that updates schedule.Status.Conditions (the metav1.Condition with Type etcdaenixiov1alpha1.EtcdBackupScheduleConditionReady and Reason "CronJobReady"), change the Message from fmt.Sprintf("CronJob %q is active", existingCronJob.Name) to a clearer phrasing such as fmt.Sprintf("CronJob %q exists", existingCronJob.Name) or "is configured" so it unambiguously indicates the CronJob resource presence rather than running jobs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/v1alpha1/etcdbackup_webhook.go`:
- Around line 54-65: The validation currently enforces a 52-char maxJobNameLen
for Jobs (using const jobSuffix = "-backup" and computing maxNameLen :=
maxJobNameLen - len(jobSuffix)), which is wrong for GetBackupJobName() because
it creates a plain Job (63-char limit). Update the logic in
etcdbackup_webhook.go to use the Job name limit of 63 (set maxJobNameLen to 63
or otherwise compute maxNameLen from 63), recompute maxNameLen := maxJobNameLen
- len(jobSuffix), and keep the field.Invalid append for
field.NewPath("metadata","name")/r.Name but with the corrected limits and
message.
In `@cmd/restore-agent/main.go`:
- Around line 139-155: The copyFromPVC() path currently opens PVC_BACKUP_PATH
and calls writeSnapshot without any timeout or cancellation; update copyFromPVC
to accept/create a context with a deadline (e.g., context.WithTimeout) and use
that context to guard the read operation by either (a) changing writeSnapshot to
accept a context and perform a context-aware copy (using io.CopyN with periodic
context checks or a context-aware reader) or (b) wrapping the src file in a
reader that checks the context and aborts on timeout, ensuring src is closed on
cancel; make sure error messages include context cancellation/timeout semantics
so the init container won't hang on slow NFS PVCs.
In `@config/samples/etcd.aenix.io_v1alpha1_etcdcluster_restore.yaml`:
- Around line 6-12: The sample uses bootstrap.restore with a PVC source
(source.pvc.claimName and subPath) while replicas is set to 3, which will cause
every StatefulSet pod to mount the same backup PVC (ReadWriteOnce) and likely
stall scheduling; change the example to replicas: 1 when using a local PVC
restore or replace the PVC-based source with an S3/object-store restore example,
and if you keep PVC-based restore document the recommended workflow: set
replicas to 1 for the restore, verify the cluster recovered, then scale the
EtcdCluster up to the desired replica count.
In `@internal/controller/etcdbackup_controller.go`:
- Around line 92-100: The code currently trusts any Job with the generated
jobName; after fetching existingJob in r.Get, check that the Job is owned by the
EtcdBackup (use metav1.IsControlledBy(existingJob, backup) or inspect
existingJob.OwnerReferences for the EtcdBackup UID/kind) before calling
r.reconcileJobStatus; if the Job is not controller-owned, do not adopt it—return
a non-adopting result (e.g., return an error or skip reconciliation with a clear
log) so only Jobs with a proper owner reference to the EtcdBackup are used.
In `@internal/controller/etcdbackupschedule_controller_test.go`:
- Around line 139-167: The test never changes the EtcdBackupSchedule spec before
the second Reconcile so it doesn't exercise the update path; after the first
Reconcile (which creates the CronJob) fetch and mutate the EtcdBackupSchedule's
spec.schedule to a different cron string, persist that change via the k8s
client, then call reconciler.Reconcile again and assert that
factory.GetBackupCronJobName(schedule) now has the updated CronJob.Spec.Schedule
and that the EtcdBackupSchedule status Ready condition is true; this will
exercise reconcileExistingCronJob/cronJobNeedsUpdate rather than just the
"already exists" branch.
In `@internal/controller/factory/backup_cronjob.go`:
- Around line 57-61: The factory in
internal/controller/factory/backup_cronjob.go unconditionally sets
Spec.ConcurrencyPolicy = batchv1.ForbidConcurrent but cronJobNeedsUpdate in
internal/controller/etcdbackupschedule_controller.go does not compare that
field, causing drift; update cronJobNeedsUpdate to compare
desired.Spec.ConcurrencyPolicy against actual.Spec.ConcurrencyPolicy (and return
true when they differ) so reconciler will reset the value to ForbidConcurrent,
or alternatively add a comparator helper that explicitly checks
ConcurrencyPolicy alongside the other spec fields referenced in
cronJobNeedsUpdate.
In `@internal/controller/factory/backup_job.go`:
- Around line 68-75: The PodSecurityContext currently sets
RunAsNonRoot/RunAsUser/RunAsGroup but omits FSGroup, which causes PVC mounts to
remain root-owned and makes writeToPVC() fail with permission denied; update the
PodSecurityContext in the backup job spec to include FSGroup:
ptr.To(int64(65532)) (matching RunAsGroup/RunAsUser) so the PVC volume gets the
proper group ownership, and apply the same change to the other backup job
PodSecurityContext instance referenced around lines ~180-200.
- Around line 147-177: The S3 backup container used by uploadToS3() only
requests CPU/memory but writes a full snapshot to disk; update the backup Job
spec created in backup_job.go to add ephemeral-storage resource requests/limits
and a dedicated emptyDir scratch volume (mount it into the container and set
TMPDIR to that mount) so large uploads use node local scratch and avoid
eviction; locate the S3 handling branch (the block that builds envVars for S3
and the uploadToS3() caller) and: 1) add resources.requests["ephemeral-storage"]
and resources.limits["ephemeral-storage"] to the container spec, 2) add a volume
of type emptyDir and a corresponding volumeMount (e.g., /tmp/backup-scratch),
and 3) add/override env var TMPDIR (or BACKUP_TMPDIR) to point to that mount;
apply the same changes for the other S3 branch referenced around lines 203-217.
---
Nitpick comments:
In `@charts/etcd-operator/crds/etcd-cluster.yaml`:
- Around line 60-116: The schema for bootstrap.restore.source currently allows
both or neither of s3 and pvc; add an OpenAPI XOR constraint (use oneOf with two
object schemas—one requiring s3 and forbidding pvc, the other requiring pvc and
forbidding s3—or use anyOf + a not to forbid both) to enforce exactly-one
semantics at the CRD level for bootstrap.restore.source; reference the existing
properties named "s3" and "pvc" and ensure the new oneOf/anyOf block mirrors the
existing nested schemas and required fields so it complements the
validateBackupDestination webhook.
In `@cmd/backup-agent/main_test.go`:
- Around line 106-107: Replace the magic hex literal with the tls package
constant: import "crypto/tls" in cmd/backup-agent/main_test.go and change the
assertion comparing cfg.MinVersion to use tls.VersionTLS12 instead of 0x0303
(the cfg variable and the test expectation in the test that validates
buildTLSConfig()) so the test reads cfg.MinVersion != tls.VersionTLS12 and the
error message reflects that constant.
- Around line 81-91: The test name TestBuildTLSConfig_Unset is misleading
because it sets ETCD_TLS_ENABLED to an empty string rather than truly unsetting
it; update the test that calls buildTLSConfig() to either (a) rename the test to
TestBuildTLSConfig_EmptyValue to reflect the current behavior, or (b) actually
unset the environment variable before calling buildTLSConfig by calling
os.Unsetenv("ETCD_TLS_ENABLED") (or t.Unsetenv if available) and keep the name
TestBuildTLSConfig_Unset so the test semantics match its name; ensure the
assertion logic remains the same.
In `@internal/controller/etcdbackupschedule_controller.go`:
- Around line 229-235: Update the readiness condition message to avoid ambiguity
around CronJob.Status.Active: in the call to meta.SetStatusCondition that
updates schedule.Status.Conditions (the metav1.Condition with Type
etcdaenixiov1alpha1.EtcdBackupScheduleConditionReady and Reason "CronJobReady"),
change the Message from fmt.Sprintf("CronJob %q is active",
existingCronJob.Name) to a clearer phrasing such as fmt.Sprintf("CronJob %q
exists", existingCronJob.Name) or "is configured" so it unambiguously indicates
the CronJob resource presence rather than running jobs.
In `@internal/controller/factory/statefulset_test.go`:
- Line 98: The test currently only calls CreateOrUpdateStatefulSet(ctx,
&etcdcluster, k8sClient, "") which never exercises the operatorImage-gated
branch; add two test cases in statefulset_test.go that exercise
CreateOrUpdateStatefulSet: one where etcdcluster.Spec.Bootstrap.Restore (or
spec.bootstrap.restore) is set and you pass a non-empty operatorImage string
(e.g., "operator:latest") and assert success, and another where Restore is set
but you pass "" and assert the function returns the expected empty-image error;
ensure these cases use the existing etcdcluster and k8sClient setup (and mirror
analogous additions at the other call sites referenced) so the new parameter
behavior is covered.
In `@internal/controller/factory/statefulset.go`:
- Around line 461-474: The podEnv variable is defined in
generateRestoreInitContainers but never used except inside
restoreDatadirContainer; remove the unused podEnv slice and instead add the two
EnvVar entries (POD_NAME with FieldRef metadata.name and POD_NAMESPACE with
FieldRef metadata.namespace) inline in the Env/EnvFrom list where
restoreDatadirContainer is constructed (reference restoreDatadirContainer and
generateRestoreInitContainers) so the env vars remain present but the unused
podEnv symbol is eliminated.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b6b9a4e9-1c43-4f3f-9462-b3ebc7a464b3
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (33)
DockerfileMakefileapi/v1alpha1/etcdbackup_webhook.goapi/v1alpha1/etcdbackup_webhook_test.goapi/v1alpha1/etcdbackupschedule_webhook.goapi/v1alpha1/etcdbackupschedule_webhook_test.goapi/v1alpha1/etcdcluster_types.goapi/v1alpha1/etcdcluster_webhook.goapi/v1alpha1/etcdcluster_webhook_test.goapi/v1alpha1/zz_generated.deepcopy.gocharts/etcd-operator/crds/etcd-backup-schedule.yamlcharts/etcd-operator/crds/etcd-backup.yamlcharts/etcd-operator/crds/etcd-cluster.yamlcmd/backup-agent/main.gocmd/backup-agent/main_test.gocmd/manager/main.gocmd/restore-agent/main.gocmd/restore-agent/main_test.goconfig/crd/bases/etcd.aenix.io_etcdclusters.yamlconfig/samples/etcd.aenix.io_v1alpha1_etcdcluster_restore.yamlgo.modinternal/controller/cronjob_update_test.gointernal/controller/etcdbackup_controller.gointernal/controller/etcdbackup_controller_test.gointernal/controller/etcdbackupschedule_controller.gointernal/controller/etcdbackupschedule_controller_test.gointernal/controller/etcdcluster_controller.gointernal/controller/factory/backup_cronjob.gointernal/controller/factory/backup_job.gointernal/controller/factory/backup_job_test.gointernal/controller/factory/statefulset.gointernal/controller/factory/statefulset_restore_test.gointernal/controller/factory/statefulset_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
- Dockerfile
- api/v1alpha1/etcdbackup_webhook_test.go
- internal/controller/etcdbackup_controller_test.go
- Makefile
- cmd/backup-agent/main.go
- api/v1alpha1/etcdbackupschedule_webhook_test.go
sircthulhu
left a comment
There was a problem hiding this comment.
Review Summary
Good architecture overall — two CRDs, Job-based execution, CronJob for scheduling. This aligns well with patterns from CloudNativePG and Percona operators.
I've compared the implementation with coreos/etcd-operator, CloudNativePG, and Percona XtraDB Cluster Operator. Key findings are in the inline comments below.
Critical
- Missing
FSGroupin PodSecurityContext — PVC writes will fail on most storage classes - Conversion webhook patch is invalid — no conversion handlers exist
- No ownership check when adopting existing Job
- Backup file naming — operator should own naming, user should only specify directory/prefix
Important
cronJobNeedsUpdatedoesn't compareConcurrencyPolicy- No
ephemeral-storageresource requests for S3 temp file - PVC restore sample uses
replicas: 3with potentiallyReadWriteOncePVC - Job name length limit uses 52 (CronJob limit) instead of 63 (Job limit) for EtcdBackup
Suggestions
- S3/PVC env var generation is duplicated between
backup_job.goandstatefulset.go - No
activeDeadlineSecondson backup Job - TTL of 600s is aggressive for debugging
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
internal/controller/factory/backup_job.go (1)
150-180:⚠️ Potential issue | 🟠 MajorBudget scratch space for S3 uploads.
The S3 path writes a full snapshot to local temp storage before upload, but this container only requests CPU and memory. Large backups can be evicted or fail mid-upload once node ephemeral storage is tight. Because
buildBackupContainer()is shared, fixing it here also covers scheduled backups.Proposed fix
if s3 := destination.S3; s3 != nil { + volumes = append(volumes, corev1.Volume{ + Name: "backup-scratch", + VolumeSource: corev1.VolumeSource{ + EmptyDir: &corev1.EmptyDirVolumeSource{}, + }, + }) + volumeMounts = append(volumeMounts, corev1.VolumeMount{ + Name: "backup-scratch", + MountPath: "/tmp/backup-scratch", + }) envVars = append(envVars, corev1.EnvVar{Name: "BACKUP_DESTINATION", Value: "s3"}, + corev1.EnvVar{Name: "TMPDIR", Value: "/tmp/backup-scratch"}, corev1.EnvVar{Name: "S3_ENDPOINT", Value: s3.Endpoint}, corev1.EnvVar{Name: "S3_BUCKET", Value: s3.Bucket}, corev1.EnvVar{Name: "S3_KEY", Value: s3.Key}, corev1.EnvVar{Name: "S3_REGION", Value: s3.Region}, @@ Resources: corev1.ResourceRequirements{ Requests: corev1.ResourceList{ corev1.ResourceCPU: resource.MustParse("100m"), corev1.ResourceMemory: resource.MustParse("128Mi"), + corev1.ResourceEphemeralStorage: resource.MustParse("2Gi"), }, Limits: corev1.ResourceList{ corev1.ResourceMemory: resource.MustParse("512Mi"), + corev1.ResourceEphemeralStorage: resource.MustParse("4Gi"), }, },Also applies to: 206-220
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/factory/backup_job.go` around lines 150 - 180, The container used for S3 uploads needs ephemeral scratch space and ephemeral-storage resource requests/limits to avoid eviction; in buildBackupContainer() add an emptyDir volume (e.g., name "backup-scratch") mounted into the container at a scratch path (e.g., "/tmp/backup") and set container.Resources.Requests["ephemeral-storage"] and .Limits["ephemeral-storage"] to sensible values (and keep CPU/memory as before); apply the same change to the other S3-handling branch/usage referenced around lines 206-220 so all S3 backup containers get the scratch volume mount and ephemeral-storage resource settings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/etcdbackupschedule_controller.go`:
- Around line 171-173: After successfully fetching an existing CronJob, verify
it is actually owned by the EtcdBackupSchedule before calling
reconcileExistingCronJob: use controllerutil.GetControllerOf(existingCronJob)
(or metav1.GetControllerOf) and compare the returned OwnerReference's UID (and
Kind/Name) to the schedule object; if there is no controller reference or it
does not match the schedule, return an error or requeue (do not call
reconcileExistingCronJob) so we don't mutate or mirror status for CronJobs this
schedule does not own.
- Around line 42-92: cronJobNeedsUpdate currently misses several operator-owned
JobTemplate fields; update it to also detect drift in
JobTemplate.Spec.JobSpec.BackoffLimit,
JobTemplate.Spec.Template.Spec.RestartPolicy,
JobTemplate.Spec.Template.Spec.SecurityContext (pod-level) and container-level
SecurityContext and Resources. For BackoffLimit and RestartPolicy do simple
value comparisons, and for SecurityContext and Resources use reflect.DeepEqual
when comparing existing.Spec.JobTemplate.Spec.Template.Spec.SecurityContext to
desired and for each container compare existingContainers[i].SecurityContext and
existingContainers[i].Resources to desiredContainers[i].SecurityContext and
desiredContainers[i].Resources; return true if any of those differ. Ensure you
reference cronJobNeedsUpdate, existing.Spec.JobTemplate.Spec and
desired.Spec.JobTemplate.Spec, and the container loop when adding these checks.
In `@internal/controller/factory/backup_cronjob.go`:
- Around line 62-79: The CronJob JobTemplate's PodSpec lacks an
ActiveDeadlineSeconds setting, so hung backup-agent pods can run indefinitely
and block subsequent schedules; update the PodTemplate Spec inside the CronJob
(in backup_cronjob.go) to include ActiveDeadlineSeconds (e.g., add
ActiveDeadlineSeconds: ptr.To(int64(<same value used in CreateBackupJob>))) on
the corev1.PodSpec for the JobTemplate so it matches the deadline used by
CreateBackupJob and ensures jobs are forcibly terminated after the configured
timeout.
---
Duplicate comments:
In `@internal/controller/factory/backup_job.go`:
- Around line 150-180: The container used for S3 uploads needs ephemeral scratch
space and ephemeral-storage resource requests/limits to avoid eviction; in
buildBackupContainer() add an emptyDir volume (e.g., name "backup-scratch")
mounted into the container at a scratch path (e.g., "/tmp/backup") and set
container.Resources.Requests["ephemeral-storage"] and
.Limits["ephemeral-storage"] to sensible values (and keep CPU/memory as before);
apply the same change to the other S3-handling branch/usage referenced around
lines 206-220 so all S3 backup containers get the scratch volume mount and
ephemeral-storage resource settings.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bbe6ef50-b578-43c6-ac7e-69372c31ec1d
📒 Files selected for processing (9)
api/v1alpha1/etcdbackup_webhook.goapi/v1alpha1/etcdbackup_webhook_test.goconfig/crd/patches/webhook_in_etcdbackups.yamlconfig/crd/patches/webhook_in_etcdbackupschedules.yamlconfig/samples/etcd.aenix.io_v1alpha1_etcdcluster_restore.yamlinternal/controller/etcdbackup_controller.gointernal/controller/etcdbackupschedule_controller.gointernal/controller/factory/backup_cronjob.gointernal/controller/factory/backup_job.go
🚧 Files skipped from review as they are similar to previous changes (1)
- api/v1alpha1/etcdbackup_webhook_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (2)
internal/controller/etcdbackupschedule_controller.go (2)
171-173:⚠️ Potential issue | 🟠 MajorGuard against same-name CronJobs not controlled by this schedule.
Line 173 reconciles by name only. If a CronJob with that name is user-managed or owned by another controller, this reconciler can mutate and mirror status for the wrong object.
Proposed fix
err := r.Get(ctx, types.NamespacedName{Name: cronJobName, Namespace: schedule.Namespace}, existingCronJob) if err == nil { + owner := metav1.GetControllerOf(existingCronJob) + if owner == nil || owner.UID != schedule.UID || owner.Kind != "EtcdBackupSchedule" || owner.Name != schedule.Name { + message := fmt.Sprintf("CronJob %q already exists and is not controlled by this EtcdBackupSchedule", cronJobName) + meta.SetStatusCondition(&schedule.Status.Conditions, metav1.Condition{ + Type: etcdaenixiov1alpha1.EtcdBackupScheduleConditionFailed, + Status: metav1.ConditionTrue, + Reason: "CronJobNameConflict", + Message: message, + ObservedGeneration: schedule.Generation, + }) + meta.SetStatusCondition(&schedule.Status.Conditions, metav1.Condition{ + Type: etcdaenixiov1alpha1.EtcdBackupScheduleConditionReady, + Status: metav1.ConditionFalse, + Reason: "CronJobNameConflict", + Message: message, + ObservedGeneration: schedule.Generation, + }) + return r.updateStatus(ctx, schedule) + } return r.reconcileExistingCronJob(ctx, schedule, cluster, existingCronJob) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/etcdbackupschedule_controller.go` around lines 171 - 173, The current code calls r.Get and then immediately returns r.reconcileExistingCronJob when a CronJob with cronJobName exists, but it does not check ownership and may mutate user-managed or foreign-controller objects; modify the logic after r.Get(...) to verify ownership (e.g., use controllerutil.IsControlledBy(existingCronJob, schedule) or inspect existingCronJob.GetOwnerReferences() for a matching UID/kind) and only call r.reconcileExistingCronJob when the CronJob is owned by this EtcdBackupSchedule; if not owned, skip reconcile (log/record an event and return without mutating) to avoid touching same-name CronJobs managed by others.
42-92:⚠️ Potential issue | 🟠 MajorDetect drift for all operator-owned JobTemplate security/runtime fields.
Line 42 currently misses drift checks for
BackoffLimit, podRestartPolicy, podSecurityContext, containerSecurityContext, and containerResources. Manual edits to these fields will not be corrected.Proposed fix
func cronJobNeedsUpdate(existing, desired *batchv1.CronJob) bool { if existing.Spec.Schedule != desired.Spec.Schedule { return true } if existing.Spec.ConcurrencyPolicy != desired.Spec.ConcurrencyPolicy { return true } if !equality.Semantic.DeepEqual(existing.Spec.SuccessfulJobsHistoryLimit, desired.Spec.SuccessfulJobsHistoryLimit) { return true } if !equality.Semantic.DeepEqual(existing.Spec.FailedJobsHistoryLimit, desired.Spec.FailedJobsHistoryLimit) { return true } + if !equality.Semantic.DeepEqual(existing.Spec.JobTemplate.Spec.BackoffLimit, desired.Spec.JobTemplate.Spec.BackoffLimit) { + return true + } if !equality.Semantic.DeepEqual(existing.Labels, desired.Labels) { return true } + + existingPodSpec := existing.Spec.JobTemplate.Spec.Template.Spec + desiredPodSpec := desired.Spec.JobTemplate.Spec.Template.Spec + if existingPodSpec.RestartPolicy != desiredPodSpec.RestartPolicy { + return true + } + if !equality.Semantic.DeepEqual(existingPodSpec.SecurityContext, desiredPodSpec.SecurityContext) { + return true + } existingContainers := existing.Spec.JobTemplate.Spec.Template.Spec.Containers desiredContainers := desired.Spec.JobTemplate.Spec.Template.Spec.Containers if len(existingContainers) != len(desiredContainers) { return true @@ if !equality.Semantic.DeepEqual(existingContainers[i].VolumeMounts, desiredContainers[i].VolumeMounts) { return true } + if !equality.Semantic.DeepEqual(existingContainers[i].SecurityContext, desiredContainers[i].SecurityContext) { + return true + } + if !equality.Semantic.DeepEqual(existingContainers[i].Resources, desiredContainers[i].Resources) { + return true + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/etcdbackupschedule_controller.go` around lines 42 - 92, cronJobNeedsUpdate currently misses drift detection for JobTemplate backoff limit, pod restart policy and pod security context, plus each container's SecurityContext and Resources; update the function to compare existing.Spec.JobTemplate.Spec.BackoffLimit (use equality.Semantic.DeepEqual), existing.Spec.JobTemplate.Spec.Template.Spec.RestartPolicy (direct compare), existing.Spec.JobTemplate.Spec.Template.Spec.SecurityContext (DeepEqual), and for each container also compare existingContainers[i].SecurityContext and existingContainers[i].Resources to desiredContainers[i].SecurityContext/Resources with equality.Semantic.DeepEqual so manual edits to those fields are detected and trigger updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/controller/etcdbackupschedule_controller.go`:
- Around line 171-173: The current code calls r.Get and then immediately returns
r.reconcileExistingCronJob when a CronJob with cronJobName exists, but it does
not check ownership and may mutate user-managed or foreign-controller objects;
modify the logic after r.Get(...) to verify ownership (e.g., use
controllerutil.IsControlledBy(existingCronJob, schedule) or inspect
existingCronJob.GetOwnerReferences() for a matching UID/kind) and only call
r.reconcileExistingCronJob when the CronJob is owned by this EtcdBackupSchedule;
if not owned, skip reconcile (log/record an event and return without mutating)
to avoid touching same-name CronJobs managed by others.
- Around line 42-92: cronJobNeedsUpdate currently misses drift detection for
JobTemplate backoff limit, pod restart policy and pod security context, plus
each container's SecurityContext and Resources; update the function to compare
existing.Spec.JobTemplate.Spec.BackoffLimit (use equality.Semantic.DeepEqual),
existing.Spec.JobTemplate.Spec.Template.Spec.RestartPolicy (direct compare),
existing.Spec.JobTemplate.Spec.Template.Spec.SecurityContext (DeepEqual), and
for each container also compare existingContainers[i].SecurityContext and
existingContainers[i].Resources to
desiredContainers[i].SecurityContext/Resources with equality.Semantic.DeepEqual
so manual edits to those fields are detected and trigger updates.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 451791ed-1356-48b5-b189-0933c47c22a1
📒 Files selected for processing (3)
api/v1alpha1/etcdbackup_webhook.goapi/v1alpha1/etcdcluster_webhook.gointernal/controller/etcdbackupschedule_controller.go
🚧 Files skipped from review as they are similar to previous changes (1)
- api/v1alpha1/etcdbackup_webhook.go
Add EtcdBackup custom resource definition with spec and status types. Supports S3 and PVC backup destinations with TLS-aware etcd connections. Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Validate create (exactly one of S3/PVC, required fields, subPath path traversal protection) and update (immutable spec, status-only updates allowed). Register webhook in test suite. Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Standalone binary that takes etcd snapshots and stores them to S3-compatible storage or PVC. Supports TLS client certificates, creates intermediate directories for SubPath, and ensures data durability via fsync before close. Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Build deterministic backup Jobs from EtcdBackup and EtcdCluster specs. Handles TLS volume mounts, S3 credential injection via secretKeyRef, PVC mounting, and endpoint construction from cluster replicas. Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Reconcile EtcdBackup resources by creating backup Jobs and tracking their status via conditions (Started, Complete, Failed). Sets ObservedGeneration on all conditions. Watches owned Jobs for status changes. Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Generated CRD, RBAC role, and webhook manifests. Add kustomize patches for webhook and CA injection. Add editor/viewer ClusterRoles and sample CR. Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Build backup-agent binary alongside manager in a single RUN layer. Add build-backup-agent target and update helm-crd-copy to include EtcdBackup CRD. Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Add RBAC rules for etcdbackups and batch/jobs. Add validating webhook entry. Inject OPERATOR_IMAGE env var into manager deployment. Copy generated CRD to chart crds directory. Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Make S3 path style configurable via ForcePathStyle CRD field instead of hardcoding UsePathStyle=true. Include backup name in default PVC path to prevent overwrites when multiple backups target the same PVC. Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Add EtcdBackupSchedule CRD that creates a Kubernetes CronJob to run periodic etcd snapshots. Reuses BackupDestination types and backup-agent pod spec from EtcdBackup via extracted buildBackupContainer helper. Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Fix TLS CA path: use ServerTrustedCASecret instead of ServerSecret for CA certificate volume in backup jobs - Fix CronJob update loop: replace reflect.DeepEqual on full spec with field-by-field comparison of operator-controlled fields only - Add CronJob name length validation to EtcdBackupSchedule webhook (52 char limit minus 17 char suffix) - Set TLS MinVersion to 1.2 in backup-agent - Fix S3 upload for large snapshots by writing to temp file first to provide Content-Length header - Fix scheduled backup file overwrite: inject UTC timestamp into filenames when BACKUP_TIMESTAMP=true - Extract shared validateBackupDestination() function and use it directly in EtcdBackupSchedule webhook Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Add spec.bootstrap.restore field to EtcdCluster CRD, allowing users to create new etcd clusters pre-populated with data from a backup snapshot (S3 or PVC). Uses a two-init-container architecture: 1. restore-agent (operator image): downloads snapshot to shared volume 2. restore-datadir (etcd image): runs etcdutl snapshot restore Both init containers check for existing data directory for idempotency. Bootstrap field is immutable after cluster creation. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Copy all CRDs (including EtcdBackupSchedule) to Helm chart - Replace cron regex with robfig/cron/v3 parser for proper validation - Validate OPERATOR_IMAGE env var at startup (fail-fast) - Make backup/restore agent timeouts configurable via env vars Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Add empty snapshot check in restore-agent (reject 0-byte snapshots) - Pass actual error to log.Error for OPERATOR_IMAGE validation - Set TTLSecondsAfterFinished (600s) on backup Jobs for auto-cleanup - Add unit tests for backup-agent (injectTimestamp, buildTLSConfig) and restore-agent (getTimeout, copyFromPVC, run idempotency) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Add zero-byte snapshot detection in backup-agent (both S3 and PVC paths) to prevent writing unrestorable backups - Fix cronJobNeedsUpdate to compare only operator-set container fields (Image, Command, Env, VolumeMounts) instead of full container spec, avoiding infinite reconcile loop from API server defaults - Run go mod tidy to mark robfig/cron/v3 as direct dependency Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Clean up zero-byte files on error in both backup-agent and restore-agent (os.Remove on all error paths) - Make restoreSnapshotDir a var for testability; add real tests for writeSnapshot (non-empty, empty, PVC copy success/failure) - Make OPERATOR_IMAGE optional at startup (warn instead of exit); validate at reconcile time with clear status condition when missing Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Align restore-agent timeout validation with backup-agent (return error on invalid RESTORE_TIMEOUT_MINUTES instead of silent fallback) - Add table-driven unit test for cronJobNeedsUpdate covering schedule, history limits, labels, image, env, command, volumes changes and verifying API server defaults do not trigger false positives - Add controller test for empty OperatorImage (verifies Failed condition with ConfigurationError reason) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Extract validateSpec() from ValidateCreate to allow ValidateUpdate to skip the name length check. This prevents breaking updates of existing EtcdBackupSchedule resources with names longer than 35 characters that were created before the webhook was added. Add tests for name length validation (boundary and over-limit), update-with-long-name, and empty OperatorImage in schedule controller. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
…e Job conditions - Add name length validation to EtcdBackup webhook (max 45 chars) to prevent Job/Pod name overflow at runtime. - Change snapshot file creation from os.Create (0666) to os.OpenFile with 0600 in both backup-agent and restore-agent, since etcd snapshots contain sensitive cluster data. - Use Job condition type=Failed instead of job.Status.Failed > 0 for more robust terminal state detection in reconcileJobStatus. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Add resource requests/limits (100m/128Mi requests, 512Mi memory limit) to backup-agent, restore-agent, and restore-datadir containers to prevent node resource exhaustion with large etcd snapshots. - Change MkdirAll from 0755 to 0700 in backup-agent and restore-agent since etcd snapshots contain sensitive cluster data. - Add S3 endpoint URL validation in webhook: must start with http:// or https://. - Fix potential fd leak in backup-agent S3 upload: defer tmpFile.Close() immediately after creation instead of closing manually on each path. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Replace raw string "True" with corev1.ConditionTrue in Job condition check for type safety and codebase consistency. - Change EtcdBackupSchedule ClusterNotFound from permanent failure to retryable state: set Ready=false and requeue after 30s. The schedule controller does not watch EtcdCluster resources, so periodic requeue is needed to recover when the cluster is created after the schedule. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Add PodSecurityContext (RunAsNonRoot, RunAsUser/Group 65532) to backup Job, backup CronJob, and restore init containers, matching the Dockerfile's nonroot user. - Add set -e to the restore-datadir init container shell script for proper error propagation. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Add FSGroup to PodSecurityContext on backup Job/CronJob pods to ensure PVC mounts are writable by UID 65532. - Add ActiveDeadlineSeconds (15min) to backup Job as a safety net if backup-agent hangs on a syscall. - Add Job ownership verification: reject if a Job with the expected name exists but is not controlled by this EtcdBackup. - Add ConcurrencyPolicy to cronJobNeedsUpdate comparison to detect drift from manual kubectl edits. - Fix EtcdBackup name length limit: use 63 (Job DNS label limit) instead of 52 (CronJob limit). Max name is now 56 chars. - Change CRD conversion webhook strategy from Webhook to None for EtcdBackup and EtcdBackupSchedule (no conversion handlers exist). - Fix sample restore CR: use replicas:1 with S3 source to avoid ReadWriteOnce PVC scheduling issues with multiple replicas. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
6bfdcb9 to
54658c5
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
api/v1alpha1/etcdbackupschedule_webhook.go (1)
123-128: Consider case-insensitive check for@every.The robfig/cron parser handles descriptors case-insensitively, so
@Every 5mor@EVERY 5mwould pass validation but still fail when Kubernetes attempts to create the CronJob.🔧 Suggested fix for case-insensitive matching
- } else if strings.HasPrefix(r.Spec.Schedule, "@every ") { + } else if strings.HasPrefix(strings.ToLower(r.Spec.Schedule), "@every ") {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/v1alpha1/etcdbackupschedule_webhook.go` around lines 123 - 128, The validation for rejecting "@every" schedules currently uses strings.HasPrefix(r.Spec.Schedule, "@every "), which is case-sensitive; update the check to be case-insensitive by comparing a lowercased schedule (e.g., strings.HasPrefix(strings.ToLower(r.Spec.Schedule), "@every ")). Ensure this change is applied where r.Spec.Schedule is validated (the block that appends field.Invalid for spec.schedule) so descriptors like "@Every" or "@EVERY" are also rejected.internal/controller/factory/backup_job.go (1)
68-78: Disable service-account token automount if the agent is API-free.I don't see any pod-local need for in-cluster credentials in this spec: the container gets its inputs via env vars, mounted Secrets, and the PVC. If
backup-agentdoes not call the Kubernetes API from inside the pod, setautomountServiceAccountToken: ptr.To(false)here and in the CronJob template.Possible fix
Spec: corev1.PodSpec{ + AutomountServiceAccountToken: ptr.To(false), RestartPolicy: corev1.RestartPolicyNever, SecurityContext: &corev1.PodSecurityContext{🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/factory/backup_job.go` around lines 68 - 78, The PodSpec for the backup job currently doesn't disable automatic mounting of the service account token; if the backup-agent process never calls the Kubernetes API from inside the pod, set automountServiceAccountToken to ptr.To(false) on the PodSpec in the backup job factory (the corev1.PodSpec block you modified) and also add the same setting to the CronJob pod template spec used elsewhere in this file so pods won't receive in-cluster credentials; ensure you only apply this change when the backup-agent is API-free.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/v1alpha1/etcdbackup_types.go`:
- Around line 98-100: The current printcolumn for Status in etcdbackup_types.go
only queries the Complete condition via JSONPath
`.status.conditions[?(@.type=="Complete")].status`, so resources with
`Failed=True` will appear blank; update the CRD annotations to either add a
second printcolumn that queries the Failed condition (e.g. JSONPath
`.status.conditions[?(@.type=="Failed")].status`) or introduce a dedicated
`.status.phase` field on the EtcdBackup API and point the printcolumn to that
field; modify the API type (EtcdBackup/EtcdBackupStatus) and controller to
populate the chosen field consistently and update the kubebuilder:+printcolumn
annotation(s) accordingly so failures are visible in `kubectl get`.
In `@internal/controller/factory/backup_job.go`:
- Around line 49-52: The code unconditionally sets ActiveDeadlineSeconds from
the activeDeadline variable which can be 0 and is rejected by Kubernetes; update
the Job/Pod spec assignments so ActiveDeadlineSeconds is only set when
activeDeadline > 0 (create a *int64 from activeDeadline and assign it) and leave
it nil otherwise; locate uses of activeDeadline in backup_job.go (where the
Job/Pod spec is built) and the two places in backup_cronjob.go and wrap the
assignment with a conditional so the field is omitted unless activeDeadline > 0.
---
Nitpick comments:
In `@api/v1alpha1/etcdbackupschedule_webhook.go`:
- Around line 123-128: The validation for rejecting "@every" schedules currently
uses strings.HasPrefix(r.Spec.Schedule, "@every "), which is case-sensitive;
update the check to be case-insensitive by comparing a lowercased schedule
(e.g., strings.HasPrefix(strings.ToLower(r.Spec.Schedule), "@every ")). Ensure
this change is applied where r.Spec.Schedule is validated (the block that
appends field.Invalid for spec.schedule) so descriptors like "@Every" or
"@EVERY" are also rejected.
In `@internal/controller/factory/backup_job.go`:
- Around line 68-78: The PodSpec for the backup job currently doesn't disable
automatic mounting of the service account token; if the backup-agent process
never calls the Kubernetes API from inside the pod, set
automountServiceAccountToken to ptr.To(false) on the PodSpec in the backup job
factory (the corev1.PodSpec block you modified) and also add the same setting to
the CronJob pod template spec used elsewhere in this file so pods won't receive
in-cluster credentials; ensure you only apply this change when the backup-agent
is API-free.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 994fe613-e9f4-4b3d-9085-790d1c8ed7b2
📒 Files selected for processing (14)
api/v1alpha1/etcdbackup_types.goapi/v1alpha1/etcdbackupschedule_types.goapi/v1alpha1/etcdbackupschedule_webhook.goapi/v1alpha1/etcdbackupschedule_webhook_test.goapi/v1alpha1/etcdcluster_webhook.goconfig/crd/bases/etcd.aenix.io_etcdbackups.yamlconfig/crd/bases/etcd.aenix.io_etcdbackupschedules.yamlconfig/crd/bases/etcd.aenix.io_etcdclusters.yamlconfig/rbac/role.yamlinternal/controller/etcdbackupschedule_controller_test.gointernal/controller/factory/backup_cronjob.gointernal/controller/factory/backup_cronjob_test.gointernal/controller/factory/backup_job.gointernal/controller/factory/backup_job_test.go
✅ Files skipped from review due to trivial changes (4)
- internal/controller/factory/backup_cronjob_test.go
- internal/controller/factory/backup_job_test.go
- api/v1alpha1/etcdbackupschedule_webhook_test.go
- config/crd/bases/etcd.aenix.io_etcdbackupschedules.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
- config/crd/bases/etcd.aenix.io_etcdbackups.yaml
- api/v1alpha1/etcdcluster_webhook.go
- internal/controller/etcdbackupschedule_controller_test.go
- api/v1alpha1/etcdbackupschedule_types.go
- config/crd/bases/etcd.aenix.io_etcdclusters.yaml
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Summary
EtcdBackupCRD — a one-shot resource that triggers a backup Job to take an etcd snapshot and store it to S3-compatible storage or a PVCbackup-agentbinary (built into the same image) that connects to etcd, takes a snapshot, and uploads it to the configured destinationOPERATOR_IMAGEenv injection, CRDDesign
New files
api/v1alpha1/etcdbackup_types.goapi/v1alpha1/etcdbackup_webhook.gocmd/backup-agent/main.gointernal/controller/etcdbackup_controller.gointernal/controller/factory/backup_job.goTest plan
make generate && make manifestsproduce clean outputmake testpasses all suitesmake docker-buildbuilds both binariesSummary by CodeRabbit