Skip to content

feat(backup): add EtcdBackup CRD for declarative etcd snapshots#307

Open
IvanHunters wants to merge 35 commits intoaenix-io:mainfrom
IvanHunters:feat/etcdbackup-crd
Open

feat(backup): add EtcdBackup CRD for declarative etcd snapshots#307
IvanHunters wants to merge 35 commits intoaenix-io:mainfrom
IvanHunters:feat/etcdbackup-crd

Conversation

@IvanHunters
Copy link

@IvanHunters IvanHunters commented Mar 7, 2026

Summary

  • Add EtcdBackup CRD — a one-shot resource that triggers a backup Job to take an etcd snapshot and store it to S3-compatible storage or a PVC
  • Add backup-agent binary (built into the same image) that connects to etcd, takes a snapshot, and uploads it to the configured destination
  • Add validating webhook enforcing immutable spec, exactly-one-destination constraint, and subPath path traversal protection
  • Add Helm chart support: RBAC rules, webhook config, OPERATOR_IMAGE env injection, CRD

Design

User creates EtcdBackup CR
        |
EtcdBackupReconciler reads EtcdBackup + referenced EtcdCluster
        |
Controller creates a Job (backup-agent container)
        |
backup-agent connects to etcd, takes snapshot, uploads to S3/PVC
        |
Controller watches Job status -> updates EtcdBackup conditions

New files

File Purpose
api/v1alpha1/etcdbackup_types.go CRD types (spec, status, conditions)
api/v1alpha1/etcdbackup_webhook.go Validating webhook
cmd/backup-agent/main.go Backup agent binary
internal/controller/etcdbackup_controller.go Reconciler
internal/controller/factory/backup_job.go Job factory

Test plan

  • Webhook tests: valid S3/PVC, both set, neither set, empty fields, subPath traversal, spec immutability, status-only updates allowed
  • Controller tests: Job creation (PVC/S3), cluster not found, Job success/failure conditions, already complete skip, TLS volume mounts
  • Factory tests: PVC/S3 job specs, TLS volumes, SubPath handling, labels, owner references
  • make generate && make manifests produce clean output
  • make test passes all suites
  • make docker-build builds both binaries

Summary by CodeRabbit

  • New Features
    • EtcdBackup and EtcdBackupSchedule CRDs with S3/PVC destinations; backup and scheduled backup orchestration; restore-at-bootstrap support; backup-agent and restore-agent binaries; operator image propagated to workloads.
  • Chores
    • Builds now produce manager, backup-agent, and restore-agent; charts, CRD manifests, samples, and packaging updated.
  • Security & Ops
    • Validating webhooks added, cert-manager CA injection, and expanded RBAC for backups, schedules, jobs, and cronjobs.
  • Tests
    • Extensive unit and integration tests covering webhooks, controllers, factory helpers, agents, jobs, cronjobs, and restore flows.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds EtcdBackup and EtcdBackupSchedule CRDs, validating webhooks, controllers to manage Jobs/CronJobs, backup-agent and restore-agent binaries, StatefulSet bootstrap restore init containers, Helm/CRD/RBAC/chart updates, generated deepcopy code, and extensive unit tests plus build/tooling changes.

Changes

Cohort / File(s) Summary
API types & deepcopy
PROJECT, api/v1alpha1/etcdbackup_types.go, api/v1alpha1/etcdbackupschedule_types.go, api/v1alpha1/zz_generated.deepcopy.go, api/v1alpha1/etcdcluster_types.go
Add EtcdBackup and EtcdBackupSchedule API types (spec/status, S3/PVC BackupDestination), bootstrap/restore types on EtcdCluster, condition constants, and generated deepcopy implementations.
Webhooks & tests
api/v1alpha1/etcdbackup_webhook.go, api/v1alpha1/etcdbackup_webhook_test.go, api/v1alpha1/etcdbackupschedule_webhook.go, api/v1alpha1/etcdbackupschedule_webhook_test.go, api/v1alpha1/etcdcluster_webhook.go, api/v1alpha1/webhook_suite_test.go
Add validating webhooks for backup and schedule CRs, destination/cron/name validation, immutability checks, registration in test suite, and comprehensive unit tests.
Controllers & reconciliation tests
internal/controller/etcdbackup_controller.go, internal/controller/etcdbackup_controller_test.go, internal/controller/etcdbackupschedule_controller.go, internal/controller/etcdbackupschedule_controller_test.go
Introduce EtcdBackupReconciler and EtcdBackupScheduleReconciler to create/manage Jobs and CronJobs, update status conditions, handle cluster resolution, and include extensive reconciliation tests.
Factory builders & tests
internal/controller/factory/backup_job.go, internal/controller/factory/backup_job_test.go, internal/controller/factory/backup_cronjob.go, internal/controller/factory/backup_cronjob_test.go, internal/controller/cronjob_update_test.go
Add Job/CronJob factory helpers and naming, build container/volume/env wiring for TLS/S3/PVC, cronJobNeedsUpdate comparator, and unit tests validating generated resources.
StatefulSet restore support & tests
internal/controller/factory/statefulset.go, internal/controller/factory/statefulset_restore_test.go, internal/controller/factory/statefulset_test.go
Add bootstrap restore flow: require operatorImage, generate restore init containers/volumes (restore-agent + restore-datadir), inject into StatefulSet PodSpec, and tests covering PVC/S3 restore variants.
Agents & tests
cmd/backup-agent/main.go, cmd/backup-agent/main_test.go, cmd/restore-agent/main.go, cmd/restore-agent/main_test.go
Add backup-agent and restore-agent programs (etcd snapshot/restore to S3 or PVC) with TLS handling, S3 uploads/downloads, atomic writes, timeouts, and unit tests.
Manager, build & binaries
cmd/manager/main.go, Dockerfile, Makefile, go.mod
Wire new reconcilers/webhooks and OperatorImage into manager; update Dockerfile/Makefile to build/copy manager, backup-agent, and restore-agent; add AWS SDK and cron dependencies.
CRDs, kustomize & samples
config/crd/bases/...etcdbackups.yaml, ...etcdbackupschedules.yaml, config/crd/kustomization.yaml, charts/etcd-operator/crds/*, config/samples/*
Add base CRD manifests for EtcdBackup and EtcdBackupSchedule, kustomize entries and CA/webhook patches, update EtcdCluster CRD schema for bootstrap.restore, and add sample CR manifests.
Helm charts & webhook manifests
charts/etcd-operator/crds/etcd-backup.yaml, charts/.../validatingwebhookconfiguration.yml, charts/.../clusterrole-manager-role.yml, charts/.../deployment.yml, config/webhook/manifests.yaml
Add chart CRD entries, ValidatingWebhookConfiguration webhooks for backups/schedules, extend ClusterRole for batch and etcd.aenix.io resources, and expose OPERATOR_IMAGE env var in deployment template.
RBAC manifests
config/rbac/role.yaml, config/rbac/etcdbackup_editor_role.yaml, config/rbac/etcdbackup_viewer_role.yaml
Add/update ClusterRole rules to permit cronjobs/jobs and etcdbackups/etcdbackupschedules (including finalizers/status); add dedicated editor/viewer roles.
Controller glue & tests
internal/controller/*_test.go, internal/controller/etcdcluster_controller.go
Add many controller tests; EtcdClusterReconciler now carries OperatorImage and CreateOrUpdateStatefulSet signature updated to accept operatorImage.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant K8s as Kubernetes API
    participant Controller as EtcdBackup Controller
    participant Factory as Job Factory
    participant Job as Kubernetes Job
    participant Agent as backup-agent
    participant Etcd as Etcd Cluster
    participant Storage as S3/PVC

    User->>K8s: Create EtcdBackup CR
    K8s->>Controller: Reconcile EtcdBackup
    Controller->>K8s: Get EtcdCluster
    Controller->>Factory: CreateBackupJob(backup, cluster, operatorImage)
    Factory->>K8s: Create Job
    K8s->>Job: Schedule Job
    Job->>Agent: Run backup-agent container
    Agent->>Etcd: Connect and snapshot
    Etcd-->>Agent: Snapshot stream
    alt S3
        Agent->>Storage: Upload snapshot to S3
    else PVC
        Agent->>Storage: Write snapshot to PVC
    end
    Storage-->>Agent: Confirm
    Agent-->>Job: Exit success
    Job->>K8s: Update Job status
    Controller->>K8s: Observe Job completion, Update EtcdBackup status
Loading
sequenceDiagram
    participant User
    participant K8s as Kubernetes API
    participant ScheduleController as EtcdBackupSchedule Controller
    participant Factory as CronJob Factory
    participant Cron as CronJob
    participant Scheduler as K8s Scheduler
    participant Job as Created Job
    participant Agent as backup-agent

    User->>K8s: Create EtcdBackupSchedule CR
    K8s->>ScheduleController: Reconcile Schedule
    ScheduleController->>K8s: Get EtcdCluster
    ScheduleController->>Factory: CreateBackupCronJob(schedule, cluster, operatorImage)
    Factory->>K8s: Create CronJob
    note over Scheduler,Job: Periodically per cron schedule
    Scheduler->>K8s: Instantiate Job from CronJob
    K8s->>Job: Run backup-agent
    Job->>Agent: Execute backup-agent
    Agent->>Agent: Perform backup (S3/PVC)
    Job-->>K8s: Complete
    ScheduleController->>K8s: Update EtcdBackupSchedule status (LastScheduleTime/Ready)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐰 I nibble code, then hop and cheer,

Jobs and CronJobs hum when backups near.
Agents fetch snapshots, safe and neat,
Restores wake clusters from slumber sweet.
A rabbit pats the build — all systems clear!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly describes the main change: adding an EtcdBackup CRD for declarative etcd snapshots. It is concise, specific, and directly reflects the primary objective of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@IvanHunters IvanHunters marked this pull request as ready for review March 8, 2026 09:23
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (13)
Makefile (1)

121-128: Deduplicate the backup-agent build recipe.

build and build-backup-agent now carry the same go build -o bin/backup-agent ... command. Making build depend on build-backup-agent avoids the two targets drifting the next time build flags change.

Possible cleanup
 .PHONY: build
-build: manifests generate fmt vet ## Build manager binary.
+build: manifests generate fmt vet build-backup-agent ## Build manager binary.
 	go build -o bin/manager cmd/manager/main.go
-	go build -o bin/backup-agent cmd/backup-agent/main.go
 
 .PHONY: build-backup-agent
 build-backup-agent: ## Build backup-agent binary.
 	go build -o bin/backup-agent cmd/backup-agent/main.go
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 121 - 128, The Makefile duplicates the backup-agent
build command in both the build and build-backup-agent targets; change the build
target to depend on build-backup-agent (add build-backup-agent to its
prerequisites) and remove the duplicate "go build -o bin/backup-agent
cmd/backup-agent/main.go" line from the build recipe so the backup-agent is
built only via the build-backup-agent target (leave the build-backup-agent
target containing the go build command intact).
internal/controller/factory/backup_job.go (1)

205-218: Consider handling zero replicas edge case.

When cluster.Spec.Replicas is explicitly set to 0 (allowed per kubebuilder validation minimum), buildEndpoints returns an empty string. The backup-agent will then fail with an unhelpful error about missing endpoints rather than a clear message about no replicas.

This is a minor edge case since backing up a zero-replica cluster is inherently invalid, but could be confusing during debugging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/factory/backup_job.go` around lines 205 - 218,
buildEndpoints currently returns an empty string when cluster.Spec.Replicas is
explicitly 0; change buildEndpoints to detect replicas == 0 and return a clear
sentinel string (for example "no-replicas:<clusterName>.<namespace>") instead of
"", so callers get a deterministic, debuggable value; update the logic inside
buildEndpoints (which uses GetServerProtocol, GetHeadlessServiceName and reads
cluster.Spec.Replicas) to check for zero replicas and return the descriptive
message immediately before building endpoints.
config/crd/bases/etcd.aenix.io_etcdbackups.yaml (1)

66-127: Push the destination invariants into the CRD schema.

destination still accepts {} and {s3,pvc} at the OpenAPI level, so these invalid specs are only rejected once the validating webhook is involved. Adding oneOf plus branch-level required fields here would make the CRD self-validating and fail bad manifests earlier.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/crd/bases/etcd.aenix.io_etcdbackups.yaml` around lines 66 - 127, The
CRD currently allows destination to be empty or contain both s3 and pvc because
its schema only marks properties as optional; update the destination schema (the
"destination" object that contains "s3" and "pvc") to use a oneOf with two
branches: one branch requiring the "pvc" property and within it the existing
required "claimName", and the other branch requiring the "s3" property and
within it the existing required fields ("bucket", "credentialsSecretRef",
"endpoint", "key", etc.); ensure each branch keeps the existing property
definitions and required lists so the CRD self-validates and rejects {} or mixed
specs without relying on the webhook.
api/v1alpha1/etcdbackupschedule_webhook_test.go (1)

28-199: Add negative coverage for schedule destination required fields too.

This suite checks xor and subPath, but it never exercises missing claimName or missing required S3 fields on EtcdBackupSchedule. Since this CR has its own webhook, those regressions would still slip past the existing EtcdBackup tests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v1alpha1/etcdbackupschedule_webhook_test.go` around lines 28 - 199, Add
negative unit tests that call EtcdBackupSchedule.ValidateCreate to cover missing
required destination fields: one test creating an EtcdBackupSchedule whose
Spec.Destination.PVC has an empty ClaimName and asserting ValidateCreate returns
a StatusError containing "claimName" (or similar), and another test where
Spec.Destination.S3 omits required fields (e.g., empty Bucket or empty
CredentialsSecretRef.Name) and asserting ValidateCreate returns a StatusError
containing the missing S3 field names; locate the test cases near the existing
tests referencing EtcdBackupSchedule, BackupDestination, PVCBackupDestination,
S3BackupDestination and reuse the ValidateCreate call/assertion pattern used
elsewhere in this file.
charts/etcd-operator/templates/cert-manager/validatingwebhookconfiguration.yml (1)

10-49: Set an explicit timeout on these new validating webhooks.

These rules fail closed and currently rely on the API server default timeout. Pinning timeoutSeconds to a small value keeps CREATE/UPDATE failures fast when the webhook service is slow or unavailable.

Suggested change
   - admissionReviewVersions:
       - v1
     clientConfig:
       service:
         name: {{ include "etcd-operator.fullname" . }}-webhook-service
         namespace: {{ .Release.Namespace }}
         path: /validate-etcd-aenix-io-v1alpha1-etcdbackup
     failurePolicy: Fail
     name: vetcdbackup.kb.io
+    timeoutSeconds: 5
     rules:
       - apiGroups:
           - etcd.aenix.io
@@
   - admissionReviewVersions:
       - v1
     clientConfig:
       service:
         name: {{ include "etcd-operator.fullname" . }}-webhook-service
         namespace: {{ .Release.Namespace }}
         path: /validate-etcd-aenix-io-v1alpha1-etcdbackupschedule
     failurePolicy: Fail
     name: vetcdbackupschedule.kb.io
+    timeoutSeconds: 5
     rules:
       - apiGroups:
           - etcd.aenix.io
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@charts/etcd-operator/templates/cert-manager/validatingwebhookconfiguration.yml`
around lines 10 - 49, The validating webhook configurations for
vetcdbackup.kb.io and vetcdbackupschedule.kb.io lack an explicit timeoutSeconds
and thus use the API server default; add a timeoutSeconds: 5 (or another small
integer) field to each webhook object (alongside failurePolicy) in the
validatingwebhookconfiguration entries for paths
/validate-etcd-aenix-io-v1alpha1-etcdbackup and
/validate-etcd-aenix-io-v1alpha1-etcdbackupschedule so CREATE/UPDATE calls fail
fast when the webhook service is slow or unavailable.
config/webhook/manifests.yaml (1)

33-72: Mirror an explicit webhook timeout in the generated manifest too.

Like the chart template, these fail-closed webhooks currently inherit the API server default timeout. Setting timeoutSeconds explicitly keeps local/generated deployments from stalling writes for longer than necessary during webhook issues.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/webhook/manifests.yaml` around lines 33 - 72, The generated
Mutating/ValidatingWebhookConfiguration entries for the webhooks
vetcdbackup.kb.io and vetcdbackupschedule.kb.io (look for
clientConfig.service.path values /validate-etcd-aenix-io-v1alpha1-etcdbackup and
/validate-etcd-aenix-io-v1alpha1-etcdbackupschedule) lack an explicit
timeoutSeconds and therefore inherit the API server default; add a
timeoutSeconds field (e.g., timeoutSeconds: 10) to each webhook block in the
manifest so the generated YAML mirrors the chart template and avoids long stalls
when the webhook is unreachable.
internal/controller/factory/backup_job_test.go (1)

142-203: Assert the new S3 endpoint/path-style env wiring here.

CreateBackupJob now exports S3_ENDPOINT and S3_FORCE_PATH_STYLE, but this test never checks them. That leaves the S3 compatibility change in this PR effectively unguarded.

Suggested change
 	if envMap["S3_BUCKET"].Value != "backups" {
 		t.Errorf("expected S3_BUCKET=backups, got %q", envMap["S3_BUCKET"].Value)
 	}
+	if envMap["S3_ENDPOINT"].Value != "https://s3.example.com" {
+		t.Errorf("expected S3_ENDPOINT=https://s3.example.com, got %q", envMap["S3_ENDPOINT"].Value)
+	}
 	if envMap["S3_KEY"].Value != "etcd/snap.db" {
 		t.Errorf("expected S3_KEY=etcd/snap.db, got %q", envMap["S3_KEY"].Value)
 	}
 	if envMap["S3_REGION"].Value != "eu-west-1" {
 		t.Errorf("expected S3_REGION=eu-west-1, got %q", envMap["S3_REGION"].Value)
 	}
+	if envMap["S3_FORCE_PATH_STYLE"].Value != "false" {
+		t.Errorf("expected S3_FORCE_PATH_STYLE=false, got %q", envMap["S3_FORCE_PATH_STYLE"].Value)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/factory/backup_job_test.go` around lines 142 - 203, The
test fails to assert the new S3 env wiring added by CreateBackupJob: add checks
in backup_job_test.go (after building envMap from
job.Spec.Template.Spec.Containers[0].Env) to assert envMap["S3_ENDPOINT"].Value
== "https://s3.example.com" and that envMap["S3_FORCE_PATH_STYLE"].Value exists
and equals the expected default (e.g. "false") so the S3 endpoint and path-style
behavior are validated for the backup created from the backup object with
Endpoint set.
internal/controller/factory/backup_cronjob_test.go (1)

30-46: Test setup looks good, consider explicit error checking on scheme registration.

The test properly sets up the runtime scheme with both client-go and etcd-operator types. While ignoring the return value of AddToScheme is common in tests (these rarely fail), explicit error checking would be more defensive.

🔧 Optional: Add explicit error checking
 func TestCreateBackupCronJob_PVC(t *testing.T) {
 	scheme := runtime.NewScheme()
-	_ = clientgoscheme.AddToScheme(scheme)
-	_ = etcdaenixiov1alpha1.AddToScheme(scheme)
+	if err := clientgoscheme.AddToScheme(scheme); err != nil {
+		t.Fatalf("failed to add client-go scheme: %v", err)
+	}
+	if err := etcdaenixiov1alpha1.AddToScheme(scheme); err != nil {
+		t.Fatalf("failed to add etcd-operator scheme: %v", err)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/factory/backup_cronjob_test.go` around lines 30 - 46,
TestCreateBackupCronJob_PVC currently ignores errors returned by
clientgoscheme.AddToScheme and etcdaenixiov1alpha1.AddToScheme; update the test
to explicitly check and fail on those errors by capturing their return values
and calling t.Fatalf or t.Fatal when non-nil. Locate the two AddToScheme calls
in TestCreateBackupCronJob_PVC and change them to assign the error (e.g., err :=
clientgoscheme.AddToScheme(scheme)) and handle err similarly for
etcdaenixiov1alpha1.AddToScheme so the test fails fast if scheme registration
fails.
api/v1alpha1/etcdbackupschedule_webhook.go (1)

34-35: Consider using a cron parsing library instead of regex for schedule validation.

The current regex allows some invalid cron expressions (e.g., 99 99 99 99 99 would match). Consider using a library like github.com/robfig/cron/v3 to parse and validate cron expressions more accurately, which would also catch semantic errors (e.g., invalid day-of-month values).

♻️ Suggested approach using cron library
import "github.com/robfig/cron/v3"

// In ValidateCreate, replace regex check with:
if r.Spec.Schedule == "" {
    allErrors = append(allErrors, field.Required(...))
} else {
    parser := cron.NewParser(cron.Minute | cron.Hour | cron.Dom | cron.Month | cron.Dow | cron.Descriptor)
    if _, err := parser.Parse(r.Spec.Schedule); err != nil {
        allErrors = append(allErrors, field.Invalid(
            field.NewPath("spec", "schedule"),
            r.Spec.Schedule,
            fmt.Sprintf("invalid cron expression: %v", err),
        ))
    }
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v1alpha1/etcdbackupschedule_webhook.go` around lines 34 - 35, Replace the
fragile regex-based cron validation (cronRegexp) in ValidateCreate with parsing
via github.com/robfig/cron/v3: remove/stop using cronRegexp, add the cron
import, construct a parser with
cron.Minute|cron.Hour|cron.Dom|cron.Month|cron.Dow|cron.Descriptor and call
parser.Parse(r.Spec.Schedule), treat empty r.Spec.Schedule as required, and on
parse error append a field.Invalid for field.NewPath("spec","schedule")
including the parser error text so semantic cron errors are caught.
api/v1alpha1/etcdbackup_webhook.go (1)

144-152: Consider using path.Clean instead of filepath.Clean for consistent path validation.

filepath.Clean uses OS-specific path separators, which may behave differently when the webhook runs on Windows vs. when the backup-agent runs in a Linux container. Using path.Clean from the path package ensures consistent forward-slash handling regardless of the webhook's host OS.

♻️ Suggested fix for consistent path handling
 import (
 	"fmt"
-	"path/filepath"
+	"path"
 	"reflect"
 	"strings"
 ...

 	if pvc.SubPath != "" {
-		cleaned := filepath.Clean(pvc.SubPath)
-		if strings.HasPrefix(cleaned, "..") || filepath.IsAbs(cleaned) {
+		cleaned := path.Clean(pvc.SubPath)
+		if strings.HasPrefix(cleaned, "..") || path.IsAbs(cleaned) {
 			allErrors = append(allErrors, field.Invalid(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v1alpha1/etcdbackup_webhook.go` around lines 144 - 152, Replace
OS-dependent filepath usage with the POSIX path package for consistent
validation: in the block that checks pvc.SubPath (using filepath.Clean and
filepath.IsAbs), switch to path.Clean and path.IsAbs and update imports to
include "path" instead of "path/filepath"; keep the same logic with
strings.HasPrefix(cleaned, "..") and field.Invalid on pvcPath.Child("subPath")
so subPath is validated as a relative, non-.. containing path.
internal/controller/etcdbackupschedule_controller_test.go (1)

274-278: The compile-time verification block is unnecessary.

Lines 275-278 use a pattern to verify imports compile, but this is already ensured by the actual test code that uses errors.IsNotFound and ptr.To. This block can be removed to reduce noise.

🧹 Remove unnecessary verification block
-// Verify that all helpers referenced in this file compile.
-var _ = func() {
-	_ = errors.IsNotFound
-	_ = ptr.To[int32]
-}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/etcdbackupschedule_controller_test.go` around lines 274 -
278, Remove the unnecessary compile-time verification block (the anonymous var _
= func() { ... } closure) that assigns _ = errors.IsNotFound and _ =
ptr.To[int32]; simply delete that entire block from
etcdbackupschedule_controller_test.go since the tests already reference
errors.IsNotFound and ptr.To elsewhere; ensure no other references remain to
that anonymous function or its variables.
config/crd/bases/etcd.aenix.io_etcdbackupschedules.yaml (1)

70-131: Consider adding CEL validation for destination mutual exclusivity.

Currently, the destination schema allows both pvc and s3 to be specified simultaneously (or neither), relying entirely on webhook validation. For defense-in-depth and better user experience when webhooks are unavailable, consider adding a CEL validation rule (requires Kubernetes 1.25+).

📋 Optional: Add CEL validation rule

Add to the destination property in the schema:

              destination:
                description: Destination defines where the backup will be stored.
                x-kubernetes-validations:
                  - rule: "has(self.pvc) != has(self.s3)"
                    message: "exactly one of pvc or s3 must be specified"
                properties:
                  # ... existing properties

Alternatively, use the corresponding kubebuilder marker in the Go types file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/crd/bases/etcd.aenix.io_etcdbackupschedules.yaml` around lines 70 -
131, Add a CEL validation to the destination schema to enforce mutual
exclusivity between pvc and s3: under the destination property (referencing
destination, pvc, and s3) add an x-kubernetes-validations entry with rule
"has(self.pvc) != has(self.s3)" and an appropriate message like "exactly one of
pvc or s3 must be specified" so the CRD itself rejects both/none selections when
webhooks are unavailable.
internal/controller/etcdbackupschedule_controller.go (1)

127-167: Consider: reflect.DeepEqual may cause unnecessary updates.

Using reflect.DeepEqual on CronJob.Spec (Line 140) is functional but may miss semantic equivalence (e.g., nil vs empty slices). This could occasionally trigger no-op updates, but it's not incorrect behavior—just slightly suboptimal.

For a v1alpha1 controller, this is acceptable. Consider using a more targeted comparison if update frequency becomes a concern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/etcdbackupschedule_controller.go` around lines 127 - 167,
The code uses reflect.DeepEqual to compare existingCronJob.Spec and
desiredCronJob.Spec in reconcileExistingCronJob which can treat semantically
equal objects (e.g., nil vs empty slices) as different and cause unnecessary
updates; replace the comparison with a Kubernetes-aware semantic equality check
(e.g., import "k8s.io/apimachinery/pkg/api/equality" and use
equality.Semantic.DeepEqual(existingCronJob.Spec, desiredCronJob.Spec)) or
implement targeted field comparisons for the specific spec fields you care about
(normalize nil/empty slices/maps first) before assigning existingCronJob.Spec
and calling r.Update.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/backup-agent/main.go`:
- Around line 89-121: In buildTLSConfig, set an explicit minimum TLS version on
the tls.Config (tlsConfig.MinVersion) to enforce TLS1.2+ (e.g.,
tls.VersionTLS12) before returning; update the tlsConfig variable initialization
(in function buildTLSConfig) to assign MinVersion to the chosen constant so the
config enforces the minimum protocol version for all connections.

In `@cmd/manager/main.go`:
- Around line 118-133: Read OPERATOR_IMAGE once at startup into a local variable
(e.g. operatorImage := os.Getenv("OPERATOR_IMAGE")), validate it is non-empty
and fail fast (log error and os.Exit(1)) if unset, then pass that checked
operatorImage into both controller.EtcdBackupReconciler and
controller.EtcdBackupScheduleReconciler via their OperatorImage field instead of
calling os.Getenv inline; update imports if you need errors (or logging) per
your chosen failure path.

In `@config/crd/patches/webhook_in_etcdbackups.yaml`:
- Around line 7-16: The CRD patch currently sets conversion.strategy: Webhook
(webhook.clientConfig.path: /convert) but there are no conversion handlers
(ConvertTo/ConvertFrom) or multiple API versions; remove the conversion webhook
block from the EtcdBackup CRD patch (delete the conversion: ... webhook: ...
conversionReviewVersions: ... entries) so the CRD remains single-version without
a conversion webhook, or alternatively implement proper hub/spoke conversion
methods (ConvertTo/ConvertFrom) and add additional API versions if you intend to
keep the webhook.

In `@internal/controller/factory/backup_job_test.go`:
- Around line 299-302: The test currently only checks
envMap["ETCD_ENDPOINTS"].Value is non-empty; change it to actually verify the
scheme is HTTPS by parsing the endpoint(s) and asserting the URL scheme equals
"https". For envMap["ETCD_ENDPOINTS"].Value (which may contain a single endpoint
or comma-separated endpoints), split on commas, trim whitespace, parse each with
a URL parser and assert parsedURL.Scheme == "https" (or fail the test with a
clear message referencing ETCD_ENDPOINTS and the offending endpoint). This
ensures the test enforces TLS rather than merely non-empty values.

---

Nitpick comments:
In `@api/v1alpha1/etcdbackup_webhook.go`:
- Around line 144-152: Replace OS-dependent filepath usage with the POSIX path
package for consistent validation: in the block that checks pvc.SubPath (using
filepath.Clean and filepath.IsAbs), switch to path.Clean and path.IsAbs and
update imports to include "path" instead of "path/filepath"; keep the same logic
with strings.HasPrefix(cleaned, "..") and field.Invalid on
pvcPath.Child("subPath") so subPath is validated as a relative, non-..
containing path.

In `@api/v1alpha1/etcdbackupschedule_webhook_test.go`:
- Around line 28-199: Add negative unit tests that call
EtcdBackupSchedule.ValidateCreate to cover missing required destination fields:
one test creating an EtcdBackupSchedule whose Spec.Destination.PVC has an empty
ClaimName and asserting ValidateCreate returns a StatusError containing
"claimName" (or similar), and another test where Spec.Destination.S3 omits
required fields (e.g., empty Bucket or empty CredentialsSecretRef.Name) and
asserting ValidateCreate returns a StatusError containing the missing S3 field
names; locate the test cases near the existing tests referencing
EtcdBackupSchedule, BackupDestination, PVCBackupDestination, S3BackupDestination
and reuse the ValidateCreate call/assertion pattern used elsewhere in this file.

In `@api/v1alpha1/etcdbackupschedule_webhook.go`:
- Around line 34-35: Replace the fragile regex-based cron validation
(cronRegexp) in ValidateCreate with parsing via github.com/robfig/cron/v3:
remove/stop using cronRegexp, add the cron import, construct a parser with
cron.Minute|cron.Hour|cron.Dom|cron.Month|cron.Dow|cron.Descriptor and call
parser.Parse(r.Spec.Schedule), treat empty r.Spec.Schedule as required, and on
parse error append a field.Invalid for field.NewPath("spec","schedule")
including the parser error text so semantic cron errors are caught.

In
`@charts/etcd-operator/templates/cert-manager/validatingwebhookconfiguration.yml`:
- Around line 10-49: The validating webhook configurations for vetcdbackup.kb.io
and vetcdbackupschedule.kb.io lack an explicit timeoutSeconds and thus use the
API server default; add a timeoutSeconds: 5 (or another small integer) field to
each webhook object (alongside failurePolicy) in the
validatingwebhookconfiguration entries for paths
/validate-etcd-aenix-io-v1alpha1-etcdbackup and
/validate-etcd-aenix-io-v1alpha1-etcdbackupschedule so CREATE/UPDATE calls fail
fast when the webhook service is slow or unavailable.

In `@config/crd/bases/etcd.aenix.io_etcdbackups.yaml`:
- Around line 66-127: The CRD currently allows destination to be empty or
contain both s3 and pvc because its schema only marks properties as optional;
update the destination schema (the "destination" object that contains "s3" and
"pvc") to use a oneOf with two branches: one branch requiring the "pvc" property
and within it the existing required "claimName", and the other branch requiring
the "s3" property and within it the existing required fields ("bucket",
"credentialsSecretRef", "endpoint", "key", etc.); ensure each branch keeps the
existing property definitions and required lists so the CRD self-validates and
rejects {} or mixed specs without relying on the webhook.

In `@config/crd/bases/etcd.aenix.io_etcdbackupschedules.yaml`:
- Around line 70-131: Add a CEL validation to the destination schema to enforce
mutual exclusivity between pvc and s3: under the destination property
(referencing destination, pvc, and s3) add an x-kubernetes-validations entry
with rule "has(self.pvc) != has(self.s3)" and an appropriate message like
"exactly one of pvc or s3 must be specified" so the CRD itself rejects both/none
selections when webhooks are unavailable.

In `@config/webhook/manifests.yaml`:
- Around line 33-72: The generated Mutating/ValidatingWebhookConfiguration
entries for the webhooks vetcdbackup.kb.io and vetcdbackupschedule.kb.io (look
for clientConfig.service.path values /validate-etcd-aenix-io-v1alpha1-etcdbackup
and /validate-etcd-aenix-io-v1alpha1-etcdbackupschedule) lack an explicit
timeoutSeconds and therefore inherit the API server default; add a
timeoutSeconds field (e.g., timeoutSeconds: 10) to each webhook block in the
manifest so the generated YAML mirrors the chart template and avoids long stalls
when the webhook is unreachable.

In `@internal/controller/etcdbackupschedule_controller_test.go`:
- Around line 274-278: Remove the unnecessary compile-time verification block
(the anonymous var _ = func() { ... } closure) that assigns _ =
errors.IsNotFound and _ = ptr.To[int32]; simply delete that entire block from
etcdbackupschedule_controller_test.go since the tests already reference
errors.IsNotFound and ptr.To elsewhere; ensure no other references remain to
that anonymous function or its variables.

In `@internal/controller/etcdbackupschedule_controller.go`:
- Around line 127-167: The code uses reflect.DeepEqual to compare
existingCronJob.Spec and desiredCronJob.Spec in reconcileExistingCronJob which
can treat semantically equal objects (e.g., nil vs empty slices) as different
and cause unnecessary updates; replace the comparison with a Kubernetes-aware
semantic equality check (e.g., import "k8s.io/apimachinery/pkg/api/equality" and
use equality.Semantic.DeepEqual(existingCronJob.Spec, desiredCronJob.Spec)) or
implement targeted field comparisons for the specific spec fields you care about
(normalize nil/empty slices/maps first) before assigning existingCronJob.Spec
and calling r.Update.

In `@internal/controller/factory/backup_cronjob_test.go`:
- Around line 30-46: TestCreateBackupCronJob_PVC currently ignores errors
returned by clientgoscheme.AddToScheme and etcdaenixiov1alpha1.AddToScheme;
update the test to explicitly check and fail on those errors by capturing their
return values and calling t.Fatalf or t.Fatal when non-nil. Locate the two
AddToScheme calls in TestCreateBackupCronJob_PVC and change them to assign the
error (e.g., err := clientgoscheme.AddToScheme(scheme)) and handle err similarly
for etcdaenixiov1alpha1.AddToScheme so the test fails fast if scheme
registration fails.

In `@internal/controller/factory/backup_job_test.go`:
- Around line 142-203: The test fails to assert the new S3 env wiring added by
CreateBackupJob: add checks in backup_job_test.go (after building envMap from
job.Spec.Template.Spec.Containers[0].Env) to assert envMap["S3_ENDPOINT"].Value
== "https://s3.example.com" and that envMap["S3_FORCE_PATH_STYLE"].Value exists
and equals the expected default (e.g. "false") so the S3 endpoint and path-style
behavior are validated for the backup created from the backup object with
Endpoint set.

In `@internal/controller/factory/backup_job.go`:
- Around line 205-218: buildEndpoints currently returns an empty string when
cluster.Spec.Replicas is explicitly 0; change buildEndpoints to detect replicas
== 0 and return a clear sentinel string (for example
"no-replicas:<clusterName>.<namespace>") instead of "", so callers get a
deterministic, debuggable value; update the logic inside buildEndpoints (which
uses GetServerProtocol, GetHeadlessServiceName and reads cluster.Spec.Replicas)
to check for zero replicas and return the descriptive message immediately before
building endpoints.

In `@Makefile`:
- Around line 121-128: The Makefile duplicates the backup-agent build command in
both the build and build-backup-agent targets; change the build target to depend
on build-backup-agent (add build-backup-agent to its prerequisites) and remove
the duplicate "go build -o bin/backup-agent cmd/backup-agent/main.go" line from
the build recipe so the backup-agent is built only via the build-backup-agent
target (leave the build-backup-agent target containing the go build command
intact).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 11eaf79a-1ab5-4d2c-9c09-359f4139cf3e

📥 Commits

Reviewing files that changed from the base of the PR and between 6537618 and 788baac.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (40)
  • Dockerfile
  • Makefile
  • PROJECT
  • api/v1alpha1/etcdbackup_types.go
  • api/v1alpha1/etcdbackup_webhook.go
  • api/v1alpha1/etcdbackup_webhook_test.go
  • api/v1alpha1/etcdbackupschedule_types.go
  • api/v1alpha1/etcdbackupschedule_webhook.go
  • api/v1alpha1/etcdbackupschedule_webhook_test.go
  • api/v1alpha1/webhook_suite_test.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • charts/etcd-operator/crds/etcd-backup.yaml
  • charts/etcd-operator/templates/cert-manager/validatingwebhookconfiguration.yml
  • charts/etcd-operator/templates/rbac/clusterrole-manager-role.yml
  • charts/etcd-operator/templates/workload/deployment.yml
  • cmd/backup-agent/main.go
  • cmd/manager/main.go
  • config/crd/bases/etcd.aenix.io_etcdbackups.yaml
  • config/crd/bases/etcd.aenix.io_etcdbackupschedules.yaml
  • config/crd/bases/etcd.aenix.io_etcdclusters.yaml
  • config/crd/kustomization.yaml
  • config/crd/patches/cainjection_in_etcdbackups.yaml
  • config/crd/patches/cainjection_in_etcdbackupschedules.yaml
  • config/crd/patches/webhook_in_etcdbackups.yaml
  • config/crd/patches/webhook_in_etcdbackupschedules.yaml
  • config/rbac/etcdbackup_editor_role.yaml
  • config/rbac/etcdbackup_viewer_role.yaml
  • config/rbac/role.yaml
  • config/samples/etcd.aenix.io_v1alpha1_etcdbackup.yaml
  • config/samples/etcd.aenix.io_v1alpha1_etcdbackupschedule.yaml
  • config/webhook/manifests.yaml
  • go.mod
  • internal/controller/etcdbackup_controller.go
  • internal/controller/etcdbackup_controller_test.go
  • internal/controller/etcdbackupschedule_controller.go
  • internal/controller/etcdbackupschedule_controller_test.go
  • internal/controller/factory/backup_cronjob.go
  • internal/controller/factory/backup_cronjob_test.go
  • internal/controller/factory/backup_job.go
  • internal/controller/factory/backup_job_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (6)
cmd/backup-agent/main_test.go (2)

106-107: Prefer using tls.VersionTLS12 constant over magic number.

Using the named constant improves readability and ensures consistency with the implementation in buildTLSConfig().

♻️ Suggested improvement

Add crypto/tls to imports:

import (
	"crypto/tls"
	"path/filepath"
	"strings"
	"testing"
)

Then update the assertion:

-	if cfg.MinVersion != 0x0303 { // tls.VersionTLS12
-		t.Errorf("expected MinVersion TLS 1.2 (0x0303), got 0x%04x", cfg.MinVersion)
+	if cfg.MinVersion != tls.VersionTLS12 {
+		t.Errorf("expected MinVersion TLS 1.2, got 0x%04x", cfg.MinVersion)
	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/backup-agent/main_test.go` around lines 106 - 107, Replace the magic hex
literal with the tls package constant: import "crypto/tls" in
cmd/backup-agent/main_test.go and change the assertion comparing cfg.MinVersion
to use tls.VersionTLS12 instead of 0x0303 (the cfg variable and the test
expectation in the test that validates buildTLSConfig()) so the test reads
cfg.MinVersion != tls.VersionTLS12 and the error message reflects that constant.

81-91: Test name is slightly misleading.

t.Setenv("ETCD_TLS_ENABLED", "") sets the variable to an empty string rather than unsetting it. While the behavior is the same (since buildTLSConfig checks != "true"), consider renaming to TestBuildTLSConfig_EmptyValue for accuracy, or use os.Unsetenv to truly test the unset case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/backup-agent/main_test.go` around lines 81 - 91, The test name
TestBuildTLSConfig_Unset is misleading because it sets ETCD_TLS_ENABLED to an
empty string rather than truly unsetting it; update the test that calls
buildTLSConfig() to either (a) rename the test to TestBuildTLSConfig_EmptyValue
to reflect the current behavior, or (b) actually unset the environment variable
before calling buildTLSConfig by calling os.Unsetenv("ETCD_TLS_ENABLED") (or
t.Unsetenv if available) and keep the name TestBuildTLSConfig_Unset so the test
semantics match its name; ensure the assertion logic remains the same.
internal/controller/factory/statefulset.go (1)

461-474: Unused podEnv variable in generateRestoreInitContainers.

The podEnv slice containing POD_NAME and POD_NAMESPACE is defined but only used by restoreDatadirContainer (line 587). This appears intentional since restore-agent doesn't need pod metadata, but consider removing the variable and defining the env vars inline at line 587 to improve clarity.

♻️ Optional: Define env vars inline
-	podEnv := []corev1.EnvVar{
-		{
-			Name: "POD_NAME",
-			ValueFrom: &corev1.EnvVarSource{
-				FieldRef: &corev1.ObjectFieldSelector{FieldPath: "metadata.name"},
-			},
-		},
-		{
-			Name: "POD_NAMESPACE",
-			ValueFrom: &corev1.EnvVarSource{
-				FieldRef: &corev1.ObjectFieldSelector{FieldPath: "metadata.namespace"},
-			},
-		},
-	}
-
 	// Env vars for the restore-agent init container
 	restoreAgentEnv := []corev1.EnvVar{
 		{Name: "ETCD_DATA_DIR", Value: "/var/run/etcd/default.etcd"},
 	}

Then at line 587:

-		Env:     podEnv,
+		Env: []corev1.EnvVar{
+			{
+				Name: "POD_NAME",
+				ValueFrom: &corev1.EnvVarSource{
+					FieldRef: &corev1.ObjectFieldSelector{FieldPath: "metadata.name"},
+				},
+			},
+			{
+				Name: "POD_NAMESPACE",
+				ValueFrom: &corev1.EnvVarSource{
+					FieldRef: &corev1.ObjectFieldSelector{FieldPath: "metadata.namespace"},
+				},
+			},
+		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/factory/statefulset.go` around lines 461 - 474, The
podEnv variable is defined in generateRestoreInitContainers but never used
except inside restoreDatadirContainer; remove the unused podEnv slice and
instead add the two EnvVar entries (POD_NAME with FieldRef metadata.name and
POD_NAMESPACE with FieldRef metadata.namespace) inline in the Env/EnvFrom list
where restoreDatadirContainer is constructed (reference restoreDatadirContainer
and generateRestoreInitContainers) so the env vars remain present but the unused
podEnv symbol is eliminated.
internal/controller/factory/statefulset_test.go (1)

98-98: Add a bootstrap-restore case instead of only threading the new arg through.

All updated call sites still pass "", so the new operatorImage-gated branch in CreateOrUpdateStatefulSet never runs here. Please add one case with spec.bootstrap.restore set and a real image, plus one asserting the empty-image error, so the new parameter is actually covered.

Also applies to: 162-162, 324-324, 384-393

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/factory/statefulset_test.go` at line 98, The test
currently only calls CreateOrUpdateStatefulSet(ctx, &etcdcluster, k8sClient, "")
which never exercises the operatorImage-gated branch; add two test cases in
statefulset_test.go that exercise CreateOrUpdateStatefulSet: one where
etcdcluster.Spec.Bootstrap.Restore (or spec.bootstrap.restore) is set and you
pass a non-empty operatorImage string (e.g., "operator:latest") and assert
success, and another where Restore is set but you pass "" and assert the
function returns the expected empty-image error; ensure these cases use the
existing etcdcluster and k8sClient setup (and mirror analogous additions at the
other call sites referenced) so the new parameter behavior is covered.
charts/etcd-operator/crds/etcd-cluster.yaml (1)

60-116: Consider adding schema-level XOR constraint for source field.

The bootstrap.restore.source field allows both s3 and pvc to be set (or neither) at the schema level. While the webhook (validateBackupDestination) enforces exactly-one semantics, this validation can be bypassed via --dry-run, when webhooks are unavailable, or through direct etcd writes.

Adding OpenAPI constraints provides defense-in-depth:

♻️ Suggested schema enhancement
                          properties:
                            pvc:
                              # ... pvc definition ...
                            s3:
                              # ... s3 definition ...
                          type: object
+                         minProperties: 1
+                         maxProperties: 1

Alternatively, you can use oneOf with separate schemas for each destination type if more complex validation is needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@charts/etcd-operator/crds/etcd-cluster.yaml` around lines 60 - 116, The
schema for bootstrap.restore.source currently allows both or neither of s3 and
pvc; add an OpenAPI XOR constraint (use oneOf with two object schemas—one
requiring s3 and forbidding pvc, the other requiring pvc and forbidding s3—or
use anyOf + a not to forbid both) to enforce exactly-one semantics at the CRD
level for bootstrap.restore.source; reference the existing properties named "s3"
and "pvc" and ensure the new oneOf/anyOf block mirrors the existing nested
schemas and required fields so it complements the validateBackupDestination
webhook.
internal/controller/etcdbackupschedule_controller.go (1)

229-235: Minor: Condition message could be more precise.

The message "is active" at line 233 could be interpreted as the CronJob currently running Jobs (the technical meaning of Status.Active in CronJob). Consider using "exists" or "is configured" for clarity.

♻️ Suggested change
 	meta.SetStatusCondition(&schedule.Status.Conditions, metav1.Condition{
 		Type:               etcdaenixiov1alpha1.EtcdBackupScheduleConditionReady,
 		Status:             metav1.ConditionTrue,
 		Reason:             "CronJobReady",
-		Message:            fmt.Sprintf("CronJob %q is active", existingCronJob.Name),
+		Message:            fmt.Sprintf("CronJob %q is configured", existingCronJob.Name),
 		ObservedGeneration: schedule.Generation,
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/etcdbackupschedule_controller.go` around lines 229 - 235,
Update the readiness condition message to avoid ambiguity around
CronJob.Status.Active: in the call to meta.SetStatusCondition that updates
schedule.Status.Conditions (the metav1.Condition with Type
etcdaenixiov1alpha1.EtcdBackupScheduleConditionReady and Reason "CronJobReady"),
change the Message from fmt.Sprintf("CronJob %q is active",
existingCronJob.Name) to a clearer phrasing such as fmt.Sprintf("CronJob %q
exists", existingCronJob.Name) or "is configured" so it unambiguously indicates
the CronJob resource presence rather than running jobs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/v1alpha1/etcdbackup_webhook.go`:
- Around line 54-65: The validation currently enforces a 52-char maxJobNameLen
for Jobs (using const jobSuffix = "-backup" and computing maxNameLen :=
maxJobNameLen - len(jobSuffix)), which is wrong for GetBackupJobName() because
it creates a plain Job (63-char limit). Update the logic in
etcdbackup_webhook.go to use the Job name limit of 63 (set maxJobNameLen to 63
or otherwise compute maxNameLen from 63), recompute maxNameLen := maxJobNameLen
- len(jobSuffix), and keep the field.Invalid append for
field.NewPath("metadata","name")/r.Name but with the corrected limits and
message.

In `@cmd/restore-agent/main.go`:
- Around line 139-155: The copyFromPVC() path currently opens PVC_BACKUP_PATH
and calls writeSnapshot without any timeout or cancellation; update copyFromPVC
to accept/create a context with a deadline (e.g., context.WithTimeout) and use
that context to guard the read operation by either (a) changing writeSnapshot to
accept a context and perform a context-aware copy (using io.CopyN with periodic
context checks or a context-aware reader) or (b) wrapping the src file in a
reader that checks the context and aborts on timeout, ensuring src is closed on
cancel; make sure error messages include context cancellation/timeout semantics
so the init container won't hang on slow NFS PVCs.

In `@config/samples/etcd.aenix.io_v1alpha1_etcdcluster_restore.yaml`:
- Around line 6-12: The sample uses bootstrap.restore with a PVC source
(source.pvc.claimName and subPath) while replicas is set to 3, which will cause
every StatefulSet pod to mount the same backup PVC (ReadWriteOnce) and likely
stall scheduling; change the example to replicas: 1 when using a local PVC
restore or replace the PVC-based source with an S3/object-store restore example,
and if you keep PVC-based restore document the recommended workflow: set
replicas to 1 for the restore, verify the cluster recovered, then scale the
EtcdCluster up to the desired replica count.

In `@internal/controller/etcdbackup_controller.go`:
- Around line 92-100: The code currently trusts any Job with the generated
jobName; after fetching existingJob in r.Get, check that the Job is owned by the
EtcdBackup (use metav1.IsControlledBy(existingJob, backup) or inspect
existingJob.OwnerReferences for the EtcdBackup UID/kind) before calling
r.reconcileJobStatus; if the Job is not controller-owned, do not adopt it—return
a non-adopting result (e.g., return an error or skip reconciliation with a clear
log) so only Jobs with a proper owner reference to the EtcdBackup are used.

In `@internal/controller/etcdbackupschedule_controller_test.go`:
- Around line 139-167: The test never changes the EtcdBackupSchedule spec before
the second Reconcile so it doesn't exercise the update path; after the first
Reconcile (which creates the CronJob) fetch and mutate the EtcdBackupSchedule's
spec.schedule to a different cron string, persist that change via the k8s
client, then call reconciler.Reconcile again and assert that
factory.GetBackupCronJobName(schedule) now has the updated CronJob.Spec.Schedule
and that the EtcdBackupSchedule status Ready condition is true; this will
exercise reconcileExistingCronJob/cronJobNeedsUpdate rather than just the
"already exists" branch.

In `@internal/controller/factory/backup_cronjob.go`:
- Around line 57-61: The factory in
internal/controller/factory/backup_cronjob.go unconditionally sets
Spec.ConcurrencyPolicy = batchv1.ForbidConcurrent but cronJobNeedsUpdate in
internal/controller/etcdbackupschedule_controller.go does not compare that
field, causing drift; update cronJobNeedsUpdate to compare
desired.Spec.ConcurrencyPolicy against actual.Spec.ConcurrencyPolicy (and return
true when they differ) so reconciler will reset the value to ForbidConcurrent,
or alternatively add a comparator helper that explicitly checks
ConcurrencyPolicy alongside the other spec fields referenced in
cronJobNeedsUpdate.

In `@internal/controller/factory/backup_job.go`:
- Around line 68-75: The PodSecurityContext currently sets
RunAsNonRoot/RunAsUser/RunAsGroup but omits FSGroup, which causes PVC mounts to
remain root-owned and makes writeToPVC() fail with permission denied; update the
PodSecurityContext in the backup job spec to include FSGroup:
ptr.To(int64(65532)) (matching RunAsGroup/RunAsUser) so the PVC volume gets the
proper group ownership, and apply the same change to the other backup job
PodSecurityContext instance referenced around lines ~180-200.
- Around line 147-177: The S3 backup container used by uploadToS3() only
requests CPU/memory but writes a full snapshot to disk; update the backup Job
spec created in backup_job.go to add ephemeral-storage resource requests/limits
and a dedicated emptyDir scratch volume (mount it into the container and set
TMPDIR to that mount) so large uploads use node local scratch and avoid
eviction; locate the S3 handling branch (the block that builds envVars for S3
and the uploadToS3() caller) and: 1) add resources.requests["ephemeral-storage"]
and resources.limits["ephemeral-storage"] to the container spec, 2) add a volume
of type emptyDir and a corresponding volumeMount (e.g., /tmp/backup-scratch),
and 3) add/override env var TMPDIR (or BACKUP_TMPDIR) to point to that mount;
apply the same changes for the other S3 branch referenced around lines 203-217.

---

Nitpick comments:
In `@charts/etcd-operator/crds/etcd-cluster.yaml`:
- Around line 60-116: The schema for bootstrap.restore.source currently allows
both or neither of s3 and pvc; add an OpenAPI XOR constraint (use oneOf with two
object schemas—one requiring s3 and forbidding pvc, the other requiring pvc and
forbidding s3—or use anyOf + a not to forbid both) to enforce exactly-one
semantics at the CRD level for bootstrap.restore.source; reference the existing
properties named "s3" and "pvc" and ensure the new oneOf/anyOf block mirrors the
existing nested schemas and required fields so it complements the
validateBackupDestination webhook.

In `@cmd/backup-agent/main_test.go`:
- Around line 106-107: Replace the magic hex literal with the tls package
constant: import "crypto/tls" in cmd/backup-agent/main_test.go and change the
assertion comparing cfg.MinVersion to use tls.VersionTLS12 instead of 0x0303
(the cfg variable and the test expectation in the test that validates
buildTLSConfig()) so the test reads cfg.MinVersion != tls.VersionTLS12 and the
error message reflects that constant.
- Around line 81-91: The test name TestBuildTLSConfig_Unset is misleading
because it sets ETCD_TLS_ENABLED to an empty string rather than truly unsetting
it; update the test that calls buildTLSConfig() to either (a) rename the test to
TestBuildTLSConfig_EmptyValue to reflect the current behavior, or (b) actually
unset the environment variable before calling buildTLSConfig by calling
os.Unsetenv("ETCD_TLS_ENABLED") (or t.Unsetenv if available) and keep the name
TestBuildTLSConfig_Unset so the test semantics match its name; ensure the
assertion logic remains the same.

In `@internal/controller/etcdbackupschedule_controller.go`:
- Around line 229-235: Update the readiness condition message to avoid ambiguity
around CronJob.Status.Active: in the call to meta.SetStatusCondition that
updates schedule.Status.Conditions (the metav1.Condition with Type
etcdaenixiov1alpha1.EtcdBackupScheduleConditionReady and Reason "CronJobReady"),
change the Message from fmt.Sprintf("CronJob %q is active",
existingCronJob.Name) to a clearer phrasing such as fmt.Sprintf("CronJob %q
exists", existingCronJob.Name) or "is configured" so it unambiguously indicates
the CronJob resource presence rather than running jobs.

In `@internal/controller/factory/statefulset_test.go`:
- Line 98: The test currently only calls CreateOrUpdateStatefulSet(ctx,
&etcdcluster, k8sClient, "") which never exercises the operatorImage-gated
branch; add two test cases in statefulset_test.go that exercise
CreateOrUpdateStatefulSet: one where etcdcluster.Spec.Bootstrap.Restore (or
spec.bootstrap.restore) is set and you pass a non-empty operatorImage string
(e.g., "operator:latest") and assert success, and another where Restore is set
but you pass "" and assert the function returns the expected empty-image error;
ensure these cases use the existing etcdcluster and k8sClient setup (and mirror
analogous additions at the other call sites referenced) so the new parameter
behavior is covered.

In `@internal/controller/factory/statefulset.go`:
- Around line 461-474: The podEnv variable is defined in
generateRestoreInitContainers but never used except inside
restoreDatadirContainer; remove the unused podEnv slice and instead add the two
EnvVar entries (POD_NAME with FieldRef metadata.name and POD_NAMESPACE with
FieldRef metadata.namespace) inline in the Env/EnvFrom list where
restoreDatadirContainer is constructed (reference restoreDatadirContainer and
generateRestoreInitContainers) so the env vars remain present but the unused
podEnv symbol is eliminated.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b6b9a4e9-1c43-4f3f-9462-b3ebc7a464b3

📥 Commits

Reviewing files that changed from the base of the PR and between 788baac and 290ce98.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (33)
  • Dockerfile
  • Makefile
  • api/v1alpha1/etcdbackup_webhook.go
  • api/v1alpha1/etcdbackup_webhook_test.go
  • api/v1alpha1/etcdbackupschedule_webhook.go
  • api/v1alpha1/etcdbackupschedule_webhook_test.go
  • api/v1alpha1/etcdcluster_types.go
  • api/v1alpha1/etcdcluster_webhook.go
  • api/v1alpha1/etcdcluster_webhook_test.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • charts/etcd-operator/crds/etcd-backup-schedule.yaml
  • charts/etcd-operator/crds/etcd-backup.yaml
  • charts/etcd-operator/crds/etcd-cluster.yaml
  • cmd/backup-agent/main.go
  • cmd/backup-agent/main_test.go
  • cmd/manager/main.go
  • cmd/restore-agent/main.go
  • cmd/restore-agent/main_test.go
  • config/crd/bases/etcd.aenix.io_etcdclusters.yaml
  • config/samples/etcd.aenix.io_v1alpha1_etcdcluster_restore.yaml
  • go.mod
  • internal/controller/cronjob_update_test.go
  • internal/controller/etcdbackup_controller.go
  • internal/controller/etcdbackup_controller_test.go
  • internal/controller/etcdbackupschedule_controller.go
  • internal/controller/etcdbackupschedule_controller_test.go
  • internal/controller/etcdcluster_controller.go
  • internal/controller/factory/backup_cronjob.go
  • internal/controller/factory/backup_job.go
  • internal/controller/factory/backup_job_test.go
  • internal/controller/factory/statefulset.go
  • internal/controller/factory/statefulset_restore_test.go
  • internal/controller/factory/statefulset_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • Dockerfile
  • api/v1alpha1/etcdbackup_webhook_test.go
  • internal/controller/etcdbackup_controller_test.go
  • Makefile
  • cmd/backup-agent/main.go
  • api/v1alpha1/etcdbackupschedule_webhook_test.go

Copy link
Member

@sircthulhu sircthulhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Good architecture overall — two CRDs, Job-based execution, CronJob for scheduling. This aligns well with patterns from CloudNativePG and Percona operators.

I've compared the implementation with coreos/etcd-operator, CloudNativePG, and Percona XtraDB Cluster Operator. Key findings are in the inline comments below.

Critical

  1. Missing FSGroup in PodSecurityContext — PVC writes will fail on most storage classes
  2. Conversion webhook patch is invalid — no conversion handlers exist
  3. No ownership check when adopting existing Job
  4. Backup file naming — operator should own naming, user should only specify directory/prefix

Important

  1. cronJobNeedsUpdate doesn't compare ConcurrencyPolicy
  2. No ephemeral-storage resource requests for S3 temp file
  3. PVC restore sample uses replicas: 3 with potentially ReadWriteOnce PVC
  4. Job name length limit uses 52 (CronJob limit) instead of 63 (Job limit) for EtcdBackup

Suggestions

  1. S3/PVC env var generation is duplicated between backup_job.go and statefulset.go
  2. No activeDeadlineSeconds on backup Job
  3. TTL of 600s is aggressive for debugging

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
internal/controller/factory/backup_job.go (1)

150-180: ⚠️ Potential issue | 🟠 Major

Budget scratch space for S3 uploads.

The S3 path writes a full snapshot to local temp storage before upload, but this container only requests CPU and memory. Large backups can be evicted or fail mid-upload once node ephemeral storage is tight. Because buildBackupContainer() is shared, fixing it here also covers scheduled backups.

Proposed fix
 	if s3 := destination.S3; s3 != nil {
+		volumes = append(volumes, corev1.Volume{
+			Name: "backup-scratch",
+			VolumeSource: corev1.VolumeSource{
+				EmptyDir: &corev1.EmptyDirVolumeSource{},
+			},
+		})
+		volumeMounts = append(volumeMounts, corev1.VolumeMount{
+			Name:      "backup-scratch",
+			MountPath: "/tmp/backup-scratch",
+		})
 		envVars = append(envVars,
 			corev1.EnvVar{Name: "BACKUP_DESTINATION", Value: "s3"},
+			corev1.EnvVar{Name: "TMPDIR", Value: "/tmp/backup-scratch"},
 			corev1.EnvVar{Name: "S3_ENDPOINT", Value: s3.Endpoint},
 			corev1.EnvVar{Name: "S3_BUCKET", Value: s3.Bucket},
 			corev1.EnvVar{Name: "S3_KEY", Value: s3.Key},
 			corev1.EnvVar{Name: "S3_REGION", Value: s3.Region},
@@
 		Resources: corev1.ResourceRequirements{
 			Requests: corev1.ResourceList{
 				corev1.ResourceCPU:    resource.MustParse("100m"),
 				corev1.ResourceMemory: resource.MustParse("128Mi"),
+				corev1.ResourceEphemeralStorage: resource.MustParse("2Gi"),
 			},
 			Limits: corev1.ResourceList{
 				corev1.ResourceMemory: resource.MustParse("512Mi"),
+				corev1.ResourceEphemeralStorage: resource.MustParse("4Gi"),
 			},
 		},

Also applies to: 206-220

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/factory/backup_job.go` around lines 150 - 180, The
container used for S3 uploads needs ephemeral scratch space and
ephemeral-storage resource requests/limits to avoid eviction; in
buildBackupContainer() add an emptyDir volume (e.g., name "backup-scratch")
mounted into the container at a scratch path (e.g., "/tmp/backup") and set
container.Resources.Requests["ephemeral-storage"] and
.Limits["ephemeral-storage"] to sensible values (and keep CPU/memory as before);
apply the same change to the other S3-handling branch/usage referenced around
lines 206-220 so all S3 backup containers get the scratch volume mount and
ephemeral-storage resource settings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/etcdbackupschedule_controller.go`:
- Around line 171-173: After successfully fetching an existing CronJob, verify
it is actually owned by the EtcdBackupSchedule before calling
reconcileExistingCronJob: use controllerutil.GetControllerOf(existingCronJob)
(or metav1.GetControllerOf) and compare the returned OwnerReference's UID (and
Kind/Name) to the schedule object; if there is no controller reference or it
does not match the schedule, return an error or requeue (do not call
reconcileExistingCronJob) so we don't mutate or mirror status for CronJobs this
schedule does not own.
- Around line 42-92: cronJobNeedsUpdate currently misses several operator-owned
JobTemplate fields; update it to also detect drift in
JobTemplate.Spec.JobSpec.BackoffLimit,
JobTemplate.Spec.Template.Spec.RestartPolicy,
JobTemplate.Spec.Template.Spec.SecurityContext (pod-level) and container-level
SecurityContext and Resources. For BackoffLimit and RestartPolicy do simple
value comparisons, and for SecurityContext and Resources use reflect.DeepEqual
when comparing existing.Spec.JobTemplate.Spec.Template.Spec.SecurityContext to
desired and for each container compare existingContainers[i].SecurityContext and
existingContainers[i].Resources to desiredContainers[i].SecurityContext and
desiredContainers[i].Resources; return true if any of those differ. Ensure you
reference cronJobNeedsUpdate, existing.Spec.JobTemplate.Spec and
desired.Spec.JobTemplate.Spec, and the container loop when adding these checks.

In `@internal/controller/factory/backup_cronjob.go`:
- Around line 62-79: The CronJob JobTemplate's PodSpec lacks an
ActiveDeadlineSeconds setting, so hung backup-agent pods can run indefinitely
and block subsequent schedules; update the PodTemplate Spec inside the CronJob
(in backup_cronjob.go) to include ActiveDeadlineSeconds (e.g., add
ActiveDeadlineSeconds: ptr.To(int64(<same value used in CreateBackupJob>))) on
the corev1.PodSpec for the JobTemplate so it matches the deadline used by
CreateBackupJob and ensures jobs are forcibly terminated after the configured
timeout.

---

Duplicate comments:
In `@internal/controller/factory/backup_job.go`:
- Around line 150-180: The container used for S3 uploads needs ephemeral scratch
space and ephemeral-storage resource requests/limits to avoid eviction; in
buildBackupContainer() add an emptyDir volume (e.g., name "backup-scratch")
mounted into the container at a scratch path (e.g., "/tmp/backup") and set
container.Resources.Requests["ephemeral-storage"] and
.Limits["ephemeral-storage"] to sensible values (and keep CPU/memory as before);
apply the same change to the other S3-handling branch/usage referenced around
lines 206-220 so all S3 backup containers get the scratch volume mount and
ephemeral-storage resource settings.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bbe6ef50-b578-43c6-ac7e-69372c31ec1d

📥 Commits

Reviewing files that changed from the base of the PR and between 290ce98 and f979d9f.

📒 Files selected for processing (9)
  • api/v1alpha1/etcdbackup_webhook.go
  • api/v1alpha1/etcdbackup_webhook_test.go
  • config/crd/patches/webhook_in_etcdbackups.yaml
  • config/crd/patches/webhook_in_etcdbackupschedules.yaml
  • config/samples/etcd.aenix.io_v1alpha1_etcdcluster_restore.yaml
  • internal/controller/etcdbackup_controller.go
  • internal/controller/etcdbackupschedule_controller.go
  • internal/controller/factory/backup_cronjob.go
  • internal/controller/factory/backup_job.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • api/v1alpha1/etcdbackup_webhook_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
internal/controller/etcdbackupschedule_controller.go (2)

171-173: ⚠️ Potential issue | 🟠 Major

Guard against same-name CronJobs not controlled by this schedule.

Line 173 reconciles by name only. If a CronJob with that name is user-managed or owned by another controller, this reconciler can mutate and mirror status for the wrong object.

Proposed fix
 	err := r.Get(ctx, types.NamespacedName{Name: cronJobName, Namespace: schedule.Namespace}, existingCronJob)
 	if err == nil {
+		owner := metav1.GetControllerOf(existingCronJob)
+		if owner == nil || owner.UID != schedule.UID || owner.Kind != "EtcdBackupSchedule" || owner.Name != schedule.Name {
+			message := fmt.Sprintf("CronJob %q already exists and is not controlled by this EtcdBackupSchedule", cronJobName)
+			meta.SetStatusCondition(&schedule.Status.Conditions, metav1.Condition{
+				Type:               etcdaenixiov1alpha1.EtcdBackupScheduleConditionFailed,
+				Status:             metav1.ConditionTrue,
+				Reason:             "CronJobNameConflict",
+				Message:            message,
+				ObservedGeneration: schedule.Generation,
+			})
+			meta.SetStatusCondition(&schedule.Status.Conditions, metav1.Condition{
+				Type:               etcdaenixiov1alpha1.EtcdBackupScheduleConditionReady,
+				Status:             metav1.ConditionFalse,
+				Reason:             "CronJobNameConflict",
+				Message:            message,
+				ObservedGeneration: schedule.Generation,
+			})
+			return r.updateStatus(ctx, schedule)
+		}
 		return r.reconcileExistingCronJob(ctx, schedule, cluster, existingCronJob)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/etcdbackupschedule_controller.go` around lines 171 - 173,
The current code calls r.Get and then immediately returns
r.reconcileExistingCronJob when a CronJob with cronJobName exists, but it does
not check ownership and may mutate user-managed or foreign-controller objects;
modify the logic after r.Get(...) to verify ownership (e.g., use
controllerutil.IsControlledBy(existingCronJob, schedule) or inspect
existingCronJob.GetOwnerReferences() for a matching UID/kind) and only call
r.reconcileExistingCronJob when the CronJob is owned by this EtcdBackupSchedule;
if not owned, skip reconcile (log/record an event and return without mutating)
to avoid touching same-name CronJobs managed by others.

42-92: ⚠️ Potential issue | 🟠 Major

Detect drift for all operator-owned JobTemplate security/runtime fields.

Line 42 currently misses drift checks for BackoffLimit, pod RestartPolicy, pod SecurityContext, container SecurityContext, and container Resources. Manual edits to these fields will not be corrected.

Proposed fix
 func cronJobNeedsUpdate(existing, desired *batchv1.CronJob) bool {
 	if existing.Spec.Schedule != desired.Spec.Schedule {
 		return true
 	}
 	if existing.Spec.ConcurrencyPolicy != desired.Spec.ConcurrencyPolicy {
 		return true
 	}
 	if !equality.Semantic.DeepEqual(existing.Spec.SuccessfulJobsHistoryLimit, desired.Spec.SuccessfulJobsHistoryLimit) {
 		return true
 	}
 	if !equality.Semantic.DeepEqual(existing.Spec.FailedJobsHistoryLimit, desired.Spec.FailedJobsHistoryLimit) {
 		return true
 	}
+	if !equality.Semantic.DeepEqual(existing.Spec.JobTemplate.Spec.BackoffLimit, desired.Spec.JobTemplate.Spec.BackoffLimit) {
+		return true
+	}
 	if !equality.Semantic.DeepEqual(existing.Labels, desired.Labels) {
 		return true
 	}
+
+	existingPodSpec := existing.Spec.JobTemplate.Spec.Template.Spec
+	desiredPodSpec := desired.Spec.JobTemplate.Spec.Template.Spec
+	if existingPodSpec.RestartPolicy != desiredPodSpec.RestartPolicy {
+		return true
+	}
+	if !equality.Semantic.DeepEqual(existingPodSpec.SecurityContext, desiredPodSpec.SecurityContext) {
+		return true
+	}
 
 	existingContainers := existing.Spec.JobTemplate.Spec.Template.Spec.Containers
 	desiredContainers := desired.Spec.JobTemplate.Spec.Template.Spec.Containers
 	if len(existingContainers) != len(desiredContainers) {
 		return true
@@
 		if !equality.Semantic.DeepEqual(existingContainers[i].VolumeMounts, desiredContainers[i].VolumeMounts) {
 			return true
 		}
+		if !equality.Semantic.DeepEqual(existingContainers[i].SecurityContext, desiredContainers[i].SecurityContext) {
+			return true
+		}
+		if !equality.Semantic.DeepEqual(existingContainers[i].Resources, desiredContainers[i].Resources) {
+			return true
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/etcdbackupschedule_controller.go` around lines 42 - 92,
cronJobNeedsUpdate currently misses drift detection for JobTemplate backoff
limit, pod restart policy and pod security context, plus each container's
SecurityContext and Resources; update the function to compare
existing.Spec.JobTemplate.Spec.BackoffLimit (use equality.Semantic.DeepEqual),
existing.Spec.JobTemplate.Spec.Template.Spec.RestartPolicy (direct compare),
existing.Spec.JobTemplate.Spec.Template.Spec.SecurityContext (DeepEqual), and
for each container also compare existingContainers[i].SecurityContext and
existingContainers[i].Resources to
desiredContainers[i].SecurityContext/Resources with equality.Semantic.DeepEqual
so manual edits to those fields are detected and trigger updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/controller/etcdbackupschedule_controller.go`:
- Around line 171-173: The current code calls r.Get and then immediately returns
r.reconcileExistingCronJob when a CronJob with cronJobName exists, but it does
not check ownership and may mutate user-managed or foreign-controller objects;
modify the logic after r.Get(...) to verify ownership (e.g., use
controllerutil.IsControlledBy(existingCronJob, schedule) or inspect
existingCronJob.GetOwnerReferences() for a matching UID/kind) and only call
r.reconcileExistingCronJob when the CronJob is owned by this EtcdBackupSchedule;
if not owned, skip reconcile (log/record an event and return without mutating)
to avoid touching same-name CronJobs managed by others.
- Around line 42-92: cronJobNeedsUpdate currently misses drift detection for
JobTemplate backoff limit, pod restart policy and pod security context, plus
each container's SecurityContext and Resources; update the function to compare
existing.Spec.JobTemplate.Spec.BackoffLimit (use equality.Semantic.DeepEqual),
existing.Spec.JobTemplate.Spec.Template.Spec.RestartPolicy (direct compare),
existing.Spec.JobTemplate.Spec.Template.Spec.SecurityContext (DeepEqual), and
for each container also compare existingContainers[i].SecurityContext and
existingContainers[i].Resources to
desiredContainers[i].SecurityContext/Resources with equality.Semantic.DeepEqual
so manual edits to those fields are detected and trigger updates.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 451791ed-1356-48b5-b189-0933c47c22a1

📥 Commits

Reviewing files that changed from the base of the PR and between f979d9f and ec1abf0.

📒 Files selected for processing (3)
  • api/v1alpha1/etcdbackup_webhook.go
  • api/v1alpha1/etcdcluster_webhook.go
  • internal/controller/etcdbackupschedule_controller.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • api/v1alpha1/etcdbackup_webhook.go

ohotnikov.ivan and others added 16 commits March 19, 2026 18:00
Add EtcdBackup custom resource definition with spec and status types.
Supports S3 and PVC backup destinations with TLS-aware etcd connections.

Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Validate create (exactly one of S3/PVC, required fields, subPath
path traversal protection) and update (immutable spec, status-only
updates allowed). Register webhook in test suite.

Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Standalone binary that takes etcd snapshots and stores them to
S3-compatible storage or PVC. Supports TLS client certificates,
creates intermediate directories for SubPath, and ensures data
durability via fsync before close.

Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Build deterministic backup Jobs from EtcdBackup and EtcdCluster specs.
Handles TLS volume mounts, S3 credential injection via secretKeyRef,
PVC mounting, and endpoint construction from cluster replicas.

Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Reconcile EtcdBackup resources by creating backup Jobs and tracking
their status via conditions (Started, Complete, Failed). Sets
ObservedGeneration on all conditions. Watches owned Jobs for status
changes.

Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Generated CRD, RBAC role, and webhook manifests. Add kustomize
patches for webhook and CA injection. Add editor/viewer ClusterRoles
and sample CR.

Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Build backup-agent binary alongside manager in a single RUN layer.
Add build-backup-agent target and update helm-crd-copy to include
EtcdBackup CRD.

Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Add RBAC rules for etcdbackups and batch/jobs. Add validating webhook
entry. Inject OPERATOR_IMAGE env var into manager deployment. Copy
generated CRD to chart crds directory.

Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Make S3 path style configurable via ForcePathStyle CRD field instead
of hardcoding UsePathStyle=true. Include backup name in default PVC
path to prevent overwrites when multiple backups target the same PVC.

Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Add EtcdBackupSchedule CRD that creates a Kubernetes CronJob to run
periodic etcd snapshots. Reuses BackupDestination types and backup-agent
pod spec from EtcdBackup via extracted buildBackupContainer helper.

Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Fix TLS CA path: use ServerTrustedCASecret instead of ServerSecret
  for CA certificate volume in backup jobs
- Fix CronJob update loop: replace reflect.DeepEqual on full spec with
  field-by-field comparison of operator-controlled fields only
- Add CronJob name length validation to EtcdBackupSchedule webhook
  (52 char limit minus 17 char suffix)
- Set TLS MinVersion to 1.2 in backup-agent
- Fix S3 upload for large snapshots by writing to temp file first to
  provide Content-Length header
- Fix scheduled backup file overwrite: inject UTC timestamp into
  filenames when BACKUP_TIMESTAMP=true
- Extract shared validateBackupDestination() function and use it
  directly in EtcdBackupSchedule webhook

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Add spec.bootstrap.restore field to EtcdCluster CRD, allowing users
to create new etcd clusters pre-populated with data from a backup
snapshot (S3 or PVC).

Uses a two-init-container architecture:
1. restore-agent (operator image): downloads snapshot to shared volume
2. restore-datadir (etcd image): runs etcdutl snapshot restore

Both init containers check for existing data directory for idempotency.
Bootstrap field is immutable after cluster creation.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Copy all CRDs (including EtcdBackupSchedule) to Helm chart
- Replace cron regex with robfig/cron/v3 parser for proper validation
- Validate OPERATOR_IMAGE env var at startup (fail-fast)
- Make backup/restore agent timeouts configurable via env vars

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Add empty snapshot check in restore-agent (reject 0-byte snapshots)
- Pass actual error to log.Error for OPERATOR_IMAGE validation
- Set TTLSecondsAfterFinished (600s) on backup Jobs for auto-cleanup
- Add unit tests for backup-agent (injectTimestamp, buildTLSConfig)
  and restore-agent (getTimeout, copyFromPVC, run idempotency)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Add zero-byte snapshot detection in backup-agent (both S3 and PVC
  paths) to prevent writing unrestorable backups
- Fix cronJobNeedsUpdate to compare only operator-set container fields
  (Image, Command, Env, VolumeMounts) instead of full container spec,
  avoiding infinite reconcile loop from API server defaults
- Run go mod tidy to mark robfig/cron/v3 as direct dependency

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
ohotnikov.ivan and others added 12 commits March 19, 2026 18:01
- Clean up zero-byte files on error in both backup-agent and
  restore-agent (os.Remove on all error paths)
- Make restoreSnapshotDir a var for testability; add real tests for
  writeSnapshot (non-empty, empty, PVC copy success/failure)
- Make OPERATOR_IMAGE optional at startup (warn instead of exit);
  validate at reconcile time with clear status condition when missing

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Align restore-agent timeout validation with backup-agent (return
  error on invalid RESTORE_TIMEOUT_MINUTES instead of silent fallback)
- Add table-driven unit test for cronJobNeedsUpdate covering schedule,
  history limits, labels, image, env, command, volumes changes and
  verifying API server defaults do not trigger false positives
- Add controller test for empty OperatorImage (verifies Failed
  condition with ConfigurationError reason)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Extract validateSpec() from ValidateCreate to allow ValidateUpdate
to skip the name length check. This prevents breaking updates of
existing EtcdBackupSchedule resources with names longer than 35
characters that were created before the webhook was added.

Add tests for name length validation (boundary and over-limit),
update-with-long-name, and empty OperatorImage in schedule controller.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
…e Job conditions

- Add name length validation to EtcdBackup webhook (max 45 chars)
  to prevent Job/Pod name overflow at runtime.
- Change snapshot file creation from os.Create (0666) to
  os.OpenFile with 0600 in both backup-agent and restore-agent,
  since etcd snapshots contain sensitive cluster data.
- Use Job condition type=Failed instead of job.Status.Failed > 0
  for more robust terminal state detection in reconcileJobStatus.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Add resource requests/limits (100m/128Mi requests, 512Mi memory
  limit) to backup-agent, restore-agent, and restore-datadir containers
  to prevent node resource exhaustion with large etcd snapshots.
- Change MkdirAll from 0755 to 0700 in backup-agent and restore-agent
  since etcd snapshots contain sensitive cluster data.
- Add S3 endpoint URL validation in webhook: must start with
  http:// or https://.
- Fix potential fd leak in backup-agent S3 upload: defer tmpFile.Close()
  immediately after creation instead of closing manually on each path.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Replace raw string "True" with corev1.ConditionTrue in Job
  condition check for type safety and codebase consistency.
- Change EtcdBackupSchedule ClusterNotFound from permanent failure
  to retryable state: set Ready=false and requeue after 30s.
  The schedule controller does not watch EtcdCluster resources,
  so periodic requeue is needed to recover when the cluster
  is created after the schedule.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Add PodSecurityContext (RunAsNonRoot, RunAsUser/Group 65532) to
  backup Job, backup CronJob, and restore init containers, matching
  the Dockerfile's nonroot user.
- Add set -e to the restore-datadir init container shell script
  for proper error propagation.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
- Add FSGroup to PodSecurityContext on backup Job/CronJob pods to
  ensure PVC mounts are writable by UID 65532.
- Add ActiveDeadlineSeconds (15min) to backup Job as a safety net
  if backup-agent hangs on a syscall.
- Add Job ownership verification: reject if a Job with the expected
  name exists but is not controlled by this EtcdBackup.
- Add ConcurrencyPolicy to cronJobNeedsUpdate comparison to detect
  drift from manual kubectl edits.
- Fix EtcdBackup name length limit: use 63 (Job DNS label limit)
  instead of 52 (CronJob limit). Max name is now 56 chars.
- Change CRD conversion webhook strategy from Webhook to None for
  EtcdBackup and EtcdBackupSchedule (no conversion handlers exist).
- Fix sample restore CR: use replicas:1 with S3 source to avoid
  ReadWriteOnce PVC scheduling issues with multiple replicas.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
@BROngineer BROngineer force-pushed the feat/etcdbackup-crd branch from 6bfdcb9 to 54658c5 Compare March 19, 2026 18:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
api/v1alpha1/etcdbackupschedule_webhook.go (1)

123-128: Consider case-insensitive check for @every.

The robfig/cron parser handles descriptors case-insensitively, so @Every 5m or @EVERY 5m would pass validation but still fail when Kubernetes attempts to create the CronJob.

🔧 Suggested fix for case-insensitive matching
-	} else if strings.HasPrefix(r.Spec.Schedule, "@every ") {
+	} else if strings.HasPrefix(strings.ToLower(r.Spec.Schedule), "@every ") {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v1alpha1/etcdbackupschedule_webhook.go` around lines 123 - 128, The
validation for rejecting "@every" schedules currently uses
strings.HasPrefix(r.Spec.Schedule, "@every "), which is case-sensitive; update
the check to be case-insensitive by comparing a lowercased schedule (e.g.,
strings.HasPrefix(strings.ToLower(r.Spec.Schedule), "@every ")). Ensure this
change is applied where r.Spec.Schedule is validated (the block that appends
field.Invalid for spec.schedule) so descriptors like "@Every" or "@EVERY" are
also rejected.
internal/controller/factory/backup_job.go (1)

68-78: Disable service-account token automount if the agent is API-free.

I don't see any pod-local need for in-cluster credentials in this spec: the container gets its inputs via env vars, mounted Secrets, and the PVC. If backup-agent does not call the Kubernetes API from inside the pod, set automountServiceAccountToken: ptr.To(false) here and in the CronJob template.

Possible fix
 				Spec: corev1.PodSpec{
+					AutomountServiceAccountToken: ptr.To(false),
 					RestartPolicy: corev1.RestartPolicyNever,
 					SecurityContext: &corev1.PodSecurityContext{
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/factory/backup_job.go` around lines 68 - 78, The PodSpec
for the backup job currently doesn't disable automatic mounting of the service
account token; if the backup-agent process never calls the Kubernetes API from
inside the pod, set automountServiceAccountToken to ptr.To(false) on the PodSpec
in the backup job factory (the corev1.PodSpec block you modified) and also add
the same setting to the CronJob pod template spec used elsewhere in this file so
pods won't receive in-cluster credentials; ensure you only apply this change
when the backup-agent is API-free.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/v1alpha1/etcdbackup_types.go`:
- Around line 98-100: The current printcolumn for Status in etcdbackup_types.go
only queries the Complete condition via JSONPath
`.status.conditions[?(@.type=="Complete")].status`, so resources with
`Failed=True` will appear blank; update the CRD annotations to either add a
second printcolumn that queries the Failed condition (e.g. JSONPath
`.status.conditions[?(@.type=="Failed")].status`) or introduce a dedicated
`.status.phase` field on the EtcdBackup API and point the printcolumn to that
field; modify the API type (EtcdBackup/EtcdBackupStatus) and controller to
populate the chosen field consistently and update the kubebuilder:+printcolumn
annotation(s) accordingly so failures are visible in `kubectl get`.

In `@internal/controller/factory/backup_job.go`:
- Around line 49-52: The code unconditionally sets ActiveDeadlineSeconds from
the activeDeadline variable which can be 0 and is rejected by Kubernetes; update
the Job/Pod spec assignments so ActiveDeadlineSeconds is only set when
activeDeadline > 0 (create a *int64 from activeDeadline and assign it) and leave
it nil otherwise; locate uses of activeDeadline in backup_job.go (where the
Job/Pod spec is built) and the two places in backup_cronjob.go and wrap the
assignment with a conditional so the field is omitted unless activeDeadline > 0.

---

Nitpick comments:
In `@api/v1alpha1/etcdbackupschedule_webhook.go`:
- Around line 123-128: The validation for rejecting "@every" schedules currently
uses strings.HasPrefix(r.Spec.Schedule, "@every "), which is case-sensitive;
update the check to be case-insensitive by comparing a lowercased schedule
(e.g., strings.HasPrefix(strings.ToLower(r.Spec.Schedule), "@every ")). Ensure
this change is applied where r.Spec.Schedule is validated (the block that
appends field.Invalid for spec.schedule) so descriptors like "@Every" or
"@EVERY" are also rejected.

In `@internal/controller/factory/backup_job.go`:
- Around line 68-78: The PodSpec for the backup job currently doesn't disable
automatic mounting of the service account token; if the backup-agent process
never calls the Kubernetes API from inside the pod, set
automountServiceAccountToken to ptr.To(false) on the PodSpec in the backup job
factory (the corev1.PodSpec block you modified) and also add the same setting to
the CronJob pod template spec used elsewhere in this file so pods won't receive
in-cluster credentials; ensure you only apply this change when the backup-agent
is API-free.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 994fe613-e9f4-4b3d-9085-790d1c8ed7b2

📥 Commits

Reviewing files that changed from the base of the PR and between ec1abf0 and 6bfdcb9.

📒 Files selected for processing (14)
  • api/v1alpha1/etcdbackup_types.go
  • api/v1alpha1/etcdbackupschedule_types.go
  • api/v1alpha1/etcdbackupschedule_webhook.go
  • api/v1alpha1/etcdbackupschedule_webhook_test.go
  • api/v1alpha1/etcdcluster_webhook.go
  • config/crd/bases/etcd.aenix.io_etcdbackups.yaml
  • config/crd/bases/etcd.aenix.io_etcdbackupschedules.yaml
  • config/crd/bases/etcd.aenix.io_etcdclusters.yaml
  • config/rbac/role.yaml
  • internal/controller/etcdbackupschedule_controller_test.go
  • internal/controller/factory/backup_cronjob.go
  • internal/controller/factory/backup_cronjob_test.go
  • internal/controller/factory/backup_job.go
  • internal/controller/factory/backup_job_test.go
✅ Files skipped from review due to trivial changes (4)
  • internal/controller/factory/backup_cronjob_test.go
  • internal/controller/factory/backup_job_test.go
  • api/v1alpha1/etcdbackupschedule_webhook_test.go
  • config/crd/bases/etcd.aenix.io_etcdbackupschedules.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
  • config/crd/bases/etcd.aenix.io_etcdbackups.yaml
  • api/v1alpha1/etcdcluster_webhook.go
  • internal/controller/etcdbackupschedule_controller_test.go
  • api/v1alpha1/etcdbackupschedule_types.go
  • config/crd/bases/etcd.aenix.io_etcdclusters.yaml

Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
Signed-off-by: Artem Bortnikov <brongineer747@gmail.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change controllers documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants