Skip to content

Conversation

AmitSahastra
Copy link
Contributor

@AmitSahastra AmitSahastra commented Jun 17, 2025

Description

This PR adds support for using launch template EKSConfig bootstrapping for Amazon Linux 2023 nodes. The EKSConfig controller now able to create bootstrap datasecrete with nodeConfig that will enable using AL2023 images with LaunchTemplates.

⚠️ Prior to this change, CAPA always generated cloud-init-style userData, which is incompatible with AL2023. This patch enables EKS-managed bootstrap using nodeadm via MIME-compliant NodeConfig.

Changes

  • Modified EKSConfig controller to fetch AMI ID from AWSManagedMachinePool's launch template
  • Added proper condition handling for control plane readiness
  • Ensured proper reconciliation when dependencies (AWSManagedControlPlane and AWSManagedMachinePool) become ready

Testing

  • Verified EKSConfig correctly generates a bootstrap data secret using a NodeConfig MIME payload when nodeType: al2023 is specified
  • Confirmed that node labels are properly set with the AMI ID
  • Tested reconciliation behaviour when dependencies are not ready
  • Validated that the controller proceeds with data secret creation once dependencies are ready
  • Example node output:
k get nodes -o wide
NAME                                          STATUS   ROLES    AGE   VERSION                INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION                    CONTAINER-RUNTIME
ip-10-0-183-158.ap-south-1.compute.internal   Ready    <none>   17h   v1.30.11-eks-473151a   10.0.183.158   <none>        Amazon Linux 2023.7.20250527   6.1.134-152.225.amzn2023.x86_64   containerd://1.7.27
ip-10-0-65-54.ap-south-1.compute.internal     Ready    <none>   17h   v1.30.11-eks-473151a   10.0.65.54     <none>        Amazon Linux 2023.7.20250527   6.1.134-152.225.amzn2023.x86_64   containerd://1.7.27
  • EKSConfig with nodeType:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfig
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"bootstrap.cluster.x-k8s.io/v1beta2","kind":"EKSConfig","metadata":{"annotations":{},"name":"am-clusterctl-eks-2-pool-0-bootstrap","namespace":"default"},"spec":{"nodeType":"al2023"}}
  creationTimestamp: "2025-06-16T12:26:26Z"
  generation: 1
  labels:
    cluster.x-k8s.io/cluster-name: am-clusterctl-eks-2
  name: am-clusterctl-eks-2-pool-0-bootstrap
  namespace: default
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachinePool
    name: am-clusterctl-eks-2-pool-0
    uid: c4721a8a-a456-434c-8e3f-a5722a5ffb13
  resourceVersion: "200264"
  uid: d224aca6-da5e-4788-8fa6-c727ace79c25
spec:
  nodeType: al2023
status:
  conditions:
  - lastTransitionTime: "2025-06-17T03:37:53Z"
    status: "True"
    type: Ready

  • Generated bootstrap data:
value: |-
  MIME-Version: 1.0
  Content-Type: multipart/mixed; boundary="//"

  --//
  Content-Type: application/node.eks.aws

  ---
  apiVersion: node.eks.aws/v1alpha1
  kind: NodeConfig
  spec:
    cluster:
      apiServerEndpoint: https://xxxxxxxx.xxx.ap-south-1.eks.amazonaws.com
      certificateAuthority: xxxxx
      cidr: 10.96.0.0/12
      name: default_am-clusterctl-eks-2-control-plane-2
    kubelet:
      config:
        maxPods: 110
        clusterDNS:
        - 10.96.0.10
      flags:
      - "--node-labels=eks.amazonaws.com/nodegroup-image=ami-0930ab0e58973e126,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup=am-clusterctl-eks-2-pool-0-bootstrap"

  --//--

  • AWSMMP
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"infrastructure.cluster.x-k8s.io/v1beta2","kind":"AWSManagedMachinePool","metadata":{"annotations":{},"name":"am-clusterctl-eks-2-pool-0","namespace":"default"},"spec":{"amiType":"CUSTOM","awsLaunchTemplate":{"ami":{"id":"ami-0930ab0e58973e126"},"instanceType":"t3.large","sshKeyName":"spectro2024"},"scaling":{"maxSize":5,"minSize":1}}}
  creationTimestamp: "2025-06-16T12:26:26Z"
  finalizers:
  - awsmanagedmachinepools.infrastructure.cluster.x-k8s.io
  generation: 4
  labels:
    cluster.x-k8s.io/cluster-name: am-clusterctl-eks-2
  name: am-clusterctl-eks-2-pool-0
  namespace: default
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachinePool
    name: am-clusterctl-eks-2-pool-0
    uid: c4721a8a-a456-434c-8e3f-a5722a5ffb13
  resourceVersion: "200257"
  uid: 20954503-d2eb-430f-933f-977a70f9cba3
spec:
  amiType: CUSTOM
  awsLaunchTemplate:
    ami:
      id: ami-0930ab0e58973e126
    instanceType: t3.large
    marketType: OnDemand
    sshKeyName: sshkey
  capacityType: onDemand
  eksNodegroupName: default_am-clusterctl-eks-2-pool-0
  providerIDList:
  - aws:///xxxx/i-07142dfb97c7exxxx
  - aws:///xxxx/i-0ef83e9b00a9fxxxx
  roleName: eks-nodegroup.cluster-api-provider-aws.sigs.k8s.io
  scaling:
    maxSize: 5
    minSize: 1
  updateConfig:
    maxUnavailable: 1
status:
  conditions:
  - lastTransitionTime: "2025-06-17T03:37:52Z"
    status: "True"
    type: Ready

Impact

These changes allow users to use launch templates with AL2023 images. Also to specify custom AMIs through launch templates while maintaining compatibility with CAPA's auto AMI lookup mechanism.

Related Issues

Fixes #5546

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • includes emoji in title
  • adds unit tests
  • adds or updates e2e tests

Release note:

EKSConfig now supports generating MIME-formatted bootstrap data for AL2023 nodes using nodeadm and Launch Templates via AWSManagedMachinePool.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 17, 2025
@k8s-ci-robot k8s-ci-robot added needs-priority size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 17, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @AmitSahastra. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@AmitSahastra AmitSahastra changed the title Add support for EKSConfig LaunchTemplate bootstrapping for AL2023 using nodeadm ✨ Add support for EKSConfig LaunchTemplate bootstrapping for AL2023 using nodeadm Jun 17, 2025
@AmitSahastra AmitSahastra changed the title ✨ Add support for EKSConfig LaunchTemplate bootstrapping for AL2023 using nodeadm # ✨ Add support for EKSConfig LaunchTemplate bootstrapping for AL2023 using nodeadm Jun 17, 2025
@AmitSahastra AmitSahastra changed the title # ✨ Add support for EKSConfig LaunchTemplate bootstrapping for AL2023 using nodeadm ✨ Add support for EKSConfig LaunchTemplate bootstrapping for AL2023 using nodeadm Jun 17, 2025
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 17, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 20, 2025
@fiunchinho
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 24, 2025
@AmitSahastra
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-apidiff-main
/test pull-cluster-api-provider-aws-test

@@ -24,6 +24,9 @@ import (

// EKSConfigSpec defines the desired state of Amazon EKS Bootstrap Configuration.
type EKSConfigSpec struct {
// NodeType specifies the type of node (e.g., "al2023")
// +optional
NodeType string `json:"nodeType,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to derive this from the AMI being used rather than asking the user to specify in the API?

AmitSahastra and others added 3 commits August 14, 2025 17:26
Code refactor for NodeTypeAL2023 enum
As per documentation at https://github.com/awslabs/amazon-eks-ami/blob/v20250813/nodeadm/api/v1alpha1/nodeconfig_types.go#L52-L53:

```
// CIDR is your cluster's service CIDR block. This value is used to infer your cluster's DNS address.
CIDR string `json:"cidr,omitempty"`
```

Previously setting it to the VPC CIDR was breaking DNS resolution in pods because they
were expecting CoreDNS at 10.0.0.10 (10th IP in VPC CIDR) rather than the 10th IP in the service CIDR.

Also change the default service CIDR to EKS default of 172.20.0.0/12.
@rudimk
Copy link

rudimk commented Aug 31, 2025

Hey folks - curious as to when we're looking at seeing this merged / released?

@damdo
Copy link
Member

damdo commented Sep 1, 2025

/assign @damdo @nrb @richardcase @AndiDog @dlipovetsky @Ankitasw

What do you think of this?

Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, I think we polish a bit the API first, then we can adjust the logic according to it

Comment on lines +25 to +27
// NodeType specifies the type of nodeq
// +kubebuilder:validation:Enum=al2023
type NodeType string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments below regarding the naming of this field

Comment on lines +29 to +32
const (
// NodeTypeAL2023 represents the AL2023 node type.
NodeTypeAL2023 NodeType = "al2023"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here a NodeTypeAL2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use the ones you defined for the AMIFamilyType also here

	// AMIFamilyAL2 is the Amazon Linux 2 AMI family.
	AMIFamilyAL2 = "AmazonLinux2"
	// AMIFamilyAL2023 is the Amazon Linux 2023 AMI family.
	AMIFamilyAL2023 = "AmazonLinux2023"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +36 to +38
// NodeType specifies the type of node (e.g., "al2023")
// +optional
NodeType NodeType `json:"nodeType,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NodeType is a bit weak as a field name because Type is quite generic.
How about AMIFamilyType like the field we have in the NodeInput struct? So then we can directly pass that down to it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would also need to specify what the default is (I guess AmazonLinux2) and an enum validation for this.

	// +kubebuilder:validation:Enum="";AmazonLinux2;AmazonLinux2023
	// +optional

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum is added on the type definition (https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5553/files#diff-64e2d3d6a1323c5ae37a84cfc459f926fe7e329e83a4f695df28de89f19a3172R26). Only al2023 is allowed at this point, no other values are required to change behaviour, default (unset) is current behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is a little confusing to me. We're essentially specifying an AMI family here. We have a field like this already in AWSMachine

EKSOptimizedLookupType *EKSAMILookupType `json:"eksLookupType,omitempty"`

I think we shouldn't repeat this information here.

Copy link
Contributor

@faiq faiq Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that's worth noting here is that Amazon Linux 2 will be EOL coming up very soon - june 2026 source: https://aws.amazon.com/amazon-linux-2/faqs/

It seems like EKS has largely moved away from this AL2 support to either AL2023 or bottlerocket.

Bottlerocket seems to have its own set backs see kubernetes-sigs/cluster-api#7840

so realistically the only option is going to be AL2023.

EDIT:

[!IMPORTANT] Amazon EKS will no longer publish EKS-optimized Amazon Linux 2 (AL2) AMIs after November 26th, 2025. Additionally, Kubernetes version 1.32 is the last version for which Amazon EKS will release AL2 AMIs. From version 1.33 onwards, Amazon EKS will continue to release AL2023 and Bottlerocket based AMIs. To learn more, see Guide to EKS AL2 & AL2-Accelerated AMIs transition features.

source: https://awslabs.github.io/amazon-eks-ami/

it looks like they won't be building this AMI

Comment on lines +37 to +40
// AMIFamilyAL2 is the Amazon Linux 2 AMI family.
AMIFamilyAL2 = "AmazonLinux2"
// AMIFamilyAL2023 is the Amazon Linux 2023 AMI family.
AMIFamilyAL2023 = "AmazonLinux2023"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move them in the outer API and leverage them here too

Comment on lines 127 to 137
// AL2023 specific fields
AMIImageID string
APIServerEndpoint string
Boundary string
CACert string
CapacityType *v1beta2.ManagedMachinePoolCapacityType
ClusterCIDR string // CIDR range for the cluster
ClusterDNS string
MaxPods *int32
NodeGroupName string
NodeLabels string // Not exposed in CRD, computed from user input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, is there a way to make this more specific to AmazonLinux2023 if that's only for it?
How about a discriminated union?
Have we thought about what's going to happen if there is going to be, for example an AmazonLinux2027 or similar, which may have different fields from this one?

@@ -56,3 +56,40 @@ clusterctl init --infrastructure aws
```

NOTE: you will need to enable the creation of the default Fargate IAM role. The easiest way is using `clusterawsadm` and using the `fargate` configuration option, for instructions see the [prerequisites](../using-clusterawsadm-to-fulfill-prerequisites.md).

### Amazon Linux 2023
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call it something more generic depending on what we agree on for the API, like AMIFamilyType, and then we can have various subsections for each type.

@@ -199,6 +203,70 @@ func TestEKSConfigReconciler(t *testing.T) {
gomega.Expect(string(secret.Data["value"])).To(Not(Equal(string(expectedUserData))))
}).Should(Succeed())
})

t.Run("Should return requeue when control plane is not initialized", func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also add webhook tests for the values we are going to be validating for the AMIFamilyType

Comment on lines +214 to +215
// Set node type to AL2023 to trigger requeue
config.Spec.NodeType = "al2023"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need changing based on the API we agree on

// validateAL2023Input validates the input for AL2023 user data generation.
func validateAL2023Input(input *NodeInput) error {
if input.APIServerEndpoint == "" {
return fmt.Errorf("API server endpoint is required for AL2023")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere in this PR it would be better to use the full extended AmazonLinux2023, rather than AL2023, same goes for AL2 vs AmazonLinux2.

🐛 Use cluster service CIDR in NodeConfig CIDR
@k8s-ci-robot
Copy link
Contributor

Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository.
Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andidog. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

}

// Get AMI ID from AWSManagedMachinePool's launch template if specified
if configOwner.GetKind() == "AWSManagedMachinePool" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason that users who don' use managed machine pools can't set AMI IDs like users of AWSMachineTemplate?

@faiq
Copy link
Contributor

faiq commented Sep 3, 2025

Generally speaking I think we can simplify this to just support the nodeadm script because the other stuff isn't going to be applicable. Please take a look at some of the work I did here that attempts to simplify some of this stuff faiq@cf6a280

@jimmidyson
Copy link
Member

jimmidyson commented Sep 3, 2025

@faiq

Generally speaking I think we can simplify this to just support the nodeadm script because the other stuff isn't going to be applicable.

AFAIK nodeadm will only apply to AL2023 so if using other distros they'll still use cloud-init. I think so anyway 😅

@faiq
Copy link
Contributor

faiq commented Sep 3, 2025

@jimmidyson

AFAIK nodeadm will only apply to AL2023 so if using other distros they'll still use cloud-init. I think so anyway 😅

I think they'll still be using nodeadm 🤔 so from my POV there's 2 things happening here

  1. The old bootstrap mechanism (bootstrap.sh) is being deprecated in favor of the new nodeadm tool.
  2. Because EKS mostly works with the images they provide what I think is happening here is that we're pairing the Operating Systems with the bootstrap methods -- ie: AL2 + bootstrap.sh and AL2023 + nodeadm. I think the bootstrap methods are independent of the OS and nodeadm is what they're using going forward.

i could be wrong, but that's how im interpreting the situation 🤷‍♂️

{{.}}
{{- end}}
{{- range .PostBootstrapCommands}}
{{.}}
Copy link
Contributor

@faiq faiq Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloud-init handles the text/x-shellscript content type and all of that is going to run before Nodeadm runs to join the cluster. I think we should remove the PostBootstrapCommands because of this.

Running the nodeadm command and then running the postBootstrapCommands will likely be very cumbersome (disabling the service) and subject to a lot of downstream changes which will be difficult to maintain

@faiq
Copy link
Contributor

faiq commented Sep 4, 2025

Hey @AmitSahastra after looking and thinking about this PR a little bit more I think the right approach here would be to create a new API type that handles bootstrapping for Nodeadm.

The EKSConfig type was written as an API adapter layer above the old bootstrap.sh bootstrapping method. This current PR is essentially adding a new implementation for nodeadm, but there are a lot of fields that nodeadm supports (see API reference) that bootstrap.sh doesn't support and vice versa

I think the cleanest way to support this would be to create a new type so when Amazon Linux and bootstrap.sh get no longer supported by EKS we can cleanly remove that code. In the meantime, we can support both using the two different bootstrap methods.

@faiq
Copy link
Contributor

faiq commented Sep 5, 2025

I wrote this document that outlines the need for a new bootstrap API here https://docs.google.com/document/d/1n2v0Q4D5VAzydu7aJno3gRjMlSHueJ8BlGIOrKLDaiQ/edit?tab=t.0#heading=h.ly39fzd1bgc0

@pavansokkenagaraj
Copy link

I wrote this document that outlines the need for a new bootstrap API here https://docs.google.com/document/d/1n2v0Q4D5VAzydu7aJno3gRjMlSHueJ8BlGIOrKLDaiQ/edit?tab=t.0#heading=h.ly39fzd1bgc0

@damdo @faiq @jimmidyson
Native support for Bottlerocket in Managed Node Groups with EKS has been around since 2022(AFAIK).
If we are designing a new API, Bottlerocket configuration should be considered.

@faiq
Copy link
Contributor

faiq commented Sep 12, 2025

@pavansokkenagaraj that's a good point however it seems that bottlerocket support in cluster API is set back due to kubernetes-sigs/cluster-api#7840 as well as this kubernetes-sigs/cluster-api#5294

Given that it's not going to use a cloud-init based configuration, we should consider that out of the scope of this new API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/contains-merge-commits needs-priority ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.