Skip to content

Conversation

zongzhidanielhu
Copy link
Contributor

@zongzhidanielhu zongzhidanielhu commented Mar 17, 2025

Issue number:

Description of changes:
Added feature allows configuring a specific zone in Test.toml, so that EC2 instance(s) can be launched in this specific zone. Usually, two zones must specified, but this feature allows one specific single zone for EC2 instance to launch into.

Testing done:
5 rounds same testings have been launched to make sure us-west-2b is not randomly picked up:
Input Test.toml:
Screenshot 2025-03-17 at 11 23 30 AM

After EC2 instance launched:
[ (eksctl-x86-64-aws-k8s-129-nvidia-cluster/SubnetPublicUSWEST2B)]

Invalid input tests:
input:
Screenshot 2025-03-17 at 1 49 13 PM

output:
Screenshot 2025-03-17 at 1 49 54 PM

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@zongzhidanielhu zongzhidanielhu force-pushed the 1.29_az branch 4 times, most recently from c7f92d7 to 08f5cb9 Compare March 17, 2025 20:18
let mut flex_subnet_ids = spec.clone().configuration.subnet_ids;

if flex_subnet_ids.len() <= 2 {
flex_subnet_ids.remove(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would explain why we should remove the first position value here and why this would be triggered under two subnet ids.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, Will do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone provides ["us-west-2b", "us-west-2c"], wouldn't this now fail? I do not believe we want fixed offsets like this. We should look for the value we want and use that. I don't think it is safe to assume the input is sorted or in the order we expect.

UsWest2b,
#[strum(serialize = "us-west-2c")]
UsWest2c,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we miss UsWest2d here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears the test is flagging UsWest2d as an invalid input, which doesn't seem correct. Why should we consider this availability zone invalid when it's a legitimate zone in AWS US-West-2 region?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 valid zones a, b and c. but in Test.toml you basically can enter whatever you want to. This check just make sure input is valid, if not log out info in testsys. eksctl will through a error too if the input zone is invalid, but a concise code will do a check too, I would rather check it before the eksctl through.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


if zones.is_some() && zones.iter().len() < 2 {
let zones_upwrap = zones.clone().unwrap().first().unwrap().to_string();
if AvailabilityZones::from_str(&zones_upwrap).is_err() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should run the zone input value validation at twoliter. Twoliter does the job to read test.toml file and pass it to testsys. so I think twoliter should own the job to validate it as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either way will works, but checking here is much easily due to the core logic is in the cluster creation, and the key purpose of give a input zone is for create a cluster with the instance we want to.

.unwrap()
.as_ref()
.to_string();
eks_zones.insert(0, remain_zone);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I don't fully understand this method. The second position value will always be 2b ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the current implementation (although I haven't seen the twoliter logic directly). The system randomly adds an extra availability zone to the EKS cluster zone list if only one zone is provided by Test.toml and inserts its subnet ID at the start of the subnet list. When launching EC2 instances, it removes this first subnet ID and uses the remaining ones. I think this isn't the ideal solution that we want.

Instead, we should use the AWS describe-subnets API to identify the exact subnet IDs matching our target availability zones from Test.toml, then pass these specific subnet IDs to ec2 instance launch client. This would allow the EC2 agent to launch instances only in our intended subnets, providing a cleaner and more controlled solution.

Copy link
Contributor Author

@zongzhidanielhu zongzhidanielhu Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree of using AWS describe-subnets, this is just add another layer of unnecessary calling of other API, the hard part of this problem is when have 2 subnets, you don't know which on is which, here I just used the most simple way, the Vec index to identify the target subnet without calling anything API overhead, and remove the others. You don't have to describe the subnets cause you already known the index[1] is the target zone subnet id.

By adding up the AWS describe-subnets, I need to setup the awscli, and do a API call, get the response, then from the response, find the field need, compare what I did which is much simple and clean, this just too much overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By adding up the AWS describe-subnets, I need to setup the awscli, and do a API call, get the response, then from the response, find the field need, compare what I did which is much simple and clean, this just too much overhead.

You should use the AWS Rust SDK to make these calls and I agree with @gthao313, we should do this. It is not too much overhead, you have to make this code functional in more than just this one situation, and these describes and dealing with dynamic changes is part of dealing with making this code re-useable/flexible.

You don't have to describe the subnets cause you already known the index[1] is the target zone subnet id.

You have hard coded the AZs above which is why I commented there. This code will only ever work while there are 3 AZs and only in us-west-2. Using describe calls with the AWS SDK would allow this code to run other places than us-west-2 and still retain the same functionality.

#[strum(serialize_all = "kebab-case")]
enum AvailabilityZones {
#[strum(serialize = "us-west-2a")]
UsWest2a,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to hard code this. Is there a way we can get the Availability Zones from the ec2 client at runtime?

Copy link
Contributor Author

@zongzhidanielhu zongzhidanielhu Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the hard part of this issue, everything has to be done before the ec2 instances created, that's why I can not use ec2 client due to by then the instance(s) already created which is no way we can manipulate it. It is kind of not hardcode due to it is only has three zones, a, b, c, and it is just for define the valid zone boundary. This will give us more flexible of when input is c, you know you will add a or b to create cluster, when input is b, you know you will check a, c, not e, f, g cause the valid zone boundary already defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to go back to the approach and see if we need to create an EC2 client before the rest of this because this is a blocking issue, we can't merge hard-coding us-west-2 AZs as the only place this works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we shouldn't hardcode since it would be broken if we pass the zone at other region. It already provides the region that cluster is running at and setup aws config based on the region, so we can programmatically list the available zones by using the ec2_client.

The issue here is to provide extra available zones if only one zone is provided via Test.toml, so we list all zones and randomly give one (not the input one) to eksctl.

This would help https://docs.rs/aws-sdk-ec2/latest/aws_sdk_ec2/struct.Client.html#method.describe_availability_zones

.unwrap()
.as_ref()
.to_string();
eks_zones.insert(0, remain_zone);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By adding up the AWS describe-subnets, I need to setup the awscli, and do a API call, get the response, then from the response, find the field need, compare what I did which is much simple and clean, this just too much overhead.

You should use the AWS Rust SDK to make these calls and I agree with @gthao313, we should do this. It is not too much overhead, you have to make this code functional in more than just this one situation, and these describes and dealing with dynamic changes is part of dealing with making this code re-useable/flexible.

You don't have to describe the subnets cause you already known the index[1] is the target zone subnet id.

You have hard coded the AZs above which is why I commented there. This code will only ever work while there are 3 AZs and only in us-west-2. Using describe calls with the AWS SDK would allow this code to run other places than us-west-2 and still retain the same functionality.

let mut flex_subnet_ids = spec.clone().configuration.subnet_ids;

if flex_subnet_ids.len() <= 2 {
flex_subnet_ids.remove(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone provides ["us-west-2b", "us-west-2c"], wouldn't this now fail? I do not believe we want fixed offsets like this. We should look for the value we want and use that. I don't think it is safe to assume the input is sorted or in the order we expect.

#[strum(serialize_all = "kebab-case")]
enum AvailabilityZones {
#[strum(serialize = "us-west-2a")]
UsWest2a,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to go back to the approach and see if we need to create an EC2 client before the rest of this because this is a blocking issue, we can't merge hard-coding us-west-2 AZs as the only place this works.

UsWest2b,
#[strum(serialize = "us-west-2c")]
UsWest2c,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

desired_capacity: i32,
}

fn zone_to_exclude(exclude: String) -> Vec<AvailabilityZones> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it should be swapped out for a function using the EC2 client to get the available availability zones for that account in that region.

nit: the name is not really helpful. You have named it what you are using it for, not what it does. This function returns the availability zones in us-west-2. If we must keep this function (which my above comment asks not to), then it should be named something more specific to its function, not what you want to use it for like all_us_west_2_zones() or something like that.

}
let remain_zone = zone_to_exclude(zone_input.to_string())
.first()
.unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use unwrap() for these situations. I like to think of unwrap() as a good indicator I might have a flaw in my approach. It makes me ask "why does Rust think this can fail?" and it helps shift my focus to how to avoid it being possible to fail. In this case, you likely need to handle the error since you are hard coding the output from this function, you should be able to enforce that it won't fail. But if this was output from a client to EC2, then you need to properly handle if the output of this function doesn't match your expectations of there being a return from first().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants