Skip to content

[OCI] VM.Standard.A1.Flex CPU capacity overestimated 2x due to incorrect OCPU-to-vCPU conversion #9337

@grkml

Description

@grkml

Bug Description

The OCI cloud provider incorrectly estimates CPU capacity for ARM-based flex shapes (e.g. VM.Standard.A1.Flex) when building template nodes for scale-from-zero decisions. The code unconditionally multiplies OCPUs by 2 to convert to vCPUs, but this conversion factor only applies to x86 shapes. ARM (Ampere A1) shapes have a 1:1 OCPU-to-vCPU mapping.

Affected Code

cluster-autoscaler/cloudprovider/oci/common/oci_shape.go in GetNodePoolShape():

return &Shape{
    Name: shapeName,
    CPU:  *np.NodeShapeConfig.Ocpus * 2,
    MemoryInBytes:           *np.NodeShapeConfig.MemoryInGBs * 1024 * 1024 * 1024,
    GPU:                     0,
    EphemeralStorageInBytes: float32(ephemeralStorage),
}, nil

The * 2 multiplier is correct for x86 flex shapes (e.g. VM.Standard.E4.Flex, VM.Standard.E5.Flex) where 1 OCPU = 2 vCPUs, but incorrect for ARM shapes (VM.Standard.A1.Flex, VM.Standard.A2.Flex) where 1 OCPU = 1 vCPU.

Impact

When scaling from zero on ARM flex node pools, the autoscaler overestimates node CPU capacity by 2x. This causes it to under-provision nodes because the binpacking estimator thinks fewer nodes are needed than actually required.

The autoscaler self-corrects after the first node joins and a pending pod remains unschedulable, but this adds an unnecessary extra scaling cycle (~3-5 minutes delay per correction on OKE).

Reproduction Steps

  1. Create an OKE cluster with a node pool using VM.Standard.A1.Flex shape, 5 OCPUs, size 0, managed by cluster autoscaler
  2. Create a deployment with pods requesting 4 CPU each
  3. Scale to a replica count that requires 2+ new nodes (e.g. 2 replicas when no other capacity exists)
  4. Observe the autoscaler scales only 1 node initially, expecting 10 vCPU capacity
  5. After the node joins with actual 5 vCPU capacity and one pod remains pending, a second scale-up is triggered

Expected Behavior

The autoscaler should recognize ARM flex shapes use 1:1 OCPU-to-vCPU mapping and estimate CPU capacity correctly, provisioning the right number of nodes on the first scaling decision.

Suggested Fix

Check whether the shape is ARM-based before applying the multiplier. ARM shapes can be identified by name prefix (e.g. VM.Standard.A1, VM.Standard.A2). Alternatively, the OCI ListShapes API returns processor description metadata that could be used.

cpuMultiplier := float32(2) // x86: 1 OCPU = 2 vCPUs
if strings.Contains(shapeName, ".A1.") || strings.Contains(shapeName, ".A2.") {
    cpuMultiplier = 1 // ARM: 1 OCPU = 1 vCPU
}
return &Shape{
    Name: shapeName,
    CPU:  *np.NodeShapeConfig.Ocpus * cpuMultiplier,
    ...

Environment

  • Cloud: Oracle Cloud Infrastructure (OCI)
  • Cluster: OKE enhanced cluster, Kubernetes v1.34.2
  • Autoscaler: OKE managed add-on (ClusterAutoscaler)
  • Shape: VM.Standard.A1.Flex (Ampere A1, ARM64), 5 OCPUs, 30GB memory
  • Auth: Workload identity
  • Node group discovery: nodeGroupAutoDiscovery with freeform tags

Additional Notes

  • The GetInstancePoolShape() function in the same file does NOT apply the * 2 multiplier for instance pool flex shapes, making this inconsistent within the same file.
  • The ListShapes API path (used for non-flex/standard shapes) derives CPU from Ocpus * 2 as well, which would also be wrong for fixed ARM shapes if any exist.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions