Skip to content

Conversation

valeriy42
Copy link
Contributor

@valeriy42 valeriy42 commented Jul 25, 2025

We observed that update of the number of deployment allocation lead to double-accounting of the function StartTrainedModelDeploymentAction.estimateMemoryUsageBytes. Further investigation showed that this bug was introduced in #104260.

This PR reverts changes from #104260 and refactors memory calculation in ML inference assignment planning to improve testability and prevent memory accounting issues.

Main Changes

  • Refactored memory calculation in AssignmentPlan.Builder:
    • Added assignModelToNodeAndAccountForCurrentAllocations() in AssignmentPlan.java that handles both new and current allocations in a single call
    • Moved memory accounting into the Builder class to reduce code duplication and potential bugs
    • Used by ZoneAwareAssignmentPlanner and TrainedModelAssignmentRebalancer for consistent memory handling
  • Added dependency injection in Deployment for memory estimation functional
  • Created test to verify correct memory accounting when scaling allocations from 3 to 4 and confirmed no double-counting of memory during allocation scaling

@valeriy42 valeriy42 added >bug :ml Machine learning labels Jul 25, 2025
@valeriy42 valeriy42 self-assigned this Jul 25, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @valeriy42, I've created a changelog YAML for you.

@valeriy42 valeriy42 changed the title [ML] Fix the bug with double allocation accouting [ML] Prevent double allocation accounting in memory estimation of trained model deployments Jul 25, 2025
@valeriy42 valeriy42 changed the title [ML] Prevent double allocation accounting in memory estimation of trained model deployments [ML] Prevent the trained model deployment memory estimation from double-counting allocations. Jul 25, 2025
@valeriy42 valeriy42 requested a review from jan-elastic July 25, 2025 12:23
@valeriy42 valeriy42 marked this pull request as ready for review July 25, 2025 12:23
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 25, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@valeriy42 valeriy42 added v9.1.1 v8.19.1 v9.0.5 v8.18.5 auto-backport Automatically create backport pull requests when merged labels Jul 25, 2025
int currentAllocations = getCurrentAllocations(deployment, node);
if (currentAllocations > 0) {
long memoryForCurrentAllocations = deployment.estimateMemoryUsageBytes(currentAllocations);
accountMemory(deployment, node, memoryForCurrentAllocations);
Copy link
Contributor

@jan-elastic jan-elastic Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. Why call:

  • assignModelToNode for the new allocations; and
  • accountMemory for the old ones?

I guess I also don't really understand what the state of AssignmentPlan exactly contains.

Isn't the old already accounted for? And what about the cores?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, it feels like this shouldn't be doing that much, so I don't get why it's 500+ lines of similar confusing methods...

@valeriy42 valeriy42 closed this Jul 31, 2025
@valeriy42 valeriy42 deleted the bug/allocations-2 branch July 31, 2025 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :ml Machine learning Team:ML Meta label for the ML team v8.18.5 v8.19.1 v9.0.5 v9.1.1 v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants