Skip to content

Conversation

bulasevich
Copy link
Contributor

@bulasevich bulasevich commented Aug 22, 2025

This reworks the recent update #24696 to fix a UBSan issue on aarch64. The problem now reproduces on x86_64 as well, which suggests the previous update was not optimal.

The issue reproduces with a HeapByteBufferTest jtreg test on a UBSan-enabled build. Actually the trigger is XX:+OptoScheduling option used by test (by default OptoScheduling is disabled on most x86 CPUs). With the option enabled, the failure can be reproduced with a simple java -version run.

This fix is in ADLC-generated code. For simplicity, the examples below show the generated fragments.

The problems is that shift count n may be too large here:

class Pipeline_Use_Cycle_Mask {
protected:
  uint _mask;
  ..
  Pipeline_Use_Cycle_Mask& operator<<=(int n) {
    _mask <<= n;
    return *this;
  }
};

The recent change attempted to cap the shift amount at one call site:

class Pipeline_Use_Element {
protected:
  ..
  // Mask of specific used cycles
  Pipeline_Use_Cycle_Mask _mask;
  ..
  void step(uint cycles) {
    _used = 0;
    uint max_shift = 8 * sizeof(_mask) - 1;
    _mask <<= (cycles < max_shift) ? cycles : max_shift;
  }
}

However, there is another site where Pipeline_Use_Cycle_Mask::operator<<= can be called with a too-large shift count:

// The following two routines assume that the root Pipeline_Use entity
// consists of exactly 1 element for each functional unit
// start is relative to the current cycle; used for latency-based info
uint Pipeline_Use::full_latency(uint delay, const Pipeline_Use &pred) const {
  for (uint i = 0; i < pred._count; i++) {
    const Pipeline_Use_Element *predUse = pred.element(i);
    if (predUse->_multiple) {
      uint min_delay = 7;
      // Multiple possible functional units, choose first unused one
      for (uint j = predUse->_lb; j <= predUse->_ub; j++) {
        const Pipeline_Use_Element *currUse = element(j);
        uint curr_delay = delay;
        if (predUse->_used & currUse->_used) {
          Pipeline_Use_Cycle_Mask x = predUse->_mask;
          Pipeline_Use_Cycle_Mask y = currUse->_mask;

          for ( y <<= curr_delay; x.overlaps(y); curr_delay++ )
            y <<= 1;
        }
        if (min_delay > curr_delay)
          min_delay = curr_delay;
      }
      if (delay < min_delay)
      delay = min_delay;
    }
    else {
      for (uint j = predUse->_lb; j <= predUse->_ub; j++) {
        const Pipeline_Use_Element *currUse = element(j);
        if (predUse->_used & currUse->_used) {
          Pipeline_Use_Cycle_Mask x = predUse->_mask;
          Pipeline_Use_Cycle_Mask y = currUse->_mask;

>        for ( y <<= delay; x.overlaps(y); delay++ )
            y <<= 1;
        }
      }
    }
  }

  return (delay);
}

Fix: cap the shift inside Pipeline_Use_Cycle_Mask::operator<<= so all call sites are safe:

class Pipeline_Use_Cycle_Mask {
protected:
  uint _mask;
  ..
  Pipeline_Use_Cycle_Mask& operator<<=(int n) {
    int max_shift = 8 * sizeof(_mask) - 1;
    _mask <<= (n < max_shift) ? n : max_shift;
    return *this;
  }
};

class Pipeline_Use_Element {
protected:
  ..
  // Mask of specific used cycles
  Pipeline_Use_Cycle_Mask _mask;
  ..
  void step(uint cycles) {
    _used = 0;
    _mask <<= cycles;
  }
}

Note: on platforms where PipelineForm::_maxcycleused > 32 (e.g., ARM32), the Pipeline_Use_Cycle_Mask implementation already handles large shifts, so no additional check is needed:

class Pipeline_Use_Cycle_Mask {
protected:
  uint _mask1, _mask2, _mask3;

  Pipeline_Use_Cycle_Mask& operator<<=(int n) {
    if (n >= 32)
      do {
        _mask3 = _mask2; _mask2 = _mask1; _mask1 = 0;
      } while ((n -= 32) >= 32);

    if (n > 0) {
      uint m = 32 - n;
      uint mask = (1 << n) - 1;
      uint temp2 = mask & (_mask1 >> m); _mask1 <<= n;
      uint temp3 = mask & (_mask2 >> m); _mask2 <<= n; _mask2 |= temp2;
      _mask3 <<= n; _mask3 |= temp3;
    }
    return *this;
  }
}

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8338197: ubsan: ad_x86.hpp:6417:11: runtime error: shift exponent 100 is too large for 32-bit type 'unsigned int' (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26890/head:pull/26890
$ git checkout pull/26890

Update a local copy of the PR:
$ git checkout pull/26890
$ git pull https://git.openjdk.org/jdk.git pull/26890/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26890

View PR using the GUI difftool:
$ git pr show -t 26890

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26890.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 22, 2025

👋 Welcome back bulasevich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 22, 2025

@bulasevich This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8338197: ubsan: ad_x86.hpp:6417:11: runtime error: shift exponent 100 is too large for 32-bit type 'unsigned int'

Reviewed-by: kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 31 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Aug 22, 2025

@bulasevich The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

… is too large for 32-bit type 'unsigned int'
@bulasevich bulasevich marked this pull request as ready for review August 22, 2025 16:29
@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 22, 2025
@mlbridge
Copy link

mlbridge bot commented Aug 22, 2025

Webrevs

Comment on lines 773 to 774
fprintf(fp_hpp, " _mask <<= n;\n");
fprintf(fp_hpp, " int max_shift = 8 * sizeof(_mask) - 1;\n");
fprintf(fp_hpp, " _mask <<= (n < max_shift) ? n : max_shift;\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof(_mask) is know - it is sizeof(uint).
Lines 760-768 should be cleaned: <= 32 checks are redundant because of check at line 758. This is leftover from SPARC code (not clean) removal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I removed the redundant code.

As for sizeof(_mask), shouldn’t it just be max_shift = 31 or _mask <<= (n < 32) ? n : 31;?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if sizeof(uint) is 32 bits on all our platforms.

Hmm, may be we should use uint32_t for _mask here. Then we can use 32 and 31 without confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to use _mask <<= (n < 32) ? n : 31;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! Let me correct both variants then. The resulting code is:

class Pipeline_Use_Cycle_Mask {
protected:
  uint32_t _mask;

public:
  Pipeline_Use_Cycle_Mask() : _mask(0) {}

  Pipeline_Use_Cycle_Mask(uint32_t mask) : _mask(mask) {}

  bool overlaps(const Pipeline_Use_Cycle_Mask &in2) const {
    return ((_mask & in2._mask) != 0);
  }

  Pipeline_Use_Cycle_Mask& operator<<=(int n) {
    _mask <<= (n < 32) ? n : 31;
    return *this;
  }

  void Or(const Pipeline_Use_Cycle_Mask &in2) {
    _mask |= in2._mask;
  }

  friend Pipeline_Use_Cycle_Mask operator&(const Pipeline_Use_Cycle_Mask &, const Pipeline_Use_Cycle_Mask &);
  friend Pipeline_Use_Cycle_Mask operator|(const Pipeline_Use_Cycle_Mask &, const Pipeline_Use_Cycle_Mask &);

  friend class Pipeline_Use;

  friend class Pipeline_Use_Element;

};
// code generated for arm32:

class Pipeline_Use_Cycle_Mask {
protected:
  uint32_t _mask1, _mask2, _mask3;

public:
  Pipeline_Use_Cycle_Mask() : _mask1(0), _mask2(0), _mask3(0) {}

  Pipeline_Use_Cycle_Mask(uint32_t mask1, uint32_t mask2, uint32_t mask3) : _mask1(mask1), _mask2(mask2), _mask3(mask3) {}

  Pipeline_Use_Cycle_Mask intersect(const Pipeline_Use_Cycle_Mask &in2) {
    Pipeline_Use_Cycle_Mask out;
    out._mask1 = _mask1 & in2._mask1;
    out._mask2 = _mask2 & in2._mask2;
    out._mask3 = _mask3 & in2._mask3;
    return out;
  }

  bool overlaps(const Pipeline_Use_Cycle_Mask &in2) const {
    return ((_mask1 & in2._mask1) != 0) || ((_mask2 & in2._mask2) != 0) || ((_mask3 & in2._mask3) != 0);
  }

  Pipeline_Use_Cycle_Mask& operator<<=(int n) {
    if (n >= 32)
      do {
        _mask3 = _mask2; _mask2 = _mask1; _mask1 = 0;
      } while ((n -= 32) >= 32);

    if (n > 0) {
      uint m = 32 - n;
      uint32_t mask = (1 << n) - 1;
      uint32_t temp2 = mask & (_mask1 >> m); _mask1 <<= n;
      uint32_t temp3 = mask & (_mask2 >> m); _mask2 <<= n; _mask2 |= temp2;
      _mask3 <<= n; _mask3 |= temp3;
    }
    return *this;
  }

  void Or(const Pipeline_Use_Cycle_Mask &);

  friend Pipeline_Use_Cycle_Mask operator&(const Pipeline_Use_Cycle_Mask &, const Pipeline_Use_Cycle_Mask &);
  friend Pipeline_Use_Cycle_Mask operator|(const Pipeline_Use_Cycle_Mask &, const Pipeline_Use_Cycle_Mask &);

  friend class Pipeline_Use;

  friend class Pipeline_Use_Element;

};

@dean-long
Copy link
Member

I didn't realize we already had code to handle masks for large shifts. So I think the main problem is that _maxcycleused is not being set to the max value of 100. There is a secondary problem that we don't really need values that high, if the units are in pipeline stages.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I will submit testing.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 25, 2025
@dean-long
Copy link
Member

diff --git a/src/hotspot/share/adlc/adlparse.cpp b/src/hotspot/share/adlc/adlparse.cpp
index 033e8d26ca7..ca6e8b7ed5e 100644
--- a/src/hotspot/share/adlc/adlparse.cpp
+++ b/src/hotspot/share/adlc/adlparse.cpp
@@ -1770,6 +1770,10 @@ void ADLParser::pipe_class_parse(PipelineForm &pipeline) {
         return;
       }
 
+      if (pipeline._maxcycleused < fixed_latency) {
+        pipeline._maxcycleused = fixed_latency;
+      }
+
       pipe_class->setFixedLatency(fixed_latency);
       next_char(); skipws();
       continue;

I think this also solves the problem, because the 100 is coming from a fixed_latency(100) statement.

@vnkozlov
Copy link
Contributor

I think this also solves the problem, because the 100 is coming from a fixed_latency(100) statement.

Or we can fix pipe_slow()to use reasonable fixed_latency instead of arbitrary 100.
It is used for float point instructions mostly and, I think, came from time when we used FPU instead of current SSE/AVX instructions.

But I think code in output_h.cpp should be fixed, as proposed, regardless what we do with fixed_latency.

@dean-long
Copy link
Member

Also note that the min_delay logic in Pipeline_Use::full_latency() initializes min_delay to _maxcycleused+1, so it does seem to expect _maxcycleused to be set to the max value.

@dean-long
Copy link
Member

Yes, I think we should fix both, output_h.cpp and fixed_latency(100) on all platforms, then we can get rid of the workarounds and arm32-specific logic.

@adinn
Copy link
Contributor

adinn commented Aug 26, 2025

Yes, I think we should fix both, output_h.cpp and fixed_latency(100) on all platforms, then we can get rid of the workarounds and arm32-specific logic.

When I looked into this earlier I thought the obvious thing needed to fix this was to reassign all the latencies so they represented a realizable pipeline delay. A proper fix would sensibly require each latency to be less than the pipeline length declared in the CPU model -- which for most arches is much less than 32. However, I didn't suggest such a rationalization because I believed (perhaps wrongly) that the latencies were also used to pick a preferred choice when we have alternative instruction/operand rule matches. The selection process involves comparing the cumulative latencies for subgraph nodes against the latency of each node defined by a match rule for the subgraph and picking the lowest latency result. After looking at some of the rules I was not sure that it would be easy to reduce all current latencies so they lie in the range 0-31 and still guarantee the current selection order. It would be even harder when the range was correctly reduced to 0 - lengthof(pipeline).

I don't even think most rule authors understand that the latencies are used by the pipeline model and instead they simply use latency as a weight to enforce orderings. That's certainly how I understood it until I ran into this issue. If so then perhaps we would be better sticking with the de facto use and fixing the shift issue with a maximum shift bound. The mask tests which rely on this shift count may help with deriving scheduling delays for some instructions with small latencies but I don't believe it is very reliable even in cases where the accumulated shifts lie within the 32 bit range. If we are to change anything here then I think we need a review of the accuracy of pipeline models and their current or potential value before doing so.

@bulasevich
Copy link
Contributor Author

+      if (pipeline._maxcycleused < fixed_latency) {
+        pipeline._maxcycleused = fixed_latency;
+      }
+

I think this also solves the problem, because the 100 is coming from a fixed_latency(100) statement.

@dean-long Right! I checked that, it makes ubsan quiet.

Please note. 100 isn’t the only triggering value. With an extra trace on macosx-aarch64 I see:

printf("%i -> %i\n", pipeline._maxcycleused, fixed_latency);
6 -> 8
8 -> 16
16 -> 100

If we resolve it at parse stage, I think we should do the opposite: limit the user-specified value to maxcycleused.

diff --git a/src/hotspot/share/adlc/adlparse.cpp b/src/hotspot/share/adlc/adlparse.cpp
index 033e8d26ca7..1060f7b18ab 100644
--- a/src/hotspot/share/adlc/adlparse.cpp
+++ b/src/hotspot/share/adlc/adlparse.cpp
@@ -1770,7 +1770,7 @@ void ADLParser::pipe_class_parse(PipelineForm &pipeline) {
         return;
       }

-      pipe_class->setFixedLatency(fixed_latency);
+      pipe_class->setFixedLatency(fixed_latency <= pipeline._maxcycleused ? fixed_latency : pipeline._maxcycleused);
       next_char(); skipws();
       continue;
     }

@vnkozlov
Copy link
Contributor

vnkozlov commented Aug 26, 2025

My testing passed for version V02.

@dean-long
Copy link
Member

If we are to change anything here then I think we need a review of the accuracy of pipeline models and their current or potential value before doing so.

That's a good point. While looking into this, I discovered that the initial masks generated by pipeline_res_mask_initializer() appear wrong. For example, the mask for stage 0 with 1 cycle is computed as 0x80000001, not the 0x1 that I would expect. Stage 2 with 1 cycle is 0x2, not 0x4, etc. I guess if all the masks are wrong in the same way, the problems might mostly cancel out, but it does shed doubt on the usefulness of this code.

We could preserve the large latencies for now, and let them trigger the _maxcycleused > 32 code for more platforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler [email protected] ready Pull request is ready to be integrated rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

4 participants