- 
                Notifications
    You must be signed in to change notification settings 
- Fork 247
chore: Add memory reservation debug logging and visualization #2521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Open
      
      
            andygrove
  wants to merge
  44
  commits into
  apache:main
  
    
      
        
          
  
    
      Choose a base branch
      
     
    
      
        
      
      
        
          
          
        
        
          
            
              
              
              
  
           
        
        
          
            
              
              
           
        
       
     
  
        
          
            
          
            
          
        
       
    
      
from
andygrove:debug-mem
  
      
      
   
  
    
  
  
  
 
  
      
    base: main
Could not load branches
            
              
  
    Branch not found: {{ refName }}
  
            
                
      Loading
              
            Could not load tags
            
            
              Nothing to show
            
              
  
            
                
      Loading
              
            Are you sure you want to change the base?
            Some commits from the old base branch may be removed from the timeline,
            and old review comments may become outdated.
          
          
      
        
          +309
        
        
          −5
        
        
          
        
      
    
  
  
     Open
                    Changes from all commits
      Commits
    
    
            Show all changes
          
          
            44 commits
          
        
        Select commit
          Hold shift + click to select a range
      
      7252605
              
                Access Spark configs from native code
              
              
                andygrove d084cfa
              
                code cleanup
              
              
                andygrove 4837935
              
                revert
              
              
                andygrove ad9c9b8
              
                debug
              
              
                andygrove f3bb412
              
                use df release
              
              
                andygrove 13f14d3
              
                cargo update
              
              
                andygrove 78f5b4f
              
                [skip ci]
              
              
                andygrove 5a39d3b
              
                merge other PR [skip-ci]
              
              
                andygrove dc11515
              
                save [skip-ci]
              
              
                andygrove d2a1ab1
              
                [skip ci]
              
              
                andygrove 31cdbc6
              
                save [skip ci]
              
              
                andygrove ffb1f71
              
                Merge remote-tracking branch 'apache/main' into debug-mem
              
              
                andygrove 322b4c5
              
                info logging
              
              
                andygrove 89e10ac
              
                log task id [skip ci]
              
              
                andygrove 3b191fd
              
                println
              
              
                andygrove 7c24836
              
                revert lock file
              
              
                andygrove 405f5b7
              
                prep for review
              
              
                andygrove 522238d
              
                save
              
              
                andygrove 36565ca
              
                Update spark/src/main/scala/org/apache/comet/CometExecIterator.scala
              
              
                andygrove 21189a6
              
                info logging
              
              
                andygrove dfa2c67
              
                Merge branch 'debug-mem' of github.com:andygrove/datafusion-comet int…
              
              
                andygrove d9817ce
              
                fix
              
              
                andygrove acba7bc
              
                log error on try_grow fail
              
              
                andygrove 4051d29
              
                log error on try_grow fail
              
              
                andygrove df69875
              
                revert
              
              
                andygrove ad891a0
              
                add Python script to convert log to csv
              
              
                andygrove 8756256
              
                Python script to generate chart
              
              
                andygrove 7eb1bc1
              
                scripts
              
              
                andygrove 21bd386
              
                new script
              
              
                andygrove ec823c2
              
                show err
              
              
                andygrove a66fa65
              
                save
              
              
                andygrove 12db37f
              
                Merge branch 'debug-mem' of github.com:andygrove/datafusion-comet int…
              
              
                andygrove 2fb336e
              
                track errors
              
              
                andygrove 706f5e7
              
                format
              
              
                andygrove 4faf881
              
                ASF header
              
              
                andygrove d91abda
              
                add brief docs
              
              
                andygrove f6128b5
              
                docs
              
              
                andygrove 7d40ac2
              
                fix
              
              
                andygrove c495897
              
                cargo fmt
              
              
                andygrove 06814b7
              
                upmerge
              
              
                andygrove e51751f
              
                format
              
              
                andygrove 75e727f
              
                upmerge
              
              
                andygrove e844287
              
                fix regression
              
              
                andygrove 2884ed3
              
                upmerge
              
              
                andygrove File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| #!/usr/bin/python | ||
| ############################################################################## | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
| ############################################################################## | ||
|  | ||
| import argparse | ||
| import re | ||
| import sys | ||
|  | ||
| def main(file, task_filter): | ||
| # keep track of running total allocation per consumer | ||
| alloc = {} | ||
|  | ||
| # open file | ||
| with open(file) as f: | ||
| # iterate over lines in file | ||
| print("name,size,label") | ||
| for line in f: | ||
| # print(line, file=sys.stderr) | ||
|  | ||
| # example line: [Task 486] MemoryPool[HashJoinInput[6]].shrink(1000) | ||
| # parse consumer name | ||
| re_match = re.search('\[Task (.*)\] MemoryPool\[(.*)\]\.(try_grow|grow|shrink)\(([0-9]*)\)', line, re.IGNORECASE) | ||
| if re_match: | ||
| try: | ||
| task = int(re_match.group(1)) | ||
| if task != task_filter: | ||
| continue | ||
|  | ||
| consumer = re_match.group(2) | ||
| method = re_match.group(3) | ||
| size = int(re_match.group(4)) | ||
|  | ||
| if alloc.get(consumer) is None: | ||
| alloc[consumer] = size | ||
| else: | ||
| if method == "grow" or method == "try_grow": | ||
| if "Err" in line: | ||
| # do not update allocation if try_grow failed | ||
| # annotate this entry so it can be shown in the chart | ||
| print(consumer, ",", alloc[consumer], ",ERR") | ||
| else: | ||
| alloc[consumer] = alloc[consumer] + size | ||
| elif method == "shrink": | ||
| alloc[consumer] = alloc[consumer] - size | ||
|  | ||
| print(consumer, ",", alloc[consumer]) | ||
|  | ||
| except Exception as e: | ||
| print("error parsing", line, e, file=sys.stderr) | ||
|  | ||
|  | ||
| if __name__ == "__main__": | ||
| ap = argparse.ArgumentParser(description="Generate CSV From memory debug output") | ||
| ap.add_argument("--task", default=None, help="Task ID.") | ||
| ap.add_argument("--file", default=None, help="Spark log containing memory debug output") | ||
| args = ap.parse_args() | ||
| main(args.file, int(args.task)) | 
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| #!/usr/bin/python | ||
| ############################################################################## | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
| ############################################################################## | ||
|  | ||
| import pandas as pd | ||
| import matplotlib.pyplot as plt | ||
| import sys | ||
|  | ||
| def plot_memory_usage(csv_file): | ||
| # Read the CSV file | ||
| df = pd.read_csv(csv_file) | ||
|  | ||
| # Create time index based on row order (each row is a sequential time point) | ||
| df['time'] = range(len(df)) | ||
|  | ||
| # Pivot the data to have consumers as columns | ||
| pivot_df = df.pivot(index='time', columns='name', values='size') | ||
| pivot_df = pivot_df.fillna(method='ffill').fillna(0) | ||
|  | ||
| # Create stacked area chart | ||
| plt.figure(figsize=(8, 4)) | ||
| plt.stackplot(pivot_df.index, | ||
| [pivot_df[col] for col in pivot_df.columns], | ||
| labels=pivot_df.columns, | ||
| alpha=0.8) | ||
|  | ||
| # Add annotations for ERR labels | ||
| if 'label' in df.columns: | ||
| err_points = df[df['label'].str.contains('ERR', na=False)] | ||
| for _, row in err_points.iterrows(): | ||
| plt.axvline(x=row['time'], color='red', linestyle='--', alpha=0.7, linewidth=1.5) | ||
| plt.text(row['time'], plt.ylim()[1] * 0.95, 'ERR', | ||
| ha='center', va='top', color='red', fontweight='bold') | ||
|  | ||
| plt.xlabel('Time') | ||
| plt.ylabel('Memory Usage') | ||
| plt.title('Memory Usage Over Time by Consumer') | ||
| plt.legend(loc='upper left', bbox_to_anchor=(1.05, 1), borderaxespad=0, fontsize='small') | ||
| plt.grid(True, alpha=0.3) | ||
| plt.tight_layout() | ||
|  | ||
| # Save the plot | ||
| output_file = csv_file.replace('.csv', '_chart.png') | ||
| plt.savefig(output_file, dpi=300, bbox_inches='tight') | ||
| print(f"Chart saved to: {output_file}") | ||
| plt.show() | ||
|  | ||
| if __name__ == "__main__": | ||
| if len(sys.argv) != 2: | ||
| print("Usage: python plot_memory_usage.py <csv_file>") | ||
| sys.exit(1) | ||
|  | ||
| plot_memory_usage(sys.argv[1]) | 
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|  | ||
| use datafusion::execution::memory_pool::{ | ||
| MemoryConsumer, MemoryLimit, MemoryPool, MemoryReservation, | ||
| }; | ||
| use log::info; | ||
| use std::sync::Arc; | ||
|  | ||
| #[derive(Debug)] | ||
| pub(crate) struct LoggingPool { | ||
| task_attempt_id: u64, | ||
| pool: Arc<dyn MemoryPool>, | ||
| } | ||
|  | ||
| impl LoggingPool { | ||
| pub fn new(task_attempt_id: u64, pool: Arc<dyn MemoryPool>) -> Self { | ||
| Self { | ||
| task_attempt_id, | ||
| pool, | ||
| } | ||
| } | ||
| } | ||
|  | ||
| impl MemoryPool for LoggingPool { | ||
| fn register(&self, consumer: &MemoryConsumer) { | ||
| self.pool.register(consumer) | ||
| } | ||
|  | ||
| fn unregister(&self, consumer: &MemoryConsumer) { | ||
| self.pool.unregister(consumer) | ||
| } | ||
|  | ||
| fn grow(&self, reservation: &MemoryReservation, additional: usize) { | ||
| info!( | ||
| "[Task {}] MemoryPool[{}].grow({})", | ||
|         
                  parthchandra marked this conversation as resolved.
              Show resolved
            Hide resolved | ||
| self.task_attempt_id, | ||
| reservation.consumer().name(), | ||
| additional | ||
| ); | ||
| self.pool.grow(reservation, additional); | ||
| } | ||
|  | ||
| fn shrink(&self, reservation: &MemoryReservation, shrink: usize) { | ||
| info!( | ||
| "[Task {}] MemoryPool[{}].shrink({})", | ||
| self.task_attempt_id, | ||
| reservation.consumer().name(), | ||
| shrink | ||
| ); | ||
| self.pool.shrink(reservation, shrink); | ||
| } | ||
|  | ||
| fn try_grow( | ||
| &self, | ||
| reservation: &MemoryReservation, | ||
| additional: usize, | ||
| ) -> datafusion::common::Result<()> { | ||
| match self.pool.try_grow(reservation, additional) { | ||
| Ok(_) => { | ||
| info!( | ||
| "[Task {}] MemoryPool[{}].try_grow({}) returning Ok", | ||
| self.task_attempt_id, | ||
| reservation.consumer().name(), | ||
| additional | ||
| ); | ||
| Ok(()) | ||
| } | ||
| Err(e) => { | ||
| info!( | ||
| "[Task {}] MemoryPool[{}].try_grow({}) returning Err: {e:?}", | ||
| self.task_attempt_id, | ||
| reservation.consumer().name(), | ||
| additional | ||
| ); | ||
| Err(e) | ||
| } | ||
| } | ||
| } | ||
|  | ||
| fn reserved(&self) -> usize { | ||
| self.pool.reserved() | ||
| } | ||
|  | ||
| fn memory_limit(&self) -> MemoryLimit { | ||
| self.pool.memory_limit() | ||
| } | ||
| } | ||
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -17,6 +17,7 @@ | |
|  | ||
| mod config; | ||
| mod fair_pool; | ||
| pub mod logging_pool; | ||
| mod task_shared; | ||
| mod unified_pool; | ||
|  | ||
|  | ||
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
      
      Oops, something went wrong.
        
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be useful to add a debug! log message which has the backtrace of where this was requested from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I updated
try_growto log theErrif it fails. This should contain the backtrace if the backtrace feature is enabled, but I need to test this out locally.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that we do this for every call (not just for the error) so we can trace the precise origins of the allocations. Probably should be a trace message (not a debug) though. This is merely a suggestion though, I'll leave it to you to decide if it is useful.
Logging the backtrace on error is definitely useful.