- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.9k
Add new self-profiling event to cheaply aggregate query cache hit counts #142978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| https://github.com/search?type=code&q=query-cache-hits looks like no one used this anyway.. 😆 | 
| /// With this approach, we don't know the individual thread IDs and timestamps | ||
| /// of cache hits, but it has very little overhead on top of `-Zself-profile`. | ||
| /// Recording the cache hits as individual events made compilation 3-5x slower. | ||
| query_hits: RwLock<FxHashMap<QueryInvocationId, AtomicU64>>, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you switch this to using a dense map, e.g. IndexVec? QueryInvocationId should be monotonically assigned I think and so this should end up dense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they allocated monotonically in the order of executed queries though? 🤔 We don't know before the start of rustc how many invocations there will be (I assume, since it includes queries combined with the unique argument combinations), so we can't preallocate it. So the only thing we could do is .push() on demand (if the new ID is one larger than the size of the vec), and lookup by index. Is that what you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like they are strictly monotonic:
ID: 2
ID: 2
ID: 4
ID: 1
ID: 0
ID: 4
ID: 7
ID: 8
ID: 1
ID: 9
ID: 10
ID: 11
ID: 12
ID: 13
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
ID: 1
I guess that it depends on the invocations being cached or not, loaded from disk, etc. I don't think we can count on them actually arriving in order.
That being said, instead of push, I suppose that we could do something like query_hits.resize(new_observed_max_id, 0). Do you want me to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a change to vec, let me know what do you think (I didn't use IndexVec, because we index it with QueryInvocationId just once, the rest of operations (like resize or iterating) works with usize anyway. Also, we would need to implement Idx for it, which requires working with usize, but he invocation ID only stores u32.
I wonder if we can hit some pathological cases here if we keep resize_withing a vec by one each time...
| profiler_ref.profiler.as_ref().unwrap().increment_query_cache_hit(query_invocation_id); | ||
| } | ||
|  | ||
| if unlikely(self.event_filter_mask.contains(EventFilter::QUERY_CACHE_HITS)) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to change the existing event rather than adding a new one that only tracks counts?
I think this is losing the information for the query "tree" that was previously present, right? It used to be possible to generate a flamegraph of queries but now since there's no timing/thread information we can't track the parent relationships.
That doesn't seem consistently useful, but it also doesn't seem useless to me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I figured that it wasn't really used in practice (haven't found anything on GitHub code search), and it was quite expensive. A practical reason to avoid adding a new filter event was to avoid having two mask checks in this very hot function. But the cost of that (with -Zself-profile enabled) is probably still miniscule in comparison to what was happening before, and without self-profiling, we could just ask if QUERY_CACHE_HITS | QUERY_CACHE_HITS_COUNT is enabled, to keep a single check in the fast path, so probably it would be fine.
Happy to add a new filter event though, should be simple enough, and wouldn't break backwards compatibility.
How do you generate such a flamegraph that takes query hits into account, btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh I need to try this out, it could be very useful to see the timestamp difference between query hit counts within another query to analyze performance changes
| I changed the implementation to use a vec instead of a map (although I think map perf. is also ~fine), and added a new event filter (enabled by default), instead of changing the behavior of the old one. | 
| @bors r+ | 
| @bors rollup=never Just in case there's some perf. effect. | 
Add new self-profiling event to cheaply aggregate query cache hit counts Self-profile can record various types of things, some of them are not enabled, like query cache hits. Rustc currently records cache hits as "instant" measureme events, which records the thread ID, current timestamp, and constructs an individual event for each such cache hit. This is incredibly expensive, in a small hello world benchmark that just depends on serde, it makes compilation with nightly go from ~3s (with `-Zself-profile`) to ~15s (with `-Zself-profile -Zself-profile-events=default,query-cache-hit`). We'd like to add query cache hits to rustc-perf (rust-lang/rustc-perf#2168), but there we only need the actualy cache hit counts, not the timestamp/thread ID metadata associated with it. This PR adds a new `query-cache-hit-count` event. Instead of generating individual instant events, it simply aggregates cache hit counts per *query invocation* (so a combination of a query and its arguments, if I understand it correctly) using an atomic counter. At the end of the compilation session, these counts are then dumped to the self-profile log using integer events (in a similar fashion as how we record artifact sizes). I suppose that we could dedup the query invocations in rustc directly, but I don't think it's really required. In local experiments with the hello world + serde case, the query invocation records generated ~30 KiB more data in the self-profile, which was ~10% increase in this case. With this PR, the overhead of `-Zself-profile` seems to be the same as before, at least on my machine, so I also enabled query cache hit counts by default when self profiling is enabled. We should also modify `analyzeme`, specifically [this](https://github.com/rust-lang/measureme/blob/master/analyzeme/src/analysis.rs#L139), and make it load the integer events with query cache hit counts. I can do that as a follow-up, it's not required to be done in sync with this PR, and it doesn't require changes in rustc. CC `@cjgillot` r? `@oli-obk`
      
        
              This comment has been minimized.
        
        
      
    
  This comment has been minimized.
| 💔 Test failed - checks-actions | 
| Oops, apparently 32-bit platforms are a thing. I decided to use portable  @bors2 try jobs=dist-powerpc-linux | 
Add new self-profiling event to cheaply aggregate query cache hit counts Self-profile can record various types of things, some of them are not enabled, like query cache hits. Rustc currently records cache hits as "instant" measureme events, which records the thread ID, current timestamp, and constructs an individual event for each such cache hit. This is incredibly expensive, in a small hello world benchmark that just depends on serde, it makes compilation with nightly go from ~3s (with `-Zself-profile`) to ~15s (with `-Zself-profile -Zself-profile-events=default,query-cache-hit`). We'd like to add query cache hits to rustc-perf (rust-lang/rustc-perf#2168), but there we only need the actualy cache hit counts, not the timestamp/thread ID metadata associated with it. This PR adds a new `query-cache-hit-count` event. Instead of generating individual instant events, it simply aggregates cache hit counts per *query invocation* (so a combination of a query and its arguments, if I understand it correctly) using an atomic counter. At the end of the compilation session, these counts are then dumped to the self-profile log using integer events (in a similar fashion as how we record artifact sizes). I suppose that we could dedup the query invocations in rustc directly, but I don't think it's really required. In local experiments with the hello world + serde case, the query invocation records generated ~30 KiB more data in the self-profile, which was ~10% increase in this case. With this PR, the overhead of `-Zself-profile` seems to be the same as before, at least on my machine, so I also enabled query cache hit counts by default when self profiling is enabled. We should also modify `analyzeme`, specifically [this](https://github.com/rust-lang/measureme/blob/master/analyzeme/src/analysis.rs#L139), and make it load the integer events with query cache hit counts. I can do that as a follow-up, it's not required to be done in sync with this PR, and it doesn't require changes in rustc. CC `@cjgillot` r? `@oli-obk` try-job: dist-powerpc-linux
| Looks good. @bors r=oli-obk | 
| ☀️ Test successful - checks-actions | 
| What is this?This is an experimental post-merge analysis report that shows differences in test outcomes between the merged PR and its parent PR.Comparing f51c987 (parent) -> b94bd12 (this PR) Test differencesShow 3 test diffs3 doctest diffs were found. These are ignored, as they are noisy. Test dashboardRun cargo run --manifest-path src/ci/citool/Cargo.toml -- \
    test-dashboard b94bd12401d26ccf1c3b04ceb4e950b0ff7c8d29 --output-dir test-dashboardAnd then open  Job duration changes
 How to interpret the job duration changes?Job durations can vary a lot, based on the actual runner instance | 
| Finished benchmarking commit (b94bd12): comparison URL. Overall result: ❌ regressions - please read the text belowOur benchmarks found a performance regression caused by this PR. Next Steps: 
 @rustbot label: +perf-regression Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy. 
 Max RSS (memory usage)Results (primary -2.5%, secondary -3.1%)A less reliable metric. May be of interest, but not used to determine the overall result above. 
 CyclesResults (primary 1.6%, secondary 1.3%)A less reliable metric. May be of interest, but not used to determine the overall result above. 
 Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 462.08s -> 462.452s (0.08%) | 
| Might be genuine regressions, although the fast path shouldn't have changed much. | 
| perf triage: @Kobzol Do you plan to dig into these regressions or do we accept that? It seems like most of those are in type system heavy benchmarks, which (I assume) use the query system a lot, so it seems like this might be legit overhead of the new counter. | 
| It’s also very possible that the regressions are only seen when using the self profiler, and maybe are not something people would actually encounter in the real world on these crates — whereas rustc-perf uses it unconditionally. | 
| The results from the benchmark suite hopefully shouldn't have the self-profiler enabled, we only use it for a single iteration to gather the self-profile data, but the rest of the iterations are executed without it. In terms of the regressions, the only change to "normal" code should be this: if unlikely(self.event_filter_mask.contains(EventFilter::QUERY_CACHE_HITS)) {turned into this: if unlikely(self.event_filter_mask.intersects(EventFilter::QUERY_CACHE_HIT_COMBINED))I actually tried to compile this in the playground: foo1:                                   # @foo1
# %bb.0:
	movl	%edi, %eax
	andl	$4, %eax
	shrl	$2, %eax
                                        # kill: def $al killed $al killed $eax
	retq
                                        # -- End function
foo2:                                   # @foo2
# %bb.0:
	testl	$1028, %edi                     # imm = 0x404
	setne	%al
	retqand it looks like  The only other change is that the cold part of the code got larger, but I don't think we can do anything about it, and if the cold annotation works properly, the code should lie somewhere else in the binary, in order not to affect the fast path (hopefully). | 
Fix wrong cache event query key I messed this up in rust-lang#142978. It is only an issue if someone enables the event manually, which almost no-one does, so it could take a while before we found it :D Luckily I noticed it while re-reading the PR. r? `@oli-obk`
Rollup merge of #143586 - Kobzol:self-profile-fix, r=oli-obk Fix wrong cache event query key I messed this up in #142978. It is only an issue if someone enables the event manually, which almost no-one does, so it could take a while before we found it :D Luckily I noticed it while re-reading the PR. r? `@oli-obk`
Self-profile can record various types of things, some of them are not enabled, like query cache hits. Rustc currently records cache hits as "instant" measureme events, which records the thread ID, current timestamp, and constructs an individual event for each such cache hit. This is incredibly expensive, in a small hello world benchmark that just depends on serde, it makes compilation with nightly go from ~3s (with
-Zself-profile) to ~15s (with-Zself-profile -Zself-profile-events=default,query-cache-hit).We'd like to add query cache hits to rustc-perf (rust-lang/rustc-perf#2168), but there we only need the actualy cache hit counts, not the timestamp/thread ID metadata associated with it.
This PR adds a new
query-cache-hit-countevent. Instead of generating individual instant events, it simply aggregates cache hit counts per query invocation (so a combination of a query and its arguments, if I understand it correctly) using an atomic counter. At the end of the compilation session, these counts are then dumped to the self-profile log using integer events (in a similar fashion as how we record artifact sizes). I suppose that we could dedup the query invocations in rustc directly, but I don't think it's really required. In local experiments with the hello world + serde case, the query invocation records generated ~30 KiB more data in the self-profile, which was ~10% increase in this case.With this PR, the overhead of
-Zself-profileseems to be the same as before, at least on my machine, so I also enabled query cache hit counts by default when self profiling is enabled.We should also modify
analyzeme, specifically this, and make it load the integer events with query cache hit counts. I can do that as a follow-up, it's not required to be done in sync with this PR, and it doesn't require changes in rustc.CC @cjgillot
r? @oli-obk