Skip to content

Conversation

@Martchus
Copy link
Contributor

@Martchus Martchus commented Nov 21, 2025

When accessing the test results overview page (especially without parameters, e.g. http://localhost:9526/tests/overview) a very huge set of jobs is considered. This leads to a very excessive memory usage, e.g. with current OSD data the RSS rises over 7 GiB and the request takes very long. There's already a limit of 2000 jobs which is also effective but unfortunately not passed to complex_query and only used later on. I tracked the memory usage while running the code with debug printing it it was very clear that complex_query is the culprit.

This change uses the limit from the start. This means additional filtering done in Perl code also needs to happen as part of the now limited initial query. So this change replaces these Perl loops/grep with passing the filter parameters directly to the database. This slightly changes the order of things but should lead to the same end result.


Still a draft as I'll also have to take care of other places where we use complex_query in a similar way. I probably also still need to change some details depending on test results. If I took care of other places I can probably also remove code that is then no longer required, e.g. latest_jobs.


Related ticket: https://progress.opensuse.org/issues/192448

@Martchus
Copy link
Contributor Author

Martchus commented Nov 24, 2025

It looks like the current version of the test results overview actually doesn't return results across all builds via /tests/overview - despite having problems of huge memory consumption in some cases. At least with my local database I actually just get two jobs via /tests/overview of a single build which looks rather random. With this PR I actually get jobs of many different builds which is much more what one would expect from such an unconstrained query. This behavior is also visible in unit tests where now more results show up on test result overview pages.

I'll review whether the new behavior makes sense in the different test cases (manual and unit tests). So for the new behavior makes sense in my manual tests which would mean the current code is already cutting corners somewhere which now seems no longer necessary.


EDIT: I've also just tested this with OSD data. The version of this PR keeps the memory usage limited (the final RSS of the process is 178.7 MiB) and it is notably faster than the current version (which ends up with an RSS of 7.4 GiB). With OSD data the difference in the jobs being present on /tests/overview is much smaller. They are actually almost identical. Considering results are limited to 2000 jobs with the remark "narrow down your search parameters" this is completely acceptable.

I also get meaningful results for e.g. /tests/overview?result=failed much faster. The set of jobs is again not identical but it does make sense. The typical overview pages (with the usual constraints so the limit of 2000 isn't reached) I tested look identical. Less constrained over view pages (e.g. just the BUILD is constrained but the limit still isn't reached) I tested also look identical. All tested with OSD data.

The only regression I've seen so far is a missing @64bit.


EDIT: The regression about the missing @64bit is not really a regression. With this PR it only considers the jobs that are actually being displayed when computing the "preferred" machine (to omit it) but that makes actually more sense. (The current code also doesn't consider all jobs to compute the preferred machine. But also not only jobs being actually displayed. So the current behavior is a complicated middle ground. Some filters (e.g. the job group) are considered and some filters are not (e.g. the state, result and the "filter" to show only the latest jobs). Probably we don't care much about this so I'll ignore it.)

@Martchus
Copy link
Contributor Author

Looks like the current code determines the latest build automatically when no build is specified. Since that actually seems to be a wanted feature I made my change behave similarly. With this there is almost no change in behavior anymore. Some tests still fail which I'll have to work on tomorrow.

I also added the following to the commit message because this in one of the remaining differences and I haven't found a good way to avoid it:

With this change the "preferred" machine (basically the most frequent machine per architecture) is only determined based on the jobs that are actually displayed. That means a jobs will no longer show as e.g. kde@uefi when filtering for machine=uefi because with this filter the machine uefi becomes the preferred (most frequent) machine (as now all displayed jobs have that machine). I think this change is acceptable. It means that some tests had to be adjusted accordingly.

@Martchus
Copy link
Contributor Author

I almost fixed all mistakes but the test filtering does not reveal old jobs is revealing a bigger problem with my current approach. I guess it actually made an important difference that the current code applies these filters only relatively at the end (but unfortunately also before limiting takes place). I need to find a way to do the filtering (on e.g. result) after the "latest" de-duplication/grouping but still before the limit is applied.

@Martchus Martchus force-pushed the overview-limit branch 2 times, most recently from e78ad16 to 8bcf758 Compare November 25, 2025 16:14
@Martchus
Copy link
Contributor Author

The case tested via the sub test filtering does not reveal old jobs is now also working with my latest push.

I now also covered the API route that is the equivalent of the overview page. With this the only remaining unaltered usage of complex_query is in the "list" API route but this one is already limited from the start. It still uses Perl code for filtering the latest jobs. Hence I wasn't able to remove the latest_jobs function.

With all these changes to the PR the memory usage of /tests/overview is still significantly reduced. (Sill around 200 MiB, not over 7 GiB.)

@Martchus Martchus marked this pull request as ready for review November 25, 2025 16:15
@Martchus Martchus changed the title WIP: Fix possibly excessive memory use when computer test result overview Fix possibly excessive memory use when computer test result overview Nov 25, 2025
@codecov
Copy link

codecov bot commented Nov 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.26%. Comparing base (e3b9e48) to head (c018f21).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6850   +/-   ##
=======================================
  Coverage   99.26%   99.26%           
=======================================
  Files         402      402           
  Lines       41493    41522   +29     
=======================================
+ Hits        41187    41218   +31     
+ Misses        306      304    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jobs = @jobs[0 .. ($limit - 1)] if $limit_exceeded;
my @jobids = map { $_->id } @jobs;
my @jobs = $jobs->all;
my $preferred_machines = _calculate_preferred_machines(\@jobs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note. The $jobs are used here only for calculate the $preferred_machines which is used later for this condition

            && $preferred_machines->{$job->ARCH}
            && $preferred_machines->{$job->ARCH} ne $job->MACHINE)```

So the `_prepare_job_results` needs this only and not the whole ResultSet. If you call `_calculate_preferred_machines(\@jobs);` before https://github.com/Martchus/openQA/blob/79f52e8c934da96eaa20e95fca56a8891dfc361a/lib/OpenQA/WebAPI/Controller/Test.pm#L848C5-L848C94 and just pass this. I think that way, it would simplify the dependencies between the invocations and the functions would be more clear. 

But lets not change anything now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The $jobs are used here only for calculate …

The "only" in this sentence makes it wrong. Just search for @jobs in the subsequent code of this function; you'll find 4 occurrences.

Copy link
Contributor

@d3flex d3flex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished with the review. I would approve but I would like to ask you for a tiny typo in the 3rd commit. it says That means a jobs will no longer show as. it meant to be without a, right? Other than that I am ready to approve it

When accessing the test results overview page (especially without
parameters, e.g. http://localhost:9526/tests/overview) a very huge set of
jobs is considered. This leads to a very excessive memory usage, e.g. with
current OSD data the RSS rises over 7 GiB and the request takes very long.
There's already a limit of 2000 jobs which is also effective but
unfortunately not passed to `complex_query` and only used later on. I
tracked the memory usage while running the code with debug printing it it
was very clear that `complex_query` is the culprit.

This change uses the limit from the start. This means additional
filtering done in Perl code also needs to happen as part of the now
limited initial query (using a sub query, see the added comment). So this
change replaces these Perl loops/grep with passing the filter parameters
directly to the database. This slightly changes the order of things but
should lead to the same end result.

With this change the "preferred" machine (basically the most frequent
machine per architecture) is only determined based on the jobs that are
actually displayed. That means jobs will no longer show as e.g. `kde@uefi`
when filtering for `machine=uefi` because with this filter the machine
`uefi` becomes the preferred (most frequent) machine (as now all displayed
jobs have that machine). I think this change is acceptable. It means that
some tests had to be adjusted accordingly.

Related ticket: https://progress.opensuse.org/issues/192448
@Martchus
Copy link
Contributor Author

I fixed the typo in the 3rd commit.

@mergify mergify bot merged commit 131df4d into os-autoinst:master Nov 26, 2025
51 checks passed
@Martchus Martchus deleted the overview-limit branch November 26, 2025 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants