Skip to content

Conversation

ydirson
Copy link
Contributor

@ydirson ydirson commented Jun 12, 2025

It is very annoying to get a single failed test stop the whole run by default: it requires rerunning the whole run (or spending time forging a filter to exclude exactly those we don't want to rerun), whereas when a test has failed and the user wants to stop the run because of that, Ctrl-C is our friend.

It is also very easy to just add -x to a pytest command to stop on first error when really needed, much less obvious to dig for --maxfail=0 when we want to have a view of all errors. Even in CI it is not obvious that we always want this anyway, likely only for the very long jobs. Those CI jobs can be modified even before this PR gets merged, to add -x where we want it.

And it is the standard behaviour, and will be the default expectation of any newcomer with previous knowledge of pytest.

It is very annoying to get a single failed test stop the whole run by
default: it requires rerunning the whole run (or spending time forging
a filter to exclude exactly those we don't want to rerun), whereas
when a test has failed and the user wants to stop the run because of
that, Ctrl-C is our friend.

It is also very easy to just add `-x` to a pytest command to stop on
first error when really needed, much less obvious to dig for
`--maxfail=0` when we want to have a view of all errors.  Even in CI
it is not obvious that we always want this anyway, likely only for the
very long jobs.

And it is the standard behaviour, and will be the default expectation
of any newcomer with previous knowledge of pytest.

Signed-off-by: Yann Dirson <[email protected]>
@ydirson ydirson requested review from glehmann, stormi and gduperrey June 12, 2025 12:26
Copy link
Member

@glehmann glehmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way works for me

Copy link
Member

@gduperrey gduperrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure I understand the purpose here because you're making a change for the whole thing.

And for quite a few tests run by Jenkins (apart from the two big ones that are because we're testing a lot of VMs), I'm not sure we want this to continue with SRs or other tests in a second state.
So the change here means re-editing all the jobs to add the -x, and this PR doesn't do that.

So just for the two big parts of multi, you're applying an option to everything. At the very least, I would have expected a discussion and for this PR to take into account the addition of the -x that you're proposing to the jobs that should have it.

@ydirson
Copy link
Contributor Author

ydirson commented Jun 12, 2025

At the very least, I would have expected a discussion

That's what we're having here, a PR is not a push to master, and has the nice property of archiving the discussion for future reference :)

and for this PR to take into account the addition of the -x that you're proposing to the jobs that should have it.

I would expect that to happen somewhere else, or are you thinking about adding them to jobs.py?

@ydirson
Copy link
Contributor Author

ydirson commented Jun 12, 2025

I'm not really sure I understand the purpose here because you're making a change for the whole thing.

The purpose is to get the pytest behaviour back to the default on this point. I tried to make the reasons understandable in the commit message, what points could I improve on?

@gduperrey
Copy link
Member

I really don't have that PR definition ;)

Since you'll also want to be able to use jobs.py commands with your modification, I'd say we should simply be able to either override the default jobs.py with error stopping (no modifications required for Jenkins). But perhaps add an option to remove this, for those who want it to work the way you want.

Or conversely, be able to add an option to ignore this behavior and be as before. But that would require modifying all existing pipelines/jobs and not forgetting it for future pipelines/jobs.

And no, the commit message seems to be what you want. As explained, the only job where I would have liked this change is the multi-job, when we test all the xva files and not the others. In my opinion, the others want it to fail for analysis. Otherwise, we risk running the tests on unstable and non-compliant environments.

@ydirson
Copy link
Contributor Author

ydirson commented Jun 12, 2025

Since you'll also want to be able to use jobs.py commands with your modification, I'd say we should simply be able to either override the default jobs.py with error stopping (no modifications required for Jenkins). But perhaps add an option to remove this, for those who want it to work the way you want.

jobs.py already forwards options to pytest, so it can already take -x

@ydirson
Copy link
Contributor Author

ydirson commented Jun 12, 2025

And no, the commit message seems to be what you want. As explained, the only job where I would have liked this change is the multi-job, when we test all the xva files and not the others. In my opinion, the others want it to fail for analysis. Otherwise, we risk running the tests on unstable and non-compliant environments.

Semi-ideally, we should be able to easily identify those tests which are known to cause such issues, and then we can mark them for use of -x. (for something "more ideal" I still tend to think about those tests (where teardown does not, or is not able to, leave the tested object in a sane state) as needing work, and being able to identify those will be needed as well).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests in xcp-ng-tests are integration tests, running on a complex product. Whenever a test fails, we don't know what state this left the host in. Sure, we attempt to add robust teardown, but when you don't know what state the system is in, and teardown itself relies on that system being not more broken that you can handle (Unless teardown would completely reinstall and setup the hosts, or restore some kind of snapshot... whether it's due to a recoverable issue or not.)

So our approach is to play it safe, fail early and fix.

pytest has options to restart from last failure, couldn't we use that (there's -lf and -ff) to avoid re-running everything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest has options to restart from last failure, couldn't we use that (there's -lf and -ff) to avoid re-running everything?

We can check if that works. But it does not help in the case of "let's launch the tests and use that time for a break", just to discover that the whole thing stopped a few minutes in - basically that's what happens to me all the time, and why I'm pushing for this.

The tests in xcp-ng-tests are integration tests, running on a complex product. Whenever a test fails, we don't know what state this left the host in. Sure, we attempt to add robust teardown, but when you don't know what state the system is in, and teardown itself relies on that system being not more broken that you can handle (Unless teardown would completely reinstall and setup the hosts, or restore some kind of snapshot... whether it's due to a recoverable issue or not.)

I do understand it is the case for some tests. Very notably, tests done on nested hosts protected with snapshots (eg. tests/install/* do not need that extra safety (and I hope we'll add feature this someday to all tests run in nested hosts). But since this is a global flag, even when running safe tests we have to take explicit action to be efficient, and that just adds a mix of mental burden and time loss.

It would be much better if we could flag those tests that make big-enough changes to the host, and based on that (and on other info, like "is there a snapshot to save us if needed") decide to stop or not. That could need to be a pytest plugin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the behaviour we want at the moment for most test jobs, outside the ones that spin up their own throw-away test pools. Maybe there's a way to enable it per top level directory? Or to have a "stop of first failure" default as we have currently and override it for whole test directories.

@stormi
Copy link
Member

stormi commented Jun 12, 2025

That's what we're having here, a PR is not a push to master, and has the nice property of archiving the discussion for future reference :)

Regarding this, I personnally regularly state that a PR in our projects is a proposition to merge a change unless anyone objects, and can be received as a disturbance when you have to jump in and say "Wait!" before it's too late. That's a bit different from a request for comments, a subtle difference, but important for me. Draft PRs are a better fit for controversial changes such as the one that is proposed here.

@ydirson ydirson marked this pull request as draft June 13, 2025 09:35
@ydirson
Copy link
Contributor Author

ydirson commented Jun 13, 2025

Draft PRs are a better fit for controversial changes such as the one that is proposed here.

good point, changed to draft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants