Skip to content

Implement efficient seeking from non-null trees using tree_pos #2911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 13, 2025

Conversation

duncanMR
Copy link
Contributor

Description

Continuing from #2874, we want to finish moving over the tree-positioning code to use tree_pos efficiently. At the moment, tsk.tree_seek will either call tsk_tree_seek_from_null or tsk_tree_seek_linear depending on whether we are starting from the null tree or not. seek_linear repeatedly calls next or prev until it reaches the given position, with the direction being determined by which would cover the shortest distance.

As a first pass, I've implemented tsk_tree_seek_forward and tsk_tree_seek_backward and I've incorporated them into tsk_tree_seek_linear. We will need to revise some of the test_highlevel.py seek tests, because the direction we choose to seek along is different to the old approach in some cases. For example, we now seek forward to go from the first to the last tree in a sequence.

Curiously, my implementation passes all the C tests with no memory issues detected by Valgrind, and it also passes all the test_highlevel.py and test_tree_positioning tests except for the ones dependent on seeking direction. However, it has caused chaos with other Python tests, causing failures and segfaults in test_stats.py and test_divmat.py among others. The failing/crashing tests seem to be primarily be associated with LD calculations and divergence.

I'm currently trying to determine whether the problems are due to an error in my implemention (most likely) or the subtle problems with the time ordering of inserted edges, discussed in #2792.

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

@duncanMR duncanMR marked this pull request as draft February 26, 2024 11:58
Copy link

codecov bot commented Feb 26, 2024

Codecov Report

❌ Patch coverage is 93.54839% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.62%. Comparing base (ce09b35) to head (c79fbd8).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
c/tskit/trees.c 92.00% 2 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2911      +/-   ##
==========================================
+ Coverage   89.61%   89.62%   +0.01%     
==========================================
  Files          28       28              
  Lines       31983    32058      +75     
  Branches     5888     5903      +15     
==========================================
+ Hits        28660    28731      +71     
- Misses       1888     1889       +1     
- Partials     1435     1438       +3     
Flag Coverage Δ
c-tests 86.61% <92.00%> (+0.01%) ⬆️
lwt-tests 80.38% <ø> (ø)
python-c-tests 88.19% <100.00%> (+0.04%) ⬆️
python-tests 98.80% <100.00%> (+<0.01%) ⬆️
python-tests-numpy1 52.45% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
python/_tskitmodule.c 88.19% <100.00%> (+0.04%) ⬆️
python/tskit/trees.py 98.85% <100.00%> (+<0.01%) ⬆️
c/tskit/trees.c 90.61% <92.00%> (+<0.01%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jeromekelleher
Copy link
Member

That is interesting. I can see the LD calculator causing problems as that does a lot of seeking backward and forward (was the original motivation for bidirectional seeking). I don't see anything obvious wrong, but it must be something to do with sample counts.

One thing that we can at least make progress on is that I think we should keep the option of seeking linearly. So, we make a new option to seek, TSK_SEEK_LINEAR which always uses the linear seek algorithm. We'll want to expose this to Python also, so there'll be a bit of plumbing involved.

We should specify this option in the ld_calculator, as that is definitely somewhere that linear seeking makes sense.

@duncanMR
Copy link
Contributor Author

That is interesting. I can see the LD calculator causing problems as that does a lot of seeking backward and forward (was the original motivation for bidirectional seeking). I don't see anything obvious wrong, but it must be something to do with sample counts.

One thing that we can at least make progress on is that I think we should keep the option of seeking linearly. So, we make a new option to seek, TSK_SEEK_LINEAR which always uses the linear seek algorithm. We'll want to expose this to Python also, so there'll be a bit of plumbing involved.

We should specify this option in the ld_calculator, as that is definitely somewhere that linear seeking makes sense.

Okay; that makes sense to me. I'll revert seek_linear to its previous form and move the seek algorithm that uses tree_pos to a new function. I'll make the tree_pos method the default for seek, then see exactly which methods need to be changed back to linear in order for the tests to pass. Shall we call the new tree_pos method tsk_tree_seek_nonlinear?

@jeromekelleher
Copy link
Member

seek_skip might be a bit more descriptive?

@benjeffery benjeffery added this to the Python 0.5.10 milestone Sep 23, 2024
@jeromekelleher
Copy link
Member

Something to bear in mind here is that sample_lists might not be compatible with this when seeking around randomly because of the non-linear order that edges can get inserted. A straightforward approach to getting this merged then could be to simply require that TSK_SEEK_LINEAR is always used with sample lists.

@jeromekelleher
Copy link
Member

Note this is affected by #3245. Perhaps worth picking up to see if the problem is indeed with sample lists?

@jeromekelleher
Copy link
Member

I'm going to take a quick look at this to see what's happening

@jeromekelleher jeromekelleher marked this pull request as ready for review August 6, 2025 13:46
@jeromekelleher
Copy link
Member

I think this is ready for a more general review and merging. I'm reasonably sure that the algorithms are working now and everything is safe, but it's still not clear to me if we should make skip the default (if it's more efficient, why wouldn't we?). I guess a reasonable way forward might be to merge this much and file an issue to track what the performance differences are on a few different things and then decide whether we want to make it the default or not based on that.

@jeromekelleher jeromekelleher force-pushed the seek_revision branch 2 times, most recently from 46cc619 to 7144ab0 Compare August 6, 2025 13:56
@benjeffery
Copy link
Member

I guess a reasonable way forward might be to merge this much and file an issue to track what the performance differences are on a few different things and then decide whether we want to make it the default or not based on that.

Sounds good to me.

@jeromekelleher
Copy link
Member

I think we need your review and approval then @benjeffery to merge

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Couple of small things

@@ -6549,7 +6635,7 @@ tsk_tree_seek_index(tsk_tree_t *self, tsk_id_t tree, tsk_flags_t options)
}

static int TSK_WARN_UNUSED
tsk_tree_seek_linear(tsk_tree_t *self, double x, tsk_flags_t TSK_UNUSED(options))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the unused flags are being removed? Won't ever be part of the public API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense when you look at where we call tsk_tree_seek_skip or tsk_tree_seek_linear below by looking at options. Doesn't really matter ultimately as it's a private function.

Co-authored-by: Ben Jeffery <[email protected]>
@benjeffery benjeffery added this pull request to the merge queue Aug 13, 2025
Merged via the queue into tskit-dev:main with commit 0ea8e64 Aug 13, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants