Skip to content

Fix genetic_relatedness to allow single sample set with self-comparisons #3235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

andrewkern
Copy link
Member

@andrewkern andrewkern commented Jun 26, 2025

Description

This PR fixes issue #3055 where genetic_relatedness would fail when computing self-comparisons with a single sample set. As @petrelharp pointed out there is no reason this shouldn't be allowed for certain statistics, including genetic_relatedness

The Problem

Previously, calling genetic_relatedness with a single sample set and indexes referring to that set would raise an error:

ts.genetic_relatedness([[0]], indexes=[(0,0)])
# TSK_ERR_INSUFFICIENT_SAMPLE_SETS: Insufficient sample sets provided

This occurred because the underlying C API validates that at least k=2 sample sets are provided for k-way statistics, even when the indexes only reference a subset of those sets.

This PR

The fix adds an allow_self_comparisons parameter to the internal __k_way_sample_set_stat method, which is only set to True for genetic_relatedness. When enabled:

  1. The method validates that all indexes reference valid sample sets
  2. If fewer than k sample sets are provided, it pads the list with dummy sample sets to satisfy the C API requirement
  3. The dummy sets are never accessed during computation since only the sets referenced by the indexes are used

This ensures that:

  • Only genetic_relatedness behavior is changed
  • Other statistics (like Fst) continue to enforce the minimum sample set requirement
  • The C API contract is satisfied without modifying the C code

Further, this PR sets the stage for allowing for other appropriate k-way stats to be computed with self comparisons by setting a flag.

Testing

Added comprehensive tests for all three computation modes (site, branch, node) to verify:

  • Single sample set with self-comparison works correctly
  • Multiple samples within a single set work correctly
  • The fix doesn't affect other statistics

All existing tests continue to pass, but @petrelharp this would benefit from you specifically looking over what I've done to be sure you're okay with the way I've implemented allow_self_comparisons

Fixes #3055

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • [] Changelogs, if there are API changes.

I'm not sure if I should touch the changelogs here?

…sons

Previously, genetic_relatedness would fail with TSK_ERR_INSUFFICIENT_SAMPLE_SETS
when given a single sample set and indexes referring to that set, e.g.:
  ts.genetic_relatedness([[0]], indexes=[(0,0)])

This was because the C API requires at least k=2 sample sets for k-way statistics,
even when indexes only reference a single set for self-comparison.

The fix adds an allow_self_comparisons parameter to __k_way_sample_set_stat,
which is set to True only for genetic_relatedness. When enabled and indexes
only reference existing sample sets, dummy sample sets are added to satisfy
the C API requirement while ensuring they are never accessed during computation.

Tests added for all three modes (site, branch, node) to verify self-comparisons
work correctly with single sample sets.

Fixes tskit-dev#3055
Copy link

codecov bot commented Jun 26, 2025

Codecov Report

Attention: Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 89.61%. Comparing base (3c0a99a) to head (cd830f7).

Files with missing lines Patch % Lines
python/tskit/trees.py 91.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3235   +/-   ##
=======================================
  Coverage   89.61%   89.61%           
=======================================
  Files          28       28           
  Lines       31983    31995   +12     
  Branches     5888     5892    +4     
=======================================
+ Hits        28660    28671   +11     
  Misses       1888     1888           
- Partials     1435     1436    +1     
Flag Coverage Δ
c-tests 86.59% <ø> (ø)
lwt-tests 80.38% <ø> (ø)
python-c-tests 88.15% <ø> (ø)
python-tests 98.79% <91.66%> (-0.02%) ⬇️
python-tests-numpy1 52.35% <0.00%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
python/tskit/trees.py 98.82% <91.66%> (-0.04%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrewkern andrewkern force-pushed the fix-issue-3055-genetic-relatedness branch from 89397d5 to cd830f7 Compare June 26, 2025 22:12
@jeromekelleher
Copy link
Member

I think we'll need to wait for @petrelharp to get informed review on this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

genetic_relatedness thinks it needs 2 samples sets but is valid with only one
2 participants