- 
                Notifications
    
You must be signed in to change notification settings  - Fork 79
 
Fix genetic_relatedness to allow single sample set with self-comparisons #3235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix genetic_relatedness to allow single sample set with self-comparisons #3235
Conversation
          Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@           Coverage Diff           @@
##             main    #3235   +/-   ##
=======================================
  Coverage   89.75%   89.75%           
=======================================
  Files          29       29           
  Lines       32545    32545           
  Branches     5962     5962           
=======================================
  Hits        29211    29211           
  Misses       1890     1890           
  Partials     1444     1444           
 Flags with carried forward coverage won't be shown. Click here to find out more. 
 🚀 New features to boost your workflow:
  | 
    
89397d5    to
    cd830f7      
    Compare
  
    | 
           I think we'll need to wait for @petrelharp to get informed review on this one.  | 
    
| 
           Bumping this one for you @petrelharp  | 
    
| 
           Ignore my previous comment (now deleted); I was being dense.  | 
    
| 
           Let's see. First, it's quite easy to modify the checking code that produces the error. I went ahead and did that (see this commit: 57ad49f). However, one can compute 'self' comparisons for other statistics, for instance, this works: or this These are perfectly valid things to compute, just the thought was they are probably not telling someone what they actually want to know. So, if the intention behind the restriction for "at least k sample sets for a k-way stat" was to keep people from computing things they probably didn't want to compute, then we're not actually doing that. We could require indexes be unique, but I don't think we should actually do that - for instance,  How's that sound, @andrewkern?  | 
    
4b8755f    to
    7f48542      
    Compare
  
    | 
           LMK how this looks @andrewkern; if you like it I'll squash these commits down.  | 
    
931ce7e    to
    34d6826      
    Compare
  
    | 
           Rebased; added to CHANGELOG; tests are (should be) passing. Ready to merge when you approve, @andrewkern .  | 
    
| int ret = 0; | ||
| 
               | 
          ||
| if (num_sample_sets < tuple_size) { | ||
| if (num_sample_sets < 1) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment here to explain the check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems pretty self-explanatory to me? Like, how can you compute a statistic with only one sample set? (The comment would be something like // check there is at least one sample set, but we like to avoid this sort of comment that just says the same thing as the code?)
| 
           looks good to me @petrelharp -- a couple nit picks about adding comments to help explain the code to future us  | 
    
| 
           See my comments & changes. Could you "approve" the code in review if you're good with it now? (Or let me know if not?)  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…sons Previously, genetic_relatedness would fail with TSK_ERR_INSUFFICIENT_SAMPLE_SETS when given a single sample set and indexes referring to that set, e.g.: ts.genetic_relatedness([[0]], indexes=[(0,0)]) This was because the C API requires at least k=2 sample sets for k-way statistics, even when indexes only reference a single set for self-comparison. The fix is more general: it removes the restriction that there be at least k sample sets for k-way stats, because this wasn't a useful or consistently applied restriction. Closes tskit-dev#3055
c98e387    to
    c173b32      
    Compare
  
    
Description
This PR fixes issue #3055 where
genetic_relatednesswould fail when computing self-comparisons with a single sample set. As @petrelharp pointed out there is no reason this shouldn't be allowed for certain statistics, includinggenetic_relatednessThe Problem
Previously, calling
genetic_relatednesswith a single sample set and indexes referring to that set would raise an error:This occurred because the underlying C API validates that at least k=2 sample sets are provided for k-way statistics, even when the indexes only reference a subset of those sets.
This PR
The fix adds an
allow_self_comparisonsparameter to the internal__k_way_sample_set_statmethod, which is only set toTrueforgenetic_relatedness. When enabled:This ensures that:
genetic_relatednessbehavior is changedFurther, this PR sets the stage for allowing for other appropriate k-way stats to be computed with self comparisons by setting a flag.
Testing
Added comprehensive tests for all three computation modes (site, branch, node) to verify:
All existing tests continue to pass, but @petrelharp this would benefit from you specifically looking over what I've done to be sure you're okay with the way I've implemented
allow_self_comparisonsFixes #3055
PR Checklist:
I'm not sure if I should touch the changelogs here?