diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md new file mode 100644 index 0000000..605bb8e --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -0,0 +1,34 @@ +--- +name: Bug Report +about: Create a report about something not working as expected. +title: '' +labels: '' +assignees: '' + +--- + +## **Describe the bug** +A clear and concise description of what the bug is. + +## **To reproduce the bug** +Steps to reproduce the observed issue: +1. Go to '...' +2. Click on '....' +3. Scroll down to '....' +4. See error + +## **Expected behaviour** +A clear and concise description of what you expected to happen. + +## **Screenshots and data** +If applicable, add input/output data (files or cloud location) or screenshots to help explain your problem. + +## **Additional context** +Add any other context about the problem here. + +## **Proposed troubleshooting steps** +Describe how you're going to try to fix it! +- [ ] This can be a task list + +## **Help requests** +Describe any specific aspects of the problem/solution where help from a teammate is required. Tag them so they're aware. diff --git a/.github/ISSUE_TEMPLATE/feature-addition.md b/.github/ISSUE_TEMPLATE/feature-addition.md new file mode 100644 index 0000000..b824490 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature-addition.md @@ -0,0 +1,26 @@ +--- +name: Feature Addition +about: Suggest an idea for this project. +title: '' +labels: '' +assignees: '' + +--- + +## **Is your feature request related to a problem? Please describe.** +A clear and concise description of what the problem is. Ex. Currently we cannot test ___, the way ___ is done is disorganized [...] + +## **Describe the solution you'd like** +A clear and concise description of what you want to happen, in plain english. + +## **Describe the plan** +A clear and concise description of the path forward as you see it. Describe any uncertainties or alternative solutions that need to be compared. + +### **TODO** +- [ ] Break the plan down into specifics + +## **Additional context** +Add any other context or screenshots about the feature request here. + +## **Help requests** +Describe any specific aspects of the problem/solution where help from a teammate is required. Tag them so they're aware. diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md new file mode 100644 index 0000000..6b4941a --- /dev/null +++ b/.github/pull_request_template.md @@ -0,0 +1,25 @@ +# Overview of Pull Request +A brief description of _why_ this change is necessary (couple of sentences/short paragraph). + +## Release notes +A brief description of what this change is in plain english. What does this change add, fix, or address? (couple of sentences/short paragraph) + +## Closes the following issues + - _[Links to any relevant GitHub Issues]_ + +## Pre-review checklist +Make sure all these boxes are checked before tagging a reviewer! + +- [ ] The PR title is a concise, present-tense summary of the change +- [ ] The PR is linked to any relevant GitHub Issues, if they exist +- [ ] The PR is against the intended branch, and is mergeable +- [ ] The code is sufficiently tested (unit tests and real world experiments, where applicable) +- [ ] The code is linted and meets style guidelines +- [ ] The changelog has been updated +- [ ] I reviewed the PR myself before requesting a review from others + +## Pre-release notes +* External dependencies affected + - [ ] _List any dependencies that may be affected by this change or need to be updated to align with this change_ +* Post-release tasks + - [ ] _List any rake tasks or other TODOs that need to be sorted after this change is released_ diff --git a/.github/workflows/main.yaml b/.github/workflows/main.yaml new file mode 100644 index 0000000..6648884 --- /dev/null +++ b/.github/workflows/main.yaml @@ -0,0 +1,37 @@ +name: ContiniousIntegrationChecks +run-name: ${{ github.actor }} triggered CI run with event ${{ github.event_name }} for branch ${{ github.ref }} +on: + push: + branches: [ main, ci_add] + pull_request: + branches_ignore: [] +permissions: + id-token: write + contents: read # This is required for actions/checkout +jobs: + fullCI: + defaults: + run: + shell: bash -l {0} + name: CI test + runs-on: ubuntu-latest + steps: + - run: echo "The job was automatically triggered by a ${{ github.event_name }} event." + - run: echo "The name of your branch is ${{ github.ref }} and your repository is ${{ github.repository }}." + - run: echo "The job's status is ${{ job.status }}." + - name: Git clone the repository + uses: actions/checkout@v4 + - name: Set up R env + uses: r-lib/actions/setup-r@v2 + - name: Set up conda + uses: conda-incubator/setup-miniconda@v3 + with: + auto-activate-base: false + python-version: 3.12 + channels: bioconda, conda-forge, defaults + - name: Conda create env and install dependencies + shell: bash -l {0} + run: | + conda create -n test_r r-base r-devtools r-testthat + conda activate test_r + Rscript -e "testthat::test_local()" diff --git a/DESCRIPTION b/DESCRIPTION index ac26179..9e59962 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,22 +1,23 @@ -Package: seqbio.variant.scoring -Title: Score evidence of rare variants' association with disease status in a family, based on relationships of individuals +Package: KinformR +Title: Relationship-informed pedigree and variant scoring Version: 0.1.0 Authors@R: person("Cameron", "Nugent", , "cam.nugent@sequencebio.com", role = c("aut", "cre"), comment = c(ORCID = "0000-0002-1135-2605")) Description: - This R package is designed to score rare variants, assigning values based on - the disease status of individuals, the presence or absence of a rare variant - in those individuals, and their pairwise coefficients of relatedness. - The package uses a custom formula to assign value to a variant that gives - more weight to shared variants common to distantly related affected - individuals. The variant status for unaffected individuals can optionally - be considered as well, with the highest scoring values being given to - closely related individuals that do not share a variant of interst. - Since variants can be incompletely penetrant, the scoring can be based - solely on the affected individuals, or the weight of unaffected evidence - can be customized. + The KinformR R package is meant to aid in comparative evaluation of families + and candidate variants in rare-variant association studies. The package can be used for + two methodologically overlapping but distinct purposes. First, the prior to any genetic or genomic + evaluation, evaluation of relative detection power of pedigrees, can direct recruitment + efforts by showing which unsampled individuals would be the most meaningful additions to a study. + Second, after sequencing and analysis, variants based on association with disease status + and familial relationships of individuals, aids in variant prioritization. License: MIT + file LICENSE Encoding: UTF-8 Roxygen: list(markdown = TRUE) RoxygenNote: 7.3.2 +VignetteBuilder: knitr +Suggests: + devtools, + testthat, + knitr diff --git a/seqbio.variant.scoring.Rproj b/KinformR.Rproj similarity index 100% rename from seqbio.variant.scoring.Rproj rename to KinformR.Rproj diff --git a/LICENSE b/LICENSE index 63e3ef0..d30f53e 100644 --- a/LICENSE +++ b/LICENSE @@ -1,2 +1,21 @@ -YEAR: 2025 -COPYRIGHT HOLDER: seqbio.variant.scoring authors +# MIT License + +Copyright (c) 2025 KinformR Sequence Bioinformatics Inc. + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/LICENSE.md b/LICENSE.md deleted file mode 100644 index f491bd0..0000000 --- a/LICENSE.md +++ /dev/null @@ -1,21 +0,0 @@ -# MIT License - -Copyright (c) 2025 seqbio.variant.scoring authors - -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: - -The above copyright notice and this permission notice shall be included in all -copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -SOFTWARE. diff --git a/NAMESPACE b/NAMESPACE index c893e29..c1ff121 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -1,12 +1,17 @@ # Generated by roxygen2: do not edit by hand -S3method(sum,fam.scores) +export(add.fam.scores) export(assign.status) export(build.relation.dict) export(calc.rv.score) export(encode.rows) +export(ibd) +export(penetrance) export(read.indiv) +export(read.pedigree) export(read.relation.mat) export(read.var.table) +export(score) export(score.fam) +export(score.pedigree) export(score.variant.status) diff --git a/R/encoding.R b/R/encoding.R index f8fe84f..cf3f6d2 100644 --- a/R/encoding.R +++ b/R/encoding.R @@ -1,12 +1,12 @@ #' Take a disease status and a genetic variant and determine which category the combo falls in. -#' A_c = Affected individual with ALT variant -#' A_i = Affected individual without ALT variant -#' U_c = Unaffected individual without ALT variant -#' U_i = Unaffected individual with ALT variant +#' A.c = Affected individual with ALT variant +#' A.i = Affected individual without ALT variant +#' U.c = Unaffected individual without ALT variant +#' U.i = Unaffected individual with ALT variant #' If theoretical.max = TRUE the true variant statuses are ignored and all -#' affected/unaffected are assigned A_c and U_c respectively. +#' affected/unaffected are assigned A.c and U.c respectively. #' These encoding can then be used show what a family's max score would be. #' #' @param status Disease status of an individual. A = affected, U = unaffected. @@ -16,41 +16,44 @@ #' Default is FALSE. #' @return a string #' @examples -#' assign.status("A", "0/1") == "A_c" -#' assign.status("A", "0|0") == "A_i" -#' assign.status("U", 1) == "U_i" -#' assign.status("U", "0|0") =="U_c" +#' assign.status("A", "0/1") == "A.c" +#' assign.status("A", "0|0") == "A.i" +#' assign.status("U", 1) == "U.i" +#' assign.status("U", "0|0") =="U.c" #' @export assign.status <- function(status, variant, theoretical.max=FALSE){ + + var.err<-"Incompatible variant value! Supported encodings are: '0' '1' '0/0' '0/1' '0|0' '0|1'" if(status == "A"){ if(theoretical.max){ - return("A_c") + return("A.c") } - else if(variant == "0/1" || variant == "1/1" || variant == "1" || variant == "0|1" || variant == "1|0"|| variant == "1|1" ){ - return("A_c") + #NOTE - Once in a while 1/0 genotypes crop up; also 0/2 etc. if derived from multi-allelics. This edge case not covered at present. + else if(variant == "0/1" || variant == "1/1" || variant == "1" || variant == "0|1" || variant == "1|0" || variant == "1|1" ){ + return("A.c") }else if (variant == "0/0" || variant == "0" || variant == "0|0" ){ - return("A_i") + return("A.i") }else{ - return("unk") + stop(var.err) } }else if (status == "U"){ if(theoretical.max){ - return("U_c") - } else if(variant == "0/1" || variant == "1/1" || variant == "1" || variant == "0|1" || variant == "1|0"|| variant == "1|1" ){ - return("U_i") + return("U.c") + } else if(variant == "0/1" || variant == "1/1" || variant == "1" || variant == "0|1" || variant == "1|0" || variant == "1|1" ){ + return("U.i") }else if (variant == "0/0" || variant == "0" || variant == "0|0" ){ - return("U_c") + return("U.c") }else{ - return("unk") + stop(var.err) } }else{ - return("unk") + stop("Status must be one of: U or A") } } #' Take the dataframe with variants and status and determine which indivudals #' are scored correctly and which are scored incorrectly. -#' Assign an A_c, A_i, U_c, U_i, unk +#' Assign an A.c, A.i, U.c, U.i, unk #' #' Variants can be encoded as binary (0 or 1, genotypes 0/0 or 0/1, or phased genotypes 0|0 0|1). #' Note the program assumes alt is the disease allele. homozygous alts are allowed. @@ -61,7 +64,7 @@ assign.status <- function(status, variant, theoretical.max=FALSE){ #' TODO - switch to numbers 1-4 and -1? #' @param indiv.df A dataframe with the format: #' name status variant -#' MS-4107-1001 A 0/1 +#' MS-5678-1001 A 0/1 #' @param theoretical.max Should the theoretical maxima be returned instead of the observed values? #' When true, the scoring assumes correct variant-status pair for each individual. #' Default is FALSE. @@ -95,15 +98,13 @@ return(indiv.df) #' @param drop.unrelated Should unrelated (-1) relationships be dropped? Default = TRUE. #' #' @return A list with the categorized relationship/variant information. -#' @examples -#' #TODO - add #' @export build.relation.dict <- function( mat.row, name.stat.dict, drop.unrelated=TRUE){ indiv.rels = list( - "A_c" = c(), - "A_i" = c(), - "U_c" = c(), - "U_i" = c() + "A.c" = c(), + "A.i" = c(), + "U.c" = c(), + "U.i" = c() ) for(i in seq_along(mat.row)){ @@ -124,10 +125,9 @@ build.relation.dict <- function( mat.row, name.stat.dict, drop.unrelated=TRUE){ #' For each row, generate the encoded data for scoring. #' @param relation.mat The relationship matrix for all pairwise combinations of individuals. #' @param status.df The ID, status, and genotypes for each individual. +#' @param ... Additional arguments to be passed between methods. #' @return A dictionary with the per-individual relationship lists. #' One value for each row of the matrix. -#' @examples -#' #TODO - add #' @export encode.rows <- function(relation.mat, status.df, ...){ diff --git a/R/io.R b/R/io.R index b29b019..cc7da22 100644 --- a/R/io.R +++ b/R/io.R @@ -4,10 +4,10 @@ #' #' @param fname A file name, expected format of contents is: #' name status variant -#' MS-4107-1001 A 0/1 +#' MS-5678-1001 A 0/1 #' @return A data frame. #' @examples -#' tsv.name1 <-system.file('extdata/7003_notch3.tsv', package = 'seqbio.variant.scoring') +#' tsv.name1 <-system.file('extdata/1234_ex2.tsv', package = 'KinformR') #' id.df1 <- read.indiv(tsv.name1) #' @export read.indiv <- function(fname){ @@ -24,7 +24,7 @@ read.indiv <- function(fname){ #' @param fname The file with the relationship matrix information. #' @return A matrix with the relationships and individual ids as rownames and colnames. #' @examples -#' mat.name1 <-system.file('extdata/7003_notch3.mat', package = 'seqbio.variant.scoring') +#' mat.name1 <-system.file('extdata/1234_ex2.mat', package = 'KinformR') #' mat1 <- read.relation.mat(mat.name1) #' @export read.relation.mat <- function(fname){ @@ -49,15 +49,15 @@ read.relation.mat <- function(fname){ #' #' #' @param fname A file name, expected format of contents is: -#' #CHROM POS REF ALT MS-4107-1001_A MS-4107-1002_U ... +#' #CHROM POS REF ALT MS-5678-1001_A MS-5678-1002_U ... #' chr3 46203838 G A 0/1 0/0 ... #' @return A dataframe. #' Data will be worked into a data frame with format. #' name status variant -#' MS-4107-1001 A 0/1 +#' MS-5678-1001 A 0/1 #' @examples -#' ex.infile <-system.file('extdata/example_vcf_extract_4107.tsv', -#' package = 'seqbio.variant.scoring') +#' ex.infile <-system.file('extdata/example_vcf_extract_5678.tsv', +#' package = 'KinformR') #' read.var.table(ex.infile) #' @export read.var.table <- function(fname){ diff --git a/R/pedigree.r b/R/pedigree.r new file mode 100644 index 0000000..cf80331 --- /dev/null +++ b/R/pedigree.r @@ -0,0 +1,189 @@ + +#' Likelihood function for calculation of Pedigree-based autosomal dominant penetrance value. +#' Formula deployed via optimize so as to determine the optimal value. +#' +#' @param K The range of penetrance values to be explored by the optimization function. +#' @param a Count of affected individuals +#' @param b Count of obligate carriers +#' @param c Count of children of either affecteds or carriers, with no children of their own +#' @param d Count of trees of unaffected individuals - specifically, two sequential generations (i.e. a parent and their offspring) +#' @param n Count of the number of second generation progeny in a given tree. +#' +#' @return K Pedigree-based estimation of autosomal dominant penetrance rate. +#' @export +#' +#' @examples +#' # the penetrance function is a function where values is found through optimization +#' K <- optimize(penetrance, c(0,1), 3, 1, 5, 2, 1, maximum=TRUE)$max +penetrance <- function(K, a, b, c, d, n) { + a*log(K) + b*log(1-K) + c*log(2-K) + sum(d*log(2^n + (1-K)*(2-K)^n) - d*(n+1)*log(2)) +} + + +#' Calculation of Identity by descent (IBD). +#' +#' Use the relationship informationfrom the pedigree to +#' estimate of the amount of the genome they have inherited it from a +#' common ancestor without recombination. +#' +#' Can do this for the total potential relatedness +#' in a pedigree (theoretical=TRUE), +#' or for the actual relatedness across collected samples (theoretical=FALSE). +#' For the theoretical=TRUE case, in the unaffected trees, +#' if we have a sample from the parent, +#' then the offspring do not provide any additional information +#' for a max IBD calculation. +#' This means that K does not scale with n. +#' +#' For theoretical=FALSE, sometimes we don’t have the healthy parent in an unaffected tree, +#' and only have a child. In this case, the IBD contribution from the child is 1/4, +#' and since they’re unaffected and therefore are a counter-filter, +#' they would contribute 1-1/4 = 3/4 to the total relatedness. +#' Either the parent is a non-obligate carrier, or is a non-carrier. +#' The probability of the children depends on which of those is true, +#' so we have the additional set of terms in the theoretical=FALSE logic. +#' +#' +#' @param a Count of affected individuals +#' @param b Count of obligate carriers +#' @param c Count of children of either affecteds or carriers, with no children of their own +#' @param d Count of Trees of unaffected individuals - specifically, two sequential generations (i.e. a parent and their offspring) +#' @param n Count of the number of second generation progeny in a given tree. +#' @param K The estimate of penetrance rate. +#' @param theoretical Boolean indicating if the calculation should be +#' theoretical IBD calculation (using only d and k), or if the calculation +#' should use the provided n value. +#' +#' +#' @return pi-hat value. The proportion of genome shared between individuals. +#' @export +#' +#' @examples +#' ibd <- ibd(3, 1, 5, 2, 1, 0.4576484) +ibd <- function(a, b, c, d, n, K, theoretical=TRUE) { + if (theoretical) { + x <- sum(d) * 2^K + } else { + x <- sum(d*(1 - (2^n-1)/2^(n+1))) * 2^K + } + pihat<-(a + b + c * 2^K + x) - 1 + return(pihat) +} + + + +#' Score the pedigrees using the pihat values. +#' +#' @param pihat Estimated proportion of genome shared between individuals, +#' from function: ibd. +#' +#' @return The score value. +#' @export +#' @examples +#' s.val <- score(12.61) +score <- function(pihat) { + log(2^pihat) +} + + +#' Read in the encoded pedigree data file. +#' +#' @param filename name of the file with the data. +#' +#' @return A data frame containing the encoded pedigree information. +#' @export +#' +#' @examples +#' example.pedigree.file <- system.file('extdata/example_pedigree_encoding.tsv', +#' package = 'KinformR') +#' example.pedigree.df <- read.pedigree(example.pedigree.file) +read.pedigree <- function(filename){ + h <- read.table(filename, header=TRUE, sep="\t", + check.names=FALSE, colClasses=c("Family"="character")) + return(h) +} + + + +#' Take the encoded information about the pedigrees and calculate penetrance. +#' +#' Determine a value score of families by comparing their relationship structure. +#' More distant relationships between affecteds (e.g. affected cousins) +#' is more valuable that close relationships (e.g. affected siblings) +#' as there is less IBD and therefore a smaller search space. +#' +#'Simplifying assumptions: +#' - Autosomal dominant +#' - No ambiguous statuses +#' - No more than two sequential generations of unknown carrier status +#' (non-obligate carrier vs. non-carrier). +#' Generalized support of arbitrary tree structures gets a lot more +#' complicated, especially for the likelihood function. +#' - Exclude big giant trees of unaffecteds - related to above. +#' Will slightly bias the result toward higher penetrance. +#' - Exclude subjects younger than age of onset +#' +#' @param h A data frame containing the encoded pedigree information +#' @return A data frame containing the theoretical scoring of the power of a +#' family assuming you were able to collect everyone on the simplified pedigree, +#' as well as a current scoreing, examining only those for whom you currently +#' have DNA. +#' @export +#' +#' @examples +#' example.pedigree.file <-system.file('extdata/example_pedigree_encoding.tsv', +#' package = 'KinformR') +#' example.pedigree.df <- read.pedigree(example.pedigree.file) +#' penetrance.df <- score.pedigree(example.pedigree.df) +score.pedigree <- function(h){ + + family.vec <- c() + penetrance.vec <- c() + max.pihat.vec <- c() + max.score.vec <- c() + current.pihat.vec <- c() + current.score.vec <- c() + for (i in seq_len(nrow(h))) { + family <- h[i,"Family"] + max.a <- h[i, "max_a"] + #Yeezy yeezy whats good its ya boy + max.b <- h[i, "max_b"] + max.c <- h[i, "max_c"] + max.d <- h[i, "max_d"] + max.n <- h[i, "max_n"] + a.actual <- h[i, "a"] + b.actual <- h[i, "b"] + c.actual <- h[i, "c"] + d.actual <- h[i, "d"] + n.actual <- h[i, "n"] + + max.d <- as.numeric(strsplit(as.character(max.d), ",")[[1]]) + max.n <- as.numeric(strsplit(as.character(max.n), ",")[[1]]) + + d.actual <- as.numeric(strsplit(as.character(d.actual), ",")[[1]]) + n.actual <- as.numeric(strsplit(as.character(n.actual), ",")[[1]]) + + K <- optimize(penetrance, c(0,1), max.a, max.b, max.c, + max.d, max.n, maximum=TRUE)$max + max.pihat <- ibd(max.a, max.b, max.c, max.d, max.n, K) + max.score <- score(max.pihat) + current.pihat <- ibd(a.actual, b.actual, c.actual, + d.actual, n.actual, K, FALSE) + current.score <- score(current.pihat) + + family.vec <- c(family.vec, family) + penetrance.vec <- c(penetrance.vec, K) + max.pihat.vec <- c(max.pihat.vec, max.pihat) + max.score.vec <- c(max.score.vec, max.score) + current.pihat.vec <- c(current.pihat.vec, current.pihat) + current.score.vec <- c(current.score.vec,current.score) + } + + df <- data.frame(family.vec, penetrance.vec, max.pihat.vec, max.score.vec, + current.pihat.vec, current.score.vec) + colnames(df) <- c("family", "penetrance", "max.pi-hat", "max.score", + "current.pi-hat", "current.score") + df$pct.of.max <- df$current.score / df$max.score * 100 + df[, -1] <- round(df[, -1], 2) + return(df) +} diff --git a/R/relatedness.r b/R/relatedness.r index c7aed86..b12974e 100644 --- a/R/relatedness.r +++ b/R/relatedness.r @@ -4,30 +4,30 @@ #' #' These scores can be used to compare variants of interest within a family. #' -#' For each individual, a relationship-informed weight is applied to their sharing +#' For each encoded relationship, a relationship-informed weight is applied to their sharing #' or not sharing of a variant. #' The score for affected status is: -#' (1 / coefficient_of_relatedness) * status_weight +#' (1 / coefficient.of.relatedness) * status.weight #' For example, an affected cousin (encoded as a 3) would get a score of: #' (1/0.125) * affected weight #' 8 * 1 #' = 8 points in favour of the variant. -#' Whereas for unaffected unaffected individuals, scores decay the further a person is in +#' Whereas for unaffected individuals, scores decay the further a person is in #' relation to the proband based on the formula: -#' ((unaffected.max*2) * coefficient_of_relatedness ) * unaffected_weight +#' ((unaffected.max*2) * coefficient.of.relatedness ) * unaffected.weight #' For example, with the default unaffected.max of 8. The sister that does not have a variant would get a score of -#' ((8*2) 0.5) * unaffected_weight +#' ((8*2) 0.5) * unaffected.weight #' (16 * 0.5) * 0.5 #' = 4 points for the variant. #' If these were the only two relatives considered we could sum the points #' and get a score in favour of the variant of #' 8 + 4 = 12 -#' Ig there is evidence against a variant, this is factored into the score as: -#' total_score = evidence_for - evidence_against +#' If there is evidence against a variant, this is factored into the score as: +#' total.score = evidence.for - evidence.against #' For example, if there were also an affected sibling without the variant we would have the score against of: #' (1/0.5) * 1 = 2 #' The final score for the variant would then be -#'. for - against = total +#' for - against = total #' 12 - 2 = 10 #' Giving a final score of 10 for the variant. Comparing values across variants can be used #' to rank them based on pedigree-informed levels of variant sharing across affected @@ -40,8 +40,8 @@ #' #' Input: #' -#' @param fam_list -#' - A list with the names: ['A_c', 'A_i', 'U_c', 'U_i'] +#' @param fam.list +#' - A list with the names: 'A.c', 'A.i', 'U.c', 'U.i' #' respectively containing the affected correct, affected incorrect, #' unaffected correct and unaffected incorrect. #' - This can be generated with the function: score.variant.status @@ -49,8 +49,8 @@ #' reference's relatives as encoded based on their degree of relatedness to the #' reference (reference = 0, sibling/parent/offspring = 1, uncle/grandparent = 2, #' cousin = 3, etc.) -#' @param affected.weight A coefficient to multiply the calculated A_c and A_i relatedness values by. -#' @param unaffected.weight A coefficient to multiply the U_c and U_i relatedness values by. +#' @param affected.weight A coefficient to multiply the calculated A.c and A.i relatedness values by. +#' @param unaffected.weight A coefficient to multiply the U.c and U.i relatedness values by. #' @param unaffected.max This param controls the score given to a first degree unaffected relatives #' scores decay from this specified maximum by half for each subsequent relationship degree. #' @param max.err A heuristic cap of the number of incorrect assignments allowed when scoring. When the total number @@ -59,10 +59,10 @@ #' @return A list with three components: score, score.for, score.against. #' #' @examples -#' relations<-list("A_c" = c(0, 1, 3, 1),"A_i" = c(3),"U_c" = c(1, 2),"U_i" = c(1)) +#' relations<-list("A.c" = c(0, 1, 3, 1),"A.i" = c(3),"U.c" = c(1, 2),"U.i" = c(1)) #' rv.scores <- calc.rv.score(relations) #' @export -calc.rv.score <- function(fam_list, affected.weight=1, unaffected.weight=0.5, unaffected.max=8, max.err=4){ +calc.rv.score <- function(fam.list, affected.weight=1, unaffected.weight=0.5, unaffected.max=8, max.err=4){ relatedness = list() @@ -70,43 +70,43 @@ calc.rv.score <- function(fam_list, affected.weight=1, unaffected.weight=0.5, un relatedness[i+1] = 1 / (2 ** (i)) } - score_dict = list() + score.dict = list() - score_dict[["A_c"]] - score_dict[["A_i"]] - score_dict[["U_c"]] - score_dict[["U_i"]] + score.dict[["A.c"]] + score.dict[["A.i"]] + score.dict[["U.c"]] + score.dict[["U.i"]] #TODO - change this to apply the different formula for the unaffecteds - for (n in names(fam_list)){ + for (n in names(fam.list)){ scores = c() - if(n == "A_c" || n == "A_i"){ - for (x in fam_list[[n]]){ + if(n == "A.c" || n == "A.i"){ + for (x in fam.list[[n]]){ #for affected, importance score gets higher the further from reference individual you get. scores = c(scores, 1/relatedness[[x+1]]) } - }else if ( n=="U_c" || n == "U_i"){ - for (x in fam_list[[n]]){ + }else if ( n=="U.c" || n == "U.i"){ + for (x in fam.list[[n]]){ #for unaffected, importance score gets lower the further from reference individual you get. scores = c(scores, (unaffected.max*2) * relatedness[[x+1]]) } } - score_dict[[n]] = scores + score.dict[[n]] = scores } - weighted_for = sum(score_dict[["A_c"]]*affected.weight ) + - sum(score_dict[["U_c"]]*unaffected.weight) + weighted.for = sum(score.dict[["A.c"]]*affected.weight ) + + sum(score.dict[["U.c"]]*unaffected.weight) - weighted_against = sum(score_dict[["A_i"]]*affected.weight ) + - sum(score_dict[["U_i"]]*unaffected.weight) + weighted.against = sum(score.dict[["A.i"]]*affected.weight ) + + sum(score.dict[["U.i"]]*unaffected.weight) - out.list <- list("score" = weighted_for - weighted_against, - "score.for" = weighted_for, - "score.against" = weighted_against ) + out.list <- list("score" = weighted.for - weighted.against, + "score.for" = weighted.for, + "score.against" = weighted.against ) - n.incor <- length(score_dict[["A_i"]]) + length(score_dict[["U_i"]]) + n.incor <- length(score.dict[["A.i"]]) + length(score.dict[["U.i"]]) if(n.incor > max.err){ out.list[["score"]] <- 0 } @@ -118,6 +118,9 @@ calc.rv.score <- function(fam_list, affected.weight=1, unaffected.weight=0.5, un #' Take the matrix and subset out only the encoded individuals that are present in the status dataframe. +#' @param mat.df The full matrix file to subset +#' @param status.df The list of sampled individuals, matrix is subset to only these individuals. +#' @return A subset of the input matrix. subset.mat <- function(mat.df, status.df){ sub.ids <- rownames(mat.df)[rownames(mat.df) %in% status.df$name] sub.mat <- mat.df[sub.ids, sub.ids] @@ -146,8 +149,8 @@ subset.mat <- function(mat.df, status.df){ #' #' @param relation.mat A relationship matrix for the family. #' @param status.df A dataframe with the encoded variant/disease status of each individual -#' @param affected.weight A coefficient to multiply the calculated A_c and A_i relatedness values by. -#' @param unaffected.weight A coefficient to multiply the U_c and U_i relatedness values by. +#' @param affected.weight A coefficient to multiply the calculated A.c and A.i relatedness values by. +#' @param unaffected.weight A coefficient to multiply the U.c and U.i relatedness values by. #' @param return.sums Boolean indicating if sum of family variant scores should be returned (default = FALSE). #' @param return.means Boolean indicating if mean of all family variant scores should be returned (default = TRUE). #' @param affected.only Boolean indicating if the family score should be calculated using only the affected individuals? (default = TRUE). @@ -156,17 +159,16 @@ subset.mat <- function(mat.df, status.df){ #' of points for or against. This simplifies scoring and allows for fast filtering of poor quality variants. Default is 4. #' @return A labelled vector with names: score, score.for, score.against #' @examples -#' mat.name1<-system.file('extdata/7003_notch3.mat', package = 'seqbio.variant.scoring') -#' tsv.name1<-system.file('extdata/7003_notch3.tsv', package = 'seqbio.variant.scoring') +#' mat.name1<-system.file('extdata/1234_ex2.mat', package = 'KinformR') +#' tsv.name1<-system.file('extdata/1234_ex2.tsv', package = 'KinformR') #' mat.df <- read.relation.mat(mat.name1) #' ind.df <- read.indiv(tsv.name1) #' ind.df.status <- score.variant.status(ind.df) -#' score_default <- score.fam(mat.df, ind.df.status) +#' score.default <- score.fam(mat.df, ind.df.status) #' @export score.fam <- function(relation.mat, status.df, affected.weight=1, unaffected.weight=0.5, return.sums = FALSE, return.means = TRUE, - affected.only = TRUE, max.err=4, - subset.fam = TRUE){ + affected.only = TRUE, max.err=4){ #The family encoding matrix needs to be subset to include only the individuals in the status dataframe sub.relation.mat <- subset.mat(relation.mat, status.df) @@ -200,10 +202,10 @@ score.fam <- function(relation.mat, status.df, affected.weight=1, unaffected.wei #' @examples #' score.fam1 <- c("score" = 1.0, "score.for" = 2.0, "score.against" = 1.0) #' score.fam2 <- c("score" = 1.0, "score.for" = 3.0, "score.against" = 2.0) -#' # out <- sum.fam.scores(c(score.fam1, score.fam2)) +#' # out <- add.fam.scores(c(score.fam1, score.fam2)) #' #returns: c("score" = 2.0,"score.for" = 5.0, "score.against" = 3.0) #' @export -sum.fam.scores <- function(score.vec){ +add.fam.scores <- function(score.vec){ outvec<-tapply(score.vec, names(score.vec), sum) sorted.out<-c("score" = outvec[["score"]], diff --git a/README.md b/README.md index 4023386..c2310e1 100644 --- a/README.md +++ b/README.md @@ -1,33 +1,92 @@ -# seqbio-variant-scoring -## A method for scoring variants based on relationships of individuals. +# KinformR +## The Kinship Informer: An R package for relationship-informed pedigree and variant scoring -*WORK IN PROGRESS - currently a minimal viable product.* -This R package is designed to score rare variants, assigning values based on the disease status of individuals, -the presence or absence of a rare variant in those individuals, and their pairwise coefficients of relatedness. -The package uses a custom formula to assign value to a variant that gives more weight to shared variants common -to distantly related affected individuals. The variant status for unaffected individuals can optionally be considered -as well, with the highest scoring values being given to closely related individuals that *do not* share a variant of interst. -Since variants can be incompletely penetrant, the scoring can be based solely on the affected individuals, or the weight -of unaffected evidence can be customized. +## Introduction +Family-based genetic studies are effective for identifying rare variants underlying heritable diseases, yet are often challenged by issues such as incomplete penetrance and the difficulty of prioritizing numerous candidate variants. The proportion of the genome with identity-by-descent (IBD) and estimates of penetrance are metrics that can show the value of different pedigrees in family-based studies. Additionally, IBD and the genotypes of individuals are combined by KinformRto to score the value of candidate variants. -## related info +The `KinformR` R package is meant to aid in comparative evaluation of families and candidate variants in rare-variant association studies. The package can be used for two methodologically overlapping but distinct purposes: 1) prior to any genetic/genomic evaluation, evaluation of relative detection power of pedigrees, can direct recruitment efforts by showing which unsampled individuals would be the most meaningful additions to a study, and 2) after sequencing and analysis, variants based on association with disease status and familial relationships of individuals, aids in variant prioritization -- slides on concept: https://docs.google.com/presentation/d/1yWlj400fbsrS1CCd4wpy7ve9Sov56H82lZh7hoW8vyg/edit#slide=id.g34c18bf0cf2_0_112 -- penetrance related code: https://github.com/SequenceBio/seqbio-pedigree-ranking -- penetrance related notes: https://sequencebio.atlassian.net/wiki/spaces/SEQUENCEPE/pages/1443364866/Quantifying+power+of+a+family+for+discovery +## Installation -## Introduction -When looking at shared rare variants across families, not all sets of affected and unaffected individuals are equal. -This package is a method for scoring the evidence of a series of individuals in a way that takes into account the relatedness of the individuals as well as their disease status and genotype. +The development version of `KinformR` can be installed directly from GitHub. You'll need to have the R package `devtools` installed and loaded. Also note if the build_vignettes option is set to true, you will need to have the R package `knitr` installed. + +``` +#install.packages("devtools") +#install.packages("knitr") #required if build_vignettes = TRUE +#library(devtools) +devtools::install_github("SequenceBio/KinformR", build_vignettes = TRUE) +library(KinformR) +``` + +## How it works + +The package's vignette contains detailed explanations of the functions and parameters. + +For a walk through of the `KinformR` functions for scoring the value of *families* based on penetrance and IBD, see the corresponging vignette file: +`vignettes/KinformR-penetrance_and_ibd.Rmd` +or within R, run: +``` +vignette('KinformR-penetrance_and_ibd') +``` + +For a walk through of the `KinformR` functions for scoring the value of *variants* within families, see the corresponging vignette file: +`vignettes/KinformR-variant_scoring.Rmd` + +or within R, run: +``` +vignette('KinformR-variant_scoring') +``` + +## Scoring families + +### The encoding file + +See the included example data, which encodes 14 families. See the accompanying vignette for more information on encoding pedigrees: +``` + example.pedigree.file <-system.file('extdata/example_pedigree_encoding.tsv', package = 'KinformR') +``` +The data can be loaded with the following function: +``` + example.pedigree.df <- read.pedigree(example.pedigree.file) +``` +and scoring then performed: +``` + scoring.df <- score.pedigree(example.pedigree.df) +``` + +## Scoring Variants + +When looking at shared rare variants across families, not all sets of affected and unaffected individuals are equal. This R package is designed to score rare variants, assigning values based on the disease status of individuals, the presence or absence of a rare variant in those individuals, and their pairwise coefficients of relatedness. The package uses a custom formula to assign value to a variant that gives more weight to shared variants common to distantly related affected individuals. The variant status for unaffected individuals can optionally be considered as well, with the highest scoring values being given to closely related individuals that *do not* share a variant of interst. Since variants can be incompletely penetrant, the scoring can be based solely on the affected individuals, or the weight of unaffected evidence can be customized. + + +### The relationship matrix + +This input is a matrix containing all the pairwise relationships of individuals in a family. The row and column names are the individual IDs, and the intersecting value denotes the degree of relationship between the individuals (self = 0, 1st degree relations. = 1, etc. Unrelated individuals are given a value of -1). As of version `0.1.0` the relation matrix is a manually created file, where relationship values are assigned via manual inspection of the family pedigree. + +``` +mat.name<-system.file('extdata/1234_ex2.mat', package = 'KinformR') + +rel.mat <- read.relation.mat(mat.name) +``` + +### The status file + +This file includes the same individual IDs used in the relationship matrix as well as the disease and variant status for all individuals. + +``` +tsv.name<-system.file('extdata/1234_ex2.tsv', package = 'KinformR') +ind.df <- read.indiv(tsv.name) +ind.df.status <- score.variant.status(ex1234.df) -## TODO -- add a description of the formulas and scoring system here -- outline how to deploy the code -- a means of building the relationship matricies easily? -- generate scores for a variety of situations. Do they make sense? Can they be summed across families? +``` +### Scoring variants +The two streams of information can then be combined to score a variant based off the relationships of individuals. +``` +score.example <- score.fam(rel.mat, ind.df.status) +``` \ No newline at end of file diff --git a/inst/extdata/1234_ex2.mat b/inst/extdata/1234_ex2.mat new file mode 100644 index 0000000..f53b7cd --- /dev/null +++ b/inst/extdata/1234_ex2.mat @@ -0,0 +1,9 @@ + MS-1234-1001 MS-1234-1002 MS-1234-1003 MS-1234-1004 MS-1234-1005 MS-1234-1006 MS-1234-5001 MS-1234-6001 +MS-1234-1001 0 1 1 3 3 1 1 2 +MS-1234-1002 1 0 1 3 3 1 2 1 +MS-1234-1003 1 1 0 3 3 1 2 2 +MS-1234-1004 3 3 3 0 -1 3 4 4 +MS-1234-1005 3 3 3 -1 0 3 4 4 +MS-1234-1006 1 1 1 3 3 0 2 2 +MS-1234-5001 2 4 4 2 2 1 0 3 +MS-1234-6001 3 2 4 4 2 1 2 0 diff --git a/inst/extdata/1234_ex2.tsv b/inst/extdata/1234_ex2.tsv new file mode 100644 index 0000000..117e4db --- /dev/null +++ b/inst/extdata/1234_ex2.tsv @@ -0,0 +1,9 @@ +name status variant +MS-1234-1001 A 0/1 +MS-1234-1002 U 0/1 +MS-1234-1003 A 0/1 +MS-1234-1004 A 0/1 +MS-1234-1005 A 0/0 +MS-1234-1006 U 0/0 +MS-1234-5001 A 0/1 +MS-1234-6001 U 0/0 diff --git a/inst/extdata/5432_ex1.mat b/inst/extdata/5432_ex1.mat new file mode 100644 index 0000000..a27bed4 --- /dev/null +++ b/inst/extdata/5432_ex1.mat @@ -0,0 +1,5 @@ + MS-5432-1001 MS-5432-1002 MS-5432-6001 MS-5432-6002 +MS-5432-1001 0 1 2 3 +MS-5432-1002 1 0 1 2 +MS-5432-6001 2 1 0 1 +MS-5432-6002 3 2 1 0 diff --git a/inst/extdata/5432_ex1.tsv b/inst/extdata/5432_ex1.tsv new file mode 100644 index 0000000..cfcce59 --- /dev/null +++ b/inst/extdata/5432_ex1.tsv @@ -0,0 +1,5 @@ +name status variant +MS-5432-1001 A 0/1 +MS-5432-1002 U 0/0 +MS-5432-6001 A 0/0 +MS-5432-6002 U 0/0 diff --git a/inst/extdata/5678_ex1.mat b/inst/extdata/5678_ex1.mat new file mode 100644 index 0000000..632b3b2 --- /dev/null +++ b/inst/extdata/5678_ex1.mat @@ -0,0 +1,6 @@ + MS-5678-1001 MS-5678-1002 MS-5678-1003 MS-5678-1004 MS-5678-6001 +MS-5678-1001 0 1 1 1 2 +MS-5678-1002 1 0 1 1 2 +MS-5678-1003 1 1 0 1 2 +MS-5678-1004 1 1 1 0 1 +MS-5678-6001 2 2 2 1 0 diff --git a/inst/extdata/5678_ex1.tsv b/inst/extdata/5678_ex1.tsv new file mode 100644 index 0000000..f3751b4 --- /dev/null +++ b/inst/extdata/5678_ex1.tsv @@ -0,0 +1,6 @@ +name status variant +MS-5678-1001 A 0/1 +MS-5678-1002 U 0/1 +MS-5678-1003 A 0/1 +MS-5678-1004 U 0/0 +MS-5678-6001 U 0/0 diff --git a/inst/extdata/9876_ex1.mat b/inst/extdata/9876_ex1.mat new file mode 100644 index 0000000..50858bc --- /dev/null +++ b/inst/extdata/9876_ex1.mat @@ -0,0 +1,10 @@ + MS-9876-1001 MS-9876-1002 MS-9876-1003 MS-9876-1004 MS-9876-1005 MS-9876-1006 MS-9876-1007 MS-9876-1008 MS-9876-1009 +MS-9876-1001 0 3 4 3 4 4 4 4 3 +MS-9876-1002 3 0 1 1 2 4 4 4 3 +MS-9876-1003 4 1 0 2 3 5 5 5 4 +MS-9876-1004 3 1 2 0 1 4 4 4 3 +MS-9876-1005 4 2 3 1 0 6 6 6 5 +MS-9876-1006 4 4 5 4 6 0 1 1 2 +MS-9876-1007 4 4 5 4 6 1 0 1 2 +MS-9876-1008 4 4 5 4 6 1 1 0 2 +MS-9876-1009 3 3 4 3 5 2 2 2 0 diff --git a/inst/extdata/9876_ex1.tsv b/inst/extdata/9876_ex1.tsv new file mode 100644 index 0000000..3838a4e --- /dev/null +++ b/inst/extdata/9876_ex1.tsv @@ -0,0 +1,10 @@ +name status variant +MS-9876-1001 A 0/1 +MS-9876-1002 U 0/0 +MS-9876-1003 U 0/0 +MS-9876-1004 A 0/1 +MS-9876-1005 A 0/1 +MS-9876-1006 A 0/0 +MS-9876-1007 U 0/0 +MS-9876-1008 U 0/0 +MS-9876-1009 U 0/0 diff --git a/inst/extdata/example_pedigree_encoding.tsv b/inst/extdata/example_pedigree_encoding.tsv new file mode 100644 index 0000000..c2eef67 --- /dev/null +++ b/inst/extdata/example_pedigree_encoding.tsv @@ -0,0 +1,15 @@ +Family max_a a max_b b max_c c max_d d max_n n +1111 4 4 0 0 0 0 0 0 0 0 +1122 1 1 0 0 0 0 1 1 2 2 +1133 1 1 1 1 0 0 1 1 1 1 +1144 1 1 0 0 1 1 1 1 1 1 +1155 1 1 0 0 1 1 0 0 0 0 +1166 1 1 0 0 0 0 1 1 0 0 +2222 4 4 0 0 0 0 1 1 2 2 +0347 6 2 6 3 17 1 7,3,1,1 1 2,3,1,8 1 +5031 5 2 8 4 2 4 5 2 0 0 +6325 7 2 5 0 4 0 1,2 0 1,2 0 +1234 5 4 4 1 4 0 1,2 1 1,3 1 +9876 7 4 2 0 7 4 1 1 5 1 +4006 2 2 1 0 4 1 0 0 0 0 +3734 3 0 2 0 3 0 1,3,1 1 1,3,5 0 diff --git a/inst/extdata/example_pedigree_encoding2.tsv b/inst/extdata/example_pedigree_encoding2.tsv new file mode 100644 index 0000000..7272d15 --- /dev/null +++ b/inst/extdata/example_pedigree_encoding2.tsv @@ -0,0 +1,5 @@ +Family max_a a max_b b max_c c max_d d max_n n +Fam1 2 2 1 1 5 0 1 1 1 1 +Fam2 3 2 1 0 4 1 0 0 0 0 +Fam3 3 3 3 0 2 1 0 0 0 0 +Fam4 2 2 0 0 3 0 1,1 1,1 2,4 2,4 diff --git a/inst/extdata/example_pedigree_encoding3.tsv b/inst/extdata/example_pedigree_encoding3.tsv new file mode 100644 index 0000000..cee5499 --- /dev/null +++ b/inst/extdata/example_pedigree_encoding3.tsv @@ -0,0 +1,5 @@ +Family max_a a max_b b max_c c max_d d max_n n +Fam1 2 2 1 1 5 0 1 1 1 1 +Fam2 3 2 1 0 4 1 0 0 0 0 +Fam3 3 3 3 0 2 1 0 0 0 0 +Fam4 2 2 0 0 3 0 2 2 0 0 diff --git a/inst/extdata/example_vcf_extract_4107.tsv b/inst/extdata/example_vcf_extract_4107.tsv deleted file mode 100644 index c63da40..0000000 --- a/inst/extdata/example_vcf_extract_4107.tsv +++ /dev/null @@ -1,2 +0,0 @@ -#CHROM POS REF ALT MS-4107-1001_A MS-4107-1002_U MS-4107-1004_U MS-4107-6001_U MS-4107-1003_A -chr3 46203838 G A 0/1 0/0 0/0 0/0 0/1 diff --git a/inst/extdata/example_vcf_extract_4974.tsv b/inst/extdata/example_vcf_extract_4974.tsv deleted file mode 100644 index fd08b2c..0000000 --- a/inst/extdata/example_vcf_extract_4974.tsv +++ /dev/null @@ -1,2 +0,0 @@ -#CHROM POS REF ALT MS-4974-1002_U MS-4974-1009_U MS-4974-1006_A MS-4974-1004_A MS-4974-1007_U MS-4974-1003_U MS-4974-1001_A MS-4974-1008_U MS-4974-1005_U -chr12 120155305 C T 0|0 0|0 0|0 0|1 0|0 0|0 0|1 0|0 0|1 diff --git a/inst/extdata/example_vcf_extract_5678.tsv b/inst/extdata/example_vcf_extract_5678.tsv new file mode 100644 index 0000000..c5894d2 --- /dev/null +++ b/inst/extdata/example_vcf_extract_5678.tsv @@ -0,0 +1,2 @@ +#CHROM POS REF ALT MS-5678-1001_A MS-5678-1002_U MS-5678-1004_U MS-5678-6001_U MS-5678-1003_A +chr3 46203838 G A 0/1 0/0 0/0 0/0 0/1 diff --git a/inst/extdata/example_vcf_extract_9876.tsv b/inst/extdata/example_vcf_extract_9876.tsv new file mode 100644 index 0000000..d3063fe --- /dev/null +++ b/inst/extdata/example_vcf_extract_9876.tsv @@ -0,0 +1,2 @@ +#CHROM POS REF ALT MS-9876-1002_U MS-9876-1009_U MS-9876-1006_A MS-9876-1004_A MS-9876-1007_U MS-9876-1003_U MS-9876-1001_A MS-9876-1008_U MS-9876-1005_U +chr12 120155305 C T 0|0 0|0 0|0 0|1 0|0 0|0 0|1 0|0 0|1 diff --git a/man/sum.fam.scores.Rd b/man/add.fam.scores.Rd similarity index 86% rename from man/sum.fam.scores.Rd rename to man/add.fam.scores.Rd index c50f0df..847583e 100644 --- a/man/sum.fam.scores.Rd +++ b/man/add.fam.scores.Rd @@ -1,11 +1,11 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/relatedness.r -\name{sum.fam.scores} -\alias{sum.fam.scores} +\name{add.fam.scores} +\alias{add.fam.scores} \title{Sum all the given scores and return a single vector with cumulative "score", "for" and "against" vals. For use in instances where one wishes to combine scores from multiple families.} \usage{ -\method{sum}{fam.scores}(score.vec) +add.fam.scores(score.vec) } \arguments{ \item{score.vec}{A vector will all of the per family score outputs.} @@ -20,6 +20,6 @@ For use in instances where one wishes to combine scores from multiple families. \examples{ score.fam1 <- c("score" = 1.0, "score.for" = 2.0, "score.against" = 1.0) score.fam2 <- c("score" = 1.0, "score.for" = 3.0, "score.against" = 2.0) -# out <- sum.fam.scores(c(score.fam1, score.fam2)) +# out <- add.fam.scores(c(score.fam1, score.fam2)) #returns: c("score" = 2.0,"score.for" = 5.0, "score.against" = 3.0) } diff --git a/man/assign.status.Rd b/man/assign.status.Rd index 6c722b9..b53a95a 100644 --- a/man/assign.status.Rd +++ b/man/assign.status.Rd @@ -3,12 +3,12 @@ \name{assign.status} \alias{assign.status} \title{Take a disease status and a genetic variant and determine which category the combo falls in. -A_c = Affected individual with ALT variant -A_i = Affected individual without ALT variant -U_c = Unaffected individual without ALT variant -U_i = Unaffected individual with ALT variant +A.c = Affected individual with ALT variant +A.i = Affected individual without ALT variant +U.c = Unaffected individual without ALT variant +U.i = Unaffected individual with ALT variant If theoretical.max = TRUE the true variant statuses are ignored and all -affected/unaffected are assigned A_c and U_c respectively. +affected/unaffected are assigned A.c and U.c respectively. These encoding can then be used show what a family's max score would be.} \usage{ assign.status(status, variant, theoretical.max = FALSE) @@ -27,17 +27,17 @@ a string } \description{ Take a disease status and a genetic variant and determine which category the combo falls in. -A_c = Affected individual with ALT variant -A_i = Affected individual without ALT variant -U_c = Unaffected individual without ALT variant -U_i = Unaffected individual with ALT variant +A.c = Affected individual with ALT variant +A.i = Affected individual without ALT variant +U.c = Unaffected individual without ALT variant +U.i = Unaffected individual with ALT variant If theoretical.max = TRUE the true variant statuses are ignored and all -affected/unaffected are assigned A_c and U_c respectively. +affected/unaffected are assigned A.c and U.c respectively. These encoding can then be used show what a family's max score would be. } \examples{ -assign.status("A", "0/1") == "A_c" -assign.status("A", "0|0") == "A_i" -assign.status("U", 1) == "U_i" -assign.status("U", "0|0") =="U_c" +assign.status("A", "0/1") == "A.c" +assign.status("A", "0|0") == "A.i" +assign.status("U", 1) == "U.i" +assign.status("U", "0|0") =="U.c" } diff --git a/man/build.relation.dict.Rd b/man/build.relation.dict.Rd index 7aaa3d1..a964e8e 100644 --- a/man/build.relation.dict.Rd +++ b/man/build.relation.dict.Rd @@ -19,6 +19,3 @@ A list with the categorized relationship/variant information. \description{ Build dictionary with the relationships falling in the different categories for the query row. } -\examples{ -#TODO - add -} diff --git a/man/calc.rv.score.Rd b/man/calc.rv.score.Rd index d5f42bc..5352054 100644 --- a/man/calc.rv.score.Rd +++ b/man/calc.rv.score.Rd @@ -5,7 +5,7 @@ \title{Calculate a relatedness-weighted score for a given rare variant.} \usage{ calc.rv.score( - fam_list, + fam.list, affected.weight = 1, unaffected.weight = 0.5, unaffected.max = 8, @@ -13,8 +13,8 @@ calc.rv.score( ) } \arguments{ -\item{fam_list}{\itemize{ -\item A list with the names: \link{'A_c', 'A_i', 'U_c', 'U_i'} +\item{fam.list}{\itemize{ +\item A list with the names: 'A.c', 'A.i', 'U.c', 'U.i' respectively containing the affected correct, affected incorrect, unaffected correct and unaffected incorrect. - This can be generated with the function: score.variant.status @@ -24,9 +24,9 @@ reference (reference = 0, sibling/parent/offspring = 1, uncle/grandparent = 2, cousin = 3, etc.) }} -\item{affected.weight}{A coefficient to multiply the calculated A_c and A_i relatedness values by.} +\item{affected.weight}{A coefficient to multiply the calculated A.c and A.i relatedness values by.} -\item{unaffected.weight}{A coefficient to multiply the U_c and U_i relatedness values by.} +\item{unaffected.weight}{A coefficient to multiply the U.c and U.i relatedness values by.} \item{unaffected.max}{This param controls the score given to a first degree unaffected relatives scores decay from this specified maximum by half for each subsequent relationship degree.} @@ -42,30 +42,30 @@ A list with three components: score, score.for, score.against. These scores can be used to compare variants of interest within a family. } \details{ -For each individual, a relationship-informed weight is applied to their sharing +For each encoded relationship, a relationship-informed weight is applied to their sharing or not sharing of a variant. The score for affected status is: -(1 / coefficient_of_relatedness) * status_weight +(1 / coefficient.of.relatedness) * status.weight For example, an affected cousin (encoded as a 3) would get a score of: (1/0.125) * affected weight 8 * 1 = 8 points in favour of the variant. -Whereas for unaffected unaffected individuals, scores decay the further a person is in +Whereas for unaffected individuals, scores decay the further a person is in relation to the proband based on the formula: -((unaffected.max\emph{2) * coefficient_of_relatedness ) * unaffected_weight +((unaffected.max\emph{2) * coefficient.of.relatedness ) * unaffected.weight For example, with the default unaffected.max of 8. The sister that does not have a variant would get a score of -((8}2) 0.5) * unaffected_weight +((8}2) 0.5) * unaffected.weight (16 * 0.5) * 0.5 = 4 points for the variant. If these were the only two relatives considered we could sum the points and get a score in favour of the variant of 8 + 4 = 12 -Ig there is evidence against a variant, this is factored into the score as: -total_score = evidence_for - evidence_against +If there is evidence against a variant, this is factored into the score as: +total.score = evidence.for - evidence.against For example, if there were also an affected sibling without the variant we would have the score against of: (1/0.5) * 1 = 2 The final score for the variant would then be -. for - against = total +for - against = total 12 - 2 = 10 Giving a final score of 10 for the variant. Comparing values across variants can be used to rank them based on pedigree-informed levels of variant sharing across affected @@ -79,6 +79,6 @@ of unaffected individuals that have a variant rather than affected individuals t Input: } \examples{ -relations<-list("A_c" = c(0, 1, 3, 1),"A_i" = c(3),"U_c" = c(1, 2),"U_i" = c(1)) +relations<-list("A.c" = c(0, 1, 3, 1),"A.i" = c(3),"U.c" = c(1, 2),"U.i" = c(1)) rv.scores <- calc.rv.score(relations) } diff --git a/man/encode.rows.Rd b/man/encode.rows.Rd index 43d78c4..44eac3a 100644 --- a/man/encode.rows.Rd +++ b/man/encode.rows.Rd @@ -11,6 +11,8 @@ encode.rows(relation.mat, status.df, ...) \item{relation.mat}{The relationship matrix for all pairwise combinations of individuals.} \item{status.df}{The ID, status, and genotypes for each individual.} + +\item{...}{Additional arguments to be passed between methods.} } \value{ A dictionary with the per-individual relationship lists. @@ -20,6 +22,3 @@ One value for each row of the matrix. Take the relationship matrix and the encoded statuses of info. For each row, generate the encoded data for scoring. } -\examples{ -#TODO - add -} diff --git a/man/ibd.Rd b/man/ibd.Rd new file mode 100644 index 0000000..16173ed --- /dev/null +++ b/man/ibd.Rd @@ -0,0 +1,54 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/pedigree.r +\name{ibd} +\alias{ibd} +\title{Calculation of Identity by descent (IBD).} +\usage{ +ibd(a, b, c, d, n, K, theoretical = TRUE) +} +\arguments{ +\item{a}{Count of affected individuals} + +\item{b}{Count of obligate carriers} + +\item{c}{Count of children of either affecteds or carriers, with no children of their own} + +\item{d}{Count of Trees of unaffected individuals - specifically, two sequential generations (i.e. a parent and their offspring)} + +\item{n}{Count of the number of second generation progeny in a given tree.} + +\item{K}{The estimate of penetrance rate.} + +\item{theoretical}{Boolean indicating if the calculation should be +theoretical IBD calculation (using only d and k), or if the calculation +should use the provided n value.} +} +\value{ +pi-hat value. The proportion of genome shared between individuals. +} +\description{ +Use the relationship informationfrom the pedigree to +estimate of the amount of the genome they have inherited it from a +common ancestor without recombination. +} +\details{ +Can do this for the total potential relatedness +in a pedigree (theoretical=TRUE), +or for the actual relatedness across collected samples (theoretical=FALSE). +For the theoretical=TRUE case, in the unaffected trees, +if we have a sample from the parent, +then the offspring do not provide any additional information +for a max IBD calculation. +This means that K does not scale with n. + +For theoretical=FALSE, sometimes we don’t have the healthy parent in an unaffected tree, +and only have a child. In this case, the IBD contribution from the child is 1/4, +and since they’re unaffected and therefore are a counter-filter, +they would contribute 1-1/4 = 3/4 to the total relatedness. +Either the parent is a non-obligate carrier, or is a non-carrier. +The probability of the children depends on which of those is true, +so we have the additional set of terms in the theoretical=FALSE logic. +} +\examples{ +ibd <- ibd(3, 1, 5, 2, 1, 0.4576484) +} diff --git a/man/penetrance.Rd b/man/penetrance.Rd new file mode 100644 index 0000000..7a0f98e --- /dev/null +++ b/man/penetrance.Rd @@ -0,0 +1,33 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/pedigree.r +\name{penetrance} +\alias{penetrance} +\title{Likelihood function for calculation of Pedigree-based autosomal dominant penetrance value. +Formula deployed via optimize so as to determine the optimal value.} +\usage{ +penetrance(K, a, b, c, d, n) +} +\arguments{ +\item{K}{The range of penetrance values to be explored by the optimization function.} + +\item{a}{Count of affected individuals} + +\item{b}{Count of obligate carriers} + +\item{c}{Count of children of either affecteds or carriers, with no children of their own} + +\item{d}{Count of trees of unaffected individuals - specifically, two sequential generations (i.e. a parent and their offspring)} + +\item{n}{Count of the number of second generation progeny in a given tree.} +} +\value{ +K Pedigree-based estimation of autosomal dominant penetrance rate. +} +\description{ +Likelihood function for calculation of Pedigree-based autosomal dominant penetrance value. +Formula deployed via optimize so as to determine the optimal value. +} +\examples{ +# the penetrance function is a function where values is found through optimization +K <- optimize(penetrance, c(0,1), 3, 1, 5, 2, 1, maximum=TRUE)$max +} diff --git a/man/read.indiv.Rd b/man/read.indiv.Rd index b963459..d51ac5f 100644 --- a/man/read.indiv.Rd +++ b/man/read.indiv.Rd @@ -9,7 +9,7 @@ read.indiv(fname) \arguments{ \item{fname}{A file name, expected format of contents is: name status variant -MS-4107-1001 A 0/1} +MS-5678-1001 A 0/1} } \value{ A data frame. @@ -18,6 +18,6 @@ A data frame. Read in variant and status info for individuals. } \examples{ -tsv.name1 <-system.file('extdata/7003_notch3.tsv', package = 'seqbio.variant.scoring') +tsv.name1 <-system.file('extdata/1234_ex2.tsv', package = 'KinformR') id.df1 <- read.indiv(tsv.name1) } diff --git a/man/read.pedigree.Rd b/man/read.pedigree.Rd new file mode 100644 index 0000000..b9e4ecf --- /dev/null +++ b/man/read.pedigree.Rd @@ -0,0 +1,22 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/pedigree.r +\name{read.pedigree} +\alias{read.pedigree} +\title{Read in the encoded pedigree data file.} +\usage{ +read.pedigree(filename) +} +\arguments{ +\item{filename}{name of the file with the data.} +} +\value{ +A data frame containing the encoded pedigree information. +} +\description{ +Read in the encoded pedigree data file. +} +\examples{ +example.pedigree.file <- system.file('extdata/example_pedigree_encoding.tsv', +package = 'KinformR') +example.pedigree.df <- read.pedigree(example.pedigree.file) +} diff --git a/man/read.relation.mat.Rd b/man/read.relation.mat.Rd index a611a59..65592bb 100644 --- a/man/read.relation.mat.Rd +++ b/man/read.relation.mat.Rd @@ -18,6 +18,6 @@ Row/column intersections give the degree of relationship for the two individuals. 0 = self, -1 = unrelated. } \examples{ -mat.name1 <-system.file('extdata/7003_notch3.mat', package = 'seqbio.variant.scoring') +mat.name1 <-system.file('extdata/1234_ex2.mat', package = 'KinformR') mat1 <- read.relation.mat(mat.name1) } diff --git a/man/read.var.table.Rd b/man/read.var.table.Rd index 0ed188b..2ad5848 100644 --- a/man/read.var.table.Rd +++ b/man/read.var.table.Rd @@ -10,14 +10,14 @@ read.var.table(fname) } \arguments{ \item{fname}{A file name, expected format of contents is: -#CHROM POS REF ALT MS-4107-1001_A MS-4107-1002_U ... +#CHROM POS REF ALT MS-5678-1001_A MS-5678-1002_U ... chr3 46203838 G A 0/1 0/0 ...} } \value{ A dataframe. Data will be worked into a data frame with format. name status variant -MS-4107-1001 A 0/1 +MS-5678-1001 A 0/1 } \description{ Note - ensure the status in the names match your desired encoding! @@ -25,7 +25,7 @@ There are individuals with ambigious statues, that you may require to be encoded in a specific fashion for you current purposes. } \examples{ -ex.infile <-system.file('extdata/example_vcf_extract_4107.tsv', - package = 'seqbio.variant.scoring') +ex.infile <-system.file('extdata/example_vcf_extract_5678.tsv', + package = 'KinformR') read.var.table(ex.infile) } diff --git a/man/score.Rd b/man/score.Rd new file mode 100644 index 0000000..d5e9ee5 --- /dev/null +++ b/man/score.Rd @@ -0,0 +1,21 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/pedigree.r +\name{score} +\alias{score} +\title{Score the pedigrees using the pihat values.} +\usage{ +score(pihat) +} +\arguments{ +\item{pihat}{Estimated proportion of genome shared between individuals, +from function: ibd.} +} +\value{ +The score value. +} +\description{ +Score the pedigrees using the pihat values. +} +\examples{ +s.val <- score(12.61) +} diff --git a/man/score.fam.Rd b/man/score.fam.Rd index 0042991..3e7d09a 100644 --- a/man/score.fam.Rd +++ b/man/score.fam.Rd @@ -21,9 +21,9 @@ score.fam( \item{status.df}{A dataframe with the encoded variant/disease status of each individual} -\item{affected.weight}{A coefficient to multiply the calculated A_c and A_i relatedness values by.} +\item{affected.weight}{A coefficient to multiply the calculated A.c and A.i relatedness values by.} -\item{unaffected.weight}{A coefficient to multiply the U_c and U_i relatedness values by.} +\item{unaffected.weight}{A coefficient to multiply the U.c and U.i relatedness values by.} \item{return.sums}{Boolean indicating if sum of family variant scores should be returned (default = FALSE).} @@ -42,7 +42,9 @@ A labelled vector with names: score, score.for, score.against By default all individuals are treated as the reference 'proband' and the given variant's score is calculated based on relationships to all other individuals. e.g. for each row in the relationship matrix. calc.rv.score is run, with the row name indiciating the -reference individual that the calcualtion is relative to. +reference individual that the calculation is relative to. +Note that the relation.mat can include more individuals than are present within the status.df, but the matrix will +be subset to include only those individuals that have status information provided. } \details{ There are several return options possible. @@ -55,10 +57,10 @@ NOTE: if affected.only = True, the averages and sums are calculated using only t } } \examples{ -mat.name1<-system.file('extdata/7003_notch3.mat', package = 'seqbio.variant.scoring') -tsv.name1<-system.file('extdata/7003_notch3.tsv', package = 'seqbio.variant.scoring') +mat.name1<-system.file('extdata/1234_ex2.mat', package = 'KinformR') +tsv.name1<-system.file('extdata/1234_ex2.tsv', package = 'KinformR') mat.df <- read.relation.mat(mat.name1) ind.df <- read.indiv(tsv.name1) ind.df.status <- score.variant.status(ind.df) -score_default <- score.fam(mat.df, ind.df.status) +score.default <- score.fam(mat.df, ind.df.status) } diff --git a/man/score.pedigree.Rd b/man/score.pedigree.Rd new file mode 100644 index 0000000..ff03942 --- /dev/null +++ b/man/score.pedigree.Rd @@ -0,0 +1,43 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/pedigree.r +\name{score.pedigree} +\alias{score.pedigree} +\title{Take the encoded information about the pedigrees and calculate penetrance.} +\usage{ +score.pedigree(h) +} +\arguments{ +\item{h}{A data frame containing the encoded pedigree information} +} +\value{ +A data frame containing the theoretical scoring of the power of a +family assuming you were able to collect everyone on the simplified pedigree, +as well as a current scoreing, examining only those for whom you currently +have DNA. +} +\description{ +Determine a value score of families by comparing their relationship structure. +More distant relationships between affecteds (e.g. affected cousins) +is more valuable that close relationships (e.g. affected siblings) +as there is less IBD and therefore a smaller search space. +} +\details{ +Simplifying assumptions: +\itemize{ +\item Autosomal dominant +\item No ambiguous statuses +\item No more than two sequential generations of unknown carrier status +(non-obligate carrier vs. non-carrier). +Generalized support of arbitrary tree structures gets a lot more +complicated, especially for the likelihood function. +\item Exclude big giant trees of unaffecteds - related to above. +Will slightly bias the result toward higher penetrance. +\item Exclude subjects younger than age of onset +} +} +\examples{ +example.pedigree.file <-system.file('extdata/example_pedigree_encoding.tsv', +package = 'KinformR') +example.pedigree.df <- read.pedigree(example.pedigree.file) +penetrance.df <- score.pedigree(example.pedigree.df) +} diff --git a/man/score.variant.status.Rd b/man/score.variant.status.Rd index 526ce52..6d2bf1a 100644 --- a/man/score.variant.status.Rd +++ b/man/score.variant.status.Rd @@ -4,14 +4,14 @@ \alias{score.variant.status} \title{Take the dataframe with variants and status and determine which indivudals are scored correctly and which are scored incorrectly. -Assign an A_c, A_i, U_c, U_i, unk} +Assign an A.c, A.i, U.c, U.i, unk} \usage{ score.variant.status(indiv.df, theoretical.max = FALSE) } \arguments{ \item{indiv.df}{A dataframe with the format: name status variant -MS-4107-1001 A 0/1} +MS-5678-1001 A 0/1} \item{theoretical.max}{Should the theoretical maxima be returned instead of the observed values? When true, the scoring assumes correct variant-status pair for each individual. diff --git a/man/subset.mat.Rd b/man/subset.mat.Rd new file mode 100644 index 0000000..e9ba75c --- /dev/null +++ b/man/subset.mat.Rd @@ -0,0 +1,19 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/relatedness.r +\name{subset.mat} +\alias{subset.mat} +\title{Take the matrix and subset out only the encoded individuals that are present in the status dataframe.} +\usage{ +\method{subset}{mat}(mat.df, status.df) +} +\arguments{ +\item{mat.df}{The full matrix file to subset} + +\item{status.df}{The list of sampled individuals, matrix is subset to only these individuals.} +} +\value{ +A subset of the input matrix. +} +\description{ +Take the matrix and subset out only the encoded individuals that are present in the status dataframe. +} diff --git a/tests/testthat.R b/tests/testthat.R index e2341c4..f824997 100644 --- a/tests/testthat.R +++ b/tests/testthat.R @@ -1,5 +1,5 @@ library(testthat) -library(seqbio.variant.scoring) +library(KinformR) -test_check("seqbio.variant.scoring") +test_check("KinformR") diff --git a/tests/testthat/test_encoding.R b/tests/testthat/test_encoding.R new file mode 100644 index 0000000..3513a9d --- /dev/null +++ b/tests/testthat/test_encoding.R @@ -0,0 +1,46 @@ +test_that("Families are correctly encoded.", { + + # Test status assignments + print("assign.status checks") + expect_equal(assign.status("A", "0/1"), "A.c") + expect_equal(assign.status("A", "0|1"), "A.c") + expect_equal(assign.status("A", "1|0"), "A.c") + expect_equal(assign.status("A", 1), "A.c") + + expect_equal(assign.status("A", 0), "A.i") + expect_equal(assign.status("A", "0/0"), "A.i") + expect_equal(assign.status("A", "0|0"), "A.i") + + expect_equal(assign.status("U", 1), "U.i") + expect_equal(assign.status("U", "0/1"), "U.i") + expect_equal(assign.status("U", "0|1"), "U.i") + expect_equal(assign.status("U", "1|0"), "U.i") + + expect_equal(assign.status("U", 0), "U.c") + expect_equal(assign.status("U", "0/0"), "U.c") + expect_equal(assign.status("U", "0|0"), "U.c") + + # Test score.variant.status assignments + #TODO - add the ability to encode a theoretical max + expect_error(assign.status("BAD", 0), "Status must be one of: U or A") + + expect_error(assign.status("A", "./."), "Incompatible variant value! Supported encodings are: '0' '1' '0/0' '0/1' '0|0' '0|1'") + expect_error(assign.status("U", "0|."), "Incompatible variant value! Supported encodings are: '0' '1' '0/0' '0/1' '0|0' '0|1'") + + #tsv.name1<-"Data/1234_ex2.tsv" + tsv.name1<-system.file('extdata/1234_ex2.tsv', package = 'KinformR') + indiv.df <- read.indiv(tsv.name1) + + print("score values for a real family") + scores<-score.variant.status(indiv.df) + expected.scores<-c("A.c", "U.i", "A.c", "A.c", "A.i", "U.c", "A.c", "U.c") + expect_equal(scores$statvar.cat, expected.scores) + + print("theoretical.max high score values for a family") + ther.scores <- score.variant.status(indiv.df, theoretical.max=TRUE) + + expected.thermax.scores <- c("A.c","U.c","A.c","A.c","A.c" ,"U.c", "A.c", "U.c") + expect_equal(ther.scores$statvar.cat, expected.thermax.scores) + + +}) diff --git a/tests/testthat/test_io.r b/tests/testthat/test_io.r new file mode 100644 index 0000000..561dc8c --- /dev/null +++ b/tests/testthat/test_io.r @@ -0,0 +1,50 @@ +test_that("Data are read from files correctly", { + + ################################## + # test 1 + + #tsv.name1<-"extdata/1234_ex2.tsv" + # TODO - switch all to this style + tsv.name1 <-system.file('extdata/1234_ex2.tsv', package = 'KinformR') + ex1234.df <- read.indiv(tsv.name1) + #mat.name1<-"extdata/1234_ex2.mat" + mat.name1 <-system.file('extdata/1234_ex2.mat', package = 'KinformR') + ex1234.mat <- read.relation.mat(mat.name1) + # TODO - could make an s3 class with the two data structures, have single func to + # wrap this and read in both files. + + #TODO - hard code expected values, run matrix comparisons. + + + ################################## + # test 2 - horizontal tabular vcf-like output + + expected.test2 = data.frame("name" = c("MS-5678-1001", "MS-5678-1002", "MS-5678-1004", + "MS-5678-6001", "MS-5678-1003"), + "status" = c("A", "U", "U", "U", "A"), + "variant" = c("0/1", "0/0", "0/0", "0/0", "0/1")) + infile.test2 <-system.file('extdata/example_vcf_extract_5678.tsv', + package = 'KinformR') + + test2.df <- read.var.table(infile.test2) + expect_equal(test2.df, expected.test2) + + + ################################## + # test 3 - horizontal tabular vcf-like output with phased + expected.test3 = data.frame("name" = c("MS-9876-1002", "MS-9876-1009", "MS-9876-1006", "MS-9876-1004", + "MS-9876-1007", "MS-9876-1003", "MS-9876-1001", "MS-9876-1008", "MS-9876-1005"), + "status" = c("U","U","A","A","U", "U","A","U","U"), + "variant" = c("0|0","0|0","0|0","0|1","0|0","0|0","0|1","0|0","0|1")) + infile.test3 <-system.file('extdata/example_vcf_extract_9876.tsv', + package = 'KinformR') + + test3.df <- read.var.table(infile.test3) + expect_equal(test3.df, expected.test3) + + ################################## + # test 4 - TODO run a test where there are more people in the + # matrix than there are in the status df, make sure that subsetting is performed + # correctly. + +}) diff --git a/tests/testthat/test_multi_fam_full_workflow.r b/tests/testthat/test_multi_fam_full_workflow.r new file mode 100644 index 0000000..b65a81b --- /dev/null +++ b/tests/testthat/test_multi_fam_full_workflow.r @@ -0,0 +1,33 @@ +test_that("Scoring of variant in family multiple families proceeds as expected", { + + ################################## + # test 1 + + mat.name1<-system.file('extdata/5678_ex1.mat', package = 'KinformR') + tsv.name1<-system.file('extdata/5678_ex1.tsv', package = 'KinformR') + + mat.name2<-system.file('extdata/9876_ex1.mat', package = 'KinformR') + tsv.name2<-system.file('extdata/9876_ex1.tsv', package = 'KinformR') + + mat.name3<-system.file('extdata/5432_ex1.mat', package = 'KinformR') + tsv.name3<-system.file('extdata/5432_ex1.tsv', package = 'KinformR') + + mat.fnames <- c(mat.name1, mat.name2, mat.name3) + tsv.names <- c(tsv.name1, tsv.name2, tsv.name3) + + mat.df <- lapply(mat.fnames, read.relation.mat) + indv.df <- lapply(tsv.names, read.indiv) + + status.df <- lapply(indv.df, score.variant.status) + + bulk.scores <- mapply(score.fam, mat.df, status.df, affected.weight=1, unaffected.weight=0.5, return.sums = TRUE) + + #TODO - formalize this for better interface with bulk processing + summed.scores<-add.fam.scores(c(bulk.scores[,1], bulk.scores[,2], bulk.scores[,3 ])) + + expected.summed.scores <- c("score" = 102.75,"score.for" = 212.75, "score.against" = 110.0) + expect_equal(all.equal(expected.summed.scores, summed.scores), TRUE) + + #TODO -possibly refactor above into a "score families" bulk function + +}) diff --git a/tests/testthat/test_pedigree_eval.R b/tests/testthat/test_pedigree_eval.R new file mode 100644 index 0000000..d3f010b --- /dev/null +++ b/tests/testthat/test_pedigree_eval.R @@ -0,0 +1,36 @@ +test_that("Data are read from files correctly", { + + ################################## + # test 1 + example.pedigree.file <-system.file('extdata/example_pedigree_encoding.tsv', package = 'KinformR') + example.pedigree.df <- read.pedigree(example.pedigree.file) + penetrance.df <- score.pedigree(example.pedigree.df) + + ################################## + # test 2 + example.pedigree.file2 <-system.file('extdata/example_pedigree_encoding2.tsv', package = 'KinformR') + #example.pedigree.file2 <- "inst/extdata/example_pedigree_encoding2.tsv" + example.pedigree.df2 <- read.pedigree(example.pedigree.file2) + penetrance.df2 <- score.pedigree(example.pedigree.df2) + + ################################## + # test 3 + example.pedigree.file3 <-system.file('extdata/example_pedigree_encoding3.tsv', package = 'KinformR') + #example.pedigree.file3 <- "inst/extdata/example_pedigree_encoding3.tsv" + example.pedigree.df3 <- read.pedigree(example.pedigree.file3) + penetrance.df3 <- score.pedigree(example.pedigree.df3) + + expected.penetrance.df3 <- data.frame( + "family" = c("Fam1", "Fam2", "Fam3", "Fam4"), + "penetrance" = c(0.37,0.58,0.45,0.57), + "max.pi-hat" = c(9.76,8.97,7.73,8.43), + "max.score" = c(6.76,6.22,5.36,5.84), + "current.pi-hat" = c(2.97,2.49,3.36,3.97), + "current.score" = c(2.06,1.73,2.33,2.75), + "pct.of.max" = c(30.44,27.79,43.53,47.12) + ) + colnames(expected.penetrance.df3) <- c("family", "penetrance", "max.pi-hat", "max.score", "current.pi-hat", "current.score", "pct.of.max") + + expect_equal(all.equal(penetrance.df3, expected.penetrance.df3), TRUE) + +}) diff --git a/tests/testthat/test_relatedness.R b/tests/testthat/test_relatedness.R index a2a1dd7..03a4f5d 100644 --- a/tests/testthat/test_relatedness.R +++ b/tests/testthat/test_relatedness.R @@ -3,10 +3,10 @@ test_that("Scoring proceeds as expected in all cases", { print("test per encoding score calculation for several combos") print("1. simple proband-sib") - encoded.dat1 <- list("A_c" = c(0, 1), - "A_i" = c(), - "U_c" = c(), - "U_i" = c()) + encoded.dat1 <- list("A.c" = c(0, 1), + "A.i" = c(), + "U.c" = c(), + "U.i" = c()) expected.score1<- list("score" = 3, "score.for" = 3, @@ -15,10 +15,10 @@ test_that("Scoring proceeds as expected in all cases", { expect_equal(expected.score1, score1) print("2. correct A proband and cousin, incorrect unaffected sib") - encoded.dat2 <- list("A_c" = c(0, 3), - "A_i" = c(), - "U_c" = c(), - "U_i" = c(1)) + encoded.dat2 <- list("A.c" = c(0, 3), + "A.i" = c(), + "U.c" = c(), + "U.i" = c(1)) expected.score2<- list("score" = 5, "score.for" = 9, @@ -35,10 +35,10 @@ test_that("Scoring proceeds as expected in all cases", { print("3. bad score proband and incorrect unaffected sib") - encoded.dat3 <- list("A_c" = c(0), - "A_i" = c(), - "U_c" = c(), - "U_i" = c(1)) + encoded.dat3 <- list("A.c" = c(0), + "A.i" = c(), + "U.c" = c(), + "U.i" = c(1)) expected.score3<- list("score" = -3, "score.for" = 1, @@ -48,10 +48,10 @@ test_that("Scoring proceeds as expected in all cases", { print("3. complex score, all situations covered in one.") - encoded.dat4 <- list("A_c" = c(0, 1, 3, 1), - "A_i" = c(3), - "U_c" = c(1, 2), - "U_i" = c(1)) + encoded.dat4 <- list("A.c" = c(0, 1, 3, 1), + "A.i" = c(3), + "U.c" = c(1, 2), + "U.i" = c(1)) expected.score4<- list("score" = 7, "score.for" = 19, @@ -73,7 +73,7 @@ test_that("Scoring proceeds as expected in all cases", { b <- c("score"=20, "score.for"=20, "score.against"=0) c <- c("score"=30, "score.for"=40, "score.against"=10) - abc<- sum.fam.scores(c(a,b,c)) + abc<- add.fam.scores(c(a,b,c)) expected.abc <- c("score"=60, "score.for"=70, "score.against"=10) expect_equal(all.equal(names(abc), names(expected.abc)), TRUE) for(i in length(abc)){ diff --git a/tests/testthat/test_single_fam_full_workflow.r b/tests/testthat/test_single_fam_full_workflow.r new file mode 100644 index 0000000..b9627bd --- /dev/null +++ b/tests/testthat/test_single_fam_full_workflow.r @@ -0,0 +1,68 @@ +#library(KinformR) +test_that("Scoring of variant in family single family proceeds as expected", { + + ################################## + # test 1 + + mat.name1<-system.file('extdata/1234_ex2.mat', package = 'KinformR') + ex1234.mat <- read.relation.mat(mat.name1) + tsv.name1<-system.file('extdata/1234_ex2.tsv', package = 'KinformR') + ex1234.df <- read.indiv(tsv.name1) + + ex1234.df.status <- score.variant.status(ex1234.df) + + ex2.1234.score.default <- score.fam(ex1234.mat, ex1234.df.status) + expected.ex2.score.custom <- c("score"=21.3, "score.for"=27.6, "score.against"=6.3) + expect_equal(ex2.1234.score.default, expected.ex2.score.custom) + + #should give a single list with the family totals + ex2.1234.score.custom <- score.fam(ex1234.mat, ex1234.df.status, + affected.weight=1, unaffected.weight=0.5, + return.sums = TRUE, affected.only = FALSE) + + expected.ex2.score.custom <- c("score"=178.5, "score.for"=244, "score.against"=65.5) + expect_equal(ex2.1234.score.custom,expected.ex2.score.custom) + + #should give the row-by-row + ex2.1234.score.indiv.a.only <- score.fam(ex1234.mat, ex1234.df.status, affected.weight=1, unaffected.weight=0.5, + return.means = FALSE) + + expected.ex2.1234.score.indiv.a.only <- data.frame("score" = c(7., 9., 33.5, 31.5, 25.5), + "score.for" = c(19., 21., 34.5, 33.5, 30.), + "score.against" = c(12., 12., 1., 2., 4.5)) + rownames(expected.ex2.1234.score.indiv.a.only) <- c("MS-1234-1001", "MS-1234-1003", "MS-1234-1004", + "MS-1234-1005", "MS-1234-5001") + + expect_equal(all.equal(expected.ex2.1234.score.indiv.a.only, ex2.1234.score.indiv.a.only), TRUE) + + #affected only calculation + ex2.1234.score.affected.only <- score.fam(ex1234.mat, ex1234.df.status, affected.weight=1, unaffected.weight=0.0) + expected.ex2.score.affected.only <- c("score"=19.4, "score.for"=23.6, "score.against"=4.2) + expect_equal(ex2.1234.score.affected.only,expected.ex2.score.affected.only) + + + #affected only -per indiv + ex2.1234.score.indiv.no.u <- score.fam(ex1234.mat, ex1234.df.status, + affected.weight=1, unaffected.weight=0.0, + return.means = FALSE, + affected.only = TRUE) + + expected.ex2.1234.score.indiv.no.u <- data.frame("score" = c(5, 7, 33, 31, 21), + "score.for" = c(13, 15, 33, 32, 25), + "score.against" = c(8, 8, 0, 1, 4)) + rownames(expected.ex2.1234.score.indiv.no.u) <- c("MS-1234-1001", "MS-1234-1003", "MS-1234-1004", + "MS-1234-1005", "MS-1234-5001") + + expect_equal(all.equal(ex2.1234.score.indiv.no.u, expected.ex2.1234.score.indiv.no.u), TRUE) + + + + + # possibly - mean of scores? Account for not inflating pedigrees with lots of people, or is that a feature not a bug. + + # TODO - try dropping individuals from the matrix, see if it still works. (should be OK) + + # TODO - option to wrap the read/status/score and do it all in one step. basically make it a one liner. + + +}) diff --git a/vignettes/KinformR-penetrance_and_ibd.Rmd b/vignettes/KinformR-penetrance_and_ibd.Rmd new file mode 100644 index 0000000..87ae310 --- /dev/null +++ b/vignettes/KinformR-penetrance_and_ibd.Rmd @@ -0,0 +1,142 @@ +--- +title: "KinformR - penetrance and idb informed scoring of families" +author: "Cameron M. Nugent" +date: "`r format(Sys.time(), '%d %B, %Y')`" +data: "`r Sys.Date()`" +output: pdf_document #rmarkdown::html_vignette # +pdf_document: + df_print: kable +vignette: > + %\VignetteIndexEntry{KinformR-penetrance_and_ibd} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + + + +## Introduction + +The `cal.penetrance` and associated `penetrance` and `ibd` functions facilitate comparison of the relative "power" of families in a study. This can be accomplished upstream of sequencing (i.e. in the project planning stage) and is therefore only dependent on relationship structure and reported affected status of individuals in a given family or set of families. + +## Estimating power of families + +The `cal.penetrance` function generates both a theoretical ranking of the power of a family assuming you were able to collect everyone on the simplified pedigree, as well as a current ranking, examining only those for whom you currently have DNA. This allows evaluation of the impact on the ranking if certain other family members are enrolled in the study. + +### Load the library + +```{r loadlib, echo = TRUE, results = 'hide', message=FALSE, warning=FALSE} +library(KinformR) + +show <- function(df){ + knitr::kable(df, format = "markdown", digits = 2) +} +``` + +## The input data + +The family power calculations depend on a single tab-delimited input file, where each row represents a family. The input file is read in using the `read.pedigree` function. + +```{r} +example.pedigree.file <- system.file('extdata/example_pedigree_encoding.tsv', + package = 'KinformR') + +example.pedigree.df <- read.pedigree(example.pedigree.file) +``` +The input file is expected to have the following 11 columns (with a header). +```{r} +colnames(example.pedigree.df) + +``` + +### Simplified summary of pedigrees + +For now this file should be be constructed through careful manual inspection of the predigrees. To encode the rows for each family, you should first prune down pedigrees to informative allele transfers. For +the purposes of this tool, we exclude young generations (non-adults, younger than age of onset) and large (more than two sequential generations) trees of exclusively unaffected family members. Additionally all individuals require a binary A/U status, there should be no ambigious individuals. There will be some judgment calls required here. + + +### Encoding categories of relationships + +From the simplified pedigrees, the individuals are assigned to the following categories. + +|Category|Description| +|---:|---| +|a|Affected individuals| +|b|Obligate carriers| +|c|Children of either affecteds or carriers, with no children of their own| +|d|Trees of unaffected individuals - specifically, two sequential generations (i.e. a parent and their offspring; trees of unaffecteds that are larger than this are omitted.)| + + +The counts of individuals assigned to these categories are then added to the tab-delimited input file: + +```{r} +show(example.pedigree.df) +``` + +All columns with the prefix `max_` are meant to count the total number of each category in the pedigree, while +the columns without this prefix are the number of each category for whom samples have been collected. + +The categories correspond to A, B, and C as defined above. + +Category D is represented by two numbers, d and n. n is the number of offspring in a tree of unaffecteds; d is the number of those types of trees across the pedigree. Multiple types of trees are encoded with commas separating the values. For example, the following represents a family with three total trees of unaffecteds. One tree (d=1) has three offspring (n=3); two trees (d=2) each have one offspring (n=1). + +``` +d n +1,2 3,1 +``` + +## Theoretical vs. current rankings + +With the encoded data loaded, the function `cal.penetrance` will generate both a theoretical ranking of the power of a family assuming you were able to collect everyone on the simplified pedigree, as well as a current ranking, examining only those for whom you currently have DNA. This allows evaluation of the impact on the ranking if certain other family members are enrolled in the study. + +```{r} +penetrance.df <- score.pedigree(example.pedigree.df) + +show(penetrance.df) + +``` + +The output includes seven columns that with the following information: +|Output Column|Description| +|---|---| +|family|The family id.| +|penetrance| Estimated penetrance rate (K) for the family.| +|max_pi-hat| The estimated proportion of the genome that is shared between all individuals *that could be sampled* in the family (IBD).| +|max_score| The theoretical maximum score for the given family *that could* be achieved if all individuals were sampled.| +|current_pi-hat|The estimated proportion of the genome that is shared between all individuals *that have been sampled* in the family (IBD).| +|current_score| The score for the given family *that has* been achieved through the sampled individuals| +|pct_of_max| The percentage of the theoretical maximum score that has been realized in the family sampling.| + + +With the scoring completed, the values can be queried to learn more about the realized and potential value of the families relative to one another. + +Sorting on `current_score` shows which family has the most detection power based on collected samples. + +```{r} +ord.df.current <- penetrance.df[order(penetrance.df$current.score, decreasing = TRUE),] +show(ord.df.current) + +``` +If we had to work with only what we have, then family 5031 gives the most detection power. + +Sorting on `max_score` shows which family could have the most detection power, if all individuals were sampled. This can be useful in targeting future sampling efforts as it shows where more samples would give the most value. + +```{r} +ord.df.max <- penetrance.df[order(penetrance.df$max.score, decreasing = TRUE),] +show(ord.df.max) + +``` +Here we can see that family `0347` has a maximum score of 30.84, but a realized score of only 4.17. Given the high potential detection power for this family, it is an ideal target for future sampling efforts as current samples reveal on 13.5% of the family's potential value. + + +### Note on a few special cases + +In trees of unaffecteds, there are two special cases for current ranking: +1. You have collected the parent (regardless of collection status of the child). In +this case, the child cannot provide any additional information beyond the parent, so +we only count the parent. (d=1, n=0; equivalently, c=1) +2. You have collected one or more children, but not the parent. In this case, +each of the children contribute a portion of what the parent would have contributed +to our understanding. (d=1, n>0) + + + diff --git a/vignettes/KinformR-variant_scoring.Rmd b/vignettes/KinformR-variant_scoring.Rmd new file mode 100644 index 0000000..1042fee --- /dev/null +++ b/vignettes/KinformR-variant_scoring.Rmd @@ -0,0 +1,218 @@ +--- +title: "KinformR - pedigree-informed rare variant association scoring" +author: "Cameron M. Nugent" +date: "`r format(Sys.time(), '%d %B, %Y')`" +data: "`r Sys.Date()`" +output: pdf_document #rmarkdown::html_vignette # +pdf_document: + df_print: kable +vignette: > + %\VignetteIndexEntry{KinformR-variant_scoring} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +## Introduction + +When looking at rare variants within a family, not all affected and unaffected individuals are of equal importance. `KinformR` is an R package meant to aid in comparative evaluation of information in family-based rare-variant association studies; the package considers evidence of rare variant's association with disease status in a family and scores the variant based on relationships of individuals in a pedigree. Comparing scores across candidates can thereby help in assessing their relative merit. + +The program leverages Wright’s coefficient of relatedness to score families based on the relationship of individuals as well as their disease status and genotypes for a given variant. A candidate variant is considered strongest when it is shared by an affected individual and other distantly-related affected family members and not shared by closely related unaffected family members. The theory behind this is that more distantly related individuals have a smaller proportion of their genome that is IBD, so the chance of a false positive (shared variant through chance, not due to association with disease) is lower. Conversely, when a closely related relative is unaffected, and the affected and unaffected individuals have different genotypes for the candidate variant, this provides evidence that the disease is likely not associated with the large IBD portion of the genome shared by the two individuals and thereby gives evidence in favour of the candidate. + +### Load the library + +```{r loadlib, echo = TRUE, results = 'hide', message=FALSE, warning=FALSE} +library(KinformR) + +show <- function(df){ + knitr::kable(df, format = "markdown", digits = 2) +} + +``` + + +## Input data + +Two input are required to score a candidate variant: the family relationship information and the variant status for the individuals. + +### The relation matrix + +This input is a matrix containing all the pairwise relationships of individuals in a family. The row and column names are the individual IDs, and the intersecting value denotes the degree of relationship between the individuals (self = 0, 1st degree relations = 1, etc. Unrelated individuals are given a value of -1). As of version `0.1.0` the relation matrix is a manually created file, where relationship values are assigned via manual inspection of the family pedigree. + +To read in the data, one uses the function `read.relation.mat`. +```{r} +mat.name1<-system.file('extdata/1234_ex2.mat', package = 'KinformR') +rel.mat <- read.relation.mat(mat.name1) +show(rel.mat) +``` + + +### The status file + +This file includes the same individual IDs used in the relationship matrix as well as the disease and variant status for all individuals. + +Note that the order of individuals can be different between the two files. All individual in the status file must be present within the relationship matrix, but you can have individuals in the relationship matrix that are not present in the status file. The program assumes there is no genotype information for individuals not in the status file and they are ignored in the variant scoring. + +To read in the data, one uses the function `read.indiv`. + +```{r} + +tsv.name1<-system.file('extdata/1234_ex2.tsv', package = 'KinformR') +status.df <- read.indiv(tsv.name1) + +show(status.df) +``` + +The disease-genotype scoring can then be encoded using the `score.variant.status` function to produce the status-variant category for all individuals. This creates a df with the new column: `statvar.cat`. + +```{r} + +full.df.status <- score.variant.status(status.df) +show(full.df.status) +``` + + + +## Scoring a family + +For most real-world applications, you will likely want to score family members in conjunction with one another, and take the mean score for an entire family. This can be accomplished with `score.fam`, which takes in the matrix of relationships and the table with encoded `statvar.cat` of all individuals. + +```{r} + +ex.score.default <- score.fam(rel.mat, full.df.status) +show(ex.score.default) +``` + + +By default `score.fam` returns: +- The scores considering only the Affected individuals as the reference individual (skipping the rows for the U individuals: MS-1234-1002, MS-1234-1006, and MS-1234-6001, in the previous example). +- The mean of the calculated score from each reference individuals. + +As previously noted, if an individual is present in the relationship matrix and not in the status file, it is assumed there is no genetic information for this individual and they are ignored when calculating the variant score. + +The scoring can be changed to summing across all combinations as opposed to the mean by passing the following options. Note using the program in this way will return higher scores for more dense pedigrees. +```{r} + +ex.score.sum <- score.fam(rel.mat, full.df.status, + return.sums = TRUE, return.means = FALSE) +show(ex.score.sum) +``` + + +To obtain a long form table with the scores for variants expressed relative to each individual, set both `return.sums` and `return.means` to `FALSE`. This output can aid in identifying which individuals are carrying the most weight in a family's score. +```{r} + +ex.score.table <- score.fam(rel.mat, full.df.status, + return.sums = FALSE, return.means = FALSE) +show(ex.score.table) +``` + +## How scoring works +### A Minimal example, scoring a variant from perspective of a single individual. + +This section is meant to demonstrate how the variant scoring is accomplished on a finer scale. A user does not need to interact with the package on this level of granularity. This section is for explanatory purposes only, demonstrating how the `score.fam` function operated "under the hood". + +The `score.fam` function runs the scoring method once for each affected individual in the status dataframe (or for each individual regardless of status if `affected.only = FALSE`). To do this, for each individual, the program takes corresponding row of the relationship matrix to determine the relations to all other individuals in the pedigree. + +For example, the degrees of relationships of all other members of the example family relative to the reference individual `"MS-1234-1001"` are show in the following subset of the matrix: + +```{r} +rel.mat.proband <- rel.mat["MS-1234-1001",] +show(rel.mat.proband) +``` +This is taken along with a list of encoded statuses of all individuals for the given variant. Above `score.variant.status` used the disease status and a genetic variant of each individual to determine which of the following categories they fall in to: + - A.c = Affected individual with variant + - A.i = Affected individual without variant + - U.c = Unaffected individual without variant + - U.i = Unaffected individual with variant +Within `score.fam`, a labelled list of encodings for all individuals is generated. +```{r} +name.stat.dict <- full.df.status$statvar.cat +names(name.stat.dict) <- full.df.status$name +name.stat.dict +``` + +`build.relation.dict` is used summarize the collate the status and relationship information and evidence affected and unaffected relations giving evidence for and evidence against a variant. +```{r} +rel.dict<-build.relation.dict(rel.mat.proband, name.stat.dict) +rel.dict +``` +In this example, the proband, two first degree relations, and a third degree relations are all affected and share the candidate variant. For the affected correct (`A.c`) category we therefore see the following encoded: + +```{r} +rel.dict$A.c +``` + +Since one first degree unaffected relative has the variant, they are categorized as "unaffected incorrect"(`U.i`) and we see: +```{r} +rel.dict$U.i +``` + +Deriving a relatedness-weighted score for the variant from the perspective of the given individual is then performed by `calc.rv.score` + +For each degree-encoded relationship, the coefficient of relatedness is used to weight the evidence for or against a variant. The coefficients for different degress of relationship are: +```{r} +for(i in 0:7){ + print(paste0("Degree of relatedness: ", i, + " coefficient of relatedness: ", 1 / (2 ** (i)))) +} +``` + +The score for affected status increase as individuals become more distantly related. The formula is: +``` + (1 / coefficient.of.relatedness) * affected.weight +``` +For example, an affected cousin (encoded as a 3 and having a coefficient of relatedness of 0.125) would get a score of: +``` + (1/0.125) * affected.weight + 8 * 1 + = 8 points in favour of the variant. +``` + +For unaffected individuals, scores decay the further a person is in relation to the query individual based on the formula: +``` +((unaffected.max*2) * coefficient.of.relatedness ) * unaffected.weight +``` +For example, with the default `unaffected.weight = 0.5` and unaffected sister that does not have a variant would get a score of +``` + ((8*2) 0.5) * unaffected.weight + (16 * 0.5) * 0.5 + = 4 points for the variant. + +``` + +If these were the only two relatives considered we could sum the points and get a score in favour of the variant of +``` + 8 + 4 = 12 +``` +If there is evidence against a variant, this is factored into the score as: +``` + total.score = evidence.for - evidence.against +``` +For example, if there were also an affected sibling without the variant we would have the score against of: +``` + (1/0.5) * 1 = 2 +``` + +The final score for the variant would then be: +``` + for - against = total + 12 - 2 = 10 +``` +Giving a final score of 10 for the variant. + +This is all accomplished by the function `calc.rv.score`. +```{r} +calc.rv.score(rel.dict) +``` + +The weights of the scoring can be adjusted, for example if we wanted to consider only `affected`-based evidence, we could turn off the unaffected part of the calculation by setting the unaffected weighting to 0. This can be useful for incompletely penetrant variants, where disease status and genotype of unaffected individuals are more likely to have imperfect concordance. + +Additionally, families with low numbers of affected individuals sequenced and high number of unaffected individuals may haved inflated variant scores and potentially be misleading, focusing the scoring algorithm on the affected individuals only can overcome this bias. + +```{r} +calc.rv.score(rel.dict, unaffected.weight=0) +``` + +The `score.fam` function automatically walks through this process from all specified perspectives in the pedigree and by default returns the average score. The use of the averages and different perspectives is meant to eliminate pedigree-associated bias, such as for instances when a proband is distantly related to all other members in a family (considering the relationships from only the perspective of the proband in this case would give an inflated score for the variant's value). + +