-
Notifications
You must be signed in to change notification settings - Fork 0
Translating scores to pathogenicity LLRs
More often than not, the generation of variant effect maps is motivated by the desire to assist in clinical variant interpretation. However, the scores in a given map generally do not directly express whether a variant is pathogenic or not. Instead they provide a proxy measurement of protein function, which cannot be expected to linearly relate to the potential to cause disease. To address this, we can leverage sets of known pathogenic and benign reference variants. We can observing the score distributions of these reference sets. Then, we can calculate for any given score how likely it is to belong to one distribution over the other, that is, how much more likely the score belongs to a pathogenic variant as opposed to a benign one. Mathematically, this is expressed as a log likelihood ratio or "LLR".
The result of the LLR translation can vary greatly depending on a number of parameters. This is because it uses an approach called kernel density estimation (KDE) to estimate the probability densities underlying the distribution of the scores of the reference variants. The main parameters of KDE are (i) the type of kernel function to use, and (ii) the bandwidth. The bandwidth can be thought of as the amount of smoothing applied to the result. If the bandwidth is too low, the density curve output will be jagged and rough, whereas if the bandwidth is too high, the result curve will be too broad and flat. (There is a great figure on Wikipedia that captures these effects in a visual example.) To make the selection of an appropriate bandwidth easier, a number of heuristics exist. It is generally worth exploring the different parameter options and seeing which yields the most believable density curves.
Even though tileseqMave
comes with an integrated wrapper script for LLR calculation, it requires the installation of an additional library.
#Open an interactive R session
R
#install the yogiroc
remotes::install_github("jweile/maveLLR")
#close R
q()
You can use tileseqMave
to automatically generate a reference set from ClinVar and gnomAD controls for you. For example, to generate a reference set for BRCA1 with a minimum gnomAD allele frequency cutoff of
tsm referenceSets BRCA1 --mafMin 1e-5 --starsMin 1
The output will automatically be written to a file called BRCA1_refVars.csv
.
If the target gene is associated with multiple different traits on ClinVar, the script will show a warning message about this. To filter down to a specific trait, you can use the --trait
option, which also accepts regex values in case the same trait is represented by different names.
After the reference set has been generated you can use it to calculate an LLR translation. For this we use the calcLLR
module. For example, if we have a map file called BRCA1_VEmap.csv
and a set of reference variants in the file BRCA1_refVars.csv
. We will likely need to run this program multiple times to find appropriate parameters. To make the executions faster, I recommend temporarily reducing the number of bootstrap iterations (-i
) to a minimum to save time. (This will result in unreliable confidence intervals in the output table, but we don't initially need those anyway).
tsm calcLLR -i 10 BRCA1_VEmap.csv BRCA1_refVars.csv
The program will output the following:
- BRCA_refVars_llr.csv - A table of the LLR translations for each variant in the map.
- BRCA_refVars_llr.pdf - A plot showing the kernel density curves and LLR curve.
- BRCA_refVars_llr_prc.pdf - A PRC curve plot of the map against the reference set.
After generating the output, take a look at the curves in the *_llr.pdf
file to check for problems:
- Check if the density curves appear over- or under-smoothed. If so, try a different bandwidth parameter (see below for options).
- Check if individual outlier variants introduce gaps or gouges in the LLR plot. If so, try to increase the
--outlierSuppression
parameter.
For example:
tsm calcLLR -i 10 --bandwidth nrd --outlierSuppression 1e-3 BRCA1_VEmap.csv BRCA1_refVars.csv
After you have found a set of parameters that works, you can re-run the program with full bootstrap iterations and print the evidence level transitions:
tsm calcLLR --bandwidth nrd --printTransitions BRCA1_VEmap.csv BRCA1_refVars.csv
usage: tsm calcLLR [--] [--help] [--gauss] [--spline]
[--printTransitions] [--noCI] [--outfile OUTFILE]
[--bandwidth BANDWIDTH] [--kernel KERNEL] [--posRange POSRANGE]
[--outlierSuppression OUTLIERSUPPRESSION] [--logfile LOGFILE]
[--iterations ITERATIONS] [--cores CORES] map reference
Automatically assemble reference variant sets via Clinvar and Gnomad
positional arguments:
map VE map file in MaveDB format
reference reference variant set (from referenceSets.R
flags:
-h, --help show this help message and exit
-g, --gauss Use simple gaussians to determine LLR instead
of kernel density estimation
-s, --spline Use apply spline monotonization to gauss LLR
function. (Only used if --gauss is used)
--printTransitions Print evidence code transitions.
-n, --noCI Turns off confidence intervals for LLR table.
optional arguments:
-o, --outfile The desired prefix for the output file name.
-b, --bandwidth Kernel bandwidth. Only used when --gauss isn't
used. Valid options: Any number, or the
following auto-selection methods: 'nrd0'
(Silverman's rule of thumb), 'nrd' (Scott's
rule of thumb), 'bcv' (biased cross
validation), 'SJ' (Sheather and Jones), 'ucv'
(unbiased cross validation) [default: ucv]
-k, --kernel Kernel type. Only used when --gauss isn't used.
Valid options: 'gaussian', 'epanechnikov',
'rectangular', 'triangular', 'biweight',
'cosine'
'optcosine','tricube','triweight','laplace','gamma','gamma_biased'
[default: epanechnikov]
-p, --posRange Positional range within the map to be tested.
Must be two integer numbers separated by dash,
e.g. '1-189'
--outlierSuppression Strength of outlier suppression to use. Between
0 and 1 [default: 1e-04]
-l, --logfile The desired log file location. [default:
llr.log]
-i, --iterations The number of bootstrap iterations for
determining LLR confidence interval [default:
10000]
-c, --cores The number of processes to run in parallel for
multi-core processing. Warning: This also
multiplies the amount of RAM used! [default: 4]