Skip to content

Translating scores to pathogenicity LLRs

Jochen Weile edited this page Jun 5, 2023 · 1 revision

Translating scores to pathogenicity LLRs

More often than not, the generation of variant effect maps is motivated by the desire to assist in clinical variant interpretation. However, the scores in a given map generally do not directly express whether a variant is pathogenic or not. Instead they provide a proxy measurement of protein function, which cannot be expected to linearly relate to the potential to cause disease. To address this, we can leverage sets of known pathogenic and benign reference variants. We can observing the score distributions of these reference sets. Then, we can calculate for any given score how likely it is to belong to one distribution over the other, that is, how much more likely the score belongs to a pathogenic variant as opposed to a benign one. Mathematically, this is expressed as a log likelihood ratio or "LLR".

The result of the LLR translation can vary greatly depending on a number of parameters. This is because it uses an approach called kernel density estimation (KDE) to estimate the probability densities underlying the distribution of the scores of the reference variants. The main parameters of KDE are (i) the type of kernel function to use, and (ii) the bandwidth. The bandwidth can be thought of as the amount of smoothing applied to the result. If the bandwidth is too low, the density curve output will be jagged and rough, whereas if the bandwidth is too high, the result curve will be too broad and flat. (There is a great figure on Wikipedia that captures these effects in a visual example.) To make the selection of an appropriate bandwidth easier, a number of heuristics exist. It is generally worth exploring the different parameter options and seeing which yields the most believable density curves.

Installing maveLLR

Even though tileseqMave comes with an integrated wrapper script for LLR calculation, it requires the installation of an additional library.

#Open an interactive R session
R
#install the yogiroc
remotes::install_github("jweile/maveLLR")
#close R
q()

Generating a set of reference variants

You can use tileseqMave to automatically generate a reference set from ClinVar and gnomAD controls for you. For example, to generate a reference set for BRCA1 with a minimum gnomAD allele frequency cutoff of $10^-5$ and a minimum of 1 Star quality on ClinVar:

tsm referenceSets BRCA1 --mafMin 1e-5 --starsMin 1 

The output will automatically be written to a file called BRCA1_refVars.csv.

If the target gene is associated with multiple different traits on ClinVar, the script will show a warning message about this. To filter down to a specific trait, you can use the --trait option, which also accepts regex values in case the same trait is represented by different names.

Calculating LLRs

After the reference set has been generated you can use it to calculate an LLR translation. For this we use the calcLLR module. For example, if we have a map file called BRCA1_VEmap.csv and a set of reference variants in the file BRCA1_refVars.csv. We will likely need to run this program multiple times to find appropriate parameters. To make the executions faster, I recommend temporarily reducing the number of bootstrap iterations (-i) to a minimum to save time. (This will result in unreliable confidence intervals in the output table, but we don't initially need those anyway).

tsm calcLLR -i 10 BRCA1_VEmap.csv BRCA1_refVars.csv

The program will output the following:

  • BRCA_refVars_llr.csv - A table of the LLR translations for each variant in the map.
  • BRCA_refVars_llr.pdf - A plot showing the kernel density curves and LLR curve.
  • BRCA_refVars_llr_prc.pdf - A PRC curve plot of the map against the reference set.

After generating the output, take a look at the curves in the *_llr.pdf file to check for problems:

  • Check if the density curves appear over- or under-smoothed. If so, try a different bandwidth parameter (see below for options).
  • Check if individual outlier variants introduce gaps or gouges in the LLR plot. If so, try to increase the --outlierSuppression parameter.

For example:

tsm calcLLR -i 10 --bandwidth nrd --outlierSuppression 1e-3 BRCA1_VEmap.csv BRCA1_refVars.csv 

After you have found a set of parameters that works, you can re-run the program with full bootstrap iterations and print the evidence level transitions:

tsm calcLLR --bandwidth nrd --printTransitions BRCA1_VEmap.csv BRCA1_refVars.csv 

Full usage information:

usage: tsm calcLLR [--] [--help] [--gauss] [--spline]
       [--printTransitions] [--noCI] [--outfile OUTFILE]
       [--bandwidth BANDWIDTH] [--kernel KERNEL] [--posRange POSRANGE]
       [--outlierSuppression OUTLIERSUPPRESSION] [--logfile LOGFILE]
       [--iterations ITERATIONS] [--cores CORES] map reference

Automatically assemble reference variant sets via Clinvar and Gnomad

positional arguments:
  map                   VE map file in MaveDB format
  reference             reference variant set (from referenceSets.R

flags:
  -h, --help            show this help message and exit
  -g, --gauss           Use simple gaussians to determine LLR instead
                        of kernel density estimation
  -s, --spline          Use apply spline monotonization to gauss LLR
                        function. (Only used if --gauss is used)
  --printTransitions    Print evidence code transitions.
  -n, --noCI            Turns off confidence intervals for LLR table.

optional arguments:
  -o, --outfile         The desired prefix for the output file name.
  -b, --bandwidth       Kernel bandwidth. Only used when --gauss isn't
                        used. Valid options: Any number, or the
                        following auto-selection methods: 'nrd0'
                        (Silverman's rule of thumb), 'nrd' (Scott's
                        rule of thumb), 'bcv' (biased cross
                        validation), 'SJ' (Sheather and Jones), 'ucv'
                        (unbiased cross validation) [default: ucv]
  -k, --kernel          Kernel type. Only used when --gauss isn't used.
                        Valid options: 'gaussian', 'epanechnikov',
                        'rectangular', 'triangular', 'biweight',
                        'cosine'
                        'optcosine','tricube','triweight','laplace','gamma','gamma_biased'
                        [default: epanechnikov]
  -p, --posRange        Positional range within the map to be tested.
                        Must be two integer numbers separated by dash,
                        e.g. '1-189'
  --outlierSuppression  Strength of outlier suppression to use. Between
                        0 and 1 [default: 1e-04]
  -l, --logfile         The desired log file location. [default:
                        llr.log]
  -i, --iterations      The number of bootstrap iterations for
                        determining LLR confidence interval [default:
                        10000]
  -c, --cores           The number of processes to run in parallel for
                        multi-core processing. Warning: This also
                        multiplies the amount of RAM used! [default: 4]

Clone this wiki locally