Skip to content

Commit c5b7ecb

Browse files
nalinigansmlathara
andauthored
Update README to include documentation for specifying filters to genomicsdb_query (#79)
* Add filter expression description to doc Co-authored-by: mlathara <[email protected]>
1 parent db4c88a commit c5b7ecb

File tree

1 file changed

+69
-1
lines changed

1 file changed

+69
-1
lines changed

genomicsdb/scripts/README.md

Lines changed: 69 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@
33
Simple GenomicsDB query tool `genomicsdb_query`, given a workspace and genomic intervals of the form `<CONTIG>:<START>-<END>`. The intervals at a minimum need to have the contig specified, start and end are optional. e.g chr1:100-1000, chr1:100 and chr1 are all valid. Start defaults to 1 if not specified and end defaults to the length of the contig if not specified.
44

55
Assumption : The workspace should have been created with the `vcf2genomicsdb` tool or with `gatk GenomicsDBImport` and should exist.
6+
7+
* [Caching for enhanced performance](#caching)
8+
* [Filters and Attributes](#filters)
69

710
```
811
~/GenomicsDB-Python/examples: ./genomicsdb_query --help
@@ -26,6 +29,7 @@ options:
2629
--list-contigs List contigs configured in vid mapping for the workspace and exit
2730
--list-fields List genomic fields configured in vid mapping for the workspace and exit
2831
--list-partitions List interval partitions(genomicsdb arrays in the workspace) for the given intervals(-i/--interval or -I/--interval-list) or all the intervals for the workspace and exit
32+
--no-cache Do not use cached metadata and files with the genomicsdb query
2933
-i INTERVAL, --interval INTERVAL
3034
genomic intervals over which to operate. The intervals should be specified in the <CONTIG>:<START>-<END> format with START and END optional.
3135
This argument may be specified 0 or more times e.g -i chr1:1-10000 -i chr2 -i chr3:1000.
@@ -49,7 +53,7 @@ options:
4953
1. -s/--sample and -S/--sample-list are mutually exclusive
5054
2. either samples and/or intervals using -i/-I/-s/-S options has to be specified
5155
-a ATTRIBUTES, --attributes ATTRIBUTES
52-
Optional - comma separated list of genomic attributes or fields described in the vid mapping for the query, eg. GT,AC,PL,DP... Defaults to GT
56+
Optional - comma separated list of genomic attributes(REF, ALT) and fields described in the vid mapping for the query, eg. GT,AC,PL,DP... Defaults to REF,GT
5357
-f FILTER, --filter FILTER
5458
Optional - genomic filter expression for the query, e.g. 'ISHOMREF' or 'ISHET' or 'REF == "G" && resolve(GT, REF, ALT) &= "T/T" && ALT |= "T"'
5559
-n NPROC, --nproc NPROC
@@ -105,6 +109,7 @@ query_output_1-100-100000.csv query_output_1-100001.csv query_output_2.csv
105109
106110
```
107111

112+
<a name="caching"></a>
108113
### Caching for enhanced performance
109114

110115
Locally caching artifacts from cloud URLs is optional for GenomicsDB metadata and helps with performance for metadata/artifacts which can be accessed multiple times. There is a separate caching tool `genomicsdb_cache` which takes as inputs the workspace, optionally callset/vidmap/loader.json and also optionally the intervals or intervals with the -i/--interval/-I/--interval-list option. Note that the json files are downloaded to the current working directory whereas other metadata are persisted in `$TMPDIR` or in `/tmp`. This is envisioned to be done once before the first start of the queries for the interval. Set the env variable `TILEDB_CACHE` to `1` and explicitly use `-c callset.json -v vidmap.json -l loader.json` with the `genomicsdb_query` command to access locally cached GenomicsDB metadata and json artifacts.
@@ -141,4 +146,67 @@ options:
141146
2. either samples and/or intervals using -i/-I/-s/-S options has to be specified
142147
```
143148

149+
<a name="filters"></a>
150+
### Filters and Attributes
144151

152+
Filters can be specified via an optional argument(`-f/--filter`) to `genomicsdb_query`. They are genomic filter expressions for the query and are based on the genomic attributes specified for the query. Genomic attributes are all the fields and `REF` and `ALT` specified during import of the variant files into GenomicsDB. Note that any attribute used in the filter expression should also be specified as an attribute to the query via `-a/--attribute` argument if they are not the defaults(`REF` and `GT`).
153+
154+
The expressions themselves are enhanced algebraic expressions using the attributes and the values for those attributes at the locus(contig+position) for the sample. The supported operators are all the binary, algebraic operators, e.g. `==, !=, >, <, >=, <=...` and custom operators `|=` to use with `ALT` for a match with any of the alternate alleles and `&=` to match a resolved `GT` field with respect to `REF` and `ALT`. The expressions can also contain [predefined aliases](#predefined_aliases) for often used operations. Also see [supported operators](#supported_operators) and try listing the fields(`--list-fields`) to help build the filter expression. See [examples](#examples) for sample filter expressions.
155+
156+
```
157+
~/GenomicsDB-Python/examples: genomicsdb_query -w my_workspace --list-fields
158+
Field Class Type Length Description
159+
----- ----- ---- ------ -----------
160+
PASS FILTER Integer 1 "All filters passed"
161+
q10 FILTER Integer 1 "Quality below 10"
162+
s50 FILTER Integer 1 "Less than 50\% \of samples have data"
163+
NS INFO Integer 1 "Number of Samples With Data"
164+
DP INFO Integer 1 "Total Depth"
165+
AF INFO Float A "Allele Frequency"
166+
AA INFO String var "Ancestral Allele"
167+
DB INFO Flag 1 "dbSNP membership
168+
H2 INFO Flag 1 "HapMap2 membership"
169+
GT FORMAT Integer PP "Genotype"
170+
VAF FORMAT Float 1 "Variant Allele Fraction"
171+
VP FORMAT Integer 1 "Variant Priority or clinical significance"
172+
--
173+
Abbreviations :
174+
A: Number of alternate alleles
175+
R: Number of alleles (including reference allele)
176+
G: Number of possible genotypes
177+
PP or P: Ploidy
178+
VAR or var: variable length
179+
```
180+
181+
<a name="predefined_aliases"></a>
182+
#### Predefined aliases
183+
1. ISCALL : is a variant call, filters out `GT="./."` for example
184+
2. ISHOMREF : homozygous with the reference allele(REF)
185+
3. ISHOMALT : both the alleles are non-REF (ALT)
186+
4. ISHET : heterozygous when the alleles in GT are different
187+
5. resolve : resolves the GT field specified as `0/0` or `1|2` into alleles with respect to REF and ALT. Phase separator is also considered for the comparison.
188+
189+
<a name="supported_operators"></a>
190+
#### Supported operators
191+
192+
* Standard operators: +, -, *, /, ^
193+
* Assignment operators: =, +=, -=, *=, /=
194+
* Logical operators: &&, ||, ==, !=, >, <, <=, >=
195+
* Bit manipulation: &, |, <<, >>
196+
* String concatenation: //
197+
* if then else conditionals with lazy evaluation: ?:
198+
* Type conversions: (float), (int)
199+
* Array index operator(for use with arrays of Integer/Float): e.g. AF[0]
200+
* Standard functions abs, sin, cos, tan, sinh, cosh, tanh, ln, log, log10, exp, sqrt
201+
* Unlimited number of arguments: min, max, sum
202+
* String functions: str2dbl, strlen, toupper
203+
* Array functions: sizeof and by index e.g. AF[2]
204+
* Custom operators: |= used with ALT, &= used with resolve(GT, REF, ALT)
205+
206+
<a name="examples"></a>
207+
#### Example filters:
208+
209+
* ISCALL && !ISHOMREF: Filter out no-calls and variant calls that are not homozygous reference.
210+
* ISCALL && (REF == "G" && ALT |= "T" && resolve(GT, REF, ALT) &= "T/T"): Filter out no-calls and only keep variants where the REF is G, ALT contains T and the genotype is T/T.
211+
* ISCALL && (DP>0 && resolve(GT, REF, ALT) &= "T/T"): Filter out no-calls and only keep variants where the genotype is T/T and DP is greater than 0
212+
* ISCALL && AF[0] > 0.5

0 commit comments

Comments
 (0)