You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: genomicsdb/scripts/README.md
+69-1Lines changed: 69 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,9 @@
3
3
Simple GenomicsDB query tool `genomicsdb_query`, given a workspace and genomic intervals of the form `<CONTIG>:<START>-<END>`. The intervals at a minimum need to have the contig specified, start and end are optional. e.g chr1:100-1000, chr1:100 and chr1 are all valid. Start defaults to 1 if not specified and end defaults to the length of the contig if not specified.
4
4
5
5
Assumption : The workspace should have been created with the `vcf2genomicsdb` tool or with `gatk GenomicsDBImport` and should exist.
--list-contigs List contigs configured in vid mapping for the workspace and exit
27
30
--list-fields List genomic fields configured in vid mapping for the workspace and exit
28
31
--list-partitions List interval partitions(genomicsdb arrays in the workspace) for the given intervals(-i/--interval or -I/--interval-list) or all the intervals for the workspace and exit
32
+
--no-cache Do not use cached metadata and files with the genomicsdb query
29
33
-i INTERVAL, --interval INTERVAL
30
34
genomic intervals over which to operate. The intervals should be specified in the <CONTIG>:<START>-<END> format with START and END optional.
31
35
This argument may be specified 0 or more times e.g -i chr1:1-10000 -i chr2 -i chr3:1000.
@@ -49,7 +53,7 @@ options:
49
53
1. -s/--sample and -S/--sample-list are mutually exclusive
50
54
2. either samples and/or intervals using -i/-I/-s/-S options has to be specified
51
55
-a ATTRIBUTES, --attributes ATTRIBUTES
52
-
Optional - comma separated list of genomic attributes or fields described in the vid mapping for the query, eg. GT,AC,PL,DP... Defaults to GT
56
+
Optional - comma separated list of genomic attributes(REF, ALT) and fields described in the vid mapping for the query, eg. GT,AC,PL,DP... Defaults to REF,GT
53
57
-f FILTER, --filter FILTER
54
58
Optional - genomic filter expression for the query, e.g. 'ISHOMREF' or 'ISHET' or 'REF == "G" && resolve(GT, REF, ALT) &= "T/T" && ALT |= "T"'
Locally caching artifacts from cloud URLs is optional for GenomicsDB metadata and helps with performance for metadata/artifacts which can be accessed multiple times. There is a separate caching tool `genomicsdb_cache` which takes as inputs the workspace, optionally callset/vidmap/loader.json and also optionally the intervals or intervals with the -i/--interval/-I/--interval-list option. Note that the json files are downloaded to the current working directory whereas other metadata are persisted in `$TMPDIR` or in `/tmp`. This is envisioned to be done once before the first start of the queries for the interval. Set the env variable `TILEDB_CACHE` to `1` and explicitly use `-c callset.json -v vidmap.json -l loader.json` with the `genomicsdb_query` command to access locally cached GenomicsDB metadata and json artifacts.
@@ -141,4 +146,67 @@ options:
141
146
2. either samples and/or intervals using -i/-I/-s/-S options has to be specified
142
147
```
143
148
149
+
<aname="filters"></a>
150
+
### Filters and Attributes
144
151
152
+
Filters can be specified via an optional argument(`-f/--filter`) to `genomicsdb_query`. They are genomic filter expressions for the query and are based on the genomic attributes specified for the query. Genomic attributes are all the fields and `REF` and `ALT` specified during import of the variant files into GenomicsDB. Note that any attribute used in the filter expression should also be specified as an attribute to the query via `-a/--attribute` argument if they are not the defaults(`REF` and `GT`).
153
+
154
+
The expressions themselves are enhanced algebraic expressions using the attributes and the values for those attributes at the locus(contig+position) for the sample. The supported operators are all the binary, algebraic operators, e.g. `==, !=, >, <, >=, <=...` and custom operators `|=` to use with `ALT` for a match with any of the alternate alleles and `&=` to match a resolved `GT` field with respect to `REF` and `ALT`. The expressions can also contain [predefined aliases](#predefined_aliases) for often used operations. Also see [supported operators](#supported_operators) and try listing the fields(`--list-fields`) to help build the filter expression. See [examples](#examples) for sample filter expressions.
s50 FILTER Integer 1 "Less than 50\% \of samples have data"
163
+
NS INFO Integer 1 "Number of Samples With Data"
164
+
DP INFO Integer 1 "Total Depth"
165
+
AF INFO Float A "Allele Frequency"
166
+
AA INFO String var "Ancestral Allele"
167
+
DB INFO Flag 1 "dbSNP membership
168
+
H2 INFO Flag 1 "HapMap2 membership"
169
+
GT FORMAT Integer PP "Genotype"
170
+
VAF FORMAT Float 1 "Variant Allele Fraction"
171
+
VP FORMAT Integer 1 "Variant Priority or clinical significance"
172
+
--
173
+
Abbreviations :
174
+
A: Number of alternate alleles
175
+
R: Number of alleles (including reference allele)
176
+
G: Number of possible genotypes
177
+
PP or P: Ploidy
178
+
VAR or var: variable length
179
+
```
180
+
181
+
<aname="predefined_aliases"></a>
182
+
#### Predefined aliases
183
+
1. ISCALL : is a variant call, filters out `GT="./."` for example
184
+
2. ISHOMREF : homozygous with the reference allele(REF)
185
+
3. ISHOMALT : both the alleles are non-REF (ALT)
186
+
4. ISHET : heterozygous when the alleles in GT are different
187
+
5. resolve : resolves the GT field specified as `0/0` or `1|2` into alleles with respect to REF and ALT. Phase separator is also considered for the comparison.
188
+
189
+
<aname="supported_operators"></a>
190
+
#### Supported operators
191
+
192
+
* Standard operators: +, -, *, /, ^
193
+
* Assignment operators: =, +=, -=, *=, /=
194
+
* Logical operators: &&, ||, ==, !=, >, <, <=, >=
195
+
* Bit manipulation: &, |, <<, >>
196
+
* String concatenation: //
197
+
* if then else conditionals with lazy evaluation: ?:
198
+
* Type conversions: (float), (int)
199
+
* Array index operator(for use with arrays of Integer/Float): e.g. AF[0]
200
+
* Standard functions abs, sin, cos, tan, sinh, cosh, tanh, ln, log, log10, exp, sqrt
201
+
* Unlimited number of arguments: min, max, sum
202
+
* String functions: str2dbl, strlen, toupper
203
+
* Array functions: sizeof and by index e.g. AF[2]
204
+
* Custom operators: |= used with ALT, &= used with resolve(GT, REF, ALT)
205
+
206
+
<aname="examples"></a>
207
+
#### Example filters:
208
+
209
+
* ISCALL && !ISHOMREF: Filter out no-calls and variant calls that are not homozygous reference.
210
+
* ISCALL && (REF == "G" && ALT |= "T" && resolve(GT, REF, ALT) &= "T/T"): Filter out no-calls and only keep variants where the REF is G, ALT contains T and the genotype is T/T.
211
+
* ISCALL && (DP>0 && resolve(GT, REF, ALT) &= "T/T"): Filter out no-calls and only keep variants where the genotype is T/T and DP is greater than 0
0 commit comments