|
| 1 | +--- |
| 2 | +title: Tantivy Indexing in OpenObserve |
| 3 | +description: Learn how Tantivy indexing works in OpenObserve, including full-text and secondary indexes, query behaviors with AND and OR operators, and how to verify index usage. |
| 4 | +--- |
| 5 | +This document explains Tantivy indexing in OpenObserve, the types of indexes it builds, how to use the correct query patterns, and how to verify and configure indexing. |
| 6 | + |
| 7 | +> Tantivy indexing is an open-source feature in OpenObserve. |
| 8 | +
|
| 9 | +## What is Tantivy? |
| 10 | +Tantivy is the inverted index library used in OpenObserve to accelerate searches. An inverted index keeps a map of values or tokens and the row IDs of the records that contain them. When a user searches for a value, the query can use this index to go directly to the matching rows instead of scanning every log record. |
| 11 | + |
| 12 | +Tantivy builds two kinds of indexes in OpenObserve: |
| 13 | + |
| 14 | +## Full-text index |
| 15 | +For fields such as `body` or `message` that contain sentences or long text. The field is split into tokens, and each token is mapped to the records that contain it. |
| 16 | + |
| 17 | +**Example log records** <br> |
| 18 | + |
| 19 | +- Row 1: `body = "POST /api/metrics error"` |
| 20 | +- Row 2: `body = "GET /health ok"` |
| 21 | +- Row 3: `body = "error connecting to database"` |
| 22 | + |
| 23 | +The log body `POST /api/metrics error` is stored as tokens `POST`, `api`, `metrics`, `error`. A search for `error` looks up that token in the index and immediately finds the matching records. |
| 24 | + |
| 25 | +## Secondary index |
| 26 | +For fields that represent a single exact value. For example, `k8s_namespace_name`. In this case, the entire field value is treated as one token and indexed. |
| 27 | + |
| 28 | +**Example log records** <br> |
| 29 | + |
| 30 | +- Row 1: `k8s_namespace_name = ingress-nginx` |
| 31 | +- Row 2: `k8s_namespace_name = ziox` |
| 32 | +- Row 3: `k8s_namespace_name = ingress-nginx` |
| 33 | +- Row 4: `k8s_namespace_name = cert-manager` |
| 34 | + |
| 35 | +For `k8s_namespace_name`, the index might look like: |
| 36 | + |
| 37 | +- `ingress-nginx` > [Row 1, Row 3] |
| 38 | +- `ziox` > [Row 2] |
| 39 | +- `cert-manager` > [Row 4] |
| 40 | + |
| 41 | +A query for `k8s_namespace_name = 'ingress-nginx'` retrieves those rows directly, without scanning unrelated records. By keeping these indexes, Tantivy avoids full scans across millions or billions of records. This results in queries that return in milliseconds rather than seconds. |
| 42 | + |
| 43 | +## Configure Environment Variable |
| 44 | +To enable Tantivy indexing, configure the following environment variable: |
| 45 | +``` |
| 46 | +ZO_ENABLE_INVERTED_INDEX = true |
| 47 | +``` |
| 48 | + |
| 49 | +## Query behavior |
| 50 | +Tantivy optimizes queries differently based on whether the field is full-text or secondary. Using the right operator for each field type ensures the query is served from the index instead of scanning logs. |
| 51 | + |
| 52 | +### Full-text index scenarios |
| 53 | + |
| 54 | +**Correct usage** <br> |
| 55 | + |
| 56 | +- Use `match_all()` for full-text index fields such as `body` or `message`: |
| 57 | +```sql |
| 58 | +-- Return logs whose body contains the token "error" |
| 59 | +WHERE match_all('error'); |
| 60 | +``` |
| 61 | +- Use `NOT` with `match_all()`: |
| 62 | +```sql |
| 63 | +-- Exclude logs whose body contains the token "error" |
| 64 | +WHERE NOT match_all('error'); |
| 65 | +``` |
| 66 | + |
| 67 | +**Inefficient usage** <br> |
| 68 | +```sql |
| 69 | +-- Forces full string equality, bypasses token index |
| 70 | +WHERE body = 'error'; |
| 71 | +``` |
| 72 | + |
| 73 | +### Secondary index scenarios |
| 74 | + |
| 75 | +**Correct usage** |
| 76 | + |
| 77 | +- Use `=` or `IN (...)` for secondary index fields such as `k8s_namespace_name`, `k8s_pod_name`, or `k8s_container_name`. |
| 78 | +```sql |
| 79 | +-- Single value |
| 80 | +WHERE k8s_namespace_name = 'ingress-nginx'; |
| 81 | + |
| 82 | +-- Multiple values |
| 83 | +WHERE k8s_namespace_name IN ('ingress-nginx', 'ziox', 'cert-manager'); |
| 84 | +``` |
| 85 | +- Use NOT with `=` or `IN (...)` |
| 86 | +```sql |
| 87 | +-- Exclude one exact value |
| 88 | +WHERE NOT (k8s_namespace_name = 'ingress-nginx'); |
| 89 | + |
| 90 | +-- Exclude multiple values |
| 91 | +WHERE k8s_namespace_name NOT IN ('ziox', 'cert-manager'); |
| 92 | +``` |
| 93 | + |
| 94 | +**Inefficient usage** |
| 95 | +```sql |
| 96 | +-- Treated as a token search, no advantage over '=' |
| 97 | +WHERE match_all('ingress-nginx'); |
| 98 | +``` |
| 99 | + |
| 100 | +### Mixed scenarios |
| 101 | + |
| 102 | +When a query combines full-text and secondary fields, apply the best operator for each part. |
| 103 | + |
| 104 | +**Correct usage** |
| 105 | + |
| 106 | +```sql |
| 107 | +WHERE match_all('error') |
| 108 | + AND k8s_namespace_name = 'ingress-nginx'; |
| 109 | +``` |
| 110 | + |
| 111 | +- `match_all('error')` uses full-text index. |
| 112 | +- `k8s_namespace_name = 'ingress-nginx'` uses secondary index. |
| 113 | + |
| 114 | +**Incorrect usage** |
| 115 | + |
| 116 | +```sql |
| 117 | +-- Both operators used incorrectly |
| 118 | +WHERE body = 'error' |
| 119 | + AND match_all('ingress-nginx'); |
| 120 | +``` |
| 121 | + |
| 122 | +### AND and OR operator behavior |
| 123 | + |
| 124 | +**AND behavior** <br> |
| 125 | + |
| 126 | +- If both sides are indexable, Tantivy intersects the row sets from each index. |
| 127 | +- If one side is not indexable, the indexable side is still accelerated by Tantivy, and the other side is resolved in DataFusion. |
| 128 | + |
| 129 | + |
| 130 | +**Examples** |
| 131 | +```sql |
| 132 | +-- Fast: both sides indexable |
| 133 | +WHERE match_all('error') AND k8s_namespace_name = 'ingress-nginx'; |
| 134 | + |
| 135 | +-- Mixed: one side indexable, one not |
| 136 | +WHERE match_all('error') AND body LIKE '%error%'; |
| 137 | +``` |
| 138 | + |
| 139 | +**OR behavior** |
| 140 | + |
| 141 | +- If all branches of the OR are indexable, Tantivy unites the row sets efficiently. |
| 142 | +- If any branch is not indexable, the entire OR is not indexable. The query runs in DataFusion. |
| 143 | + |
| 144 | +**Examples** |
| 145 | +```sql |
| 146 | +-- Fast: both indexable |
| 147 | +WHERE match_all('error') OR k8s_namespace_name = 'ziox'; |
| 148 | + |
| 149 | +-- Slower: both sides are not indexable |
| 150 | +WHERE match_all('error') OR body LIKE '%error%'; |
| 151 | +``` |
| 152 | + |
| 153 | +**NOT with grouped conditions** <br> |
| 154 | +```sql |
| 155 | +-- Exclude when either namespace = ziox OR body contains error |
| 156 | +WHERE NOT (k8s_namespace_name = 'ziox' OR match_all('error')); |
| 157 | +``` |
| 158 | + |
| 159 | +## Verify if a query is using Tantivy |
| 160 | +To confirm whether a query used the Tantivy inverted index: |
| 161 | + |
| 162 | +1. Open the browser developer tools and go to the **Network** tab. |
| 163 | +2. Inspect the query response JSON. |
| 164 | +3. Under took_detail, check the value of `idx_took`: |
| 165 | + |
| 166 | + - If `idx_took` is greater than `0`, the query used the inverted index. |
| 167 | + - If `idx_took` is `0`, the query did not use the inverted index. |
0 commit comments