Skip to content

Commit 6ff7969

Browse files
add Tantivy index guide with query rules and verification (#139)
1 parent 4491f35 commit 6ff7969

File tree

3 files changed

+172
-3
lines changed

3 files changed

+172
-3
lines changed

docs/user-guide/performance/.pages

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ nav:
33
- Download Manager: download-manager.md
44
- Monitor Download Queue Size and Disk Cache Metrics: monitor-download-queue-size-and-disk-cache-metrics.md
55
- Configure Disk Cache Eviction Strategy: disk-cache-strategy.md
6+
- Tantivy Index: tantivy-index.md

docs/user-guide/performance/index.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ The Performance section provides tools and configurations to optimize query exec
22

33
**Learn more:**
44

5-
- [Download Manager](download-manager.md)
6-
- [Monitor Download Queue Size and Disk Cache Metrics](monitor-download-queue-size-and-disk-cache-metrics.md)
7-
- [Configure Disk Cache Eviction Strategy](disk-cache-strategy.md)
5+
- [Download Manager](../performance/download-manager/)
6+
- [Monitor Download Queue Size and Disk Cache Metrics](../performance/monitor-download-queue-size-and-disk-cache-metrics/)
7+
- [Configure Disk Cache Eviction Strategy](../performance/disk-cache-strategy/)
8+
- [Tantivy Index](../performance/tantivy-index)
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
---
2+
title: Tantivy Indexing in OpenObserve
3+
description: Learn how Tantivy indexing works in OpenObserve, including full-text and secondary indexes, query behaviors with AND and OR operators, and how to verify index usage.
4+
---
5+
This document explains Tantivy indexing in OpenObserve, the types of indexes it builds, how to use the correct query patterns, and how to verify and configure indexing.
6+
7+
> Tantivy indexing is an open-source feature in OpenObserve.
8+
9+
## What is Tantivy?
10+
Tantivy is the inverted index library used in OpenObserve to accelerate searches. An inverted index keeps a map of values or tokens and the row IDs of the records that contain them. When a user searches for a value, the query can use this index to go directly to the matching rows instead of scanning every log record.
11+
12+
Tantivy builds two kinds of indexes in OpenObserve:
13+
14+
## Full-text index
15+
For fields such as `body` or `message` that contain sentences or long text. The field is split into tokens, and each token is mapped to the records that contain it.
16+
17+
**Example log records** <br>
18+
19+
- Row 1: `body = "POST /api/metrics error"`
20+
- Row 2: `body = "GET /health ok"`
21+
- Row 3: `body = "error connecting to database"`
22+
23+
The log body `POST /api/metrics error` is stored as tokens `POST`, `api`, `metrics`, `error`. A search for `error` looks up that token in the index and immediately finds the matching records.
24+
25+
## Secondary index
26+
For fields that represent a single exact value. For example, `k8s_namespace_name`. In this case, the entire field value is treated as one token and indexed.
27+
28+
**Example log records** <br>
29+
30+
- Row 1: `k8s_namespace_name = ingress-nginx`
31+
- Row 2: `k8s_namespace_name = ziox`
32+
- Row 3: `k8s_namespace_name = ingress-nginx`
33+
- Row 4: `k8s_namespace_name = cert-manager`
34+
35+
For `k8s_namespace_name`, the index might look like:
36+
37+
- `ingress-nginx` > [Row 1, Row 3]
38+
- `ziox` > [Row 2]
39+
- `cert-manager` > [Row 4]
40+
41+
A query for `k8s_namespace_name = 'ingress-nginx'` retrieves those rows directly, without scanning unrelated records. By keeping these indexes, Tantivy avoids full scans across millions or billions of records. This results in queries that return in milliseconds rather than seconds.
42+
43+
## Configure Environment Variable
44+
To enable Tantivy indexing, configure the following environment variable:
45+
```
46+
ZO_ENABLE_INVERTED_INDEX = true
47+
```
48+
49+
## Query behavior
50+
Tantivy optimizes queries differently based on whether the field is full-text or secondary. Using the right operator for each field type ensures the query is served from the index instead of scanning logs.
51+
52+
### Full-text index scenarios
53+
54+
**Correct usage** <br>
55+
56+
- Use `match_all()` for full-text index fields such as `body` or `message`:
57+
```sql
58+
-- Return logs whose body contains the token "error"
59+
WHERE match_all('error');
60+
```
61+
- Use `NOT` with `match_all()`:
62+
```sql
63+
-- Exclude logs whose body contains the token "error"
64+
WHERE NOT match_all('error');
65+
```
66+
67+
**Inefficient usage** <br>
68+
```sql
69+
-- Forces full string equality, bypasses token index
70+
WHERE body = 'error';
71+
```
72+
73+
### Secondary index scenarios
74+
75+
**Correct usage**
76+
77+
- Use `=` or `IN (...)` for secondary index fields such as `k8s_namespace_name`, `k8s_pod_name`, or `k8s_container_name`.
78+
```sql
79+
-- Single value
80+
WHERE k8s_namespace_name = 'ingress-nginx';
81+
82+
-- Multiple values
83+
WHERE k8s_namespace_name IN ('ingress-nginx', 'ziox', 'cert-manager');
84+
```
85+
- Use NOT with `=` or `IN (...)`
86+
```sql
87+
-- Exclude one exact value
88+
WHERE NOT (k8s_namespace_name = 'ingress-nginx');
89+
90+
-- Exclude multiple values
91+
WHERE k8s_namespace_name NOT IN ('ziox', 'cert-manager');
92+
```
93+
94+
**Inefficient usage**
95+
```sql
96+
-- Treated as a token search, no advantage over '='
97+
WHERE match_all('ingress-nginx');
98+
```
99+
100+
### Mixed scenarios
101+
102+
When a query combines full-text and secondary fields, apply the best operator for each part.
103+
104+
**Correct usage**
105+
106+
```sql
107+
WHERE match_all('error')
108+
AND k8s_namespace_name = 'ingress-nginx';
109+
```
110+
111+
- `match_all('error')` uses full-text index.
112+
- `k8s_namespace_name = 'ingress-nginx'` uses secondary index.
113+
114+
**Incorrect usage**
115+
116+
```sql
117+
-- Both operators used incorrectly
118+
WHERE body = 'error'
119+
AND match_all('ingress-nginx');
120+
```
121+
122+
### AND and OR operator behavior
123+
124+
**AND behavior** <br>
125+
126+
- If both sides are indexable, Tantivy intersects the row sets from each index.
127+
- If one side is not indexable, the indexable side is still accelerated by Tantivy, and the other side is resolved in DataFusion.
128+
129+
130+
**Examples**
131+
```sql
132+
-- Fast: both sides indexable
133+
WHERE match_all('error') AND k8s_namespace_name = 'ingress-nginx';
134+
135+
-- Mixed: one side indexable, one not
136+
WHERE match_all('error') AND body LIKE '%error%';
137+
```
138+
139+
**OR behavior**
140+
141+
- If all branches of the OR are indexable, Tantivy unites the row sets efficiently.
142+
- If any branch is not indexable, the entire OR is not indexable. The query runs in DataFusion.
143+
144+
**Examples**
145+
```sql
146+
-- Fast: both indexable
147+
WHERE match_all('error') OR k8s_namespace_name = 'ziox';
148+
149+
-- Slower: both sides are not indexable
150+
WHERE match_all('error') OR body LIKE '%error%';
151+
```
152+
153+
**NOT with grouped conditions** <br>
154+
```sql
155+
-- Exclude when either namespace = ziox OR body contains error
156+
WHERE NOT (k8s_namespace_name = 'ziox' OR match_all('error'));
157+
```
158+
159+
## Verify if a query is using Tantivy
160+
To confirm whether a query used the Tantivy inverted index:
161+
162+
1. Open the browser developer tools and go to the **Network** tab.
163+
2. Inspect the query response JSON.
164+
3. Under took_detail, check the value of `idx_took`:
165+
166+
- If `idx_took` is greater than `0`, the query used the inverted index.
167+
- If `idx_took` is `0`, the query did not use the inverted index.

0 commit comments

Comments
 (0)