Skip to content

Commit 30da581

Browse files
authored
vector data type (#2514)
* vector data type * add version update * add version update * fix broken link
1 parent 388730d commit 30da581

File tree

8 files changed

+247
-14
lines changed

8 files changed

+247
-14
lines changed

docs/en/guides/51-ai-functions/02-built-in-functions.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22
title: Built-in AI Functions
33
---
44

5+
import FunctionDescription from '@site/src/components/FunctionDescription';
6+
7+
<FunctionDescription description="Introduced or updated: v1.2.777"/>
8+
59
# Built-in AI Functions
610

711
Databend provides built-in AI functions powered by Azure OpenAI Service for seamless integration of AI capabilities into your SQL workflows.
@@ -18,7 +22,7 @@ Databend provides built-in AI functions powered by Azure OpenAI Service for seam
1822

1923
## Vector Storage in Databend
2024

21-
Databend stores embedding vectors using the `ARRAY(FLOAT NOT NULL)` data type, enabling direct similarity calculations with the `cosine_distance` function in SQL.
25+
Databend stores embedding vectors using the `VECTOR(1536)` data type, enabling direct similarity calculations with the `cosine_distance` function in SQL.
2226

2327
## Example: Semantic Search with Embeddings
2428

@@ -28,7 +32,8 @@ CREATE TABLE articles (
2832
id INT,
2933
title VARCHAR,
3034
content VARCHAR,
31-
embedding ARRAY(FLOAT NOT NULL)
35+
embedding VECTOR(1536),
36+
VECTOR INDEX idx_embedding(embedding) distance='cosine'
3237
);
3338

3439
-- Store documents with their vector embeddings

docs/en/sql-reference/00-sql-reference/10-data-types/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ The following is a list of semi-structured data types in Databend:
3434
| [TUPLE](tuple.md) | N/A | ('2023-02-14','Valentine Day') | An ordered collection of values of different data types, accessed by their index. |
3535
| [MAP](map.md) | N/A | `{"a":1, "b":2, "c":3}` | A set of key-value pairs where each key is unique and maps to a value. |
3636
| [VARIANT](variant.md) | JSON | `[1,{"a":1,"b":{"c":2}}]` | Collection of elements of different data types, including `ARRAY` and `OBJECT`. |
37+
| [VECTOR](vector.md) | N/A | [1.0, 2.1, 3.2] | Multi-dimensional arrays of 32-bit floating-point numbers for machine learning and similarity search operations. |
3738
| [BITMAP](bitmap.md) | N/A | 0101010101 | A binary data type used to represent a set of values, where each bit represents the presence or absence of a value. |
3839

3940
## Data Type Conversions
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
---
2+
title: Vector
3+
---
4+
5+
import FunctionDescription from '@site/src/components/FunctionDescription';
6+
7+
<FunctionDescription description="Introduced or updated: v1.2.777"/>
8+
9+
import EEFeature from '@site/src/components/EEFeature';
10+
11+
<EEFeature featureName='VECTOR INDEX'/>
12+
13+
14+
The VECTOR data type stores multi-dimensional arrays of 32-bit floating-point numbers, designed for machine learning, AI applications, and similarity search operations. Each vector has a fixed dimension (length) specified at creation time.
15+
16+
## Syntax
17+
18+
```sql
19+
column_name VECTOR(<dimension>)
20+
```
21+
22+
Where:
23+
- `dimension`: The dimension (length) of the vector. Must be a positive integer with a maximum value of 4096.
24+
- Elements are 32-bit floating-point numbers.
25+
26+
## Vector Indexing
27+
28+
Databend supports creating vector indexes using the HNSW (Hierarchical Navigable Small World) algorithm for fast approximate nearest neighbor search, delivering **23x faster** query performance.
29+
30+
### Index Syntax
31+
32+
```sql
33+
VECTOR INDEX index_name(column_name) distance='cosine,l1,l2'
34+
```
35+
36+
Where:
37+
- `index_name`: Name of the vector index
38+
- `column_name`: Name of the VECTOR column to index
39+
- `distance`: Distance functions to support. Can be `'cosine'`, `'l1'`, `'l2'`, or combinations like `'cosine,l1,l2'`
40+
41+
42+
### Supported Distance Functions
43+
44+
| Function | Description | Use Case |
45+
|----------|-------------|----------|
46+
| **[cosine_distance](/sql/sql-functions/vector-distance-functions/vector-cosine-distance)** | Calculates cosine distance between vectors | Semantic similarity, text embeddings |
47+
| **[l1_distance](/sql/sql-functions/vector-distance-functions/vector-l1-distance)** | Calculates L1 distance (Manhattan distance) | Feature comparison, sparse data |
48+
| **[l2_distance](/sql/sql-functions/vector-distance-functions/vector-l2-distance)** | Calculates L2 distance (Euclidean distance) | Geometric similarity, image features |
49+
50+
## Basic Usage
51+
52+
### Step 1: Create Table with Vector
53+
54+
```sql
55+
-- Create table with vector index for efficient similarity search
56+
CREATE OR REPLACE TABLE products (
57+
id INT,
58+
name VARCHAR,
59+
features VECTOR(3),
60+
VECTOR INDEX idx_features(features) distance='cosine'
61+
);
62+
```
63+
64+
**Note**: The vector index is automatically built when data is inserted into the table.
65+
66+
### Step 2: Insert Vector Data
67+
68+
```sql
69+
-- Insert product feature vectors
70+
INSERT INTO products VALUES
71+
(1, 'Product A', [1.0, 2.0, 3.0]::VECTOR(3)),
72+
(2, 'Product B', [2.0, 1.0, 4.0]::VECTOR(3)),
73+
(3, 'Product C', [1.5, 2.5, 2.0]::VECTOR(3)),
74+
(4, 'Product D', [3.0, 1.0, 1.0]::VECTOR(3));
75+
```
76+
77+
### Step 3: Perform Similarity Search
78+
79+
```sql
80+
-- Find products similar to a query vector [1.2, 2.1, 2.8]
81+
SELECT
82+
id,
83+
name,
84+
features,
85+
cosine_distance(features, [1.2, 2.1, 2.8]::VECTOR(3)) AS distance
86+
FROM products
87+
ORDER BY distance ASC
88+
LIMIT 3;
89+
```
90+
91+
Result:
92+
```
93+
┌─────┬───────────┬───────────────┬──────────────────┐
94+
│ id │ name │ features │ distance │
95+
├─────┼───────────┼───────────────┼──────────────────┤
96+
│ 2 │ Product B │ [2.0,1.0,4.0] │ 0.5384207 │
97+
│ 3 │ Product C │ [1.5,2.5,2.0] │ 0.5772848 │
98+
│ 1 │ Product A │ [1.0,2.0,3.0] │ 0.60447836 │
99+
└─────┴───────────┴───────────────┴──────────────────┘
100+
```
101+
102+
**Explanation**: The query finds the 3 most similar products to the search vector `[1.2, 2.1, 2.8]`. Lower cosine distance values indicate higher similarity.
103+
104+
## Unloading and Loading Vector Data
105+
106+
### Unloading Vector Data
107+
108+
```sql
109+
-- Export vector data to stage
110+
COPY INTO @mystage/unload/
111+
FROM (
112+
SELECT
113+
id,
114+
name,
115+
features
116+
FROM products
117+
)
118+
FILE_FORMAT = (TYPE = 'PARQUET');
119+
```
120+
121+
### Loading Vector Data
122+
123+
```sql
124+
-- Create target table for import
125+
CREATE OR REPLACE TABLE products_imported (
126+
id INT,
127+
name VARCHAR,
128+
features VECTOR(3),
129+
VECTOR INDEX idx_features(features) distance='cosine'
130+
);
131+
132+
-- Import vector data
133+
COPY INTO products_imported (id, name, features)
134+
FROM (
135+
SELECT
136+
id,
137+
name,
138+
features
139+
FROM @mystage/unload/
140+
)
141+
FILE_FORMAT = (TYPE = 'PARQUET');
142+
```

docs/en/sql-reference/20-sql-functions/11-ai-functions/02-ai-embedding-vector.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
11
---
22
title: "AI_EMBEDDING_VECTOR"
3-
description: "Creating embeddings using the ai_embedding_vector function in Databend"
43
---
54

5+
import FunctionDescription from '@site/src/components/FunctionDescription';
6+
7+
<FunctionDescription description="Introduced or updated: v1.2.777"/>
8+
69
This document provides an overview of the ai_embedding_vector function in Databend and demonstrates how to create document embeddings using this function.
710

811
The main code implementation can be found [here](https://github.com/databendlabs/databend/blob/1e93c5b562bd159ecb0f336bb88fd1b7f9dc4a62/src/common/openai/src/embedding.rs).
@@ -50,7 +53,8 @@ CREATE TABLE documents (
5053
id INT,
5154
title VARCHAR,
5255
content VARCHAR,
53-
embedding ARRAY(FLOAT NOT NULL)
56+
embedding VECTOR(1536),
57+
VECTOR INDEX idx_embedding(embedding) distance='cosine'
5458
);
5559
```
5660

docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/00-vector-cosine-distance.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,8 @@ COSINE_DISTANCE(vector1, vector2)
1313

1414
## Arguments
1515

16-
- `vector1`: First vector (ARRAY(FLOAT NOT NULL))
17-
- `vector2`: Second vector (ARRAY(FLOAT NOT NULL))
16+
- `vector1`: First vector (VECTOR Data Type)
17+
- `vector2`: Second vector (VECTOR Data Type)
1818

1919
## Returns
2020

@@ -51,7 +51,8 @@ Create a table with vector data:
5151
```sql
5252
CREATE OR REPLACE TABLE vectors (
5353
id INT,
54-
vec ARRAY(FLOAT NOT NULL)
54+
vec VECTOR(3),
55+
VECTOR INDEX idx_vec(vec) distance='cosine'
5556
);
5657

5758
INSERT INTO vectors VALUES
@@ -65,7 +66,7 @@ Find the vector most similar to [1, 2, 3]:
6566
```sql
6667
SELECT
6768
vec,
68-
COSINE_DISTANCE(vec, [1.0000, 2.0000, 3.0000]) AS distance
69+
COSINE_DISTANCE(vec, [1.0000, 2.0000, 3.0000]::VECTOR(3)) AS distance
6970
FROM
7071
vectors
7172
ORDER BY

docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/01-vector-l2-distance.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
11
---
22
title: 'L2_DISTANCE'
3-
description: 'Measuring Euclidean distance between vectors in Databend'
43
---
54

5+
import FunctionDescription from '@site/src/components/FunctionDescription';
6+
7+
<FunctionDescription description="Introduced or updated: v1.2.777"/>
8+
69
Calculates the Euclidean (L2) distance between two vectors, measuring the straight-line distance between them in vector space.
710

811
## Syntax
@@ -13,8 +16,8 @@ L2_DISTANCE(vector1, vector2)
1316

1417
## Arguments
1518

16-
- `vector1`: First vector (ARRAY(FLOAT NOT NULL))
17-
- `vector2`: Second vector (ARRAY(FLOAT NOT NULL))
19+
- `vector1`: First vector (VECTOR Data Type)
20+
- `vector2`: Second vector (VECTOR Data Type)
1821

1922
## Returns
2023

@@ -51,7 +54,8 @@ Create a table with vector data:
5154
```sql
5255
CREATE OR REPLACE TABLE vectors (
5356
id INT,
54-
vec ARRAY(FLOAT NOT NULL)
57+
vec VECTOR(3),
58+
VECTOR INDEX idx_vec(vec) distance='l2'
5559
);
5660

5761
INSERT INTO vectors VALUES
@@ -66,7 +70,7 @@ Find the vector closest to [1, 2, 3] using L2 distance:
6670
SELECT
6771
id,
6872
vec,
69-
L2_DISTANCE(vec, [1.0000, 2.0000, 3.0000]) AS distance
73+
L2_DISTANCE(vec, [1.0000, 2.0000, 3.0000]::VECTOR(3)) AS distance
7074
FROM
7175
vectors
7276
ORDER BY
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
---
2+
title: 'L1_DISTANCE'
3+
---
4+
5+
import FunctionDescription from '@site/src/components/FunctionDescription';
6+
7+
<FunctionDescription description="Introduced or updated: v1.2.777"/>
8+
9+
Calculates the Manhattan (L1) distance between two vectors, measuring the sum of absolute differences between corresponding elements.
10+
11+
## Syntax
12+
13+
```sql
14+
L1_DISTANCE(vector1, vector2)
15+
```
16+
17+
## Arguments
18+
19+
- `vector1`: First vector (VECTOR Data Type)
20+
- `vector2`: Second vector (VECTOR Data Type)
21+
22+
## Returns
23+
24+
Returns a FLOAT value representing the Manhattan (L1) distance between the two vectors. The value is always non-negative:
25+
- 0: Identical vectors
26+
- Larger values: Vectors that are farther apart
27+
28+
## Description
29+
30+
The L1 distance, also known as Manhattan distance or taxicab distance, calculates the sum of absolute differences between corresponding elements of two vectors. It's useful for feature comparison and sparse data analysis.
31+
32+
Formula: `L1_DISTANCE(a, b) = |a1 - b1| + |a2 - b2| + ... + |an - bn|`
33+
34+
## Examples
35+
36+
### Basic Usage
37+
38+
```sql
39+
-- Calculate L1 distance between two vectors
40+
SELECT L1_DISTANCE([1.0, 2.0, 3.0], [4.0, 5.0, 6.0]) AS distance;
41+
```
42+
43+
Result:
44+
```
45+
┌──────────┐
46+
│ distance │
47+
├──────────┤
48+
│ 9.0 │
49+
└──────────┘
50+
```
51+
52+
### Using with VECTOR Type
53+
54+
```sql
55+
-- Create table with VECTOR columns
56+
CREATE TABLE products (
57+
id INT,
58+
features VECTOR(3),
59+
VECTOR INDEX idx_features(features) distance='l1'
60+
);
61+
62+
INSERT INTO products VALUES
63+
(1, [1.0, 2.0, 3.0]::VECTOR(3)),
64+
(2, [2.0, 3.0, 4.0]::VECTOR(3));
65+
66+
-- Find products similar to a query vector using L1 distance
67+
SELECT
68+
id,
69+
features,
70+
L1_DISTANCE(features, [1.5, 2.5, 3.5]::VECTOR(3)) AS distance
71+
FROM products
72+
ORDER BY distance ASC
73+
LIMIT 5;
74+
```

docs/en/sql-reference/20-sql-functions/11-vector-distance-functions/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,13 @@ This section provides reference information for vector distance functions in Dat
1010
| Function | Description | Example |
1111
|----------|-------------|--------|
1212
| [COSINE_DISTANCE](./00-vector-cosine-distance.md) | Calculates angular distance between vectors (range: 0-1) | `COSINE_DISTANCE([1,2,3], [4,5,6])` |
13+
| [L1_DISTANCE](./02-vector-l1-distance.md) | Calculates Manhattan (L1) distance between vectors | `L1_DISTANCE([1,2,3], [4,5,6])` |
1314
| [L2_DISTANCE](./01-vector-l2-distance.md) | Calculates Euclidean (straight-line) distance | `L2_DISTANCE([1,2,3], [4,5,6])` |
1415

1516
## Function Comparison
1617

1718
| Function | Description | Range | Best For | Use Cases |
1819
|----------|-------------|-------|----------|-----------|
19-
| [L2_DISTANCE](./01-vector-l2-distance.md) | Euclidean (straight-line) distance | [0, ∞) | When magnitude matters | • Image similarity<br/>• Geographical data<br/>• Anomaly detection<br/>• Feature-based clustering |
2020
| [COSINE_DISTANCE](./00-vector-cosine-distance.md) | Angular distance between vectors | [0, 1] | When direction matters more than magnitude | • Document similarity<br/>• Semantic search<br/>• Recommendation systems<br/>• Text analysis |
21+
| [L1_DISTANCE](./02-vector-l1-distance.md) | Calculates Manhattan (L1) distance between vectors | [0, ∞) | When direction matters more than magnitude | • Document similarity<br/>• Semantic search<br/>• Recommendation systems<br/>• Text analysis |
22+
| [L2_DISTANCE](./01-vector-l2-distance.md) | Euclidean (straight-line) distance | [0, ∞) | When magnitude matters | • Image similarity<br/>• Geographical data<br/>• Anomaly detection<br/>• Feature-based clustering |

0 commit comments

Comments
 (0)