Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions docs/en/guides/51-ai-functions/02-built-in-functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@
title: Built-in AI Functions
---

import FunctionDescription from '@site/src/components/FunctionDescription';

<FunctionDescription description="Introduced or updated: v1.2.777"/>

# Built-in AI Functions

Databend provides built-in AI functions powered by Azure OpenAI Service for seamless integration of AI capabilities into your SQL workflows.
Expand All @@ -18,7 +22,7 @@ Databend provides built-in AI functions powered by Azure OpenAI Service for seam

## Vector Storage in Databend

Databend stores embedding vectors using the `ARRAY(FLOAT NOT NULL)` data type, enabling direct similarity calculations with the `cosine_distance` function in SQL.
Databend stores embedding vectors using the `VECTOR(1536)` data type, enabling direct similarity calculations with the `cosine_distance` function in SQL.

## Example: Semantic Search with Embeddings

Expand All @@ -28,7 +32,8 @@ CREATE TABLE articles (
id INT,
title VARCHAR,
content VARCHAR,
embedding ARRAY(FLOAT NOT NULL)
embedding VECTOR(1536),
VECTOR INDEX idx_embedding(embedding) distance='cosine'
);

-- Store documents with their vector embeddings
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ The following is a list of semi-structured data types in Databend:
| [TUPLE](tuple.md) | N/A | ('2023-02-14','Valentine Day') | An ordered collection of values of different data types, accessed by their index. |
| [MAP](map.md) | N/A | `{"a":1, "b":2, "c":3}` | A set of key-value pairs where each key is unique and maps to a value. |
| [VARIANT](variant.md) | JSON | `[1,{"a":1,"b":{"c":2}}]` | Collection of elements of different data types, including `ARRAY` and `OBJECT`. |
| [VECTOR](vector.md) | N/A | [1.0, 2.1, 3.2] | Multi-dimensional arrays of 32-bit floating-point numbers for machine learning and similarity search operations. |
| [BITMAP](bitmap.md) | N/A | 0101010101 | A binary data type used to represent a set of values, where each bit represents the presence or absence of a value. |

## Data Type Conversions
Expand Down
142 changes: 142 additions & 0 deletions docs/en/sql-reference/00-sql-reference/10-data-types/vector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
title: Vector
---

import FunctionDescription from '@site/src/components/FunctionDescription';

<FunctionDescription description="Introduced or updated: v1.2.777"/>

import EEFeature from '@site/src/components/EEFeature';

<EEFeature featureName='VECTOR INDEX'/>


The VECTOR data type stores multi-dimensional arrays of 32-bit floating-point numbers, designed for machine learning, AI applications, and similarity search operations. Each vector has a fixed dimension (length) specified at creation time.

## Syntax

```sql
column_name VECTOR(<dimension>)
```

Where:
- `dimension`: The dimension (length) of the vector. Must be a positive integer with a maximum value of 4096.
- Elements are 32-bit floating-point numbers.

## Vector Indexing

Databend supports creating vector indexes using the HNSW (Hierarchical Navigable Small World) algorithm for fast approximate nearest neighbor search, delivering **23x faster** query performance.

### Index Syntax

```sql
VECTOR INDEX index_name(column_name) distance='cosine,l1,l2'
```

Where:
- `index_name`: Name of the vector index
- `column_name`: Name of the VECTOR column to index
- `distance`: Distance functions to support. Can be `'cosine'`, `'l1'`, `'l2'`, or combinations like `'cosine,l1,l2'`


### Supported Distance Functions

| Function | Description | Use Case |
|----------|-------------|----------|
| **[cosine_distance](/sql/sql-functions/vector-distance-functions/vector-cosine-distance)** | Calculates cosine distance between vectors | Semantic similarity, text embeddings |
| **[l1_distance](/sql/sql-functions/vector-distance-functions/vector-l1-distance)** | Calculates L1 distance (Manhattan distance) | Feature comparison, sparse data |
| **[l2_distance](/sql/sql-functions/vector-distance-functions/vector-l2-distance)** | Calculates L2 distance (Euclidean distance) | Geometric similarity, image features |

## Basic Usage

### Step 1: Create Table with Vector

```sql
-- Create table with vector index for efficient similarity search
CREATE OR REPLACE TABLE products (
id INT,
name VARCHAR,
features VECTOR(3),
VECTOR INDEX idx_features(features) distance='cosine'
);
```

**Note**: The vector index is automatically built when data is inserted into the table.

### Step 2: Insert Vector Data

```sql
-- Insert product feature vectors
INSERT INTO products VALUES
(1, 'Product A', [1.0, 2.0, 3.0]::VECTOR(3)),
(2, 'Product B', [2.0, 1.0, 4.0]::VECTOR(3)),
(3, 'Product C', [1.5, 2.5, 2.0]::VECTOR(3)),
(4, 'Product D', [3.0, 1.0, 1.0]::VECTOR(3));
```

### Step 3: Perform Similarity Search

```sql
-- Find products similar to a query vector [1.2, 2.1, 2.8]
SELECT
id,
name,
features,
cosine_distance(features, [1.2, 2.1, 2.8]::VECTOR(3)) AS distance
FROM products
ORDER BY distance ASC
LIMIT 3;
```

Result:
```
┌─────┬───────────┬───────────────┬──────────────────┐
│ id │ name │ features │ distance │
├─────┼───────────┼───────────────┼──────────────────┤
│ 2 │ Product B │ [2.0,1.0,4.0] │ 0.5384207 │
│ 3 │ Product C │ [1.5,2.5,2.0] │ 0.5772848 │
│ 1 │ Product A │ [1.0,2.0,3.0] │ 0.60447836 │
└─────┴───────────┴───────────────┴──────────────────┘
```

**Explanation**: The query finds the 3 most similar products to the search vector `[1.2, 2.1, 2.8]`. Lower cosine distance values indicate higher similarity.

## Unloading and Loading Vector Data

### Unloading Vector Data

```sql
-- Export vector data to stage
COPY INTO @mystage/unload/
FROM (
SELECT
id,
name,
features
FROM products
)
FILE_FORMAT = (TYPE = 'PARQUET');
```

### Loading Vector Data

```sql
-- Create target table for import
CREATE OR REPLACE TABLE products_imported (
id INT,
name VARCHAR,
features VECTOR(3),
VECTOR INDEX idx_features(features) distance='cosine'
);

-- Import vector data
COPY INTO products_imported (id, name, features)
FROM (
SELECT
id,
name,
features
FROM @mystage/unload/
)
FILE_FORMAT = (TYPE = 'PARQUET');
```
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
---
title: "AI_EMBEDDING_VECTOR"
description: "Creating embeddings using the ai_embedding_vector function in Databend"
---

import FunctionDescription from '@site/src/components/FunctionDescription';

<FunctionDescription description="Introduced or updated: v1.2.777"/>

This document provides an overview of the ai_embedding_vector function in Databend and demonstrates how to create document embeddings using this function.

The main code implementation can be found [here](https://github.com/databendlabs/databend/blob/1e93c5b562bd159ecb0f336bb88fd1b7f9dc4a62/src/common/openai/src/embedding.rs).
Expand Down Expand Up @@ -50,7 +53,8 @@ CREATE TABLE documents (
id INT,
title VARCHAR,
content VARCHAR,
embedding ARRAY(FLOAT NOT NULL)
embedding VECTOR(1536),
VECTOR INDEX idx_embedding(embedding) distance='cosine'
);
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ COSINE_DISTANCE(vector1, vector2)

## Arguments

- `vector1`: First vector (ARRAY(FLOAT NOT NULL))
- `vector2`: Second vector (ARRAY(FLOAT NOT NULL))
- `vector1`: First vector (VECTOR Data Type)
- `vector2`: Second vector (VECTOR Data Type)

## Returns

Expand Down Expand Up @@ -51,7 +51,8 @@ Create a table with vector data:
```sql
CREATE OR REPLACE TABLE vectors (
id INT,
vec ARRAY(FLOAT NOT NULL)
vec VECTOR(3),
VECTOR INDEX idx_vec(vec) distance='cosine'
);

INSERT INTO vectors VALUES
Expand All @@ -65,7 +66,7 @@ Find the vector most similar to [1, 2, 3]:
```sql
SELECT
vec,
COSINE_DISTANCE(vec, [1.0000, 2.0000, 3.0000]) AS distance
COSINE_DISTANCE(vec, [1.0000, 2.0000, 3.0000]::VECTOR(3)) AS distance
FROM
vectors
ORDER BY
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
---
title: 'L2_DISTANCE'
description: 'Measuring Euclidean distance between vectors in Databend'
---

import FunctionDescription from '@site/src/components/FunctionDescription';

<FunctionDescription description="Introduced or updated: v1.2.777"/>

Calculates the Euclidean (L2) distance between two vectors, measuring the straight-line distance between them in vector space.

## Syntax
Expand All @@ -13,8 +16,8 @@ L2_DISTANCE(vector1, vector2)

## Arguments

- `vector1`: First vector (ARRAY(FLOAT NOT NULL))
- `vector2`: Second vector (ARRAY(FLOAT NOT NULL))
- `vector1`: First vector (VECTOR Data Type)
- `vector2`: Second vector (VECTOR Data Type)

## Returns

Expand Down Expand Up @@ -51,7 +54,8 @@ Create a table with vector data:
```sql
CREATE OR REPLACE TABLE vectors (
id INT,
vec ARRAY(FLOAT NOT NULL)
vec VECTOR(3),
VECTOR INDEX idx_vec(vec) distance='l2'
);

INSERT INTO vectors VALUES
Expand All @@ -66,7 +70,7 @@ Find the vector closest to [1, 2, 3] using L2 distance:
SELECT
id,
vec,
L2_DISTANCE(vec, [1.0000, 2.0000, 3.0000]) AS distance
L2_DISTANCE(vec, [1.0000, 2.0000, 3.0000]::VECTOR(3)) AS distance
FROM
vectors
ORDER BY
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: 'L1_DISTANCE'
---

import FunctionDescription from '@site/src/components/FunctionDescription';

<FunctionDescription description="Introduced or updated: v1.2.777"/>

Calculates the Manhattan (L1) distance between two vectors, measuring the sum of absolute differences between corresponding elements.

## Syntax

```sql
L1_DISTANCE(vector1, vector2)
```

## Arguments

- `vector1`: First vector (VECTOR Data Type)
- `vector2`: Second vector (VECTOR Data Type)

## Returns

Returns a FLOAT value representing the Manhattan (L1) distance between the two vectors. The value is always non-negative:
- 0: Identical vectors
- Larger values: Vectors that are farther apart

## Description

The L1 distance, also known as Manhattan distance or taxicab distance, calculates the sum of absolute differences between corresponding elements of two vectors. It's useful for feature comparison and sparse data analysis.

Formula: `L1_DISTANCE(a, b) = |a1 - b1| + |a2 - b2| + ... + |an - bn|`

## Examples

### Basic Usage

```sql
-- Calculate L1 distance between two vectors
SELECT L1_DISTANCE([1.0, 2.0, 3.0], [4.0, 5.0, 6.0]) AS distance;
```

Result:
```
┌──────────┐
│ distance │
├──────────┤
│ 9.0 │
└──────────┘
```

### Using with VECTOR Type

```sql
-- Create table with VECTOR columns
CREATE TABLE products (
id INT,
features VECTOR(3),
VECTOR INDEX idx_features(features) distance='l1'
);

INSERT INTO products VALUES
(1, [1.0, 2.0, 3.0]::VECTOR(3)),
(2, [2.0, 3.0, 4.0]::VECTOR(3));

-- Find products similar to a query vector using L1 distance
SELECT
id,
features,
L1_DISTANCE(features, [1.5, 2.5, 3.5]::VECTOR(3)) AS distance
FROM products
ORDER BY distance ASC
LIMIT 5;
```
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,13 @@ This section provides reference information for vector distance functions in Dat
| Function | Description | Example |
|----------|-------------|--------|
| [COSINE_DISTANCE](./00-vector-cosine-distance.md) | Calculates angular distance between vectors (range: 0-1) | `COSINE_DISTANCE([1,2,3], [4,5,6])` |
| [L1_DISTANCE](./02-vector-l1-distance.md) | Calculates Manhattan (L1) distance between vectors | `L1_DISTANCE([1,2,3], [4,5,6])` |
| [L2_DISTANCE](./01-vector-l2-distance.md) | Calculates Euclidean (straight-line) distance | `L2_DISTANCE([1,2,3], [4,5,6])` |

## Function Comparison

| Function | Description | Range | Best For | Use Cases |
|----------|-------------|-------|----------|-----------|
| [L2_DISTANCE](./01-vector-l2-distance.md) | Euclidean (straight-line) distance | [0, ∞) | When magnitude matters | • Image similarity<br/>• Geographical data<br/>• Anomaly detection<br/>• Feature-based clustering |
| [COSINE_DISTANCE](./00-vector-cosine-distance.md) | Angular distance between vectors | [0, 1] | When direction matters more than magnitude | • Document similarity<br/>• Semantic search<br/>• Recommendation systems<br/>• Text analysis |
| [L1_DISTANCE](./02-vector-l1-distance.md) | Calculates Manhattan (L1) distance between vectors | [0, ∞) | When direction matters more than magnitude | • Document similarity<br/>• Semantic search<br/>• Recommendation systems<br/>• Text analysis |
| [L2_DISTANCE](./01-vector-l2-distance.md) | Euclidean (straight-line) distance | [0, ∞) | When magnitude matters | • Image similarity<br/>• Geographical data<br/>• Anomaly detection<br/>• Feature-based clustering |
Loading