Skip to content

Latest commit

 

History

History
1647 lines (1264 loc) · 53.4 KB

File metadata and controls

1647 lines (1264 loc) · 53.4 KB

MatrixOne Python SDK

PyPI version Python Support Documentation Status License

A comprehensive Python SDK for MatrixOne that provides SQLAlchemy-like interface for database operations, vector search, fulltext search, snapshot management, PITR, restore operations, table cloning, and mo-ctl integration.

📚 Complete Documentation | 🚀 Quick Start Guide | 🧠 Vector Search & IVF Monitoring | 📖 API Reference

Note: This SDK implementation is generated and optimized through AI-assisted development, leveraging advanced code generation techniques to ensure comprehensive API coverage, type safety, and modern Python best practices.

✨ Features

  • 🚀 High Performance: Optimized for MatrixOne database operations with connection pooling
  • 🔄 Async Support: Full async/await support with AsyncClient for non-blocking operations
  • 🧠 Vector Search: Advanced vector similarity search with HNSW and IVF indexing algorithms
    • Support for f32 and f64 precision vectors
    • Multiple distance metrics (L2, Cosine, Inner Product)
    • IVF Index Health Monitoring with get_ivf_stats() - Critical for production!
    • IVF LIMIT BY RANK - Fine-grained ranking control (pre/post/force modes)
    • Configurable index parameters for performance tuning
  • 🔍 Fulltext Search: Powerful fulltext indexing and search with BM25 and TF-IDF algorithms
    • Natural language and boolean search modes
    • Multi-column fulltext indexes
    • Relevance scoring and ranking
  • 📊 Metadata Analysis: Comprehensive table and column metadata analysis with statistics
    • Table size and row count statistics
    • Column data distribution analysis
    • Index usage metrics
  • 🔍 Secondary Index Verification: Verify consistency of secondary indexes with main table
    • Get all secondary index table names
    • Get specific index table by index name
    • Verify row counts across main table and all indexes in single query
  • 📂 Stage Management: Comprehensive external stage operations for centralized data storage
    • Create and manage stages (file system, S3, cloud storage)
    • Load data from stages with automatic path resolution
    • Stage-scoped load operations for convenience
    • Transaction support for atomic stage operations
  • 📥 Data Loading: Pandas-style bulk data loading with intuitive interface
    • read_csv(), read_json(), read_parquet() methods (pandas-compatible API)
    • Load from local files or external stages (stage:// protocol)
    • Pandas-compatible parameters (sep, quotechar, skiprows, names, encoding)
    • Support for inline data loading and compression
    • Transaction-aware loading for data consistency
  • 📤 Data Export: Pandas-style data export with intuitive interface
    • to_csv() and to_jsonl() methods (pandas-compatible API)
    • Export to local files or external stages (stage:// protocol)
    • Support for raw SQL, SQLAlchemy select(), and MatrixOne queries
    • Customizable CSV options (sep, quotechar, lineterminator)
    • Transaction-aware exports for data consistency
  • 📸 Snapshot Management: Create and manage database snapshots at multiple levels
  • Point-in-Time Recovery: PITR functionality for precise data recovery
  • 🔄 Table Cloning: Clone databases and tables efficiently with data replication
    • SQLAlchemy-style statement builders (clone_table(), clone_database())
    • Snapshot-based point-in-time cloning
    • Cross-tenant cloning with TO ACCOUNT support
    • IF NOT EXISTS for idempotent operations
  • 🌿 Branch Management: Git-style version control for databases and tables
    • Create isolated branches for development and testing
    • Compare differences between branches with count-only optimization
    • Merge branches with conflict resolution strategies
    • Point-in-time branching from snapshots
    • SQLAlchemy-style statement builders (create_table_branch(), diff_table_branch(), etc.)
    • Advanced DIFF output: COUNT, LIMIT, FILE export, snapshot on both sides
    • Cross-tenant branching with TO ACCOUNT support
  • 👥 Account Management: Comprehensive user, role, and permission management
  • 📊 Pub/Sub: Real-time publication and subscription support
  • 🔧 Version Management: Automatic backend version detection and compatibility checking
  • 🛡️ Type Safety: Full type hints support with comprehensive documentation
  • 📚 SQLAlchemy Integration: Seamless SQLAlchemy integration with enhanced ORM features
  • 🔗 Enhanced Query Building: Advanced query building with logical operations (logical_and, logical_or, logical_not)
  • 🪝 Connection Hooks: Pre/post connection hooks for custom initialization logic
  • 🛠️ mo-diag CLI Tool: Interactive diagnostic tool for MatrixOne database maintenance
    • Index health monitoring and verification
    • IVF/HNSW vector index status inspection
    • Table statistics and metadata analysis
    • Interactive shell with Tab completion and command history
    • Non-interactive mode for scripting and automation
    • Batch operations on tables and indexes

⚠️ Important: Column Naming Convention

🚨 CRITICAL: Always use lowercase with underscores (snake_case) for column names!

MatrixOne does not support SQL standard double-quoted identifiers in queries, which causes issues with camelCase column names when using SQLAlchemy ORM.

# ❌ DON'T: CamelCase column names (will fail in SELECT queries)
class User(Base):
    userName = Column(String(50))      # CREATE succeeds, SELECT fails!
    userId = Column(Integer)           # Will cause SQL syntax errors

# ✅ DO: Use lowercase with underscores (snake_case)
class User(Base):
    user_name = Column(String(50))     # Works perfectly
    user_id = Column(Integer)          # All operations succeed

Why this matters:

  • ✅ CREATE TABLE works with both styles (uses backticks)
  • ✅ INSERT works with both styles
  • SELECT fails with camelCase (uses double quotes, not supported by MatrixOne)

Example of the problem:

# CamelCase generates: SELECT "userName" FROM user  ❌ Fails!
# snake_case generates: SELECT user_name FROM user  ✅ Works!

🚀 Installation

From PyPI (Stable Release)

pip install matrixone-python-sdk

From test.pypi (Latest Pre-release)

pip install \
    --index-url https://test.pypi.org/simple/ \
    --extra-index-url https://pypi.org/simple/ \
    matrixone-python-sdk

Note: The --extra-index-url is required to install dependencies (PyMySQL, SQLAlchemy, etc.) from the official PyPI.

Development Installation

Using Virtual Environment (Recommended)

git clone https://github.com/matrixorigin/matrixone.git
cd matrixone/clients/python

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate

# Quick setup with Makefile (includes all dev dependencies)
make dev-setup

# Or manual setup
pip install -e '.[dev]'  # Includes pyarrow, Faker, pytest, etc.

Using Conda Environment

git clone https://github.com/matrixorigin/matrixone.git
cd matrixone/clients/python

# Create conda environment
conda create -n matrixone-dev python=3.10
conda activate matrixone-dev

# Quick setup with Makefile
make dev-setup

# Or manual setup
pip install -e '.[dev]'

Direct Installation (Not Recommended)

git clone https://github.com/matrixorigin/matrixone.git
cd matrixone/clients/python

# Quick setup with Makefile
make dev-setup

# Or manual setup
pip install -e '.[dev]'

Testing Environment Setup

For Testing and Examples

# Install test dependencies (includes pyarrow for Parquet support)
pip install -e '.[test]'

# Or use requirements files
pip install -r requirements-sqlalchemy20.txt  # SQLAlchemy 2.x
pip install -r requirements-sqlalchemy14.txt  # SQLAlchemy 1.4.x

For Load Data Features

The SDK includes comprehensive LOAD DATA functionality with support for:

  • CSV/TSV files: Basic delimited file loading
  • JSONLINE files: JSON Lines format with object/array structures
  • Parquet files: Columnar format loading (requires pyarrow)
  • Inline data: Direct string data loading
  • Stage loading: Loading from named external stages

Dependencies for Load Data:

# Required for Parquet file support
pip install pyarrow>=10.0.0

# For generating test data
pip install Faker>=10.0.0

# Or install everything at once
pip install -e '.[dev]'  # Includes pyarrow, Faker, pytest, etc.

Quick Start

Basic Usage

from matrixone import Client

# Create and connect to MatrixOne
client = Client()
client.connect(
    host='localhost',
    port=6001,
    user='root',
    password='111',
    database='test'
)

# Execute queries
result = client.execute("SELECT 1 as test")
print(result.fetchall())

# Get backend version (auto-detected)
version = client.get_backend_version()
print(f"MatrixOne version: {version}")

client.disconnect()

📝 Connection Parameters

The connect() method requires keyword arguments (not positional):

  • database - Required, no default value
  • host - Default: 'localhost'
  • port - Default: 6001
  • user - Default: 'root'
  • password - Default: '111'

Minimal connection (uses all defaults):

client.connect(database='test')

By default, all features (IVF, HNSW, fulltext) are automatically enabled via on_connect=[ConnectionAction.ENABLE_ALL].

Transaction Management (Recommended)

Use client.session() for atomic transactions. All operations within a session succeed or fail together:

from matrixone import Client
from sqlalchemy import select, insert, update, delete

client = Client()
client.connect(database='test')

# Basic transaction with automatic commit/rollback
with client.session() as session:
    # All operations are atomic
    session.execute(insert(User).values(name='Alice', age=30))
    session.execute(update(User).where(User.age < 18).values(status='minor'))
    
    # Query within transaction
    stmt = select(User).where(User.age > 25)
    result = session.execute(stmt)
    users = result.scalars().all()
    # Commits automatically on success, rolls back on error

client.disconnect()

Key Benefits:

  • ✅ Atomic operations - all succeed or fail together
  • ✅ Automatic rollback on errors
  • ✅ Access to all MatrixOne managers (snapshots, clones, load_data, etc.)
  • ✅ Full SQLAlchemy ORM support

Data Loading (Pandas-Style)

Load bulk data from CSV, JSON, or Parquet files with pandas-compatible API:

from matrixone import Client
from matrixone.orm import declarative_base, Column, Integer, String

Base = declarative_base()

class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    name = Column(String(100))
    email = Column(String(255))

client = Client()
client.connect(database='test')

# Create table
client.create_table(User)

# Basic CSV load (pandas-style)
client.load_data.read_csv('users.csv', table=User)

# CSV with header (pandas-style)
client.load_data.read_csv('users.csv', table=User, skiprows=1)

# Custom separator (pandas-style)
client.load_data.read_csv('users.txt', table=User, sep='|')

# Tab-separated (pandas-style)
client.load_data.read_csv('users.tsv', table=User, sep='\t')

# JSON Lines (pandas-style)
client.load_data.read_json('events.jsonl', table='events', lines=True, orient='records')

# Parquet (pandas-style)
client.load_data.read_parquet('data.parquet', table=User)

# Load from stage
client.load_data.read_csv('stage://s3_stage/users.csv', table=User)
client.load_data.read_csv_stage('s3_stage', 'users.csv', table=User)

# Inline data
csv_data = "1,Alice,alice@example.com\\n2,Bob,bob@example.com\\n"
client.load_data.read_csv(csv_data, table=User, inline=True)

client.disconnect()

Key Features:

  • ✅ Pandas-compatible API (read_csv(), read_json(), read_parquet())
  • ✅ Pandas-style parameters (sep, quotechar, skiprows, names, encoding)
  • ✅ Load from local files or external stages (stage:// protocol)
  • ✅ Support for compression, parallel loading, and inline data

Data Export (Pandas-Style)

Export query results to CSV or JSONL with pandas-compatible API:

from matrixone import Client
from sqlalchemy import select

client = Client()
client.connect(database='test')

# Basic CSV export (pandas-style)
client.export.to_csv('/tmp/users.csv', "SELECT * FROM users")

# With custom separator
client.export.to_csv('/tmp/users.tsv', "SELECT * FROM users", sep='\t')

# Export with SQLAlchemy
stmt = select(User).where(User.age > 25)
client.export.to_csv('/tmp/adults.csv', stmt, sep='|')

# Export to JSONL
client.export.to_jsonl('/tmp/users.jsonl', "SELECT * FROM users")

# Export to external stage (using stage:// protocol)
client.export.to_csv('stage://s3_stage/backup.csv', stmt)

# Export to external stage (using convenience method)
client.export.to_csv_stage('s3_stage', 'backup2.csv', stmt)

# Export within transaction
with client.session() as session:
    stmt = select(Product).where(Product.stock > 0)
    session.export.to_csv('/tmp/in_stock.csv', stmt)

client.disconnect()

Key Features:

  • ✅ Pandas-compatible API (to_csv(), to_jsonl())
  • ✅ Convenience methods for stage exports (to_csv_stage(), to_jsonl_stage())
  • ✅ Support for raw SQL, SQLAlchemy, and MatrixOne queries
  • ✅ Export to local files or external stages (stage:// protocol)
  • ✅ Customizable CSV options (sep, quotechar, lineterminator)

Wrapping Existing SQLAlchemy Sessions (For Legacy Projects)

If you have existing SQLAlchemy code, you can wrap your sessions with MatrixOne features without refactoring:

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from matrixone import Client
from matrixone.session import Session as MatrixOneSession

# Your existing SQLAlchemy setup
engine = create_engine('mysql+pymysql://root:111@localhost:6001/test')
SessionFactory = sessionmaker(bind=engine)
sqlalchemy_session = SessionFactory()

# Create MatrixOne client
mo_client = Client()
mo_client.connect(host='localhost', port=6001, user='root', password='111', database='test')

# Wrap your existing session with MatrixOne features
mo_session = MatrixOneSession(
    client=mo_client,
    wrap_session=sqlalchemy_session
)

try:
    # Your existing SQLAlchemy operations still work
    result = mo_session.execute("SELECT * FROM users")
    
    # Now you can also use MatrixOne-specific features
    mo_session.stage.create_s3('backup_stage', bucket='my-backups', path='')
    mo_session.snapshots.create('daily_backup', level='database')
    mo_session.load_data.read_csv('/data/users.csv', table='users')
    
    mo_session.commit()
finally:
    mo_session.close()

Use Cases:

  • 🔄 Gradual migration from pure SQLAlchemy to MatrixOne
  • 🏢 Adding MatrixOne features to existing enterprise applications
  • 📦 Legacy code modernization without complete refactoring
  • 🔧 Testing MatrixOne features alongside existing code

SQLAlchemy ORM Integration

The SDK provides seamless SQLAlchemy integration with ORM-style operations:

from matrixone import Client
from matrixone.orm import Base, Column, Integer, String, select, insert, update, delete

# Define ORM models
class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    name = Column(String(100))
    email = Column(String(255))
    age = Column(Integer)

client = Client()
client.connect(database='test')

# Create table from model
client.create_table(User)

# ORM-style INSERT
stmt = insert(User).values(name='John', email='john@example.com', age=30)
client.execute(stmt)

# ORM-style SELECT
stmt = select(User).where(User.age > 25)
result = client.execute(stmt)
for user in result.scalars():
    print(f"User: {user.name}, Age: {user.age}")

# ORM-style UPDATE
stmt = update(User).where(User.id == 1).values(email='newemail@example.com')
client.execute(stmt)

# ORM-style DELETE
stmt = delete(User).where(User.age < 18)
client.execute(stmt)

client.disconnect()

Recommended Practices:

  • ✅ Use session() for multi-statement transactions
  • ✅ Use ORM-style statements (select, insert, update, delete)
  • ✅ Use client.execute() for single-statement operations
  • ✅ Prefer SQLAlchemy statements over raw SQL strings

Async Usage

Full async/await support with AsyncClient and async sessions:

import asyncio
from matrixone import AsyncClient
from sqlalchemy import select, insert, update

async def main():
    client = AsyncClient()
    await client.connect(database='test')
    
    # Basic async query
    result = await client.execute(select(User).where(User.age > 25))
    print(result.fetchall())
    
    # Async transaction
    async with client.session() as session:
        # All operations are atomic
        await session.execute(insert(User).values(name='Alice', age=30))
        await session.execute(update(User).where(User.age < 18).values(status='minor'))
        
        # Concurrent queries with asyncio.gather
        user_result, order_result = await asyncio.gather(
            session.execute(select(User)),
            session.execute(select(Order))
        )
        # Commits automatically
    
    await client.disconnect()

asyncio.run(main())

Async Benefits:

  • ✅ Non-blocking I/O operations
  • ✅ Concurrent execution with asyncio.gather()
  • ✅ Perfect for FastAPI, aiohttp, and async frameworks
  • ✅ Same transaction guarantees as sync version

Snapshot Management

Create and manage database snapshots. Use session() for atomic snapshot + clone operations:

from matrixone import SnapshotLevel

# Single snapshot operation
snapshot = client.snapshots.create(
    name='my_snapshot',
    level=SnapshotLevel.DATABASE,
    database='production'
)

# List snapshots
snapshots = client.snapshots.list()
for snap in snapshots:
    print(f"Snapshot: {snap.name}, Created: {snap.created_at}")

# Atomic snapshot + clone in transaction
with client.session() as session:
    # Create snapshot
    snapshot = session.snapshots.create(
        name='daily_backup',
        level=SnapshotLevel.DATABASE,
        database='production'
    )
    
    # Clone from snapshot atomically
    session.clone.clone_database(
        target_db='production_copy',
        source_db='production',
        snapshot_name='daily_backup'
    )
    # Both operations commit together

Vector Search

from matrixone.sqlalchemy_ext import create_vector_column
from sqlalchemy import Column, BigInteger, String, Text
from matrixone.orm import declarative_base
import numpy as np

# Define vector table model
Base = declarative_base()

class Document(Base):
    __tablename__ = 'documents'
    # IMPORTANT: HNSW index requires BigInteger primary key
    id = Column(BigInteger, primary_key=True, autoincrement=True)
    title = Column(String(200))
    content = Column(Text)
    embedding = create_vector_column(384, precision='f32')  # 384-dim vector

# Create table using client API
client.create_table(Document)

# Enable and create HNSW index
client.vector_ops.enable_hnsw()
client.vector_ops.create_hnsw(
    'documents',  # table name - positional argument
    name='idx_embedding',
    column='embedding',
    m=16,
    ef_construction=200
)

# Insert document with vector using client API
client.insert(Document, {
    'title': 'AI Research Paper',
    'content': 'Advanced machine learning techniques...',
    'embedding': np.random.rand(384).tolist()
})

# Vector similarity search
query_vector = np.random.rand(384).tolist()
results = client.vector_ops.similarity_search(
    'documents',  # table name - positional argument
    vector_column='embedding',
    query_vector=query_vector,
    limit=5,
    distance_type='cosine'
)

# ⭐ CRITICAL: Monitor IVF index health for production
stats = client.vector_ops.get_ivf_stats('documents', 'embedding')
counts = stats['distribution']['centroid_count']
balance_ratio = max(counts) / min(counts) if min(counts) > 0 else float('inf')

print(f"Index Health: {len(counts)} centroids, balance ratio: {balance_ratio:.2f}")
if balance_ratio > 2.5:
    print("⚠️  Index needs rebuilding for optimal performance!")

Fulltext Search

# Enable and create fulltext index using client API
client.fulltext_index.enable_fulltext()
client.fulltext_index.create(
    'documents',  # table name - positional argument
    name='ftidx_content',
    columns=['title', 'content']
)

# Boolean search with encourage (like natural language search)
from matrixone.sqlalchemy_ext.fulltext_search import boolean_match

results = client.query(
    Document.title,
    Document.content,
    boolean_match('title', 'content').encourage('machine learning techniques')
).execute()

# Boolean search with must/should/must_not operators
results = client.query(
    Document.title,
    Document.content,
    boolean_match('title', 'content')
        .must('machine')
        .must('learning')
        .must_not('basic')
).execute()

Data Loading with Stages

Load data from external sources using stages. Use session() for atomic multi-file loading:

from matrixone.orm import Base, Column, Integer, String

# Define ORM model
class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    name = Column(String(100))
    email = Column(String(255))

# Create S3 stage using simple interface
client.stage.create_s3(
    name='data_stage',
    bucket='my-bucket',
    path='data/',
    aws_key_id='your_key',
    aws_secret_key='your_secret',
    region='us-east-1'
)

# Load data from stage using ORM model (pandas-style)
client.load_data.read_csv_stage('data_stage', 'users.csv', table=User)

# Atomic multi-file loading in transaction
with client.session() as session:
    # Create local stage
    session.stage.create_local('import_stage', '/data/imports/')
    
    # Load multiple files atomically (pandas-style)
    session.load_data.read_csv('/data/users.csv', table=User)
    session.load_data.read_csv('/data/orders.csv', table=Order)
    
    # All loads commit together

Stage Benefits:

  • ✅ Centralized data source configuration
  • ✅ Support for S3, local filesystem, and cloud storage
  • ✅ Automatic path resolution
  • ✅ Transaction support for atomic operations

Metadata Analysis

# Get table metadata with statistics
metadata_rows = client.metadata.scan(
    dbname='test',
    tablename='documents',
    columns='*'  # Get structured MetadataRow objects
)

for row in metadata_rows:
    print(f"Column: {row.col_name}")
    print(f"  Nulls: {row.null_cnt}")
    print(f"  Rows: {row.rows_cnt}")
    print(f"  Size: {row.origin_size}")

# Get table statistics
brief_stats = client.metadata.get_table_brief_stats(
    dbname='test',
    tablename='documents'
)
table_stats = brief_stats['documents']
print(f"Total rows: {table_stats['row_cnt']}")
print(f"Original size: {table_stats['original_size']} bytes")
print(f"Compressed size: {table_stats['compress_size']} bytes")

Secondary Index Verification

# Get all secondary index tables
index_tables = client.get_secondary_index_tables('my_table')
print(f"Found {len(index_tables)} secondary indexes")

# Get specific index by name
physical_table = client.get_secondary_index_table_by_name('my_table', 'idx_name')
print(f"Index 'idx_name' -> {physical_table}")

# Verify all indexes have same count as main table
try:
    count = client.verify_table_index_counts('my_table')
    print(f"✓ Verification passed! Row count: {count}")
except ValueError as e:
    print(f"✗ Index mismatch detected:")
    print(e)
    # Error includes all count details for debugging

Version Management

# Check if feature is available
if client.is_feature_available('snapshot_creation'):
    snapshot = client.snapshots.create('my_snapshot', 'cluster')
else:
    hint = client.get_version_hint('snapshot_creation')
    print(f"Feature not available: {hint}")

# Check version compatibility
if client.check_version_compatibility('3.0.0', '>='):
    print("Backend supports 3.0.0+ features")

MatrixOne Version Support

The SDK automatically detects MatrixOne backend versions and handles compatibility:

  • Development Version: 8.0.30-MatrixOne-v999.0.0 (highest priority)
  • Release Version: 8.0.30-MatrixOne-v3.0.03.0.0
  • Legacy Format: MatrixOne 3.0.13.0.1
# Check if running development version
if client.is_development_version():
    print("Running development version - all features available")
else:
    print(f"Running release version: {client.get_backend_version()}")

🛠️ mo-diag - Interactive Diagnostic Tool

The mo-diag command-line tool provides an interactive shell for diagnosing and maintaining MatrixOne databases, with a special focus on vector indexes, secondary indexes, and table statistics.

Installation

After installing the SDK, mo-diag is automatically available as a command:

pip install matrixone-python-sdk
mo-diag --help

Quick Start

Interactive Mode

Launch the interactive shell to execute multiple diagnostic commands:

# Connect to default localhost
mo-diag --database test

# Connect to remote database
mo-diag --host 192.168.1.100 --port 6001 --user admin --password secret --database production

Interactive Features:

  • 🔍 Tab Completion: Press Tab to auto-complete commands, table names, and database names
  • ⬆️⬇️ Command History: Use arrow keys to browse command history
  • 🔎 History Search: Press Ctrl+R to search command history
  • 🎨 Colored Output: Clear visual feedback with syntax highlighting
  • 💾 Persistent History: Command history saved to ~/.mo_diag_history

Non-Interactive Mode

Execute single commands directly for scripting and automation:

# Check IVF index status
mo-diag -d test -c "show_ivf_status"

# Get detailed table statistics
mo-diag -d test -c "show_table_stats my_table -a -d"

# Execute SQL query
mo-diag -d test -c "sql SELECT COUNT(*) FROM my_table"

# Flush table and all indexes
mo-diag -d test -c "flush_table my_table"

CDC Task Shortcuts

Manage Change Data Capture directly from the shell or scripts. The interactive commands (cdc_tasks, cdc_task, cdc_create, cdc_drop) now have matching non-interactive helpers via mo-diag cdc.

# Summaries (same as running cdc_tasks inside the shell)
mo-diag cdc show

# Inspect a task with watermark threshold control
mo-diag cdc show nightly_sync --details --threshold=5m

# Launch the guided creator with a forced mode
mo-diag cdc create --table-level

# Drop a task with explicit confirmation override
mo-diag cdc drop nightly_sync --force

Available Commands

Index Management

show_indexes <table> [database]

  • Display all indexes for a table including IVF, HNSW, Fulltext, and regular indexes
  • Shows physical table names, index types, and statistics
  • Includes object counts, row counts, and sizes for vector/fulltext indexes
MO-DIAG[test]> show_indexes ivf_health_demo_docs

📊 Secondary Indexes for 'test.ivf_health_demo_docs'

*************************** 1. row ***************************
      Index Name: idx_embedding_ivf_v2
       Algorithm: ivfflat
      Table Type: metadata
  Physical Table: __mo_index_secondary_0199e725-0a7a-77b8-b689-ccdd0a33f581
         Columns: embedding
      Statistics:
                   - Objects: 1
                   - Rows: 7
                   - Compressed Size: 940 B
                   - Original Size: 1.98 KB

*************************** 2. row ***************************
      Index Name: idx_embedding_ivf_v2
       Algorithm: ivfflat
      Table Type: centroids
  Physical Table: __mo_index_secondary_0199e725-0a7b-706e-8f0a-a50edc3621a1
         Columns: embedding
      Statistics:
                   - Objects: 1
                   - Rows: 17
                   - Compressed Size: 3.09 KB
                   - Original Size: 6.83 KB

*************************** 3. row ***************************
      Index Name: idx_embedding_ivf_v2
       Algorithm: ivfflat
      Table Type: entries
  Physical Table: __mo_index_secondary_0199e725-0a7c-77f4-8d0b-48fd8258098a
         Columns: embedding
      Statistics:
                   - Objects: 1
                   - Rows: 1,000
                   - Compressed Size: 156.34 KB
                   - Original Size: 176.07 KB

Total: 3 index tables (1 ivfflat with 3 physical tables)

Note: IVF indexes have 3 physical tables (metadata, centroids, entries)

show_all_indexes [database]

  • Health report for all tables with secondary indexes in the database
  • Row count consistency checks
  • IVF/HNSW/Fulltext index status
MO-DIAG[test]> show_all_indexes

📊 Index Health Report for Database 'test':
========================================================================================================================

✓ HEALTHY (3 tables)
------------------------------------------------------------------------------------------------------------------------
Table Name                          | Indexes  | Row Count            | Notes
------------------------------------------------------------------------------------------------------------------------
cms_all_content_chunk_info          | 3        | ✓ 32,712 rows        | -
demo_mixed_indexes                  | 4        | ✓ 20 rows            | -
ivf_health_demo_docs                | 1        | 300 rows             | IVF: 17 centroids, 300 vectors

========================================================================================================================
Summary:
  ✓ 3 healthy tables
  Total: 3 tables with indexes

💡 Tip: Use 'verify_counts <table>' or 'show_ivf_status' for detailed diagnostics

verify_counts <table> [database]

  • Verify row count consistency between main table and all its secondary indexes
  • Highlights any mismatches
MO-DIAG[test]> verify_counts my_table

📊 Row Count Verification for 'test.my_table'
════════════════════════════════════════════════════════════════════════════
Main table: 10,000 rows
────────────────────────────────────────────────────────────────────────────
✓ __mo_index_secondary_xxx: 10,000 rows
✓ __mo_index_unique_yyy: 10,000 rows
════════════════════════════════════════════════════════════════════════════
✅ PASSED: All index tables match (10,000 rows)

Vector Index Monitoring

show_ivf_status [database] [-v] [-t table]

  • Display IVF index building status and centroid distribution
  • -v: Verbose mode with detailed centroid information
  • -t <table>: Filter by specific table
MO-DIAG[test]> show_ivf_status

📊 IVF Index Status in 'test':
════════════════════════════════════════════════════════════════════════════
Table                          | Index                     | Column               | Centroids  | Vectors      | Balance    | Status
────────────────────────────────────────────────────────────────────────────
ivf_health_demo_docs           | idx_embedding_ivf_v2      | embedding            | 17         | 1,000        | 2.35       | ✓ active

Table Statistics

show_table_stats <table> [database] [-t] [-a] [-d]

  • Display table metadata and statistics
  • -t: Include tombstone statistics
  • -a: Include all indexes (hierarchical view with -d)
  • -d: Show detailed object lists
MO-DIAG[test]> show_table_stats ivf_health_demo_docs -a -d

📊 Detailed Table Statistics for 'test.ivf_health_demo_docs':
======================================================================================================================================================

Table: ivf_health_demo_docs
  Objects: 1 | Rows: 1,000 | Null: 0 | Original: 176.03 KB | Compressed: 156.24 KB
  
  Objects:
  Object Name                                        | Rows         | Null Cnt   | Original Size   | Compressed Size
  ----------------------------------------------------------------------------------------------------------------------------------------------------
  0199e729-642e-71e0-b338-67c4980ee294_00000         | 1000         | 0          | 176.03 KB       | 156.24 KB

Index: idx_embedding_ivf_v2
  └─ (metadata): __mo_index_secondary_0199e725-0a7a-77b8-b689-ccdd0a33f581:272851
     Objects: 1 | Rows: 7 | Null: 0 | Original: 1.98 KB | Compressed: 940 B
     
     Objects:
     Object Name                                        | Rows         | Null Cnt   | Original Size   | Compressed Size
     ----------------------------------------------------------------------------------------------------------------------------------------------------
     0199e729-642a-7d36-ac37-0ae17325f7ec_00000         | 7            | 0          | 1.98 KB         | 940 B
  
  └─ (centroids): __mo_index_secondary_0199e725-0a7b-706e-8f0a-a50edc3621a1:272852
     Objects: 1 | Rows: 17 | Null: 0 | Original: 6.83 KB | Compressed: 3.09 KB
     
     Objects:
     Object Name                                        | Rows         | Null Cnt   | Original Size   | Compressed Size
     ----------------------------------------------------------------------------------------------------------------------------------------------------
     0199e729-6431-794d-9c13-d10f2e4a59e7_00000         | 17           | 0          | 6.83 KB         | 3.09 KB
  
  └─ (entries): __mo_index_secondary_0199e725-0a7c-77f4-8d0b-48fd8258098a:272853
     Objects: 1 | Rows: 1,000 | Null: 0 | Original: 696.11 KB | Compressed: 626.37 KB
     
     Objects:
     Object Name                                        | Rows         | Null Cnt   | Original Size   | Compressed Size
     ----------------------------------------------------------------------------------------------------------------------------------------------------
     0199e729-6438-7db1-bcb6-7b74c59b0aee_00000         | 1000         | 0          | 696.11 KB       | 626.37 KB

Note:

  • Hierarchical structure: Table → Index → Physical Tables → Objects
  • IVF index has 3 physical tables: metadata (7 rows), centroids (17 rows), entries (1,000 rows)
  • Physical table name format: table_name:table_id
  • Use -t flag to include tombstone statistics for deleted/updated data

Database Operations

flush_table <table> [database]

  • Flush main table and all its secondary index physical tables
  • Includes IVF metadata/centroids/entries, HNSW, Fulltext, and regular indexes
  • Requires sys user privileges
MO-DIAG[test]> flush_table ivf_health_demo_docs

🔄 Flushing table: test.ivf_health_demo_docs
✓ Main table flushed: ivf_health_demo_docs
📋 Found 3 index physical tables

Index: idx_embedding_ivf_v2
  ✓ metadata: __mo_index_secondary_0199e725-0a7a-77b8-b689-ccdd0a33f581
  ✓ centroids: __mo_index_secondary_0199e725-0a7b-706e-8f0a-a50edc3621a1
  ✓ entries: __mo_index_secondary_0199e725-0a7c-77f4-8d0b-48fd8258098a

📊 Summary:
  Main table: ✓ flushed
  Index tables: 3/3 flushed successfully

Note:

  • Automatically discovers and flushes all index physical tables
  • For IVF indexes: flushes metadata, centroids, and entries tables
  • Requires sys user privileges
  • Recommended before backups or after bulk operations

tables [database]

  • List all tables in current or specified database

databases

  • List all databases (highlights current database)

use <database>

  • Switch to a different database

sql <SQL statement>

  • Execute arbitrary SQL query
MO-DIAG[test]> sql SELECT COUNT(*) FROM my_table

col0
----
10000

1 row(s) returned

Utility Commands

history [n | -c]

  • Show last n commands (default: 20)
  • -c: Clear command history

help [command]

  • Show help for all commands or specific command

exit / quit

  • Exit the interactive shell

Command-Line Options

usage: mo-diag [-h] [--host HOST] [--port PORT] [--user USER]
               [--password PASSWORD] [--database DATABASE]
               [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
               [--command COMMAND]

options:
  --host HOST           Database host (default: localhost)
  --port PORT           Database port (default: 6001)
  --user USER           Database user (default: root)
  --password PASSWORD   Database password (default: 111)
  --database DATABASE   Database name (optional)
  -d DATABASE           Short form of --database
  --log-level LEVEL     Logging level (default: ERROR)
  --command COMMAND     Execute single command and exit
  -c COMMAND            Short form of --command

Use Cases

1. Monitor IVF Index Health in Production

# Quick check on index status
mo-diag -d production -c "show_ivf_status -v"

# Detailed index inspection with physical table stats
mo-diag -d production -c "show_indexes my_vector_table"

# Verify centroid distribution balance
mo-diag -d production -c "show_ivf_status -t my_vector_table -v"

2. Debug Index Count Mismatches

# Check row consistency
mo-diag -d test -c "verify_counts my_table"

# Get health report for all indexes
mo-diag -d test -c "show_all_indexes"

# Flush to sync if needed
mo-diag -d test -c "flush_table my_table"

3. Analyze Table Storage

# Get table statistics with all indexes
mo-diag -d test -c "show_table_stats my_table -a"

# Detailed object-level analysis
mo-diag -d test -c "show_table_stats my_table -a -d"

# Include tombstone statistics
mo-diag -d test -c "show_table_stats my_table -a -t -d"

4. Automated Maintenance Scripts

#!/bin/bash
# daily_index_check.sh

DATABASES=("prod_db1" "prod_db2" "prod_db3")

for db in "${DATABASES[@]}"; do
    echo "Checking $db..."
    mo-diag -d "$db" -c "show_all_indexes" >> /var/log/mo_diag_daily.log
    mo-diag -d "$db" -c "show_ivf_status" >> /var/log/mo_diag_daily.log
done

Tips and Best Practices

  1. Regular Health Checks: Run show_all_indexes daily to catch index issues early
  2. Monitor IVF Balance: Use show_ivf_status -v to ensure even centroid distribution
  3. Before Major Operations: Always verify_counts before bulk updates or migrations
  4. Production Debugging: Use non-interactive mode for logging and monitoring
  5. Tab Completion: Leverage Tab completion to avoid typos in table/database names
  6. Command History: Use Ctrl+R to quickly find and re-execute previous diagnostic commands
  7. Flush Regularly: If you notice index count mismatches, flush_table can help sync

Troubleshooting

Issue: Tab completion not working

  • Solution: Ensure prompt_toolkit>=3.0.0 is installed: pip install prompt_toolkit

Issue: "Unknown database" error

  • Solution: Create the database first: mo-diag -d test -c "sql CREATE DATABASE IF NOT EXISTS test"

Issue: Permission denied for flush operations

  • Solution: Connect as sys user or a user with sufficient privileges

Issue: IVF index shows 0 centroids

  • Solution: The index might be empty or still building. Check with show_table_stats <table> -a

pitr = client.pitr.create_cluster_pitr( name='cluster_pitr', range_value=7, range_unit='d' )

Restore to specific time

restore_point = datetime(2024, 1, 15, 10, 30, 0) client.pitr.restore_to_time(restore_point)


### Account Management

```python
# Create user
user = client.account.create_user(
    username='newuser',
    password='password123',
    description='New user account'
)

# Create role
role = client.account.create_role(
    role_name='analyst',
    description='Data analyst role'
)

# Grant privileges
client.account.grant_privilege(
    user='newuser',
    role='analyst',
    privileges=['SELECT', 'INSERT']
)

Pub/Sub Operations

# Create publication
publication = client.pubsub.create_publication(
    name='data_changes',
    tables=['users', 'orders'],
    description='User and order changes'
)

# Create subscription
subscription = client.pubsub.create_subscription(
    name='data_sync',
    publication_name='data_changes',
    target_tables=['users_backup', 'orders_backup']
)

Stage Management

# Create a local file system stage (simplified)
stage = client.stage.create_local('my_data_stage', './data/', comment='Main data import stage')

# Or use the full method with explicit URL
stage = client.stage.create('my_data_stage', 'file:///path/to/data/', comment='Main data import stage')

# Create an S3 stage with simplified credentials
s3_stage = client.stage.create_s3(
    'production_stage',
    'my-bucket/data/',
    aws_key='AKIAIOSFODNN7EXAMPLE',
    aws_secret='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
    aws_region='us-east-1',
    comment='Production data stage'
)

# Or use environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
s3_stage = client.stage.create_s3('production_stage', 'my-bucket/data/')

# List all stages
stages = client.stage.list()
for stage in stages:
    print(f"{stage.name}: {stage.url}")

# Load data from stage (Method 1: via client.load_data, pandas-style)
client.load_data.read_csv_stage('my_data_stage', 'users.csv', table=User)

# Load data from stage (Method 2: via stage object - simpler!)
stage = client.stage.get('my_data_stage')
stage.load_csv('users.csv', User)
stage.load_json('events.jsonl', Event)
stage.load_parquet('orders.parq', Order)

# Batch load multiple files
results = stage.load_files({
    'users.csv': User,
    'events.jsonl': Event,
    'orders.parq': Order
})

# Modify stage
client.stage.alter('my_data_stage', comment='Updated comment')

# Drop stage
client.stage.drop('my_data_stage')

Advanced Features

PITR (Point-in-Time Recovery)

# Create PITR for cluster


## Configuration

### Connection Parameters

```python
client = Client(
    connection_timeout=30,
    query_timeout=300,
    auto_commit=True,
    charset='utf8mb4',
    sql_log_mode='auto',  # 'off', 'simple', 'auto', 'full'
    slow_query_threshold=1.0
)

Logging Configuration

from matrixone import MatrixOneLogger

logger = MatrixOneLogger(
    level=logging.INFO,
    sql_log_mode='auto',  # 'off', 'simple', 'auto', 'full'
    slow_query_threshold=1.0,
    max_sql_display_length=500
)

client = Client(logger=logger)

Error Handling

The SDK provides comprehensive error handling with helpful messages:

from matrixone.exceptions import (
    ConnectionError,
    QueryError,
    VersionError,
    SnapshotError
)

try:
    snapshot = client.snapshots.create('test', 'cluster')
except VersionError as e:
    print(f"Version compatibility error: {e}")
except SnapshotError as e:
    print(f"Snapshot operation failed: {e}")

Examples

Check out the examples/ directory for comprehensive usage examples:

Basic Examples:

  • example_01_basic_connection.py - Basic database operations and connectivity
  • example_02_account_management.py - User and role management
  • example_03_async_operations.py - Async/await operations with AsyncClient
  • example_04_transaction_management.py - Transaction handling and rollback
  • example_05_snapshot_restore.py - Snapshot creation and restore operations
  • example_06_sqlalchemy_integration.py - SQLAlchemy ORM integration
  • example_07_advanced_features.py - Advanced SDK features
  • example_08_pubsub_operations.py - Pub/Sub operations
  • example_09_logger_integration.py - Logging configuration and integration
  • example_10_version_management.py - Version compatibility management
  • example_11_matrixone_version_demo.py - MatrixOne version detection

Vector Search Examples:

  • example_12_vector_basics.py - Vector data types and basic distance functions
  • example_13_vector_indexes.py - IVF and HNSW index management and performance
  • example_14_vector_search.py - Vector similarity search operations
  • example_15_vector_advanced.py - Advanced vector operations and optimization

ORM and Advanced Examples:

  • example_18_snapshot_orm.py - Snapshot operations with ORM
  • example_19_sqlalchemy_style_orm.py - SQLAlchemy-style ORM patterns
  • example_20_sqlalchemy_engine_integration.py - SQLAlchemy engine integration
  • example_21_advanced_orm_features.py - Advanced ORM features
  • example_22_unified_sql_builder.py - Unified SQL query builder
  • example_24_query_update.py - Query and update operations
  • example_25_metadata_operations.py - Table metadata analysis and statistics

Load Data Examples:

  • example_23_load_data_operations.py - Comprehensive LOAD DATA operations (CSV, TSV, JSONLINE, Parquet, Inline, Stage)

Stage Management Examples:

  • example_26_stage_operations.py - External stage management and data loading from stages

Branch Builder Examples:

  • example_branch_builder.py - SQLAlchemy-style branch statement builders (create, diff, merge)

Clone Builder Examples:

  • example_clone_builder.py - SQLAlchemy-style clone statement builders (table, database)

Specialized Examples:

  • example_connection_hooks.py - Connection hooks for custom initialization
  • example_dynamic_logging.py - Dynamic logging configuration
  • example_ivf_stats_complete.py - IVF index statistics and analysis

Running Examples

Run all examples:

make examples

Run specific examples:

make example-basic    # Basic connection and database operations
make example-async    # Async/await operations
make example-vector   # Vector search and similarity

Run individual example files:

# Basic examples
python examples/example_01_basic_connection.py
python examples/example_03_async_operations.py

# Vector examples
python examples/example_12_vector_basics.py
python examples/example_13_vector_indexes.py
python examples/example_14_vector_search.py

# Load Data examples
python examples/example_23_load_data_operations.py

# Stage management examples
python examples/example_26_stage_operations.py

# Metadata and advanced examples
python examples/example_25_metadata_operations.py
python examples/example_22_unified_sql_builder.py

Customizing Connection Parameters

You can override the default connection parameters using environment variables:

# Set custom connection parameters
export MATRIXONE_HOST=localhost
export MATRIXONE_PORT=6001
export MATRIXONE_USER=root
export MATRIXONE_PASSWORD=111
export MATRIXONE_DATABASE=test

# Run examples with custom parameters
make examples

Or set them inline:

MATRIXONE_HOST=localhost MATRIXONE_USER=admin make example-basic

Available environment variables:

  • MATRIXONE_HOST: Database host (default: 127.0.0.1)
  • MATRIXONE_PORT: Database port (default: 6001)
  • MATRIXONE_USER: Database username (default: root)
  • MATRIXONE_PASSWORD: Database password (default: 111)
  • MATRIXONE_DATABASE: Database name (default: test)
  • MATRIXONE_CHARSET: Character set (default: utf8mb4)
  • MATRIXONE_CONNECT_TIMEOUT: Connection timeout (default: 30)
  • MATRIXONE_AUTOCOMMIT: Auto-commit mode (default: true)

🧪 Testing & Development

Testing Overview

The MatrixOne Python SDK includes comprehensive testing with support for multiple Python and SQLAlchemy versions.

📖 For detailed testing documentation, see TESTING.md

Quick Test Commands

# Check environment and dependencies
make check-env

# Install test dependencies
make install-sqlalchemy14  # For SQLAlchemy 1.4.x
make install-sqlalchemy20  # For SQLAlchemy 2.0.x

# Run tests
make test-offline          # Fast mock-based tests
make test-online           # Database integration tests (requires MatrixOne)
make test-sqlalchemy14     # Full test suite with SQLAlchemy 1.4.x
make test-sqlalchemy20     # Full test suite with SQLAlchemy 2.0.x
make test-matrix           # Test both SQLAlchemy versions

Test Types

  • Offline Tests: Mock-based unit tests (450+ tests, ~3-5 seconds)
  • Online Tests: Database integration tests (100+ tests, ~20-30 seconds)
  • Matrix Tests: Cross-version compatibility testing (Python 3.10/3.11 + SQLAlchemy 1.4/2.0)

Development Setup

# Check environment requirements
make check-env

# Setup development environment (includes all dev dependencies)
make dev-setup

# Install specific dependency sets
make install-deps-dev      # Install development dependencies
make install-deps-test     # Install test dependencies  
make install-sqlalchemy14  # Install SQLAlchemy 1.4.x dependencies
make install-sqlalchemy20  # Install SQLAlchemy 2.0.x dependencies

# Run all tests
make test

# Run examples
make examples

Dependency Management:

  • make install-deps-dev: Installs all development dependencies including pyarrow, Faker, pytest, etc.
  • make install-deps-test: Installs testing dependencies including pyarrow (required for Parquet)
  • make install-sqlalchemy14/20: Installs version-specific SQLAlchemy dependencies

Important: pyarrow>=10.0.0 is required for LOAD DATA Parquet functionality. Install via make dev-setup, make install-deps-dev, or make install-deps-test. For detailed testing instructions, environment setup, and troubleshooting, see TESTING.md.

📚 Documentation

📖 Online Documentation

Full Documentation on ReadTheDocs

Complete documentation with examples, API reference, and production best practices.

Local Documentation Guides

Getting Started:

Core Features:

Advanced Features:

Production Guide:

Reference:

Generate HTML documentation:

make docs          # Generate HTML documentation
make docs-serv     # Start documentation server at http://localhost:8000

🤝 Contributing

We welcome contributions!

Development Workflow

  1. Fork the repository
  2. Set up development environment
    git clone <your-fork-url>
    cd matrixone/clients/python
    
    # Create virtual environment
    python -m venv venv
    source venv/bin/activate  # On macOS/Linux
    
    # Install development dependencies
    make dev-setup
  3. Create a feature branch
    git checkout -b feature/your-feature-name
  4. Make your changes
  5. Add tests for new functionality
  6. Run pre-commit checks
    make pre-commit  # format + lint + test
  7. Submit a pull request

Virtual Environment for Contributors

# Always work in virtual environment
source venv/bin/activate

# Before making changes
make test  # Ensure all tests pass

# After making changes
make format    # Format code
make lint      # Check code quality
make test      # Run tests
make examples  # Test examples

# Before submitting PR
make release-check  # Complete pre-release checks

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

Changelog

See CHANGELOG.md for version history and changes.


MatrixOne Python SDK - Making MatrixOne database operations simple and powerful in Python.