Skip to content

Google Drive backend for Ragas Datasets #2091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 197 additions & 0 deletions experimental/docs/gdrive_backend_setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
# Google Drive Backend Setup Guide

This guide will help you set up and use the Google Drive backend for Ragas datasets.

## Prerequisites

### 1. Install Dependencies

```bash
pip install google-api-python-client google-auth google-auth-oauthlib
```

### 2. Set up Google Cloud Project

1. Go to the [Google Cloud Console](https://console.cloud.google.com/)
2. Create a new project or select an existing one
3. Enable the following APIs:
- Google Drive API
- Google Sheets API

### 3. Create Credentials

You have two options for authentication:

#### Option A: OAuth 2.0 (Recommended for development)

1. In Google Cloud Console, go to "Credentials"
2. Click "Create Credentials" → "OAuth client ID"
3. Choose "Desktop application"
4. Download the JSON file
5. Save it securely (e.g., as `credentials.json`)

#### Option B: Service Account (Recommended for production)

1. In Google Cloud Console, go to "Credentials"
2. Click "Create Credentials" → "Service account"
3. Fill in the details and create the account
4. Generate a key (JSON format)
5. Download and save the JSON file securely
6. Share your Google Drive folder with the service account email

## Setup Instructions

### 1. Create a Google Drive Folder

1. Create a folder in Google Drive where you want to store your datasets
2. Get the folder ID from the URL:
```
https://drive.google.com/drive/folders/FOLDER_ID_HERE
```
3. If using a service account, share this folder with the service account email

### 2. Set Environment Variables (Optional)

```bash
export GDRIVE_FOLDER_ID="your_folder_id_here"
export GDRIVE_CREDENTIALS_PATH="path/to/credentials.json"
# OR for service account:
export GDRIVE_SERVICE_ACCOUNT_PATH="path/to/service_account.json"
```

### 3. Basic Usage

```python
from ragas_experimental.project.core import Project
from pydantic import BaseModel

# Define your data model
class EvaluationEntry(BaseModel):
question: str
answer: str
score: float

# Create project with Google Drive backend
project = Project.create(
name="my_project",
backend="gdrive",
gdrive_folder_id="your_folder_id_here",
gdrive_credentials_path="path/to/credentials.json" # OAuth
# OR
# gdrive_service_account_path="path/to/service_account.json" # Service Account
)

# Create a dataset
dataset = project.create_dataset(
model=EvaluationEntry,
name="my_dataset"
)

# Add data
entry = EvaluationEntry(
question="What is AI?",
answer="Artificial Intelligence",
score=0.95
)
dataset.append(entry)

# Load and access data
dataset.load()
print(f"Dataset has {len(dataset)} entries")
for entry in dataset:
print(f"{entry.question} -> {entry.answer}")
```

## File Structure

When you use the Google Drive backend, it creates the following structure:

```
Your Google Drive Folder/
├── project_name/
│ ├── datasets/
│ │ ├── dataset1.gsheet
│ │ └── dataset2.gsheet
│ └── experiments/
│ └── experiment1.gsheet
```

Each dataset is stored as a Google Sheet with:
- Column headers matching your model fields
- An additional `_row_id` column for internal tracking
- Automatic type conversion when loading data

## Authentication Flow

### OAuth (First Time)
1. When you first run your code, a browser window will open
2. Sign in to your Google account
3. Grant permissions to access Google Drive
4. A `token.json` file will be created automatically
5. Subsequent runs will use this token (no browser needed)

### Service Account
1. No interactive authentication required
2. Make sure the service account has access to your folder
3. The JSON key file is used directly

## Troubleshooting

### Common Issues

1. **"Folder not found" error**
- Check that the folder ID is correct
- Ensure the folder is shared with your service account (if using one)

2. **Authentication errors**
- Verify your credentials file path
- Check that the required APIs are enabled
- For OAuth: Delete `token.json` and re-authenticate

3. **Permission errors**
- Make sure your account has edit access to the folder
- For service accounts: share the folder with the service account email

4. **Import errors**
- Install required dependencies: `pip install google-api-python-client google-auth google-auth-oauthlib`

### Getting Help

If you encounter issues:
1. Check the error message carefully
2. Verify your Google Cloud setup
3. Test authentication with a simple Google Drive API call
4. Check that all dependencies are installed

## Security Best Practices

1. **Never commit credentials to version control**
2. **Use environment variables for sensitive information**
3. **Limit service account permissions to minimum required**
4. **Regularly rotate service account keys**
5. **Use OAuth for development, service accounts for production**

## Advanced Configuration

### Custom Authentication Paths

```python
project = Project.create(
name="my_project",
backend="gdrive",
gdrive_folder_id="folder_id",
gdrive_credentials_path="/custom/path/to/credentials.json",
gdrive_token_path="/custom/path/to/token.json"
)
```

### Multiple Projects

You can have multiple projects in the same Google Drive folder:

```python
project1 = Project.create(name="project1", backend="gdrive", ...)
project2 = Project.create(name="project2", backend="gdrive", ...)
```

Each will create its own subfolder structure.
121 changes: 121 additions & 0 deletions experimental/examples/gdrive_backend_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
"""
Example usage of the Google Drive backend for Ragas.

This example shows how to:
1. Set up authentication for Google Drive
2. Create a project with Google Drive backend
3. Create and manage datasets stored in Google Sheets

Prerequisites:
1. Install required dependencies:
pip install google-api-python-client google-auth google-auth-oauthlib

2. Set up Google Drive API credentials:
- Go to Google Cloud Console
- Enable Google Drive API and Google Sheets API
- Create credentials (either OAuth or Service Account)
- Download the JSON file

3. Set environment variables or provide paths directly
"""

import os
from pydantic import BaseModel
from ragas_experimental.project.core import Project
from ragas_experimental.metric import MetricResult


# Example model for our dataset
class EvaluationEntry(BaseModel):
question: str
answer: str
context: str
score: float
feedback: str


def example_oauth_setup():
"""Example using OAuth authentication."""

# Set up environment variables (or pass directly to Project.create)
# os.environ["GDRIVE_FOLDER_ID"] = "your_google_drive_folder_id_here"
# os.environ["GDRIVE_CREDENTIALS_PATH"] = "path/to/your/credentials.json"

# Create project with Google Drive backend
project = Project.create(
name="my_ragas_project",
description="A project using Google Drive for storage",
backend="gdrive",
gdrive_folder_id="1HLvvtKLnwGWKTely0YDlJ397XPTQ77Yg",
gdrive_credentials_path="/Users/derekanderson/Downloads/credentials.json",
gdrive_token_path="token.json" # Will be created automatically
)

return project


def example_usage():
"""Example of using the Google Drive backend."""

# Create a project (choose one of the authentication methods above)
project = example_oauth_setup() # or example_service_account_setup()

# Create a dataset
dataset = project.create_dataset(
model=EvaluationEntry,
name="evaluation_results"
)

# Add some entries
entry1 = EvaluationEntry(
question="What is the capital of France?",
answer="Paris",
context="France is a country in Europe.",
score=0.95,
feedback="Correct answer"
)

entry2 = EvaluationEntry(
question="What is 2+2?",
answer="4",
context="Basic arithmetic question.",
score=1.0,
feedback="Perfect answer"
)

# Append entries to the dataset
dataset.append(entry1)
dataset.append(entry2)

# Load all entries
dataset.load()
print(f"Dataset contains {len(dataset)} entries")

# Access entries
for i, entry in enumerate(dataset):
print(f"Entry {i}: {entry.question} -> {entry.answer} (Score: {entry.score})")

# Update an entry
dataset[0].score = 0.98
dataset[0].feedback = "Updated feedback"
dataset[0] = dataset[0] # Trigger update

# Search for entries
entry = dataset._backend.get_entry_by_field("question", "What is 2+2?", EvaluationEntry)
if entry:
print(f"Found entry: {entry.answer}")

return dataset


if __name__ == "__main__":
# Run the example
try:
dataset = example_usage()
print("Google Drive backend example completed successfully!")
except Exception as e:
print(f"Error: {e}")
print("\nMake sure to:")
print("1. Install required dependencies")
print("2. Set up Google Drive API credentials")
print("3. Update the folder ID and credential paths in this example")
19 changes: 19 additions & 0 deletions experimental/ragas_experimental/backends/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Optional imports for backends that require additional dependencies

# Always available backends
from .ragas_api_client import RagasApiClient
from .factory import RagasApiClientFactory

# Conditionally import Google Drive backend
try:
from .gdrive_backend import GDriveBackend
__all__ = ["RagasApiClient", "RagasApiClientFactory", "GDriveBackend"]
except ImportError:
__all__ = ["RagasApiClient", "RagasApiClientFactory"]

# Conditionally import Notion backend if available
try:
from .notion_backend import NotionBackend
__all__.append("NotionBackend")
except ImportError:
pass
46 changes: 46 additions & 0 deletions experimental/ragas_experimental/backends/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
"""Base classes for dataset backends."""

from abc import ABC, abstractmethod
import typing as t


class DatasetBackend(ABC):
"""Abstract base class for dataset backends.

All dataset storage backends must implement these methods.
"""

@abstractmethod
def initialize(self, dataset):
"""Initialize the backend with dataset information"""
pass

@abstractmethod
def get_column_mapping(self, model):
"""Get mapping between model fields and backend columns"""
pass

@abstractmethod
def load_entries(self, model_class):
"""Load all entries from storage"""
pass

@abstractmethod
def append_entry(self, entry):
"""Add a new entry to storage and return its ID"""
pass

@abstractmethod
def update_entry(self, entry):
"""Update an existing entry in storage"""
pass

@abstractmethod
def delete_entry(self, entry_id):
"""Delete an entry from storage"""
pass

@abstractmethod
def get_entry_by_field(self, field_name: str, field_value: t.Any, model_class):
"""Get an entry by field value"""
pass
Loading
Loading