Skip to content

Feature/ENG-31: Google sheets connector#1158

Closed
coutocass wants to merge 6 commits intomainfrom
feture/eng-31-google-sheets
Closed

Feature/ENG-31: Google sheets connector#1158
coutocass wants to merge 6 commits intomainfrom
feture/eng-31-google-sheets

Conversation

@coutocass
Copy link
Copy Markdown
Contributor

@coutocass coutocass commented Dec 9, 2025

Summary by cubic

Adds a Google Sheets connector (ENG-31) with OAuth and full/incremental sync via Drive Changes, plus docs and UI icons. Includes E2E tests and local worker QoL fixes for safer shutdowns.

  • New Features

    • Google Sheets source: lists spreadsheets, fetches sheet metadata/values, and emits spreadsheet/sheet entities with headers and formatted cell data.
    • Incremental sync via Drive Changes using start_page_token; full sync persists the next token.
    • Config and schemas: GoogleSheetsConfig (include_trashed, include_shared, max_rows_per_sheet), cursor, and entities.
    • OAuth for dev/prd/self-hosted; Composio maps to Google Docs scopes; connector uses read-only scopes; BYOC and Pipedream supported.
    • Monke E2E: bongo to create/update/delete spreadsheets, LLM-generated rows with verification tokens, force_full_sync deletion checks, and orphaned test cleanup.
    • Docs and UI: added connector docs page and nav entry, app icons, and README icon.
  • Bug Fixes

    • Worker hot-reload: optional debugpy (AIRWEAVE_WORKER_DEBUG) and optional wait-for-client (AIRWEAVE_DEBUGPY_WAIT_FOR_CLIENT); fixed watch paths.
    • Graceful shutdown even if the metrics server isn’t fully initialized.
    • Monke: troubleshooting guide for entity ID collisions/deletion order and minor logging improvements.

Written for commit 19f76e2. Summary will update on new commits.

- Make debugpy optional in worker hot-reload (set AIRWEAVE_WORKER_DEBUG=true to enable)
- Make --wait-for-client optional (set AIRWEAVE_DEBUGPY_WAIT_FOR_CLIENT=true to enable)
- Fix graceful shutdown when metrics server not fully initialized
- Fix watch paths for hot-reload script
- Add Monke troubleshooting guide for entity ID collisions and deletion order
- Minor monke logging improvements
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 19 files

Prompt for AI agents (all 1 issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="monke/bongos/google_sheets.py">

<violation number="1" location="monke/bongos/google_sheets.py:202">
P1: Passing `self._test_spreadsheets` directly causes list modification during iteration. Inside `delete_specific_entities`, items are removed from `_test_spreadsheets` while iterating over the same list, which can cause elements to be skipped. Pass a copy instead, similar to how `cleanup` does it.</violation>
</file>

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR

"google_drive": "googledrive",
"google_calendar": "googlecalendar",
"google_docs": "googledocs",
"google_sheets": "googledocs", # Reuse Google Docs OAuth (compatible scopes)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this require users to create a "googledocs" Composio connection and use that to create a "google_sheets" Airweave conncection?

Copy link
Copy Markdown
Contributor Author

@coutocass coutocass Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using googledocs just temporarily, because I don't have access to composio.

class GoogleSheetsConfig(SourceConfig):
"""Google Sheets configuration schema."""

include_trashed: bool = Field(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we should include this option. Normally we do not synced deleted data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the existing pattern for Google connectors - GoogleDocsConfig and GoogleSlidesConfig already have this same option on main. It defaults to False (trashed items excluded), so the default behavior matches our principle of not syncing deleted data. The opt-in exists for edge cases like compliance audits or data recovery scenarios.

@marc-rutzou
Copy link
Copy Markdown
Collaborator

we do have a xlsx converter. Did you consider processing the data as a file? or do you think this is a better way?

@coutocass
Copy link
Copy Markdown
Contributor Author

coutocass commented Dec 9, 2025

we do have a xlsx converter. Did you consider processing the data as a file? or do you think this is a better way?

@marc-rutzou Good question! I considered the XLSX export approach but went with the Sheets API directly for a few reasons:

  1. Incremental sync efficiency - The Drive Changes API tells us which sheets changed. With XLSX export, we'd need to re-download the entire file on any change.
  2. No export overhead - Large spreadsheets can be slow/fail to export. The Sheets API lets us paginate and limit rows (max_rows_per_sheet) without downloading everything.
  3. Live data - Sheets API returns formula results as users see them. XLSX export may not preserve all Google Sheets features correctly.
  4. Metadata - We get owners, sharing status, and web links directly from Drive API.

@coutocass
Copy link
Copy Markdown
Contributor Author

@marc-rutzou btw the test that is failing is unrelated to the changes here. It's likely a bug :/

Copy link
Copy Markdown
Collaborator

@marc-rutzou marc-rutzou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once the composio name mapping is fixed, its good to go!

@coutocass
Copy link
Copy Markdown
Contributor Author

@marc-rutzou could you please re-review? Composio is set

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 11 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/airweave/platform/sources/google_sheets.py">

<violation number="1" location="backend/airweave/platform/sources/google_sheets.py:279">
P1: Dead code: `_latest_new_start_page_token` is never assigned in this class</violation>
</file>

<file name="backend/airweave/platform/auth/yaml/dev.integrations.yaml">

<violation number="1" location="backend/airweave/platform/auth/yaml/dev.integrations.yaml:161">
P2: Using placeholder OAuth credentials will break Google Sheets authentication in the dev integration config.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@viralpraxis viralpraxis requested a review from orhanrauf March 5, 2026 14:40
Comment on lines +141 to +144
from .google_sheets import (
GoogleSheetsSheetEntity,
GoogleSheetsSpreadsheetEntity,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a Deletion Entity, is that expected? _process_changes filters for removed == False but never yields a deletion entity for removed == True. So this will work for the monke since we set force_full_sync to true, but regular users will accumulate stale data over time.

spreadsheet_title = file_data.get("name", "Untitled Spreadsheet")

# Parse timestamps
created_time = self._parse_datetime(file_data.get("createdTime")) or datetime.utcnow()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're in the process of bumping to Python 3.14. AFAIK datetime.utcnow() is deprecated for 3.12 and higher. Can you use datetime.now(timezone.utc) instead?

@felixschmetz
Copy link
Copy Markdown
Member

Thanks!! I believe we should park this connector for now. We are unable to parse sheets in a meaningful way yet that

a) doesn't crash our infra with mem-usage
b) provides meaningful results during retrieval.

Let's revisit once our platform is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

connector:file-storage File and document storage connectors (GDrive, etc.) enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants