feat(sync): SyncMultiplexer for destination migrations#1194
feat(sync): SyncMultiplexer for destination migrations#1194orhanrauf wants to merge 1 commit intofeat/raw-data-capturefrom
Conversation
- Add SyncMultiplexer for managing multiple destinations per sync - Implement fork/switch/resync operations for blue-green deployments - Add ARFReplaySource for replaying entities from raw data store - Refactor SyncFactory into modular builders (_source, _destination, _context, _pipeline) - Add DestinationRole enum (ACTIVE, SHADOW, DEPRECATED) to SyncConnection - Add feature flag SYNC_MULTIPLEXER for gating access - Add CRUD layer for SyncConnection with role-based filtering - Add API endpoints for multiplex operations
There was a problem hiding this comment.
8 issues found across 21 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/airweave/platform/sync/factory/__init__.py">
<violation number="1" location="backend/airweave/platform/sync/factory/__init__.py:15">
P2: Rule violated: **Check for Cursor Rules Drift**
The sync-architecture cursor rule needs updating to reflect these architectural changes. The rule at `.cursor/rules/sync-architecture.mdc` describes `SyncFactory` as a monolithic factory but the PR refactors it into modular builders (`_factory.py`, `_source.py`, `_destination.py`, `_context.py`, `_pipeline.py`). Additionally, the new `SyncMultiplexer` for blue-green destination migrations (with `ACTIVE`/`SHADOW`/`DEPRECATED` roles) is not documented.
Consider updating the cursor rule to:
1. Document the new modular factory structure under `platform/sync/factory/`
2. Add a section for internal builders and their purposes
3. Document the SyncMultiplexer component and destination migration workflow</violation>
</file>
<file name="backend/airweave/platform/sync/multiplex/replay.py">
<violation number="1" location="backend/airweave/platform/sync/multiplex/replay.py:25">
P2: Rule violated: **Check for Cursor Rules Drift**
Cursor rules drift detected: The sync architecture documentation needs updating to reflect the factory refactor and ARF replay capabilities.
**Affected rules:**
- `.cursor/rules/sync-architecture.mdc` - Documents `SyncFactory` as monolithic, but PR refactors into modular builders (`DestinationBuilder`, `ReplayContextBuilder`, `PipelineBuilder`)
- `.cursor/rules/arf.mdc` - Only documents ARF capture (write), missing replay capabilities (`ARFReplaySource`, `iter_entities_for_replay`, `get_replay_stats`)
**Missing documentation:**
- New `sync/factory/` module structure with builders
- `SyncMultiplexer` for blue-green destination migrations
- ARF replay workflow and `ARFReplaySource` pseudo-source
Consider updating these rules to prevent AI assistants from generating outdated patterns.</violation>
<violation number="2" location="backend/airweave/platform/sync/multiplex/replay.py:141">
P2: Raising `ValueError` for a missing ARF store may surface as a 500. Prefer `NotFoundException` (or a domain-specific exception that your exception handlers map) so the API returns a predictable status code.</violation>
</file>
<file name="backend/airweave/platform/sync/factory/_context.py">
<violation number="1" location="backend/airweave/platform/sync/factory/_context.py:105">
P1: `SyncContext.connection` is set to `None`, but the orchestrator later dereferences `sync_context.connection.*`. Pass the real source connection schema here (e.g., from `source_connection_data`).</violation>
</file>
<file name="backend/airweave/platform/sync/factory/_pipeline.py">
<violation number="1" location="backend/airweave/platform/sync/factory/_pipeline.py:92">
P0: RAW_ENTITIES destinations are collected but never wired into a handler, so self-processing destinations (e.g., Vespa) won’t receive any inserts/updates/deletes.</violation>
<violation number="2" location="backend/airweave/platform/sync/factory/_pipeline.py:116">
P3: Dead code: `if not handlers and logger:` is unreachable because `PostgresMetadataHandler()` is always appended.</violation>
</file>
<file name="backend/airweave/crud/crud_sync_connection.py">
<violation number="1" location="backend/airweave/crud/crud_sync_connection.py:21">
P1: This CRUD bypasses the project’s standard ApiContext/org access validation and relies on callers to enforce authorization. That’s a security footgun for cross-org reads/updates/deletes. Prefer requiring `ctx: ApiContext` and validating access (e.g., via joining `Sync` to check `organization_id`) inside these methods.</violation>
</file>
<file name="backend/airweave/platform/sync/multiplex/multiplexer.py">
<violation number="1" location="backend/airweave/platform/sync/multiplex/multiplexer.py:116">
P1: `sync.destination_connection_ids` appears to not exist in the current Sync model, so `fork()` will crash with AttributeError. Guard this update (or remove it) if the field isn’t present.</violation>
</file>
Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.
| from airweave.platform.sync.factory._pipeline import PipelineBuilder | ||
| """ | ||
|
|
||
| from airweave.platform.sync.factory._factory import SyncFactory |
There was a problem hiding this comment.
P2: Rule violated: Check for Cursor Rules Drift
The sync-architecture cursor rule needs updating to reflect these architectural changes. The rule at .cursor/rules/sync-architecture.mdc describes SyncFactory as a monolithic factory but the PR refactors it into modular builders (_factory.py, _source.py, _destination.py, _context.py, _pipeline.py). Additionally, the new SyncMultiplexer for blue-green destination migrations (with ACTIVE/SHADOW/DEPRECATED roles) is not documented.
Consider updating the cursor rule to:
- Document the new modular factory structure under
platform/sync/factory/ - Add a section for internal builders and their purposes
- Document the SyncMultiplexer component and destination migration workflow
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/sync/factory/__init__.py, line 15:
<comment>The sync-architecture cursor rule needs updating to reflect these architectural changes. The rule at `.cursor/rules/sync-architecture.mdc` describes `SyncFactory` as a monolithic factory but the PR refactors it into modular builders (`_factory.py`, `_source.py`, `_destination.py`, `_context.py`, `_pipeline.py`). Additionally, the new `SyncMultiplexer` for blue-green destination migrations (with `ACTIVE`/`SHADOW`/`DEPRECATED` roles) is not documented.
Consider updating the cursor rule to:
1. Document the new modular factory structure under `platform/sync/factory/`
2. Add a section for internal builders and their purposes
3. Document the SyncMultiplexer component and destination migration workflow</comment>
<file context>
@@ -0,0 +1,17 @@
+ from airweave.platform.sync.factory._pipeline import PipelineBuilder
+"""
+
+from airweave.platform.sync.factory._factory import SyncFactory
+
+__all__ = ["SyncFactory"]
</file context>
| @@ -0,0 +1,246 @@ | |||
| """Replay service - populates destinations from ARF storage. | |||
There was a problem hiding this comment.
P2: Rule violated: Check for Cursor Rules Drift
Cursor rules drift detected: The sync architecture documentation needs updating to reflect the factory refactor and ARF replay capabilities.
Affected rules:
.cursor/rules/sync-architecture.mdc- DocumentsSyncFactoryas monolithic, but PR refactors into modular builders (DestinationBuilder,ReplayContextBuilder,PipelineBuilder).cursor/rules/arf.mdc- Only documents ARF capture (write), missing replay capabilities (ARFReplaySource,iter_entities_for_replay,get_replay_stats)
Missing documentation:
- New
sync/factory/module structure with builders SyncMultiplexerfor blue-green destination migrations- ARF replay workflow and
ARFReplaySourcepseudo-source
Consider updating these rules to prevent AI assistants from generating outdated patterns.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/sync/multiplex/replay.py, line 25:
<comment>Cursor rules drift detected: The sync architecture documentation needs updating to reflect the factory refactor and ARF replay capabilities.
**Affected rules:**
- `.cursor/rules/sync-architecture.mdc` - Documents `SyncFactory` as monolithic, but PR refactors into modular builders (`DestinationBuilder`, `ReplayContextBuilder`, `PipelineBuilder`)
- `.cursor/rules/arf.mdc` - Only documents ARF capture (write), missing replay capabilities (`ARFReplaySource`, `iter_entities_for_replay`, `get_replay_stats`)
**Missing documentation:**
- New `sync/factory/` module structure with builders
- `SyncMultiplexer` for blue-green destination migrations
- ARF replay workflow and `ARFReplaySource` pseudo-source
Consider updating these rules to prevent AI assistants from generating outdated patterns.</comment>
<file context>
@@ -0,0 +1,246 @@
+from airweave.platform.entities._base import BaseEntity
+from airweave.platform.sources._base import BaseSource
+from airweave.platform.sync.factory import SyncFactory
+from airweave.platform.sync.factory._context import ReplayContextBuilder
+from airweave.platform.sync.factory._destination import DestinationBuilder
+from airweave.platform.sync.factory._pipeline import PipelineBuilder
</file context>
| sync=sync, | ||
| sync_job=sync_job, | ||
| collection=collection, | ||
| connection=None, |
There was a problem hiding this comment.
P1: SyncContext.connection is set to None, but the orchestrator later dereferences sync_context.connection.*. Pass the real source connection schema here (e.g., from source_connection_data).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/sync/factory/_context.py, line 105:
<comment>`SyncContext.connection` is set to `None`, but the orchestrator later dereferences `sync_context.connection.*`. Pass the real source connection schema here (e.g., from `source_connection_data`).</comment>
<file context>
@@ -0,0 +1,243 @@
+ sync=sync,
+ sync_job=sync_job,
+ collection=collection,
+ connection=None,
+ entity_tracker=entity_tracker,
+ state_publisher=state_publisher,
</file context>
|
|
||
| handlers.append(PostgresMetadataHandler()) | ||
|
|
||
| if not handlers and logger: |
There was a problem hiding this comment.
P3: Dead code: if not handlers and logger: is unreachable because PostgresMetadataHandler() is always appended.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/sync/factory/_pipeline.py, line 116:
<comment>Dead code: `if not handlers and logger:` is unreachable because `PostgresMetadataHandler()` is always appended.</comment>
<file context>
@@ -0,0 +1,119 @@
+
+ handlers.append(PostgresMetadataHandler())
+
+ if not handlers and logger:
+ logger.warning("No destination handlers created - sync has no valid destinations")
+
</file context>
| requirement = dest.processing_requirement | ||
| if requirement == ProcessingRequirement.CHUNKS_AND_EMBEDDINGS: | ||
| vector_db_destinations.append(dest) | ||
| elif requirement == ProcessingRequirement.RAW_ENTITIES: |
There was a problem hiding this comment.
P0: RAW_ENTITIES destinations are collected but never wired into a handler, so self-processing destinations (e.g., Vespa) won’t receive any inserts/updates/deletes.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/sync/factory/_pipeline.py, line 92:
<comment>RAW_ENTITIES destinations are collected but never wired into a handler, so self-processing destinations (e.g., Vespa) won’t receive any inserts/updates/deletes.</comment>
<file context>
@@ -0,0 +1,119 @@
+ requirement = dest.processing_requirement
+ if requirement == ProcessingRequirement.CHUNKS_AND_EMBEDDINGS:
+ vector_db_destinations.append(dest)
+ elif requirement == ProcessingRequirement.RAW_ENTITIES:
+ self_processing_destinations.append(dest)
+ else:
</file context>
| """CRUD operations for sync connections. | ||
|
|
||
| Note: SyncConnection doesn't have organization_id directly. | ||
| Access control should be enforced at the Sync level before calling these methods. |
There was a problem hiding this comment.
P1: This CRUD bypasses the project’s standard ApiContext/org access validation and relies on callers to enforce authorization. That’s a security footgun for cross-org reads/updates/deletes. Prefer requiring ctx: ApiContext and validating access (e.g., via joining Sync to check organization_id) inside these methods.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/crud/crud_sync_connection.py, line 21:
<comment>This CRUD bypasses the project’s standard ApiContext/org access validation and relies on callers to enforce authorization. That’s a security footgun for cross-org reads/updates/deletes. Prefer requiring `ctx: ApiContext` and validating access (e.g., via joining `Sync` to check `organization_id`) inside these methods.</comment>
<file context>
@@ -0,0 +1,285 @@
+ """CRUD operations for sync connections.
+
+ Note: SyncConnection doesn't have organization_id directly.
+ Access control should be enforced at the Sync level before calling these methods.
+ """
+
</file context>
| # 1. Validate ARF store exists | ||
| arf_stats = await raw_data_service.get_replay_stats(str(sync_id)) | ||
| if not arf_stats.get("exists"): | ||
| raise ValueError(f"No ARF store found for sync {sync_id}") |
There was a problem hiding this comment.
P2: Raising ValueError for a missing ARF store may surface as a 500. Prefer NotFoundException (or a domain-specific exception that your exception handlers map) so the API returns a predictable status code.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/sync/multiplex/replay.py, line 141:
<comment>Raising `ValueError` for a missing ARF store may surface as a 500. Prefer `NotFoundException` (or a domain-specific exception that your exception handlers map) so the API returns a predictable status code.</comment>
<file context>
@@ -0,0 +1,246 @@
+ # 1. Validate ARF store exists
+ arf_stats = await raw_data_service.get_replay_stats(str(sync_id))
+ if not arf_stats.get("exists"):
+ raise ValueError(f"No ARF store found for sync {sync_id}")
+
+ entity_count = arf_stats.get("entity_count", 0)
</file context>
|
|
||
| # Also update sync.destination_connection_ids to include the new destination | ||
| # This ensures backward compatibility with existing sync flow | ||
| current_dest_ids = list(sync.destination_connection_ids or []) |
There was a problem hiding this comment.
P1: sync.destination_connection_ids appears to not exist in the current Sync model, so fork() will crash with AttributeError. Guard this update (or remove it) if the field isn’t present.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/sync/multiplex/multiplexer.py, line 116:
<comment>`sync.destination_connection_ids` appears to not exist in the current Sync model, so `fork()` will crash with AttributeError. Guard this update (or remove it) if the field isn’t present.</comment>
<file context>
@@ -0,0 +1,403 @@
+
+ # Also update sync.destination_connection_ids to include the new destination
+ # This ensures backward compatibility with existing sync flow
+ current_dest_ids = list(sync.destination_connection_ids or [])
+ if destination_connection_id not in current_dest_ids:
+ current_dest_ids.append(destination_connection_id)
</file context>
Summary
Adds SyncMultiplexer for managing destination migrations (blue-green deployments) and refactors
SyncFactoryinto modular builders.🔀 SyncMultiplexer
Enables blue-green vector DB migrations (Qdrant → Vespa, config v0 → v1, etc.)
Operations:
forkswitchresynclistDestination roles:
ACTIVE/SHADOW/DEPRECATEDAPI:
/sync-multiplex/{sync_id}/destinations/...Gating: Requires
SYNC_MULTIPLEXERfeature flag🏗️ Factory Refactor
Split monolithic
SyncFactoryinto focused builders:Changes
New:
platform/sync/multiplex/- Multiplexer + ARF replayplatform/sync/factory/- Modular buildersapi/v1/endpoints/sync_multiplex.py- API endpointscrud/crud_sync_connection.py- Role-based filteringschemas/sync_connection.py- Request/response modelsadd_role_to_sync_connection.pyModified:
models/sync_connection.py- AddedDestinationRoleenum +rolecolumncore/shared_models.py- AddedSYNC_MULTIPLEXERfeature flagSummary by cubic
Introduces SyncMultiplexer for blue-green destination migrations with ARF replay, enabling safe vector DB switches without downtime. Refactors SyncFactory into modular builders to simplify orchestration and reuse.
New Features
Refactors
Written for commit 6db4ca7. Summary will update automatically on new commits.