-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Easy Data Migrations for DocumentDB and MongoDB databases is a feature designed to simplify small-scale data migrations by leveraging a user-friendly copy-and-paste experience. This feature is ideal for smaller datasets where all data can be moved through the user's local machine.
Feature Overview
This feature provides an intuitive way to migrate collections between databases, servers, or clusters by mimicking the familiar “copy-and-paste” paradigm. It offers flexibility for handling conflicts, tracking progress, and ensuring user control throughout the migration process.
Core Features
-
Copy-and-Paste Collection Workflow
- Use the context menu to "Copy" a collection.
- "Paste" the copied collection into:
- A different database in the same cluster.
- A database on a different server or cluster.
- An existing collection, with options for conflict resolution.
-
Conflict Resolution Options
- Provide choices when conflicts arise during migration:
- Document Conflict (_id exists): Overwrite, skip, or rename conflicting documents.
- Schema Validation Issues: Skip invalid documents and log errors, or pause for user intervention.
- Index Configuration: Option to copy index definitions from the source to the target collection.
- Handle naming conflicts by prompting the user to confirm a new collection name when a collection with the same name exists.
- Provide choices when conflicts arise during migration:
-
Monitoring and Abort Options
- Display real-time progress during the migration:
- Total documents copied.
- Data transferred (in MB).
- Time elapsed.
- Number of errors encountered.
- Allow the user to abort the operation at any point.
- Display real-time progress during the migration:
-
Error Logging and Review
- Log all errors encountered during the migration for user review.
- Provide actionable insights to help address issues, such as invalid schemas or conflicts.
Discussion Areas
- Conflict Resolution Behavior
- What default behavior should we adopt for document conflicts (_id clashes, schema rejections)?
- Should the feature include automated retries for failed documents?
➡️ Allow retrying for up to X configurable times; for now, keep the configuration away from the user.
- Let's ensure that our API accepts a
retry-count
value, and we'll do something with it later.- Investigate what the driver already offers.
- Migration Targets
- Should we prioritize support for specific migration targets, such as same-cluster migrations versus cross-cluster migrations?
- Would users benefit from features like pre-validation to identify potential issues before starting the migration?
➡️ This iteration will not care; we'll just write to a selected target. Platform-specific optimizations will be discussed and added later.
- Progress Tracking and UX
- Are there additional metrics or progress indicators you’d like to see?
- Should we include estimates for time remaining or throughput (documents/second)?
➡️ A separate
source count
operation should start, and once it comes back, we should be able to show the total and then compute the progress.
➡️ Consideration: we're copying entire collections (not filtered), so in theory, the count should be quick (if an_id
index exists...). However, in the future, we might support copies of query results. Also, this depends on the database engine; other database platforms might handle this differently.
➡️ This is when we'll re-touch the code base. For now, the initialization phase will run a count on the source database before moving forward.
- Data Size Warnings
- Should we warn users about potential delays for very large datasets?
- How can we provide meaningful warnings without overloading the user with information?
➡️ Not at this point. We'll add more user feedback in the next iteration: this will take the
source count
and the throughput into account and suggest different approaches. We plan to support Migration Plugins, so this would be their entry point.
- Counting Document Considerations
- Should we inform users that counting the number of documents in a collection can be expensive if no appropriate index exists?
- Would pre-validation of dataset size be helpful, or should we rely on real-time progress metrics instead?
➡️ Not in this iteration: This will make this ticket too massive. No pre-validation.
- Advanced Options
- Would users benefit from batch-based migrations for larger datasets, even though the feature is intended for smaller datasets?
- Should we support conditional migrations, where only documents matching a query are copied?
➡️ Not in this iteration.
How It Will Work
-
Initiating the Copy-and-Paste Workflow
- The user selects a collection and chooses Copy from the context menu.
- Internally, the collection is marked for migration, without affecting the source.
-
Pasting to a Target
- The user navigates to the desired database, server, or collection and chooses Paste.
- If a collection with the same name exists, the user is prompted to provide a new name or confirm overwriting.
-
Configuring Conflict Handling
- After Pasting to a Target, the user selects preferences for resolving potential conflicts:
- Overwrite, skip, or rename documents with duplicate
_id
s. - Skip or abort for documents that violate schema validation.
- Why not
pause
? There is not much we can do at this point. The user could go and modify the configuration of the target collection and then request the task to continue. But the user can achieve the same by restarting the task or choosing to skip.
- Why not
- Copy indexes: The user will be asked whether index configuration should be copied as well.
- Options to create indexes before data copy (ensures validation, potentially slower initial inserts).
- Or after data copy (faster initial inserts, but potential validation issues during import).
- The key question here is: should the user be asked for each conflict?
- Pro: full control.
- Con: more complex architecture.
- ➡️ This iteration will not support this more complex approach. A task is configured and then executed, with no option to 'ask for each conflict.'
- Overwrite, skip, or rename documents with duplicate
- After Pasting to a Target, the user selects preferences for resolving potential conflicts:
-
Executing the Migration
- The migration begins, with real-time progress tracking displayed:
- Total documents copied.
- Data transferred (in MB).
- Time elapsed and estimated time remaining.
- Number of errors encountered.
- The migration begins, with real-time progress tracking displayed:
-
Error Management and Review
- Any errors are logged with details, including document
_id
, error type, and suggested resolutions. - The user can export the error log for further analysis.
- Any errors are logged with details, including document
-
Aborting the Migration
- The user can abort the migration at any time.
- All completed operations up to that point remain intact, with errors logged for review.
Development Plan
-
Introduce a "Task Engine"
- We plan to support more long-running operations in the future; a core Task Engine is needed.
- We'll keep it simple in this iteration: a map, a
Task
interface with commands likestart
,stop
,delete
,getStatus
, etc. - Tasks will receive unique IDs to support concurrent operations.
- Note: Tasks will not persist across VS Code restarts in this iteration.
- Should we support the registration of an event handler for status updates ❓
getStatus
in a pull approach should be sufficient; let's discuss.
- ⚡ Note: This implementation should be database-type-agnostic and task-type-agnostic.
-
Basic "Copy-and-Paste" Task
-
Accept two
connectionIds
,database
,collection
, andconfig
with conflict resolution configuration (basic 'abort' on every error config in this step).- We're working with a
connectionId
here because this will support all connected servers, even the ones with more complex authentication methods going beyond a "simple" connection string. - Note: Network resilience is handled by the existing Connection class and is outside the scope of this task. The task will abort with an error if the connection cannot be recovered.
- We're working with a
-
Memory Management: Implement a buffer-based streaming approach where:
- One async operation reads from the source into the buffer and pauses when the buffer is full.
- Another async operation reads from the buffer and writes to the target using MongoDB bulk operations.
-
Implement basic tests to be run on demand using Jest.
-
⚡ Note: The architecture should be database-type-agnostic.
- This means we should have a Copy-and-Paste task that, when constructed, receives a database-specific reader and a database-specific writer to work with.
- This is how we'll ensure DocumentDB compatibility while making the architecture reusable for other projects.
- ⭐If we added a
Converter
we could, in theory, make it even better as we could move data between databases. It's not relevant for this extension, our converter would be justundefined
, but API users in the future would benefit from this.
-
-
Copy-and-Paste Workflow UX Logic
- Implement backend logic to support marking collections for copying and initiating pasting operations.
- Develop UI for selecting targets and confirming actions.
- In this step, show a dialog confirming the source and the target selection.
-
Conflict Resolution Options (UI)
- Add user-configurable settings for default conflict behaviors.
-
Expand "Copy-and-Paste" Task with supported conflict resolution configuration options
- Support configuration options defined and implemented in step 4.
- Update tests.
-
Progress Monitoring and Metrics
- Design a real-time progress dashboard, displaying key metrics like documents processed, data transferred, and errors encountered.
- Use VS Code's
window.withProgress
API for status reporting in this iteration. - Include options for aborting tasks.
- Discuss: is it a shared 'Tasks' dashboard? Or is it one for Copy-and-Paste only?
-
Error Logging and Review Tools
- Implement detailed error logging with export capabilities.
- We need a machine-readable document stating what has failed and why.
- 🎁 Bonus: provide a filter query that'd select the failed documents? What would be the size limits here?
- Provide actionable suggestions for resolving common issues.
- Implement detailed error logging with export capabilities.
-
Optional: Data Size and Performance Considerations
- Include warnings for potential delays when working with large datasets.
- Inform users about the performance impact of counting documents in collections without indexes.
-
Testing and Validation
- Test with diverse datasets and scenarios, including cross-cluster migrations, document conflicts, and schema validation errors.
- Validate performance for smaller datasets and ensure smooth handling of edge cases.
-
Documentation and User Guide
- Provide clear instructions for using the copy-and-paste workflow.
- Include best practices for conflict resolution, troubleshooting, and performance optimization.
Sub-issues
Metadata
Metadata
Assignees
Labels
Type
Projects
Status