Skip to content

Easy Lightweight Data Migrations 🚀 #63

@tnaum-ms

Description

@tnaum-ms

Easy Data Migrations for DocumentDB and MongoDB databases is a feature designed to simplify small-scale data migrations by leveraging a user-friendly copy-and-paste experience. This feature is ideal for smaller datasets where all data can be moved through the user's local machine.


Feature Overview

This feature provides an intuitive way to migrate collections between databases, servers, or clusters by mimicking the familiar “copy-and-paste” paradigm. It offers flexibility for handling conflicts, tracking progress, and ensuring user control throughout the migration process.

Core Features

  1. Copy-and-Paste Collection Workflow

    • Use the context menu to "Copy" a collection.
    • "Paste" the copied collection into:
      • A different database in the same cluster.
      • A database on a different server or cluster.
      • An existing collection, with options for conflict resolution.
  2. Conflict Resolution Options

    • Provide choices when conflicts arise during migration:
      • Document Conflict (_id exists): Overwrite, skip, or rename conflicting documents.
      • Schema Validation Issues: Skip invalid documents and log errors, or pause for user intervention.
      • Index Configuration: Option to copy index definitions from the source to the target collection.
    • Handle naming conflicts by prompting the user to confirm a new collection name when a collection with the same name exists.
  3. Monitoring and Abort Options

    • Display real-time progress during the migration:
      • Total documents copied.
      • Data transferred (in MB).
      • Time elapsed.
      • Number of errors encountered.
    • Allow the user to abort the operation at any point.
  4. Error Logging and Review

    • Log all errors encountered during the migration for user review.
    • Provide actionable insights to help address issues, such as invalid schemas or conflicts.

Discussion Areas

  1. Conflict Resolution Behavior
    • What default behavior should we adopt for document conflicts (_id clashes, schema rejections)?
    • Should the feature include automated retries for failed documents?

➡️ Allow retrying for up to X configurable times; for now, keep the configuration away from the user.

  • Let's ensure that our API accepts a retry-count value, and we'll do something with it later.
  • Investigate what the driver already offers.
  1. Migration Targets
    • Should we prioritize support for specific migration targets, such as same-cluster migrations versus cross-cluster migrations?
    • Would users benefit from features like pre-validation to identify potential issues before starting the migration?

➡️ This iteration will not care; we'll just write to a selected target. Platform-specific optimizations will be discussed and added later.

  1. Progress Tracking and UX
    • Are there additional metrics or progress indicators you’d like to see?
    • Should we include estimates for time remaining or throughput (documents/second)?

➡️ A separate source count operation should start, and once it comes back, we should be able to show the total and then compute the progress.
➡️ Consideration: we're copying entire collections (not filtered), so in theory, the count should be quick (if an _id index exists...). However, in the future, we might support copies of query results. Also, this depends on the database engine; other database platforms might handle this differently.
➡️ This is when we'll re-touch the code base. For now, the initialization phase will run a count on the source database before moving forward.

  1. Data Size Warnings
    • Should we warn users about potential delays for very large datasets?
    • How can we provide meaningful warnings without overloading the user with information?

➡️ Not at this point. We'll add more user feedback in the next iteration: this will take the source count and the throughput into account and suggest different approaches. We plan to support Migration Plugins, so this would be their entry point.

  1. Counting Document Considerations
    • Should we inform users that counting the number of documents in a collection can be expensive if no appropriate index exists?
    • Would pre-validation of dataset size be helpful, or should we rely on real-time progress metrics instead?

➡️ Not in this iteration: This will make this ticket too massive. No pre-validation.

  1. Advanced Options
    • Would users benefit from batch-based migrations for larger datasets, even though the feature is intended for smaller datasets?
    • Should we support conditional migrations, where only documents matching a query are copied?

➡️ Not in this iteration.


How It Will Work

  1. Initiating the Copy-and-Paste Workflow

    • The user selects a collection and chooses Copy from the context menu.
    • Internally, the collection is marked for migration, without affecting the source.
  2. Pasting to a Target

    • The user navigates to the desired database, server, or collection and chooses Paste.
    • If a collection with the same name exists, the user is prompted to provide a new name or confirm overwriting.
  3. Configuring Conflict Handling

    • After Pasting to a Target, the user selects preferences for resolving potential conflicts:
      • Overwrite, skip, or rename documents with duplicate _ids.
      • Skip or abort for documents that violate schema validation.
        • Why not pause? There is not much we can do at this point. The user could go and modify the configuration of the target collection and then request the task to continue. But the user can achieve the same by restarting the task or choosing to skip.
      • Copy indexes: The user will be asked whether index configuration should be copied as well.
        • Options to create indexes before data copy (ensures validation, potentially slower initial inserts).
        • Or after data copy (faster initial inserts, but potential validation issues during import).
      • The key question here is: should the user be asked for each conflict?
        • Pro: full control.
        • Con: more complex architecture.
      • ➡️ This iteration will not support this more complex approach. A task is configured and then executed, with no option to 'ask for each conflict.'
  4. Executing the Migration

    • The migration begins, with real-time progress tracking displayed:
      • Total documents copied.
      • Data transferred (in MB).
      • Time elapsed and estimated time remaining.
      • Number of errors encountered.
  5. Error Management and Review

    • Any errors are logged with details, including document _id, error type, and suggested resolutions.
    • The user can export the error log for further analysis.
  6. Aborting the Migration

    • The user can abort the migration at any time.
    • All completed operations up to that point remain intact, with errors logged for review.

Development Plan

  1. Introduce a "Task Engine"

    • We plan to support more long-running operations in the future; a core Task Engine is needed.
    • We'll keep it simple in this iteration: a map, a Task interface with commands like start, stop, delete, getStatus, etc.
    • Tasks will receive unique IDs to support concurrent operations.
    • Note: Tasks will not persist across VS Code restarts in this iteration.
    • Should we support the registration of an event handler for status updates ❓
      • getStatus in a pull approach should be sufficient; let's discuss.
    • Note: This implementation should be database-type-agnostic and task-type-agnostic.
  2. Basic "Copy-and-Paste" Task

    • Accept two connectionIds, database, collection, and config with conflict resolution configuration (basic 'abort' on every error config in this step).

      • We're working with a connectionId here because this will support all connected servers, even the ones with more complex authentication methods going beyond a "simple" connection string.
      • Note: Network resilience is handled by the existing Connection class and is outside the scope of this task. The task will abort with an error if the connection cannot be recovered.
    • Memory Management: Implement a buffer-based streaming approach where:

      • One async operation reads from the source into the buffer and pauses when the buffer is full.
      • Another async operation reads from the buffer and writes to the target using MongoDB bulk operations.
    • Implement basic tests to be run on demand using Jest.

    • Note: The architecture should be database-type-agnostic.

      • This means we should have a Copy-and-Paste task that, when constructed, receives a database-specific reader and a database-specific writer to work with.
      • This is how we'll ensure DocumentDB compatibility while making the architecture reusable for other projects.
      • ⭐If we added a Converter we could, in theory, make it even better as we could move data between databases. It's not relevant for this extension, our converter would be just undefined, but API users in the future would benefit from this.
  3. Copy-and-Paste Workflow UX Logic

    • Implement backend logic to support marking collections for copying and initiating pasting operations.
    • Develop UI for selecting targets and confirming actions.
    • In this step, show a dialog confirming the source and the target selection.
  4. Conflict Resolution Options (UI)

    • Add user-configurable settings for default conflict behaviors.
  5. Expand "Copy-and-Paste" Task with supported conflict resolution configuration options

    • Support configuration options defined and implemented in step 4.
    • Update tests.
  6. Progress Monitoring and Metrics

    • Design a real-time progress dashboard, displaying key metrics like documents processed, data transferred, and errors encountered.
    • Use VS Code's window.withProgress API for status reporting in this iteration.
    • Include options for aborting tasks.
    • Discuss: is it a shared 'Tasks' dashboard? Or is it one for Copy-and-Paste only?
  7. Error Logging and Review Tools

    • Implement detailed error logging with export capabilities.
      • We need a machine-readable document stating what has failed and why.
      • 🎁 Bonus: provide a filter query that'd select the failed documents? What would be the size limits here?
    • Provide actionable suggestions for resolving common issues.
  8. Optional: Data Size and Performance Considerations

    • Include warnings for potential delays when working with large datasets.
    • Inform users about the performance impact of counting documents in collections without indexes.
  9. Testing and Validation

    • Test with diverse datasets and scenarios, including cross-cluster migrations, document conflicts, and schema validation errors.
    • Validate performance for smaller datasets and ensure smooth handling of edge cases.
  10. Documentation and User Guide

  • Provide clear instructions for using the copy-and-paste workflow.
  • Include best practices for conflict resolution, troubleshooting, and performance optimization.

Sub-issues

Metadata

Metadata

Projects

Status

In progress

Relationships

None yet

Development

No branches or pull requests

Issue actions