Skip to content

Database dumps do not work on large databases #59

@Brayden

Description

@Brayden

Describe the bug
If you try to use any of the database dump endpoints such as SQL, CSV or JSON the data is loaded into memory and then created as a dump file. To support any size database we should investigate enhancements to allow any sized database to be exported. Currently the size limitations are 1GB for Durable Objects with 10GB in the future. Operate under the assumption that we might be attempting to dump a 10GB database into a .sql file.

Another consideration to make is because Durable Objects execute synchronous operations we may need to allow for "breathing intervals". An example might be we allow our export operation to run for 5 seconds, and take 5 seconds off if other requests are in a queue, then it can pick up again. The goal here would be to prevent locking the database for long periods of time.

But then poses the questions:

  1. How do we continue operations that need more than 30 seconds to work?
  2. Where is the data stored as it's being created? (R2, S3, something else)?
  3. How do we deliver that dump information to the user after its completed?

To Reproduce
Steps to reproduce the behavior:

  1. Hit the /export/dump endpoint on a large database
  2. Will eventually fail when the 30 second request response time window closes

Run the following command in Terminal (replace the URL with yours) and if your operation exceeds 30 seconds you should see a failed network response instead of a dump file.

curl --location 'https://starbasedb.YOUR-ID-HERE.workers.dev/export/dump' \
--header 'Authorization: Bearer ABC123' \
--output database_dump.sql

If you can't create a large enough test database feel free to add code in to sleep for 29 seconds before proceeding with the /export/dump functional code and should also see the failure.

Expected behavior
As a user I would expect any and all of the specified data to be dumped out without an error and without partial results. Where it ends up for the user to access if the operation takes more than 30 seconds is up for discussion. Ideally if shorter than 30 seconds it could be returned as our cURL above works today (downloads the file from the response of the origin request), but perhaps after the timeout it continues on uploads it to a destination source to access afterwards?

Proposed Solution:

  1. For backups require an R2 binding
  2. Have a .sql file that gets created in R2 with the filename like dump_20240101-170000.sql where it represents 2024-01-01 17:00:00
  3. Create the file and continuously append new chunks to it until reaching the end
  4. May need to utilize a DO alarm to continue the work after X time if a timeout occurs & mark where it currently is in the process in internal memory so it can pick up and continue.
  5. Provide a callback URL when the operation is finally completed so users can create custom logic to notify them (e.g. Email, Slack, etc)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions