Skip to content

Implement file uploads #713

@ja573

Description

@ja573

We need to implement a secure and scalable file upload architecture for books and chapters.

The system must support:

  1. A GraphQL mutation to request a presigned upload URL, returning S3 upload details to the client.
  2. Client uploads directly to a private S3 prefix (/uploads/...), not exposed publicly by CloudFront.
  3. A second GraphQL mutation to complete the upload, where the backend:
    • Validates the uploaded file (size, type, checksum)
    • Moves it to its final, public CDN-served path
    • Builds the canonical filename pattern (all lowercase): /{doi_prefix}/{doi_suffix}.{file_extension}
    • Creates the database record containing file metadata (checksum, mime, bytes, etc.)
    • On file updates: invalidates CloudFront cache for the final path

Database Schema

NB. Add errors to thoth-errors/src/database_errors.rs

1. Storage configuration per imprint

Only superusers may read or modify it:

ALTER TABLE imprint
  ADD COLUMN s3_bucket          TEXT,
  ADD COLUMN s3_region          TEXT,
  ADD COLUMN cdn_domain         TEXT,
  ADD COLUMN cloudfront_dist_id TEXT;

-- all or nothing
ALTER TABLE imprint
  ADD CONSTRAINT imprint_storage_cfg_all_or_none
  CHECK (
    (
      s3_bucket          IS NULL AND
      s3_region          IS NULL AND
      cdn_domain         IS NULL AND
      cloudfront_dist_id IS NULL
    )
    OR
    (
      s3_bucket          IS NOT NULL AND
      s3_region          IS NOT NULL AND
      cdn_domain         IS NOT NULL AND
      cloudfront_dist_id IS NOT NULL
    )
  );

AWS user/key is passed at runtime via CLI/env, not stored

2. File types and URL structure

Define a FileType enum with the two types of uploads we're supporting now. We'll eventually expand this with additional resources, backcovers, etc.

CREATE TYPE file_type AS ENUM ('publication', 'frontcover');

URL structure:

  • Publication file:
    /{doi_prefix}/{doi_suffix}.{extension}
  • Front cover:
    /{doi_prefix}/{doi_suffix}_frontcover.{extension}

Both lowercased before writing object_key.

3. File table (final stored files)

CREATE TABLE file (
  file_id        UUID PRIMARY KEY DEFAULT uuid_generate_v4(),

  file_type      file_type NOT NULL,   -- 'publication' | 'frontcover'

  work_id        UUID REFERENCES work (work_id),
  publication_id UUID REFERENCES publication (publication_id),

  object_key     TEXT NOT NULL,        -- lowercase DOI-based canonical path
  cdn_url        TEXT NOT NULL,        -- full public URL

  mime_type      TEXT NOT NULL,
  bytes          BIGINT NOT NULL,
  sha256         TEXT NOT NULL,

  created_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at     TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Enforce types:
ALTER TABLE file
  ADD CONSTRAINT file_type_check
  CHECK (
    (file_type = 'frontcover' AND work_id IS NOT NULL AND publication_id IS NULL) OR
    (file_type = 'publication' AND publication_id IS NOT NULL AND work_id IS NULL)
  );

-- One frontcover per work
CREATE UNIQUE INDEX file_frontcover_work_unique_idx
  ON file (work_id)
  WHERE file_type = 'frontcover';

-- One publication file per publication
CREATE UNIQUE INDEX file_publication_unique_idx
  ON file (publication_id)
  WHERE file_type = 'publication';

-- Never reuse the same object key
CREATE UNIQUE INDEX file_object_key_unique_idx
  ON file (object_key);

SELECT diesel_manage_updated_at('file');

4. file_upload table (temporary uploads)

CREATE TABLE file_upload (
  file_upload_id     UUID PRIMARY KEY DEFAULT uuid_generate_v4(),

  file_type          file_type NOT NULL,   -- same enum as final file table

  work_id            UUID REFERENCES work (work_id),
  publication_id     UUID REFERENCES publication (publication_id),

  declared_mime_type TEXT NOT NULL,
  declared_extension TEXT NOT NULL,
  declared_sha256    TEXT NOT NULL,

  created_at         TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at         TIMESTAMPTZ NOT NULL DEFAULT now()
);

ALTER TABLE file_upload
  ADD CONSTRAINT file_upload_type_check
  CHECK (
    (file_type = 'frontcover' AND work_id IS NOT NULL AND publication_id IS NULL) OR
    (file_type = 'publication' AND publication_id IS NOT NULL AND work_id IS NULL)
  );

CREATE INDEX file_upload_work_idx
  ON file_upload (work_id)
  WHERE file_type = 'frontcover';

CREATE INDEX file_upload_publication_idx
  ON file_upload (publication_id)
  WHERE file_type = 'publication';

SELECT diesel_manage_updated_at('file_upload');

GraphQL Schema

"""Type of file being uploaded/stored."""
enum FileType {
  """Publication-level file (e.g. PDF, EPUB, XML)."""
  PUBLICATION

  """Front cover image for a work."""
  FRONTCOVER
}

"""Represents an initialised upload session plus its presigned URL."""
type FileUpload {
  """Thoth ID of the upload session."""
  fileUploadId: Uuid!

  """Presigned S3 URL to which the client should upload using HTTP PUT."""
  uploadUrl: String!

  """When the presigned URL expires."""
  expiresAt: DateTime!
}

"""Represents a file that has been fully uploaded."""
type File {
  """Thoth ID of the stored file."""
  fileId: Uuid!

  """Type of file."""
  fileType: FileType!

  """Canonical S3 key, e.g. /{doi_prefix}/{doi_suffix}.pdf or /{doi_prefix}/{doi_suffix}_frontcover.jpg."""
  objectKey: String!

  """Public CDN URL, e.g. https://books.thoth.pub/{doi_prefix}/{doi_suffix}.pdf."""
  cdnUrl: String!

  """MIME type used when serving the file."""
  mimeType: String!

  """Size of the file in bytes."""
  bytes: Int!

  """SHA-256 checksum of the stored file."""
  sha256: String!
}

"""Input for starting a publication file upload (PDF, EPUB, XML, etc.)."""
input NewPublicationFileUpload {
  """Thoth ID of the publication linked to this file."""
  publicationId: Uuid!

  """MIME type declared by the client (used for validation and in the presigned URL)."""
  declaredMimeType: String!

  """File extension to use in the final canonical key, e.g. 'pdf', 'epub', 'xml'."""
  declaredExtension: String!

  """SHA-256 checksum of the file, hex-encoded."""
  declaredSha256: String!
}

"""Input for starting a front cover upload for a work."""
input NewFrontcoverFileUpload {
  """Thoth ID of the work this front cover belongs to."""
  workId: Uuid!

  """MIME type declared by the client (e.g. 'image/jpeg')."""
  declaredMimeType: String!

  """File extension to use in the final canonical key, e.g. 'jpg', 'png', 'webp'."""
  declaredExtension: String!

  """SHA-256 checksum of the file, hex-encoded."""
  declaredSha256: String!
}

"""Input for completing a file upload and promoting it to its final DOI-based location."""
input CompleteFileUpload {
  """ID of the upload session to complete."""
  fileUploadId: Uuid!
}

extend type Mutation {
  """
  Start uploading a publication file (e.g. PDF, EPUB, XML) for a given publication.
  Returns an upload session ID and a presigned S3 PUT URL.
  """
  initPublicationFileUpload(
    data: NewPublicationFileUpload!
  ): FileUpload!

  """
  Start uploading a front cover image for a given work.
  Returns an upload session ID and a presigned S3 PUT URL.
  """
  initFrontcoverFileUpload(
    data: NewFrontcoverFileUpload!
  ): FileUpload!

  """
  Complete a file upload by validating the uploaded object, moving it to its canonical DOI-based key,
  updating/creating the file record.
  """
  completeFileUpload(
    data: CompleteFileUpload!
  ): File!
}

Server logic

Overview

  • All temporary uploads go to:

    • uploads/{file_upload_id} in the imprint's S3 bucket.
  • Final public objects use the canonical DOI-based key (always lowercased):

    • Publication file:
      • /{doi_prefix}/{doi_suffix}.{extension}
    • Front cover:
      • /{doi_prefix}/{doi_suffix}_frontcover.{extension}
  • CloudFront is configured to serve from the imprint’s bucket for all keys except /uploads/* (uploads must not be exposed via CDN).

  • Imprint storage configuration (s3_bucket, s3_region, cdn_domain, cloudfront_dist_id) is only readable/writable by superusers.

There should be a storage abstraction roughly like:

  • storage::presign_put_for_upload
  • storage::copy_temp_object_to_final
  • storage::delete_temp_object
  • storage::invalidate_cloudfront

1. initPublicationFileUpload

Goal: create a file_upload row and return a presigned S3 PUT URL for the client to upload the file.

Steps:

  1. Auth

    • Check that the caller has permission to modify the given publicationId / its imprint.
    • Check the linked work has a DOI, otherwise return an error
  2. Check imprint storage configuration

    • Ensure imprint.s3_bucket, s3_region, cdn_domain, cloudfront_dist_id are all non-null.
    • If not configured, return a clear error (e.g. "Imprint is not configured for file hosting").
  3. Insert file_upload row

    • file_type = 'publication'
    • publication_id = input.publicationId
    • work_id = NULL
    • declared_mime_type = input.declaredMimeType
    • declared_extension = input.declaredExtension (stored lowercased)
    • declared_sha256 = input.declaredSha256
  4. Compute temporary S3 key

    • temp_key = "uploads/{file_upload_id}"
  5. Generate presigned PUT URL

    • Using imprint's s3_bucket + s3_region.
    • Resource: temp_key.
    • HTTP method: PUT.
    • Pre-sign two headers:
      • Content-Type = declaredMimeType
      • x-amz-checksum-sha256 = <base64_of_sha256>
    • Presigned URL should expire after a short period (e.g. 10–60 mins).
  6. Return

    • fileUploadId (the DB file_upload_id)
    • uploadUrl (the presigned PUT URL)
    • expiresAt (URL expiry time)

The client then performs a direct PUT from the browser

2. initFrontcoverFileUpload

Same pattern, but operating at the work level instead of publication. i.e.

  • file_type = 'frontcover'
  • publication_id = NULL
  • work_id = input.workId

3. completeFileUpload

Goal: Validate the uploaded file, move it to the canonical DOI path, update/create the file row, update Work.cover_url for frontcovers, invalidate CloudFront if an old file was replaced, and clean up.

Steps:

  1. Auth

    • Check that the caller has permission to modify the underlying data.
  2. Load file_upload

    • Lookup by fileUploadId.
    • If missing: "Upload session not found".
  3. HeadObject on temporary upload

    • temp_key = "uploads/{file_upload_id}"
    • Expect:
      • ContentLength: bytes
      • ContentType: mime
    • Error if object is missing (means the client never uploaded it!).
  4. Validation

    • Validate declared extension matches allowed formats for the file’s type.
    • For publications:
      • Match against PublicationType (e.g. PDF: .pdf, XML: .xml or .zip, EPUB: .epub, etc.).
    • For frontcovers:
      • Allow only real image types (jpg, jpeg, png, webp).
  5. Compute final canonical key

    • Lowercase prefix, suffix, and extension.
    • For publications: /{doi_prefix}/{doi_suffix}.{extension}
    • For frontcover: /{doi_prefix}/{doi_suffix}_frontcover.{extension}
  6. Copy object into final location

    • CopyObject from temp key to canonical key.
  7. Insert/update into file table

    • object_key = canonical_key
    • cdn_url = "https://{cdn_domain}{canonical_key}"
    • mime_type = S3 ContentType
    • bytes = ContentLength
    • sha256 = input.declaredSha256
  8. Update Work.cover_url for frontcovers

    • Only if file_type = 'frontcover'
    • This keeps the work’s cover URL in sync with the newly uploaded image.
  9. Create/update canonical Location for publication files

    • If file_type = 'publication':
      • We want a Location row with:
        • publication_id = publication_id
        • location_platform = 'Thoth'
        • canonical = true
        • full_text_url = file.cdn_url
        • landing_page = work.landing_page
  10. Invalidate CloudFront cache

  • Only when replacing an existing.
  • Call CloudFront CreateInvalidation using imprint.cloudfront_dist_id
  1. Cleanup
  • Delete the file_upload row
  • Delete the temp key (/uploads/{file_upload_id}): DeleteObject
  1. Return
{
  "fileId": "...",
  "fileType": "...",
  "objectKey": "...",
  "cdnUrl": "...",
  "mimeType": "...",
  "bytes": 123456,
  "sha256": "..."
}

Metadata

Metadata

Assignees

Labels

featureNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions