-
Notifications
You must be signed in to change notification settings - Fork 12
Description
We need to implement a secure and scalable file upload architecture for books and chapters.
The system must support:
- A GraphQL mutation to request a presigned upload URL, returning S3 upload details to the client.
- Client uploads directly to a private S3 prefix (/uploads/...), not exposed publicly by CloudFront.
- A second GraphQL mutation to complete the upload, where the backend:
- Validates the uploaded file (size, type, checksum)
- Moves it to its final, public CDN-served path
- Builds the canonical filename pattern (all lowercase):
/{doi_prefix}/{doi_suffix}.{file_extension} - Creates the database record containing file metadata (checksum, mime, bytes, etc.)
- On file updates: invalidates CloudFront cache for the final path
Database Schema
NB. Add errors to thoth-errors/src/database_errors.rs
1. Storage configuration per imprint
Only superusers may read or modify it:
ALTER TABLE imprint
ADD COLUMN s3_bucket TEXT,
ADD COLUMN s3_region TEXT,
ADD COLUMN cdn_domain TEXT,
ADD COLUMN cloudfront_dist_id TEXT;
-- all or nothing
ALTER TABLE imprint
ADD CONSTRAINT imprint_storage_cfg_all_or_none
CHECK (
(
s3_bucket IS NULL AND
s3_region IS NULL AND
cdn_domain IS NULL AND
cloudfront_dist_id IS NULL
)
OR
(
s3_bucket IS NOT NULL AND
s3_region IS NOT NULL AND
cdn_domain IS NOT NULL AND
cloudfront_dist_id IS NOT NULL
)
);AWS user/key is passed at runtime via CLI/env, not stored
2. File types and URL structure
Define a FileType enum with the two types of uploads we're supporting now. We'll eventually expand this with additional resources, backcovers, etc.
CREATE TYPE file_type AS ENUM ('publication', 'frontcover');URL structure:
- Publication file:
/{doi_prefix}/{doi_suffix}.{extension} - Front cover:
/{doi_prefix}/{doi_suffix}_frontcover.{extension}
Both lowercased before writing object_key.
3. File table (final stored files)
CREATE TABLE file (
file_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
file_type file_type NOT NULL, -- 'publication' | 'frontcover'
work_id UUID REFERENCES work (work_id),
publication_id UUID REFERENCES publication (publication_id),
object_key TEXT NOT NULL, -- lowercase DOI-based canonical path
cdn_url TEXT NOT NULL, -- full public URL
mime_type TEXT NOT NULL,
bytes BIGINT NOT NULL,
sha256 TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Enforce types:
ALTER TABLE file
ADD CONSTRAINT file_type_check
CHECK (
(file_type = 'frontcover' AND work_id IS NOT NULL AND publication_id IS NULL) OR
(file_type = 'publication' AND publication_id IS NOT NULL AND work_id IS NULL)
);
-- One frontcover per work
CREATE UNIQUE INDEX file_frontcover_work_unique_idx
ON file (work_id)
WHERE file_type = 'frontcover';
-- One publication file per publication
CREATE UNIQUE INDEX file_publication_unique_idx
ON file (publication_id)
WHERE file_type = 'publication';
-- Never reuse the same object key
CREATE UNIQUE INDEX file_object_key_unique_idx
ON file (object_key);
SELECT diesel_manage_updated_at('file');4. file_upload table (temporary uploads)
CREATE TABLE file_upload (
file_upload_id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
file_type file_type NOT NULL, -- same enum as final file table
work_id UUID REFERENCES work (work_id),
publication_id UUID REFERENCES publication (publication_id),
declared_mime_type TEXT NOT NULL,
declared_extension TEXT NOT NULL,
declared_sha256 TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
ALTER TABLE file_upload
ADD CONSTRAINT file_upload_type_check
CHECK (
(file_type = 'frontcover' AND work_id IS NOT NULL AND publication_id IS NULL) OR
(file_type = 'publication' AND publication_id IS NOT NULL AND work_id IS NULL)
);
CREATE INDEX file_upload_work_idx
ON file_upload (work_id)
WHERE file_type = 'frontcover';
CREATE INDEX file_upload_publication_idx
ON file_upload (publication_id)
WHERE file_type = 'publication';
SELECT diesel_manage_updated_at('file_upload');GraphQL Schema
"""Type of file being uploaded/stored."""
enum FileType {
"""Publication-level file (e.g. PDF, EPUB, XML)."""
PUBLICATION
"""Front cover image for a work."""
FRONTCOVER
}
"""Represents an initialised upload session plus its presigned URL."""
type FileUpload {
"""Thoth ID of the upload session."""
fileUploadId: Uuid!
"""Presigned S3 URL to which the client should upload using HTTP PUT."""
uploadUrl: String!
"""When the presigned URL expires."""
expiresAt: DateTime!
}
"""Represents a file that has been fully uploaded."""
type File {
"""Thoth ID of the stored file."""
fileId: Uuid!
"""Type of file."""
fileType: FileType!
"""Canonical S3 key, e.g. /{doi_prefix}/{doi_suffix}.pdf or /{doi_prefix}/{doi_suffix}_frontcover.jpg."""
objectKey: String!
"""Public CDN URL, e.g. https://books.thoth.pub/{doi_prefix}/{doi_suffix}.pdf."""
cdnUrl: String!
"""MIME type used when serving the file."""
mimeType: String!
"""Size of the file in bytes."""
bytes: Int!
"""SHA-256 checksum of the stored file."""
sha256: String!
}
"""Input for starting a publication file upload (PDF, EPUB, XML, etc.)."""
input NewPublicationFileUpload {
"""Thoth ID of the publication linked to this file."""
publicationId: Uuid!
"""MIME type declared by the client (used for validation and in the presigned URL)."""
declaredMimeType: String!
"""File extension to use in the final canonical key, e.g. 'pdf', 'epub', 'xml'."""
declaredExtension: String!
"""SHA-256 checksum of the file, hex-encoded."""
declaredSha256: String!
}
"""Input for starting a front cover upload for a work."""
input NewFrontcoverFileUpload {
"""Thoth ID of the work this front cover belongs to."""
workId: Uuid!
"""MIME type declared by the client (e.g. 'image/jpeg')."""
declaredMimeType: String!
"""File extension to use in the final canonical key, e.g. 'jpg', 'png', 'webp'."""
declaredExtension: String!
"""SHA-256 checksum of the file, hex-encoded."""
declaredSha256: String!
}
"""Input for completing a file upload and promoting it to its final DOI-based location."""
input CompleteFileUpload {
"""ID of the upload session to complete."""
fileUploadId: Uuid!
}
extend type Mutation {
"""
Start uploading a publication file (e.g. PDF, EPUB, XML) for a given publication.
Returns an upload session ID and a presigned S3 PUT URL.
"""
initPublicationFileUpload(
data: NewPublicationFileUpload!
): FileUpload!
"""
Start uploading a front cover image for a given work.
Returns an upload session ID and a presigned S3 PUT URL.
"""
initFrontcoverFileUpload(
data: NewFrontcoverFileUpload!
): FileUpload!
"""
Complete a file upload by validating the uploaded object, moving it to its canonical DOI-based key,
updating/creating the file record.
"""
completeFileUpload(
data: CompleteFileUpload!
): File!
}Server logic
Overview
-
All temporary uploads go to:
uploads/{file_upload_id}in the imprint's S3 bucket.
-
Final public objects use the canonical DOI-based key (always lowercased):
- Publication file:
/{doi_prefix}/{doi_suffix}.{extension}
- Front cover:
/{doi_prefix}/{doi_suffix}_frontcover.{extension}
- Publication file:
-
CloudFront is configured to serve from the imprint’s bucket for all keys except
/uploads/*(uploads must not be exposed via CDN). -
Imprint storage configuration (
s3_bucket,s3_region,cdn_domain,cloudfront_dist_id) is only readable/writable by superusers.
There should be a storage abstraction roughly like:
storage::presign_put_for_uploadstorage::copy_temp_object_to_finalstorage::delete_temp_objectstorage::invalidate_cloudfront
1. initPublicationFileUpload
Goal: create a file_upload row and return a presigned S3 PUT URL for the client to upload the file.
Steps:
-
Auth
- Check that the caller has permission to modify the given
publicationId/ its imprint. - Check the linked work has a DOI, otherwise return an error
- Check that the caller has permission to modify the given
-
Check imprint storage configuration
- Ensure
imprint.s3_bucket,s3_region,cdn_domain,cloudfront_dist_idare all non-null. - If not configured, return a clear error (e.g.
"Imprint is not configured for file hosting").
- Ensure
-
Insert
file_uploadrowfile_type = 'publication'publication_id = input.publicationIdwork_id = NULLdeclared_mime_type = input.declaredMimeTypedeclared_extension = input.declaredExtension(stored lowercased)declared_sha256 = input.declaredSha256
-
Compute temporary S3 key
temp_key = "uploads/{file_upload_id}"
-
Generate presigned PUT URL
- Using imprint's
s3_bucket+s3_region. - Resource:
temp_key. - HTTP method:
PUT. - Pre-sign two headers:
Content-Type = declaredMimeTypex-amz-checksum-sha256 = <base64_of_sha256>
- Presigned URL should expire after a short period (e.g. 10–60 mins).
- Using imprint's
-
Return
fileUploadId(the DBfile_upload_id)uploadUrl(the presigned PUT URL)expiresAt(URL expiry time)
The client then performs a direct PUT from the browser
2. initFrontcoverFileUpload
Same pattern, but operating at the work level instead of publication. i.e.
file_type = 'frontcover'publication_id = NULLwork_id = input.workId
3. completeFileUpload
Goal: Validate the uploaded file, move it to the canonical DOI path, update/create the file row, update Work.cover_url for frontcovers, invalidate CloudFront if an old file was replaced, and clean up.
Steps:
-
Auth
- Check that the caller has permission to modify the underlying data.
-
Load file_upload
- Lookup by
fileUploadId. - If missing:
"Upload session not found".
- Lookup by
-
HeadObject on temporary upload
temp_key = "uploads/{file_upload_id}"- Expect:
ContentLength: bytesContentType: mime
- Error if object is missing (means the client never uploaded it!).
-
Validation
- Validate declared extension matches allowed formats for the file’s type.
- For publications:
- Match against
PublicationType(e.g. PDF: .pdf, XML: .xml or .zip, EPUB: .epub, etc.).
- Match against
- For frontcovers:
- Allow only real image types (jpg, jpeg, png, webp).
-
Compute final canonical key
- Lowercase prefix, suffix, and extension.
- For publications:
/{doi_prefix}/{doi_suffix}.{extension} - For frontcover:
/{doi_prefix}/{doi_suffix}_frontcover.{extension}
-
Copy object into final location
CopyObjectfrom temp key to canonical key.
-
Insert/update into file table
object_key = canonical_keycdn_url = "https://{cdn_domain}{canonical_key}"mime_type = S3 ContentTypebytes = ContentLengthsha256 = input.declaredSha256
-
Update Work.cover_url for frontcovers
- Only if
file_type = 'frontcover' - This keeps the work’s cover URL in sync with the newly uploaded image.
- Only if
-
Create/update canonical Location for publication files
- If
file_type = 'publication':- We want a Location row with:
publication_id = publication_idlocation_platform = 'Thoth'canonical = truefull_text_url = file.cdn_urllanding_page = work.landing_page
- We want a Location row with:
- If
-
Invalidate CloudFront cache
- Only when replacing an existing.
- Call CloudFront
CreateInvalidationusingimprint.cloudfront_dist_id
- Cleanup
- Delete the
file_uploadrow - Delete the temp key (
/uploads/{file_upload_id}):DeleteObject
- Return
{
"fileId": "...",
"fileType": "...",
"objectKey": "...",
"cdnUrl": "...",
"mimeType": "...",
"bytes": 123456,
"sha256": "..."
}