Skip to content

UCS Redux - Lynx Boreal (GSI-1685) #148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added 76-lynx-boreal/images/ucs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added 76-lynx-boreal/images/upload_context.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
188 changes: 188 additions & 0 deletions 76-lynx-boreal/technical_specification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# Upload Service Redux (Lynx Boreal)
**Epic Type:** Implementation Epic

Epic planning and implementation follow the
[Epic Planning and Marathon SOP](https://docs.ghga-dev.de/main/sops/sop001_epic_planning.html).

## Scope
### Outline:
The goal of this epic is to overhaul the Upload Controller Service (UCS) as part of the
new [File Upload concept](https://ghga.pages.hzdr.de/internal.ghga.de/feature_archconcept-file-upload/developer/architecture_concepts/ac007_file_upload/).

![UCS Diagram](./images/ucs.png)

#### Domain Objects
The UCS owns two domain objects, which it broadcasts as outbox events via Kafka. The
first domain object is the `UploadContext`, which broadly serves to delineate
in-progress and finalized file submissions for a given study. The second domain
object is the `FileUpload`. As its name suggests, the `FileUpload` object reflects
the upload status of a single file within an `UploadContext`. Thus, there is a
hierarchical, one-to-many relationship between `UploadContext` and `FileUpload`.

We will define the Pydantic models for these two classes in `ghga-event-schemas`,
along with one stateful config class for each.

#### Inputs
The UCS only receives user input in the form of HTTP requests. It doesn't subscribe to
any Kafka events. However, we will define a slim CLI interface for the service that
exposes commands to `run-rest` and `publish-events`. These commands are commonly seen
across our services at this time.

### Outputs
There are three categories of output in the UCS: HTTP responses, published events, and
data stored in the database. HTTP responses are described below in the API Definitions
section. The published events and database storage are driven simultaneously by
Hexkit's MongoKafkaDaoPublisher, which the UCS uses to store `UploadContext` and
`FileUpload` instances. Anytime an `UploadContext` or `FileUpload` is created, modified,
or deleted, the UCS publishes a Kafka event containing the latest state. This is done
according to the Outbox Pattern (not described in further detail here).

#### Auth
Users will not access the UCS's HTTP API directly, but rather through the
`ghga-connector` or Data Portal.
Successful access to HTTP endpoints will require the encrypted
access token they obtained from the Data Portal when creating the Upload Context.
The HTTP request responsible for creating the Upload Context does not come directly
from the user, but rather from the Study Repository Service.
For more information on the HTTP API, see the endpoint definitions below.

### Included/Required:
- Remove existing core logic
- Create new core class w/ outbox publisher
- Write Unit and Integration Tests

### Not included:
Archive test bed integration, Study Repository Service development, or front end work.

## User Journeys

### UploadContext Creation
Using the Data Portal, the user initiates a file upload for a study. The request flows
from the Data Portal to the Study Repository Service (where it passes through
validation and other checks) and ultimately to the UCS's HTTP endpoint
`POST /contexts`. The UCS creates a new
`UploadContext` with the state set to `OPEN` and returns the `UploadContext` to the
Study Repository Service. The Study Repository Service returns authentication info
to the user via the Data Portal.

### UploadContext Update
The user makes a request to the `PATCH /contexts` endpoint via the Study Repository
Service. If a valid encrypted access token is supplied with the request, the UCS
updates the state of the `UploadContext` to `LOCKED`, `CLOSED`, or `OPEN`, as
specified by the request. If the `UploadContext` is already in the given state, nothing
happens and the UCS returns a successful response.
The initial state of the `UploadContext` is `OPEN`. When the user is finished uploading
files, they can use the Data Portal to set the Context to a semi-finalized state,
`LOCKED`. It is possible that the user decides they need to make changes, such as
uploading or removing a file, and in that case they can revert the Context to `OPEN`.
If no changes are needed, however, the user can fully finalize the Context by setting
it to `CLOSED`, after which point no changes can be made without opening a new
`UploadContext`.
If user tries to change the status of an `UploadContext` that's already set to `CLOSED`,
they receive an error. Once the update operation is complete, the UCS publishes a Kafka
event reflecting the latest state of the `UploadContext` and returns an HTTP response
indicating the update was successful.

![UploadContext State Diagram](./images/upload_context.png)

An `UploadContext` may only be moved from `LOCKED` to `CLOSED` if all its linked
`FileUpload`s are set to `COMPLETED`. External logic in the Study Repository Service
is responsible for further validation, like ensuring interrogation was successful.

### File Upload Init
The user initiates the upload process for a given single file by making a request to
the `POST /uploads` endpoint. The request body includes the unencrypted checksum, the
access token, and the alias (or whichever naming element is used to match the file
with the metadata content).

If a valid encrypted access token is supplied with the
request, the UCS ensures it doesn't already have a completed `FileUpload` for the same
file, then adds the `FileUpload` to the associated `UploadContext`.

The UCS publishes upsertion events to Kafka for both the `FileUpload` and
`UploadContext` objects, and finally returns an HTTP response to the user indicating
that the file upload was successfully initiated.

The `ghga-connector` uploads a given file in chunks, and for each chunk it requests
a pre-signed upload URL. If the request includes a valid access token, the UCS
returns an HTTP response with the pre-signed upload URL if the token is valid.

### File Upload Termination (Upload Completion)
The user initiates a file upload using the `ghga-connector`. When the upload is
complete, the connector automatically makes a request to `PATCH /uploads`. This call
instructs the UCS to communicate with the S3 instance and terminate (complete) the
multipart upload. The UCS will update the `FileUpload` instance to `COMPLETED` and
publish a Kafka event reflecting the new state. Finally, the UCS will return an HTTP
response indicating the operation was successful and that the file is completely
uploaded.

### File Upload Deletion
The user makes a request to the `DELETE /uploads` endpoint, indicating they wish to
delete a file from the associated Upload Context. If a valid encrypted access token
is supplied with the request, the UCS cancels the ongoing upload if it exists and
deletes the `FileUpload` object from the database. It removes the reference
from the `file_uploads` field in the `UploadContext` and publishes Kafka events
reflecting the deletion of the upload and the new state of the Upload Context.
Finally, the UCS returns an HTTP response to the user indicating the deletion was
successful.

## API Definitions:

### RESTful/Synchronous:

- POST /contexts: Create a new UploadContext
- PATCH /contexts: Update an UploadContext to change the status
- POST /uploads: Initiate a multipart file upload
- PATCH /uploads: Signal that a multipart file upload has been completed
- DELETE /uploads: Remove a file upload from the UploadContext

### Payload Schemas for Events:

```python
class UploadContextState(StrEnum):
"""The allowed states for an UploadContext instance"""

OPEN = "open"
LOCKED = "locked"
CLOSED = "closed"

class UploadContext(BaseModel):
"""A class representing an Upload Context"""

upload_context_id: UUID4 # unique identifier for the instance
state: UploadContextState # one of OPEN, LOCKED, CLOSED
file_uploads: list[FileUpload] # use list function for default_factory

class FileUploadState(StrEnum):
"""The allowed states for a FileUpload instance"""

INIT = "init"
COMPLETED = "completed"

class FileUpload(BaseModel):
"""A File Upload"""

upload_id: UUID4
state: FileUploadState # one of INIT, COMPLETED
original_path: str
checksum: str
```


## Additional Implementation Details:


### Testing
Tests need to cover at least the following items (not exhaustive):
- Standard endpoint authentication battery
- Happy path for each endpoint
- Core error translation for HTTP API for each endpoint
- Disallow changing status of a CLOSED UploadContext
- Disallow removing a file from a CLOSED UploadContext


## Human Resource/Time Estimation:

Number of sprints required: 1

Number of developers required: 1