This repository is associated with the HTR2HPC research project sponsored by the Center for Digital Humanities at Princeton. The project goal is integrating the eScriptorium handwritten text recognition (HTR) software with high performance computing (HPC) clusters and task manage.
Warning
This is experimental code for local use and assessment.
This package can be installed directly from GitHub using pip:
pip install git+https://github.com/Princeton-CDH/htr2hpc.git@main#egg=htr2hpcpucas is a dependency of this package and will be included when you install this package.
Import htr2hpc settings into the deployed escriptorium local settings. It must be imported after escriptorium settings so that overrides take precedence.
from escriptorium.settings import *
from htr2hpc.settings import *This adjusts the settings as follows:
- Adds to
INSTALLED_APPSandAUTHENTICATION_BACKENDSand provides a basicPUCAS_LDAPconfiguration to enable Princeton CAS authentication; configuresCAS_REDIRECT_URLto use the escriptoriumLOGIN_REDIRECT_URLconfiguration (currently the projects list page) and setsCAS_IGNORE_REFERER = Trueto avoid behavior where successful CAS login takes you back to the login page - Sets
ROOT_URLCONFto usehtr2hpc.urls, which addspucasurl paths to the urls defined inescriptorium.urls - Adds
htr2hpc/templatesdirectory first in the list of template directories, so that any templates in this application will take precedence over eScriptorium templates; currently used for customizing the login page to add Princeton CAS login
To fully enable CAS, you must fill out configurations for CAS server url and PUCAS LDAP settings in the local settings of your deployed application.
from escriptorium.settings import *
from htr2hpc.settings import *
# CAS login configuration
CAS_SERVER_URL = "https://example.com/cas/"
PUCAS_LDAP.update(
{
"SERVERS": [
"ldap2.example.com",
],
"SEARCH_BASE": "",
"SEARCH_FILTER": "(uid=%(user)s)",
# other ldap attributes as needed
}
)This architecture diagram shows how the eScriptorium instance was deployed on Princeton hardware during the testing phase.
flowchart TB
subgraph hpc["HPC"]
remote[["remote task"]]
end
subgraph htrvm["eScriptorium VM"]
Django["Django"]
nginx["NGINX"]
redis[("redis")]
supervisord["supervisord"]
celery["celery"]
local[["local task"]]
end
subgraph pul["PUL infrastructure"]
db[("PostgreSQL")]
nfs[/"NFS"\]
htrvm
end
nginx -- serves --> Django
Django --> db
Django -- queues tasks --> redis
supervisord -- manages --> celery
celery -- monitors --> redis
celery -- runs --> local & remote
htrvm --> nfs
For simplicity, we omit the second VM and load balancer; the two VMs are provisioned and deployed in the same way, and use shared PUL and HPC resources.
This sequence diagram shows the flow of operations between eScriptorium instance, htr2hpc installation on the HPC system, and Slurm.
The task is triggered via ssh, then training data and optionally a model are retrieved via REST API. The htr2hpc training task uses a two-job workflow with a preliminary calibration job before requesting second training job with resources and time requested based on the results of the calibration job.
sequenceDiagram
participant htrvm as htrvm
participant htr2hpc as htr2hpc
participant slurm as slurm
autonumber
htrvm ->>+ htr2hpc: Start training task
activate htr2hpc
htr2hpc -) htrvm: request data
htr2hpc --) htrvm: request model
htr2hpc ->> slurm: start calibration job
activate slurm
htr2hpc --x slurm: monitor job
slurm ->> htr2hpc: calibration output
deactivate slurm
htr2hpc ->> slurm: start training job
activate slurm
htr2hpc --x slurm: monitor job
deactivate slurm
htr2hpc -) htrvm: upload model
deactivate htr2hpc
htr2hpc is distributed under the terms of the Apache 2 license.