Skip to content

Commit ead98fb

Browse files
authored
Merge pull request #1465 from MITLibraries/etd-669-apt-integration
Wire preservation workflow to Archival Packaging Tool (APT)
2 parents 3e33fd2 + 04047b7 commit ead98fb

14 files changed

+388
-137
lines changed

Gemfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ gem 'sentry-ruby'
3636
gem 'simple_form'
3737
gem 'skylight'
3838
gem 'terser'
39+
gem 'webmock'
3940
gem 'zip_tricks'
4041

4142
group :production do

Gemfile.lock

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,9 @@ GEM
154154
cocoon (1.2.15)
155155
concurrent-ruby (1.3.5)
156156
connection_pool (2.5.3)
157+
crack (1.0.0)
158+
bigdecimal
159+
rexml
157160
crass (1.0.6)
158161
date (3.4.1)
159162
delayed_job (4.1.13)
@@ -182,6 +185,7 @@ GEM
182185
terminal-table (>= 1.8)
183186
globalid (1.2.1)
184187
activesupport (>= 6.1)
188+
hashdiff (1.2.0)
185189
hashie (5.0.0)
186190
i18n (1.14.7)
187191
concurrent-ruby (~> 1.0)
@@ -445,6 +449,10 @@ GEM
445449
activemodel (>= 6.0.0)
446450
bindex (>= 0.4.0)
447451
railties (>= 6.0.0)
452+
webmock (3.25.1)
453+
addressable (>= 2.8.0)
454+
crack (>= 0.3.2)
455+
hashdiff (>= 0.4.0, < 2.0.0)
448456
websocket (1.2.11)
449457
websocket-driver (0.7.7)
450458
base64
@@ -509,6 +517,7 @@ DEPENDENCIES
509517
terser
510518
timecop
511519
web-console
520+
webmock
512521
zip_tricks
513522

514523
RUBY VERSION

README.md

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,19 @@ polling by specifying a longer queue wait time. Defaults to 10 if unset.
159159
`SQS_RESULT_IDLE_TIMEOUT` - Configures the :idle_timeout arg of the AWS poll method, which specifies the maximum time
160160
in seconds to wait for a new message before the polling loop exists. Defaults to 0 if unset.
161161

162+
### Archival Packaging Tool (APT) configuration
163+
164+
The following enviroment variables are needed to communicate with [APT](https://github.com/MITLibraries/archival-packaging-tool), which is used in the
165+
[preservation workflow](#publishing-workflow).
166+
167+
`APT_CHALLENGE_SECRET` - Secret value used to authenticate requests to the APT Lambda endpoint.
168+
`APT_VERBOSE` - If set to `true`, enables verbose logging for APT requests.
169+
`APT_CHECKSUMS_TO_GENERATE` - Array of checksum algorithms to generate for files (default: ['md5']).
170+
`APT_COMPRESS_ZIP` - Boolean value to indicate whether the output bag should be compressed as a zip
171+
file (default: true).
172+
`APT_S3_BUCKET` - S3 bucket URI where APT output bags are stored.
173+
`APT_LAMBDA_URL` - The URL of the APT Lambda endpoint for preservation requests.
174+
162175
### Email configuration
163176

164177
`SMTP_ADDRESS`, `SMTP_PASSWORD`, `SMTP_PORT`, `SMTP_USER` - all required to send mail.
@@ -397,15 +410,15 @@ Note: `Pending publication` is allowed here, but not expected to be a normal occ
397410
## Preservation workflow
398411

399412
The publishing workflow will automatically trigger preservation for all of the published theses in the results queue.
400-
At this point a submission information package is generated for each thesis, then a bag is constructed, zipped, and
401-
streamed to an S3 bucket. (See the SubmissionInformationPackage and SubmissionInformationPackageZipper classes for more
402-
details on this part of the process.)
403413

404-
Once they are in the S3 bucket, the bags are automatically replicated to the Digital Preservation S3 bucket, where they
405-
can be ingested into Archivematica.
414+
At this point, the preservation job will generate an Archivematica payload for each thesis, which
415+
are then POSTed to [APT](https://github.com/MITLibraries/archival-packaging-tool) for further processing. Each payload includes a metadata CSV and a JSON object containing structural information about the thesis files.
416+
417+
Once the payloads are sent to APT, each thesis is structured as a BagIt bag and saved to an S3
418+
bucket, where they can be ingested into Archivematica.
406419

407-
A thesis can be sent to preservation more than once. In order to track provenance across multiple preservation events,
408-
we persist certain data about the SIP and audit the model using `paper_trail`.
420+
A thesis can be sent to preservation more than once. In order to track provenance across multiple preservation events, we persist certain data about the Archivematica payload and audit the model
421+
using `paper_trail`.
409422

410423
### Preserving a single thesis
411424

Lines changed: 34 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,16 @@
11
class PreservationSubmissionJob < ActiveJob::Base
2+
require 'net/http'
3+
require 'uri'
4+
25
queue_as :default
36

47
def perform(theses)
58
Rails.logger.info("Preparing to send #{theses.count} theses to preservation")
69
results = { total: theses.count, processed: 0, errors: [] }
710
theses.each do |thesis|
811
Rails.logger.info("Thesis #{thesis.id} is now being prepared for preservation")
9-
sip = thesis.submission_information_packages.create!
10-
preserve_sip(sip)
12+
payload = thesis.archivematica_payloads.create!
13+
preserve_payload(payload)
1114
Rails.logger.info("Thesis #{thesis.id} has been sent to preservation")
1215
results[:processed] += 1
1316
rescue StandardError, Aws::Errors => e
@@ -20,10 +23,34 @@ def perform(theses)
2023

2124
private
2225

23-
def preserve_sip(sip)
24-
SubmissionInformationPackageZipper.new(sip)
25-
sip.preservation_status = 'preserved'
26-
sip.preserved_at = DateTime.now
27-
sip.save
26+
def preserve_payload(payload)
27+
post_payload(payload)
28+
payload.preservation_status = 'preserved'
29+
payload.preserved_at = DateTime.now
30+
payload.save!
31+
end
32+
33+
def post_payload(payload)
34+
s3_url = ENV.fetch('APT_LAMBDA_URL', nil)
35+
uri = URI.parse(s3_url)
36+
request = Net::HTTP::Post.new(uri, { 'Content-Type' => 'application/json' })
37+
request.body = payload.payload_json
38+
39+
response = Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
40+
http.request(request)
41+
end
42+
43+
validate_response(response)
44+
end
45+
46+
def validate_response(response)
47+
unless response.is_a?(Net::HTTPSuccess)
48+
raise "Failed to post Archivematica payload to APT: #{response.code} #{response.body}"
49+
end
50+
51+
result = JSON.parse(response.body)
52+
unless result['success'] == true
53+
raise "APT failed to create a bag: #{response.body}"
54+
end
2855
end
2956
end
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# == Schema Information
2+
#
3+
# Table name: archivematica_payloads
4+
#
5+
# id :integer not null, primary key
6+
# preservation_status :integer default("unpreserved"), not null
7+
# payload_json :text
8+
# preserved_at :datetime
9+
# thesis_id :integer not null
10+
# created_at :datetime not null
11+
# updated_at :datetime not null
12+
#
13+
# This class assembles a payload to send to the Archival Packaging Tool (APT), which then creates a bag for
14+
# preservation. It includes the thesis files, metadata, and checksums. The payload is then serialized to JSON
15+
# for transmission.
16+
#
17+
# Instances of this class are invalid without an associated thesis that has a DSpace handle, a copyright, and
18+
# at least one attached file with no duplicate filenames.
19+
#
20+
# There is some intentional duplication between this and the SubmissionInformationPackage model. The
21+
# SubmissionInformationPackage is the legacy model that was used to create the bag, but it is not
22+
# used in the current APT workflow. We are retaining it for historical purposes.
23+
class ArchivematicaPayload < ApplicationRecord
24+
include Checksums
25+
include Baggable
26+
27+
has_paper_trail
28+
belongs_to :thesis
29+
has_one_attached :metadata_csv
30+
31+
validates :baggable?, presence: true
32+
33+
before_create :set_metadata_csv, :set_payload_json
34+
35+
enum preservation_status: %i[unpreserved preserved]
36+
37+
private
38+
39+
# compress_zip is cast to a boolean to override the string value from ENV. APT strictly requires
40+
# a boolean for this field.
41+
def build_payload
42+
{
43+
action: 'create-bagit-zip',
44+
challenge_secret: ENV.fetch('APT_CHALLENGE_SECRET', nil),
45+
verbose: ActiveModel::Type::Boolean.new.cast(ENV.fetch('APT_VERBOSE', false)),
46+
input_files: build_input_files,
47+
checksums_to_generate: ENV.fetch('APT_CHECKSUMS_TO_GENERATE', ['md5']),
48+
output_zip_s3_uri: bag_output_uri,
49+
compress_zip: ActiveModel::Type::Boolean.new.cast(ENV.fetch('APT_COMPRESS_ZIP', true))
50+
}
51+
end
52+
53+
# Build input_files array from thesis files and attached metadata CSV
54+
def build_input_files
55+
files = thesis.files.map { |file| build_file_entry(file) }
56+
files << build_file_entry(metadata_csv) # Metadata CSV is the only file that is generated in this model
57+
files
58+
end
59+
60+
# Build a file entry for each file, including the metadata CSV.
61+
def build_file_entry(file)
62+
{
63+
uri: ["s3://#{ENV.fetch('AWS_S3_BUCKET')}", file.blob.key].join('/'),
64+
filepath: set_filepath(file),
65+
checksums: {
66+
md5: base64_to_hex(file.blob.checksum)
67+
}
68+
}
69+
end
70+
71+
def set_filepath(file)
72+
file == metadata_csv ? 'metadata/metadata.csv' : file.filename.to_s
73+
end
74+
75+
# The bag_name has to be unique due to our using it as the basis of an ActiveStorage key. Using a UUID
76+
# was not preferred as the target system of these bags adds it's own UUID to the file when it arrives there
77+
# so the filename was unwieldy with two UUIDs embedded in it so we simply increment integers.
78+
def bag_name
79+
safe_handle = thesis.dspace_handle.gsub('/', '_')
80+
"#{safe_handle}-thesis-#{thesis.submission_information_packages.count + 1}"
81+
end
82+
83+
# The bag_output_uri key is constructed to match the expected format for Archivematica.
84+
def bag_output_uri
85+
key = "etdsip/#{thesis.graduation_year}/#{thesis.graduation_month}-#{thesis.accession_number}/#{bag_name}.zip"
86+
[ENV.fetch('APT_S3_BUCKET'), key].join('/')
87+
end
88+
89+
def baggable?
90+
baggable_thesis?(thesis)
91+
end
92+
93+
def set_metadata_csv
94+
csv_data = ArchivematicaMetadata.new(thesis).to_csv
95+
metadata_csv.attach(io: StringIO.new(csv_data), filename: 'metadata.csv', content_type: 'text/csv')
96+
end
97+
98+
def set_payload_json
99+
self.payload_json = build_payload.to_json
100+
end
101+
end

app/models/submission_information_package.rb

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@
1414
# updated_at :datetime not null
1515
#
1616

17+
# This model is no longer used, but it is retained for historical purposes and to preserve existing
18+
# data. Its functionality has been replaced by the ArchivematicaPayload model, which is used in the
19+
# current preservation workflow.
20+
#
1721
# Creates the structure for an individual thesis to be preserved in Archivematica according to the BagIt spec:
1822
# https://datatracker.ietf.org/doc/html/rfc8493.
1923
#

app/models/submission_information_package_zipper.rb

Lines changed: 0 additions & 54 deletions
This file was deleted.

app/models/thesis.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ class Thesis < ApplicationRecord
4848
has_many :users, through: :authors
4949

5050
has_many :submission_information_packages, dependent: :destroy
51+
has_many :archivematica_payloads, dependent: :destroy
5152

5253
has_many_attached :files
5354
has_one_attached :dspace_metadata

config/environments/test.rb

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,14 @@
4040
ENV['SQS_RESULT_WAIT_TIME_SECONDS'] = '10'
4141
ENV['SQS_RESULT_IDLE_TIMEOUT'] = '0'
4242
ENV['AWS_REGION'] = 'us-east-1'
43+
ENV['AWS_S3_BUCKET'] = 'fake-etd-bucket'
4344
ENV['DSPACE_DOCTORAL_HANDLE'] = '1721.1/999999'
4445
ENV['DSPACE_GRADUATE_HANDLE'] = '1721.1/888888'
4546
ENV['DSPACE_UNDERGRADUATE_HANDLE'] = '1721.1/777777'
47+
ENV['APT_CHALLENGE_SECRET'] = 'fake-challenge-secret'
48+
ENV['APT_S3_BUCKET'] = 's3://fake-apt-bucket'
49+
ENV['APT_LAMBDA_URL'] = 'https://fake-lambda.example.com/'
50+
ENV['APT_COMPRESS_ZIP'] = 'true'
4651

4752
# While tests run files are not watched, reloading is not necessary.
4853
config.enable_reloading = false
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
class CreateArchivematicaPayloads < ActiveRecord::Migration[7.1]
2+
def change
3+
create_table :archivematica_payloads do |t|
4+
t.integer :preservation_status, null: false, default: 0
5+
t.text :payload_json
6+
t.datetime :preserved_at
7+
8+
t.references :thesis, null: false, foreign_key: true
9+
10+
t.timestamps
11+
end
12+
end
13+
end

0 commit comments

Comments
 (0)