USE 259 - parse HTML for mitlibwebsite source #266

ghukill · 2025-12-10T20:57:11Z

Purpose and background context

This PR updates the transformer for mitlibwebsite to utilize the full, rendered HTML that is now coming out of browsertrix-harvester for the source records.

Most impactfully, we can now set the new fulltext field with full-text extracted from the HTML in a way that makes sense for this source specifically (not just using the full-text extraction from browsertrix which is good, but not great).

More mechanically, this recreates the minimal amount of metadata parsing from browsertrix-harvester that was performed into Transmogrifier (at this time, really only an OpenGraph og:description element that we map to summary). However, we're setup nicely in the future if we want to extract more metadata from the original page HTML.

How can a reviewer manually see the effects of these changes?

1- Set AWS Dev1 credentials

2- Start ipython shell

pipenv run ipython

3- Load the MITLibWebsite transformer with example records from S3:

from transmogrifier.sources.transformer import Transformer


transformer = Transformer.load(
    "mitlibwebsite",
    "s3://timdex-extract-dev-222053980223/scratch/mitlibwebsite-2025-12-10-full-extracted-records-to-index.jsonl",
)

These records were prepared with the updated browsertrix-harvester that includes the full HTML in the records

4- Transform a single record to inspect:

import json

record = next(transformer)
transformed_record = json.loads(record.transformed_record.decode())

# note that summary is still set, parsed from HTML in Transmog
print(transformed_record['summary'])

# note that we are now setting the fulltext field as well
print(transformed_record['fulltext'])

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES:

Full-text is now available in the TIMDEX record for the mitlibwebsite source.
If needed, this HTML parsing could be utilized to extract more granular, source specific metadata in the future.

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-259

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: Now that browsertrix-harvester is including full HTML + response headers in the source record available to Transmogrifier, we can do two things: 1. Parse metadata for mitlibwebsite TIMDEX records from the original, full HTML in a more opinionated fashion than we could in browsertrix-harvester. 2. Extract good, meaningful full-text from the full HTML to use for the new `fulltext` field. How this addresses that need: Expects a new `html_base64` field in the browsertrix-harvester source records. Uses this to extract metadata and full-text for the record. Side effects of this change: * Full-text is now available in the TIMDEX record for the mitlibwebsite source. * If needed, this HTML parsing could be utilized to extract more granular, source specific metadata in the future. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-259

jonavellecuerdo

Exciting! 🤓

ghukill added 2 commits December 10, 2025 15:49

Default exclusion_list_path to None

2e88aa5

ghukill marked this pull request as ready for review December 10, 2025 21:26

ghukill requested a review from a team December 11, 2025 14:18

jonavellecuerdo self-assigned this Dec 11, 2025

jonavellecuerdo approved these changes Dec 11, 2025

View reviewed changes

Update dependencies, get TDA v3.7.1

356d18b

ghukill merged commit 090f1a6 into main Dec 11, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

USE 259 - parse HTML for mitlibwebsite source #266

USE 259 - parse HTML for mitlibwebsite source #266

Uh oh!

ghukill commented Dec 10, 2025 •

edited

Loading

Uh oh!

jonavellecuerdo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

USE 259 - parse HTML for mitlibwebsite source #266

USE 259 - parse HTML for mitlibwebsite source #266

Uh oh!

Conversation

ghukill commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghukill commented Dec 10, 2025 •

edited

Loading