Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Dec 10, 2025

Purpose and background context

This PR updates the transformer for mitlibwebsite to utilize the full, rendered HTML that is now coming out of browsertrix-harvester for the source records.

Most impactfully, we can now set the new fulltext field with full-text extracted from the HTML in a way that makes sense for this source specifically (not just using the full-text extraction from browsertrix which is good, but not great).

More mechanically, this recreates the minimal amount of metadata parsing from browsertrix-harvester that was performed into Transmogrifier (at this time, really only an OpenGraph og:description element that we map to summary). However, we're setup nicely in the future if we want to extract more metadata from the original page HTML.

How can a reviewer manually see the effects of these changes?

1- Set AWS Dev1 credentials

2- Start ipython shell

pipenv run ipython

3- Load the MITLibWebsite transformer with example records from S3:

from transmogrifier.sources.transformer import Transformer


transformer = Transformer.load(
    "mitlibwebsite",
    "s3://timdex-extract-dev-222053980223/scratch/mitlibwebsite-2025-12-10-full-extracted-records-to-index.jsonl",
)
  • These records were prepared with the updated browsertrix-harvester that includes the full HTML in the records

4- Transform a single record to inspect:

import json

record = next(transformer)
transformed_record = json.loads(record.transformed_record.decode())

# note that summary is still set, parsed from HTML in Transmog
print(transformed_record['summary'])

# note that we are now setting the fulltext field as well
print(transformed_record['fulltext'])

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES:

  • Full-text is now available in the TIMDEX record for the mitlibwebsite source.
  • If needed, this HTML parsing could be utilized to extract more granular, source specific metadata in the future.

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:

Now that browsertrix-harvester is including full HTML + response headers in the
source record available to Transmogrifier, we can do two things:

1. Parse metadata for mitlibwebsite TIMDEX records from the original, full HTML
in a more opinionated fashion than we could in browsertrix-harvester.

2. Extract good, meaningful full-text from the full HTML to use for the new
`fulltext` field.

How this addresses that need:

Expects a new `html_base64` field in the browsertrix-harvester source records.
Uses this to extract metadata and full-text for the record.

Side effects of this change:
* Full-text is now available in the TIMDEX record for the mitlibwebsite
source.
* If needed, this HTML parsing could be utilized to extract more granular,
source specific metadata in the future.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-259
@ghukill ghukill marked this pull request as ready for review December 10, 2025 21:26
@ghukill ghukill requested a review from a team December 11, 2025 14:18
@jonavellecuerdo jonavellecuerdo self-assigned this Dec 11, 2025
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting! 🤓

@ghukill ghukill merged commit 090f1a6 into main Dec 11, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants