USE 259 - parse HTML for mitlibwebsite source #266
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose and background context
This PR updates the transformer for
mitlibwebsiteto utilize the full, rendered HTML that is now coming out of browsertrix-harvester for the source records.Most impactfully, we can now set the new
fulltextfield with full-text extracted from the HTML in a way that makes sense for this source specifically (not just using the full-text extraction from browsertrix which is good, but not great).More mechanically, this recreates the minimal amount of metadata parsing from browsertrix-harvester that was performed into Transmogrifier (at this time, really only an OpenGraph
og:descriptionelement that we map tosummary). However, we're setup nicely in the future if we want to extract more metadata from the original page HTML.How can a reviewer manually see the effects of these changes?
1- Set AWS Dev1 credentials
2- Start ipython shell
3- Load the
MITLibWebsitetransformer with example records from S3:4- Transform a single record to inspect:
Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES:
What are the relevant tickets?
Code review