feat: introduce custom tags extractor #662

queukat · 2025-03-06T21:38:03Z

Issue Link: n/a

Changes Overview:

Added a new TagsExtractor class that finds tags from
or
Injected TagsExtractor into ContentExtractor to unify custom tag extraction
Modified Article.parse() to merge extracted tags into article.tags
Updated docstrings and comments in English

Limitations:

This approach currently only looks for <a class="lnk" or rel="tag">. Might need expansions for other patterns.
No localized or language-specific logic for tags yet.

Breaking Changes:

None. This PR only adds new functionality; existing usage should be unaffected.

Testing Approach:

Manually tested with sample HTML containing
and
Verified it does not break existing extraction if these containers are not present.

AndyTheFactory · 2025-03-09T20:32:25Z

Hi @queukat !
Thanks for your contribution!

What I wonder is - why create a new class and not extend the functionality here:

newspaper4k/newspaper/extractors/metadata_extractor.py

Lines 164 to 174 in c5e4170

    
           def _get_tags(self, doc: lxml.html.Element) -> Set[str]: 
        
               """Extracts tags from the article's HTML""" 
        
               elements = doc.xpath(A_HREF_TAG_SELECTOR) 
        
               elements += doc.xpath(A_REL_TAG_SELECTOR) 
        
               if not elements: 
        
                   return set() 
        
               tags = [parsers.get_text(el) for el in elements if parsers.get_text(el)] 
        
               return set(tags)

queukat · 2025-03-21T22:45:14Z

hey @AndyTheFactory

Short Answer

We introduced a dedicated TagsExtractor to keep “custom tag” logic separate from the standard metadata extraction that already happens in MetadataExtractor. This ensures we don’t complicate the existing logic for recognized meta/OG fields (title, description, keywords, canonical links, etc.), while still allowing us to parse specialized structures (e.g., <div class="tags-links">, <a class="lnk">, or <div id="articleTag">) that aren’t strictly part of typical metadata.

Detailed Comparison

Purpose and Scope

MetadataExtractor:

Focuses on collecting standard fields such as og:title, og:image, meta keywords, canonical links, and so on.
It has built-in definitions for recognized tags and attributes, like <meta name="description">, <meta property="og:type">, etc.

TagsExtractor:

Targets custom “tags” or site-specific containers (e.g., <div class="tags-links">), or links with rel="tag" or class "lnk".
These patterns often vary from site to site and do not necessarily appear in typical <meta> tags.

Reduced Complexity and Risk

Without a Separate Class:

Extending MetadataExtractor for special tags would couple two different responsibilities in one place (standard metadata vs. custom tag parsing).
That could introduce regressions or make future maintenance more confusing.

With TagsExtractor:

We keep the original metadata logic untouched.
If we want to add or remove patterns for custom tags later, we can do so in TagsExtractor alone, greatly reducing the chance of breaking any existing standard metadata extraction.

Single Responsibility Principle

MetadataExtractor: Responsible for standard, widely recognized metadata fields.
TagsExtractor: Focused solely on scanning containers and extracting text links that represent article “tags.”

This separation respects the Single Responsibility Principle, making each class easier to read, maintain, and test.

Flexibility and Future Proofing

MetadataExtractor: If the underlying library or standards for meta tags evolve, changes can stay isolated here.
TagsExtractor: If we need to accommodate new container styles (e.g., a new theme with <div class="post-tags">), or gather different “tag-like” items, we can evolve it independently without affecting broader metadata logic.

No Impact on Existing Usage

All the standard extraction features remain exactly as they were.
The new tags logic is optional and only runs if you call get_tags() (or however it’s integrated).
This ensures no breaking changes for existing consumers of the library.

Conclusion

Creating a standalone TagsExtractor is a good architectural choice when dealing with site- or project-specific “custom tags.” It cleanly separates concerns from MetadataExtractor, which focuses on recognized meta/OG fields.

This approach follows best practices (like the Single Responsibility Principle), keeps the code base more modular and maintainable, and avoids risk of regressions in the existing metadata extraction.

AndyTheFactory · 2025-03-23T13:41:49Z

Hi @queukat

Thanks for your compelling arguments. You are right, it would make sense to have the tags in their own extractor.

Would it not make sense to move the whole functionality of def _get_tags into the TagsExtractor ?

queukat · 2025-03-24T10:15:43Z

@AndyTheFactory

You're right — conceptually, it would make sense to move _get_tags entirely into TagsExtractor. That said, I’d like to clarify the intent behind the current setup.

The TagsExtractor wasn’t designed as a general-purpose or centralized solution for tag extraction. It was created specifically to handle a few niche cases where tag elements weren't being captured — in particular, two smaller sites I came across where the existing metadata logic didn’t work. So rather than bloating MetadataExtractor, this was a way to isolate site-specific logic without introducing risk or complexity to the core system.

If we start fully migrating everything tag-related into TagsExtractor, we risk repeating the same issue — just in a different place — and eventually end up with a monolithic TagsExtractor.py, defeating the purpose of separation and modularity.

So yes, moving _get_tags is structurally reasonable, but it assumes TagsExtractor is meant to be a general solution — which it currently isn’t. If the goal is to evolve it into something more universal, that could be considered, but probably needs a broader discussion on scope and ownership.

feat(parse): add custom tag extraction logic

4d56706

queukat changed the title ~~feat(tags): introduce custom tags extractor~~ feat: introduce custom tags extractor Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce custom tags extractor #662

feat: introduce custom tags extractor #662

Uh oh!

queukat commented Mar 6, 2025 •

edited

Loading

Uh oh!

AndyTheFactory commented Mar 9, 2025

Uh oh!

queukat commented Mar 21, 2025

Uh oh!

AndyTheFactory commented Mar 23, 2025

Uh oh!

queukat commented Mar 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: introduce custom tags extractor #662

Are you sure you want to change the base?

feat: introduce custom tags extractor #662

Uh oh!

Conversation

queukat commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyTheFactory commented Mar 9, 2025

Uh oh!

queukat commented Mar 21, 2025

Short Answer

Detailed Comparison

Purpose and Scope

Reduced Complexity and Risk

Single Responsibility Principle

Flexibility and Future Proofing

No Impact on Existing Usage

Conclusion

Uh oh!

AndyTheFactory commented Mar 23, 2025

Uh oh!

queukat commented Mar 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

queukat commented Mar 6, 2025 •

edited

Loading