-
Notifications
You must be signed in to change notification settings - Fork 95
feat: introduce custom tags extractor #662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi @queukat ! What I wonder is - why create a new class and not extend the functionality here: newspaper4k/newspaper/extractors/metadata_extractor.py Lines 164 to 174 in c5e4170
|
|
hey @AndyTheFactory Short AnswerWe introduced a dedicated Detailed ComparisonPurpose and ScopeMetadataExtractor:
TagsExtractor:
Reduced Complexity and RiskWithout a Separate Class:
With TagsExtractor:
Single Responsibility Principle
Flexibility and Future Proofing
No Impact on Existing Usage
ConclusionCreating a standalone
|
|
Hi @queukat Thanks for your compelling arguments. You are right, it would make sense to have the tags in their own extractor. Would it not make sense to move the whole functionality of def _get_tags into the TagsExtractor ? |
|
You're right — conceptually, it would make sense to move _get_tags entirely into TagsExtractor. That said, I’d like to clarify the intent behind the current setup. The TagsExtractor wasn’t designed as a general-purpose or centralized solution for tag extraction. It was created specifically to handle a few niche cases where tag elements weren't being captured — in particular, two smaller sites I came across where the existing metadata logic didn’t work. So rather than bloating MetadataExtractor, this was a way to isolate site-specific logic without introducing risk or complexity to the core system. If we start fully migrating everything tag-related into TagsExtractor, we risk repeating the same issue — just in a different place — and eventually end up with a monolithic TagsExtractor.py, defeating the purpose of separation and modularity. So yes, moving _get_tags is structurally reasonable, but it assumes TagsExtractor is meant to be a general solution — which it currently isn’t. If the goal is to evolve it into something more universal, that could be considered, but probably needs a broader discussion on scope and ownership. |
Issue Link: n/a
Changes Overview:
TagsExtractorclass that finds tags fromTagsExtractorintoContentExtractorto unify custom tag extractionArticle.parse()to merge extracted tags intoarticle.tagsLimitations:
<a class="lnk" or rel="tag">. Might need expansions for other patterns.Breaking Changes:
Testing Approach: