NLP Assignment -- Parts 2 & 3

Overview

This project covers Questions 8-12 of the NLP assignment, implementing core Natural Language Processing techniques including stemming, lemmatization, named entity recognition, POS tagging, web text processing, and keyword extraction.

Files

File	Description
`nlp_assignment.ipynb`	Main Jupyter Notebook with all questions (Q8-Q12)
`nlp_assignment.py`	Standalone Python script (same content, runnable without Jupyter)
`test_nlp_assignment.py`	Test suite with 66 rigorous tests covering all NLP operations
`output/`	Auto-generated visualization charts (PNG)

Questions Covered

Q8 -- Stemming and Lemmatization

Applies Porter Stemmer and WordNet Lemmatizer to: ["played", "drown", "buses", "students", "agreed", "going", "sad"]
Includes comparison table (Porter Stemmer vs WordNet verb/noun vs spaCy)
Bar chart visualization comparing word lengths

Q9 -- Named Entity Recognition (NER)

Five custom sentences analyzed with spaCy NER
Entities identified: PERSON, ORG, GPE, LOC, DATE, MONEY, CARDINAL, ORDINAL
Entity type distribution chart

Q10 -- English URL Web Page Analysis

URL: Wikipedia - Natural Language Processing
10a) HTML cleaning pipeline (requests + BeautifulSoup) with ~83% size reduction
10b) POS tag extraction: nouns, adjectives, verbs with Top 15 frequency charts
10c) Sentence-level NER using NLTK sentence tokenizer + spaCy
10d) Noun co-occurrence analysis -- top pair: "language" + "processing"

Q11 -- Keyword Extraction (H-rank and D-rank)

H-rank: TF-IDF based keyword ranking on 4 English URLs
D-rank: RAKE (Rapid Automatic Keyword Extraction) on 4 English + 1 French URL

Q12 -- Full Keyword Analysis on 5+ URLs

URL component parsing (scheme, host, path, params, query, fragment)
Additional extracted features: word count, sentence count, named entities, POS distribution, lexical diversity
10 keywords via H-rank and D-rank for each URL
Summary comparison table and cross-URL bar charts

URLs Analyzed

#	URL	Language
1	https://en.wikipedia.org/wiki/Natural_language_processing	English
2	https://en.wikipedia.org/wiki/Artificial_intelligence	English
3	https://en.wikipedia.org/wiki/Deep_learning	English
4	https://en.wikipedia.org/wiki/Data_science	English
5	https://fr.wikipedia.org/wiki/Intelligence_artificielle	French

Requirements

python >= 3.10
nltk
spacy (+ en_core_web_sm model)
scikit-learn
beautifulsoup4
requests
rake-nltk
matplotlib
numpy
pandas

How to Run

Jupyter Notebook (recommended)

jupyter notebook nlp_assignment.ipynb

Then run all cells with Kernel > Restart & Run All.

Python Script

python nlp_assignment.py

Tests

pytest test_nlp_assignment.py -v

Test Coverage

66 tests across 10 test classes:

Test Class	Tests	What It Validates
TestStemming	8	Porter stemmer outputs, determinism, type checks
TestLemmatization	10	Verb/noun POS lemmatization, spaCy vs NLTK
TestNER	13	Entity detection per sentence, edge cases
TestTextCleaning	8	HTML tag removal, unicode handling, empty input
TestPOSTagging	4	Noun/verb/adj detection, universal POS tags
TestNounPairs	2	Adjacent noun co-occurrence counting
TestHRankKeywords	7	TF-IDF output format, scores, stop word exclusion
TestDRankKeywords	6	RAKE output validation, score ordering
TestSentenceTokenization	4	Edge cases, abbreviation handling
TestIntegration	4	End-to-end pipelines combining multiple steps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Assignment -- Parts 2 & 3

Overview

Files

Questions Covered

Q8 -- Stemming and Lemmatization

Q9 -- Named Entity Recognition (NER)

Q10 -- English URL Web Page Analysis

Q11 -- Keyword Extraction (H-rank and D-rank)

Q12 -- Full Keyword Analysis on 5+ URLs

URLs Analyzed

Requirements

How to Run

Jupyter Notebook (recommended)

Python Script

Tests

Test Coverage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
output		output
.gitignore		.gitignore
README.md		README.md
nlp_assignment.ipynb		nlp_assignment.ipynb
nlp_assignment.py		nlp_assignment.py
test_nlp_assignment.py		test_nlp_assignment.py

Folders and files

Latest commit

History

Repository files navigation

NLP Assignment -- Parts 2 & 3

Overview

Files

Questions Covered

Q8 -- Stemming and Lemmatization

Q9 -- Named Entity Recognition (NER)

Q10 -- English URL Web Page Analysis

Q11 -- Keyword Extraction (H-rank and D-rank)

Q12 -- Full Keyword Analysis on 5+ URLs

URLs Analyzed

Requirements

How to Run

Jupyter Notebook (recommended)

Python Script

Tests

Test Coverage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages