This project covers Questions 8-12 of the NLP assignment, implementing core Natural Language Processing techniques including stemming, lemmatization, named entity recognition, POS tagging, web text processing, and keyword extraction.
| File | Description |
|---|---|
nlp_assignment.ipynb |
Main Jupyter Notebook with all questions (Q8-Q12) |
nlp_assignment.py |
Standalone Python script (same content, runnable without Jupyter) |
test_nlp_assignment.py |
Test suite with 66 rigorous tests covering all NLP operations |
output/ |
Auto-generated visualization charts (PNG) |
- Applies Porter Stemmer and WordNet Lemmatizer to:
["played", "drown", "buses", "students", "agreed", "going", "sad"] - Includes comparison table (Porter Stemmer vs WordNet verb/noun vs spaCy)
- Bar chart visualization comparing word lengths
- Five custom sentences analyzed with spaCy NER
- Entities identified: PERSON, ORG, GPE, LOC, DATE, MONEY, CARDINAL, ORDINAL
- Entity type distribution chart
- URL: Wikipedia - Natural Language Processing
- 10a) HTML cleaning pipeline (requests + BeautifulSoup) with ~83% size reduction
- 10b) POS tag extraction: nouns, adjectives, verbs with Top 15 frequency charts
- 10c) Sentence-level NER using NLTK sentence tokenizer + spaCy
- 10d) Noun co-occurrence analysis -- top pair: "language" + "processing"
- H-rank: TF-IDF based keyword ranking on 4 English URLs
- D-rank: RAKE (Rapid Automatic Keyword Extraction) on 4 English + 1 French URL
- URL component parsing (scheme, host, path, params, query, fragment)
- Additional extracted features: word count, sentence count, named entities, POS distribution, lexical diversity
- 10 keywords via H-rank and D-rank for each URL
- Summary comparison table and cross-URL bar charts
| # | URL | Language |
|---|---|---|
| 1 | https://en.wikipedia.org/wiki/Natural_language_processing | English |
| 2 | https://en.wikipedia.org/wiki/Artificial_intelligence | English |
| 3 | https://en.wikipedia.org/wiki/Deep_learning | English |
| 4 | https://en.wikipedia.org/wiki/Data_science | English |
| 5 | https://fr.wikipedia.org/wiki/Intelligence_artificielle | French |
python >= 3.10
nltk
spacy (+ en_core_web_sm model)
scikit-learn
beautifulsoup4
requests
rake-nltk
matplotlib
numpy
pandas
jupyter notebook nlp_assignment.ipynbThen run all cells with Kernel > Restart & Run All.
python nlp_assignment.pypytest test_nlp_assignment.py -v66 tests across 10 test classes:
| Test Class | Tests | What It Validates |
|---|---|---|
| TestStemming | 8 | Porter stemmer outputs, determinism, type checks |
| TestLemmatization | 10 | Verb/noun POS lemmatization, spaCy vs NLTK |
| TestNER | 13 | Entity detection per sentence, edge cases |
| TestTextCleaning | 8 | HTML tag removal, unicode handling, empty input |
| TestPOSTagging | 4 | Noun/verb/adj detection, universal POS tags |
| TestNounPairs | 2 | Adjacent noun co-occurrence counting |
| TestHRankKeywords | 7 | TF-IDF output format, scores, stop word exclusion |
| TestDRankKeywords | 6 | RAKE output validation, score ordering |
| TestSentenceTokenization | 4 | Edge cases, abbreviation handling |
| TestIntegration | 4 | End-to-end pipelines combining multiple steps |