Skip to content

alexanderthenth/NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Assignment -- Parts 2 & 3

Overview

This project covers Questions 8-12 of the NLP assignment, implementing core Natural Language Processing techniques including stemming, lemmatization, named entity recognition, POS tagging, web text processing, and keyword extraction.

Files

File Description
nlp_assignment.ipynb Main Jupyter Notebook with all questions (Q8-Q12)
nlp_assignment.py Standalone Python script (same content, runnable without Jupyter)
test_nlp_assignment.py Test suite with 66 rigorous tests covering all NLP operations
output/ Auto-generated visualization charts (PNG)

Questions Covered

Q8 -- Stemming and Lemmatization

  • Applies Porter Stemmer and WordNet Lemmatizer to: ["played", "drown", "buses", "students", "agreed", "going", "sad"]
  • Includes comparison table (Porter Stemmer vs WordNet verb/noun vs spaCy)
  • Bar chart visualization comparing word lengths

Q9 -- Named Entity Recognition (NER)

  • Five custom sentences analyzed with spaCy NER
  • Entities identified: PERSON, ORG, GPE, LOC, DATE, MONEY, CARDINAL, ORDINAL
  • Entity type distribution chart

Q10 -- English URL Web Page Analysis

  • URL: Wikipedia - Natural Language Processing
  • 10a) HTML cleaning pipeline (requests + BeautifulSoup) with ~83% size reduction
  • 10b) POS tag extraction: nouns, adjectives, verbs with Top 15 frequency charts
  • 10c) Sentence-level NER using NLTK sentence tokenizer + spaCy
  • 10d) Noun co-occurrence analysis -- top pair: "language" + "processing"

Q11 -- Keyword Extraction (H-rank and D-rank)

  • H-rank: TF-IDF based keyword ranking on 4 English URLs
  • D-rank: RAKE (Rapid Automatic Keyword Extraction) on 4 English + 1 French URL

Q12 -- Full Keyword Analysis on 5+ URLs

  • URL component parsing (scheme, host, path, params, query, fragment)
  • Additional extracted features: word count, sentence count, named entities, POS distribution, lexical diversity
  • 10 keywords via H-rank and D-rank for each URL
  • Summary comparison table and cross-URL bar charts

URLs Analyzed

# URL Language
1 https://en.wikipedia.org/wiki/Natural_language_processing English
2 https://en.wikipedia.org/wiki/Artificial_intelligence English
3 https://en.wikipedia.org/wiki/Deep_learning English
4 https://en.wikipedia.org/wiki/Data_science English
5 https://fr.wikipedia.org/wiki/Intelligence_artificielle French

Requirements

python >= 3.10
nltk
spacy (+ en_core_web_sm model)
scikit-learn
beautifulsoup4
requests
rake-nltk
matplotlib
numpy
pandas

How to Run

Jupyter Notebook (recommended)

jupyter notebook nlp_assignment.ipynb

Then run all cells with Kernel > Restart & Run All.

Python Script

python nlp_assignment.py

Tests

pytest test_nlp_assignment.py -v

Test Coverage

66 tests across 10 test classes:

Test Class Tests What It Validates
TestStemming 8 Porter stemmer outputs, determinism, type checks
TestLemmatization 10 Verb/noun POS lemmatization, spaCy vs NLTK
TestNER 13 Entity detection per sentence, edge cases
TestTextCleaning 8 HTML tag removal, unicode handling, empty input
TestPOSTagging 4 Noun/verb/adj detection, universal POS tags
TestNounPairs 2 Adjacent noun co-occurrence counting
TestHRankKeywords 7 TF-IDF output format, scores, stop word exclusion
TestDRankKeywords 6 RAKE output validation, score ordering
TestSentenceTokenization 4 Edge cases, abbreviation handling
TestIntegration 4 End-to-end pipelines combining multiple steps

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors