A complete information retrieval system implemented in Python that includes web crawling, document indexing, and search functionality. The system processes documents, builds an inverted index with TF-IDF weighting, and provides ranked search results using cosine similarity.
This information retrieval system consists of three main components:
- Web Crawler (
webcrawler.py) - Downloads and processes web pages - Indexer (
indexer.py) - Builds an inverted index from document corpus - Search Engine (
search_engine.py) - Performs ranked search queries
porter_stemmer.py(Porter stemming algorithm)levinshtein_distance.py(spell correction utility - see below)- Install dependencies:
pip install beautifulsoup4The CACM corpus, containing 570 HTML documents from the Communications of the ACM journal, is ideal for testing the information retrieval system and should be extracted from the zip file for use.
Edit indexer.py and set the corpus folder:
corpus_folder = "./cacm" # Use CACM corpusEdit webcrawler.py and set parameters:
crawl_limit = 10 # Number of pages to crawl
start_url = "https://example.com" # Starting URL
output_dir = "crawled_corpus" # Output directoryThen crawl web pages:
python webcrawler.pyEdit indexer.py and set the corpus folder:
corpus_folder = "./crawled_corpus" # Use crawled corpuspython indexer.pypython search_engine.pylevenshtein_distance.py - A standalone utility demonstrating how to calculate the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one string into another using Levenshtein distance and dynamic programming.
Install dependencies:
pip install numpy
pip install pandasWhile this utility is separate from the main information retrieval system, it can be integrated for:
- Query Correction: Correct misspelled search terms before processing
- Document Preprocessing: Fix typos in documents before indexing
- Interactive Search: Suggest corrections for user queries
The system creates a SQLite database with three main tables:
DocumentName(text): Path to the documentdocID(int): Unique document identifier
Term(text): Stemmed termtermID(int): Unique term identifier
termID(int): Term identifierdocID(int): Document identifiertfidf(real): TF-IDF weightdocfreq(int): Document frequencytermfreq(int): Term frequency in documentcolfreq(int): Collection frequency
The system tracks and displays various statistics:
- Total documents processed
- Total tokens processed
- Unique terms after filtering
- Stop words filtered out
- Processing time
- Number of candidate documents
- Top results returned
- Processing time
The repository implements core information retrieval concepts, including:
Tokenization, Normalization, Filtering, Porter Stemming, Inverted Index, Query Processing, TF-IDF Weighting, Vector Space Model, Cosine Similarity, Conjunctive Search, Ranking.
This project is part of coursework and is intended for educational purposes.