Information Retrieval System

A complete information retrieval system implemented in Python that includes web crawling, document indexing, and search functionality. The system processes documents, builds an inverted index with TF-IDF weighting, and provides ranked search results using cosine similarity.

System Overview

This information retrieval system consists of three main components:

Web Crawler (webcrawler.py) - Downloads and processes web pages
Indexer (indexer.py) - Builds an inverted index from document corpus
Search Engine (search_engine.py) - Performs ranked search queries

Prerequisites

porter_stemmer.py (Porter stemming algorithm)
levinshtein_distance.py (spell correction utility - see below)
Install dependencies:

pip install beautifulsoup4

Usage Instructions

Step 1: Choose Your Corpus

Option A: CACM Corpus (Recommended for Testing)

The CACM corpus, containing 570 HTML documents from the Communications of the ACM journal, is ideal for testing the information retrieval system and should be extracted from the zip file for use.

Edit indexer.py and set the corpus folder:

corpus_folder = "./cacm"  # Use CACM corpus

Option B: Web Crawled Corpus

Edit webcrawler.py and set parameters:

crawl_limit = 10                     # Number of pages to crawl
start_url = "https://example.com"    # Starting URL
output_dir = "crawled_corpus"        # Output directory

Then crawl web pages:

python webcrawler.py

Edit indexer.py and set the corpus folder:

corpus_folder = "./crawled_corpus"  # Use crawled corpus

Step 2: Build the Inverted Index

python indexer.py

Step 3: Run Search Queries

python search_engine.py

Levenshtein Distance Spell Correction Utility

levenshtein_distance.py - A standalone utility demonstrating how to calculate the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one string into another using Levenshtein distance and dynamic programming.

Install dependencies:

pip install numpy
pip install pandas

Integration with Information Retrieval System

While this utility is separate from the main information retrieval system, it can be integrated for:

Query Correction: Correct misspelled search terms before processing
Document Preprocessing: Fix typos in documents before indexing
Interactive Search: Suggest corrections for user queries

Database Schema

The system creates a SQLite database with three main tables:

DocumentDictionary

DocumentName (text): Path to the document
docID (int): Unique document identifier

TermDictionary

Term (text): Stemmed term
termID (int): Unique term identifier

Postings

termID (int): Term identifier
docID (int): Document identifier
tfidf (real): TF-IDF weight
docfreq (int): Document frequency
termfreq (int): Term frequency in document
colfreq (int): Collection frequency

Performance Statistics

The system tracks and displays various statistics:

Indexing Statistics

Total documents processed
Total tokens processed
Unique terms after filtering
Stop words filtered out
Processing time

Search Statistics

Number of candidate documents
Top results returned
Processing time

Academic Context

The repository implements core information retrieval concepts, including:

Tokenization, Normalization, Filtering, Porter Stemming, Inverted Index, Query Processing, TF-IDF Weighting, Vector Space Model, Cosine Similarity, Conjunctive Search, Ranking.

License

This project is part of coursework and is intended for educational purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval System

System Overview

Prerequisites

Usage Instructions

Step 1: Choose Your Corpus

Option A: CACM Corpus (Recommended for Testing)

Option B: Web Crawled Corpus

Step 2: Build the Inverted Index

Step 3: Run Search Queries

Levenshtein Distance Spell Correction Utility

Integration with Information Retrieval System

Database Schema

DocumentDictionary

TermDictionary

Postings

Performance Statistics

Indexing Statistics

Search Statistics

Academic Context

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
cacm.zip		cacm.zip
indexer.py		indexer.py
levinshtein_distance.py		levinshtein_distance.py
porter_stemmer.py		porter_stemmer.py
search_engine.py		search_engine.py
webcrawler.py		webcrawler.py

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval System

System Overview

Prerequisites

Usage Instructions

Step 1: Choose Your Corpus

Option A: CACM Corpus (Recommended for Testing)

Option B: Web Crawled Corpus

Step 2: Build the Inverted Index

Step 3: Run Search Queries

Levenshtein Distance Spell Correction Utility

Integration with Information Retrieval System

Database Schema

DocumentDictionary

TermDictionary

Postings

Performance Statistics

Indexing Statistics

Search Statistics

Academic Context

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages