Skip to content

lawun330/information-retrieval

Repository files navigation

Information Retrieval System

A complete information retrieval system implemented in Python that includes web crawling, document indexing, and search functionality. The system processes documents, builds an inverted index with TF-IDF weighting, and provides ranked search results using cosine similarity.

System Overview

This information retrieval system consists of three main components:

  1. Web Crawler (webcrawler.py) - Downloads and processes web pages
  2. Indexer (indexer.py) - Builds an inverted index from document corpus
  3. Search Engine (search_engine.py) - Performs ranked search queries

Prerequisites

  • porter_stemmer.py (Porter stemming algorithm)
  • levinshtein_distance.py (spell correction utility - see below)
  • Install dependencies:
pip install beautifulsoup4

Usage Instructions

Step 1: Choose Your Corpus

Option A: CACM Corpus (Recommended for Testing)

The CACM corpus, containing 570 HTML documents from the Communications of the ACM journal, is ideal for testing the information retrieval system and should be extracted from the zip file for use.

Edit indexer.py and set the corpus folder:

corpus_folder = "./cacm"  # Use CACM corpus

Option B: Web Crawled Corpus

Edit webcrawler.py and set parameters:

crawl_limit = 10                     # Number of pages to crawl
start_url = "https://example.com"    # Starting URL
output_dir = "crawled_corpus"        # Output directory

Then crawl web pages:

python webcrawler.py

Edit indexer.py and set the corpus folder:

corpus_folder = "./crawled_corpus"  # Use crawled corpus

Step 2: Build the Inverted Index

python indexer.py

Step 3: Run Search Queries

python search_engine.py

Levenshtein Distance Spell Correction Utility

levenshtein_distance.py - A standalone utility demonstrating how to calculate the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one string into another using Levenshtein distance and dynamic programming.

Install dependencies:

pip install numpy
pip install pandas

Integration with Information Retrieval System

While this utility is separate from the main information retrieval system, it can be integrated for:

  1. Query Correction: Correct misspelled search terms before processing
  2. Document Preprocessing: Fix typos in documents before indexing
  3. Interactive Search: Suggest corrections for user queries

Database Schema

The system creates a SQLite database with three main tables:

DocumentDictionary

  • DocumentName (text): Path to the document
  • docID (int): Unique document identifier

TermDictionary

  • Term (text): Stemmed term
  • termID (int): Unique term identifier

Postings

  • termID (int): Term identifier
  • docID (int): Document identifier
  • tfidf (real): TF-IDF weight
  • docfreq (int): Document frequency
  • termfreq (int): Term frequency in document
  • colfreq (int): Collection frequency

Performance Statistics

The system tracks and displays various statistics:

Indexing Statistics

  • Total documents processed
  • Total tokens processed
  • Unique terms after filtering
  • Stop words filtered out
  • Processing time

Search Statistics

  • Number of candidate documents
  • Top results returned
  • Processing time

Academic Context

The repository implements core information retrieval concepts, including:

Tokenization, Normalization, Filtering, Porter Stemming, Inverted Index, Query Processing, TF-IDF Weighting, Vector Space Model, Cosine Similarity, Conjunctive Search, Ranking.

License

This project is part of coursework and is intended for educational purposes.

About

Information retrieval using basic versions of a web crawler, indexer, and search engine, along with supporting modules

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages