Skip to content

build a positional inverted index from text files in a given folder and supports phrase search queries. the index maps terms to document positions for efficient searching and retrieval

Notifications You must be signed in to change notification settings

IsmaelMousa/positional-indexing-wiki

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Positional Inverted Index

C++ program to build and query a positional inverted index from text files.

Objective

build a positional inverted index from text files in a given folder and supports phrase search queries. the index maps terms to document positions for efficient searching and retrieval. the code is written in c++ and creates an index that maps terms to document positions, facilitating fast search operations. it processes text files, excluding common stop words, and supports exact phrase matching.

Methodology

the program normalizes words by converting them to lowercase and removing non-alphanumeric characters. it processes each document, updates the index with term positions, and calculates document frequency for each term.

Performance

the indexing phase for about 5000 documents takes around 5 minutes on a standard cpu setup. note that this implementation relies solely on positional index algorithms without incorporating external algorithms such as SPIMI or BSBI.

Documents

about 5000 wikipedia raw articles are used as input for indexing and preprocessing. these documents comprise various forms of unprocessed data including articles, metadata, and user interactions, providing a diverse dataset for testing and research. this corpus is ideal for practicing data cleaning, preprocessing, and natural language processing (nlp) experiments.

URL: https://www.kaggle.com/datasets/ismaeldwikat/raw-wikipedia-8000-articles

Usage

clone the repo, open the terminal and type these commands:

  1. open:
cd positional-indexing-wiki
  1. compile:
g++ main.cpp -o main
  1. run:
./main
  1. documents:
Enter the path to the documents folder: documents
  1. query:
Enter phrase query: what is artificial intelligence?

About

build a positional inverted index from text files in a given folder and supports phrase search queries. the index maps terms to document positions for efficient searching and retrieval

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages