C++ program to build and query a positional inverted index from text files.
build a positional inverted index from text files in a given folder and supports phrase search queries. the index maps terms to document positions for efficient searching and retrieval. the code is written in c++ and creates an index that maps terms to document positions, facilitating fast search operations. it processes text files, excluding common stop words, and supports exact phrase matching.
the program normalizes words by converting them to lowercase and removing non-alphanumeric characters. it processes each document, updates the index with term positions, and calculates document frequency for each term.
the indexing phase for about 5000 documents takes around 5 minutes on a standard cpu setup. note that this implementation relies solely on positional index algorithms without incorporating external algorithms such as SPIMI or BSBI.
about 5000 wikipedia raw articles are used as input for indexing and preprocessing. these documents comprise various forms of unprocessed data including articles, metadata, and user interactions, providing a diverse dataset for testing and research. this corpus is ideal for practicing data cleaning, preprocessing, and natural language processing (nlp) experiments.
URL: https://www.kaggle.com/datasets/ismaeldwikat/raw-wikipedia-8000-articles
clone the repo, open the terminal and type these commands:
- open:
cd positional-indexing-wiki- compile:
g++ main.cpp -o main- run:
./main- documents:
Enter the path to the documents folder: documents
- query:
Enter phrase query: what is artificial intelligence?