Positional Inverted Index

C++ program to build and query a positional inverted index from text files.

Objective

build a positional inverted index from text files in a given folder and supports phrase search queries. the index maps terms to document positions for efficient searching and retrieval. the code is written in c++ and creates an index that maps terms to document positions, facilitating fast search operations. it processes text files, excluding common stop words, and supports exact phrase matching.

Methodology

the program normalizes words by converting them to lowercase and removing non-alphanumeric characters. it processes each document, updates the index with term positions, and calculates document frequency for each term.

Performance

the indexing phase for about 5000 documents takes around 5 minutes on a standard cpu setup. note that this implementation relies solely on positional index algorithms without incorporating external algorithms such as SPIMI or BSBI.

Documents

about 5000 wikipedia raw articles are used as input for indexing and preprocessing. these documents comprise various forms of unprocessed data including articles, metadata, and user interactions, providing a diverse dataset for testing and research. this corpus is ideal for practicing data cleaning, preprocessing, and natural language processing (nlp) experiments.

URL: https://www.kaggle.com/datasets/ismaeldwikat/raw-wikipedia-8000-articles

Usage

clone the repo, open the terminal and type these commands:

open:

cd positional-indexing-wiki

compile:

g++ main.cpp -o main

run:

./main

documents:

Enter the path to the documents folder: documents

query:

Enter phrase query: what is artificial intelligence?

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
documents		documents
.gitignore		.gitignore
README.md		README.md
docId_filePath_mapping.csv		docId_filePath_mapping.csv
main.cpp		main.cpp
pos_inverted_index.json		pos_inverted_index.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Positional Inverted Index

Objective

Methodology

Performance

Documents

Usage

About

Uh oh!

Releases

Packages

Languages

IsmaelMousa/positional-indexing-wiki

Folders and files

Latest commit

History

Repository files navigation

Positional Inverted Index

Objective

Methodology

Performance

Documents

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages