Run (all args in one):
python bsbi.py --debug --spimi
Args:
--debug(default:False): Compare all postings encodings and print index size + indexing time.--spimi(default:False): Use SPIMI indexing instead of BSBI.
Build mode (all build args in one):
python lsi_faiss.py build --collection collection --output-dir lsi_index --n-components 256 --min-df 2 --max-df 0.9 --index-type ivf --nlist 256 --hnsw-m 32 --ef-construction 200
Query mode (all query args in one):
python lsi_faiss.py query --output-dir lsi_index --text "lipid metabolism in pregnancy" --topk 10
Default command behavior:
- If no subcommand is provided, command defaults to
build.
Build args:
--collection(default:collection): Collection root directory.--output-dir(default:lsi_index): Output directory for saved LSI/FAISS artifacts.--n-components(default:256): Number of latent SVD dimensions.--min-df(default:2): Minimum document frequency threshold.--max-df(default:0.9): Maximum document frequency ratio.--index-type(default:ivf, choices:flat | ivf | hnsw): FAISS index type.--nlist(default:256): IVF cluster count upper bound.--hnsw-m(default:32): HNSW neighbor count.--ef-construction(default:200): HNSW construction effort.
Query args:
--output-dir(default:lsi_index): Directory containing built LSI/FAISS artifacts.--text(required): Query text.--topk(default:10): Number of top results to return.
Run (all args in one):
python search.py --spimi --patricia --adaptive-retrieval --adaptive-index-dir pt_index --lsi-faiss --lsi-output-dir lsi_index --topk 10
Disable toggles:
python search.py --no-adaptive-retrieval --no-lsi-faiss
Args:
--spimi(default:False): Use SPIMI index class instead of BSBI index class.--patricia(default:False): Enable Patricia Tree lookup for query terms.--adaptive-retrieval(default:True): Enable adaptive retrieval output.--no-adaptive-retrieval(default: not set): Disable adaptive retrieval output.--adaptive-index-dir(default:pt_index): Adaptive retrieval index artifact directory.--lsi-faiss(default:True): Enable LSI+FAISS retrieval output.--no-lsi-faiss(default: not set): Disable LSI+FAISS retrieval output.--lsi-output-dir(default:lsi_index): LSI+FAISS artifact directory.--topk(default:10): Top-k retrieval depth.
Run (all args in one):
python evaluation.py --qrels qrels.txt --queries queries.txt --k 10 --use-patricia --lsi-output-dir lsi_index --use-adaptive-retrieval --adaptive-index-dir pt_index
Args:
--qrels(default:qrels.txt): Path to qrels relevance file.--queries(default:queries.txt): Path to queries file.--k(default:10): Top-k documents per query.--use-patricia(default:False): Use Patricia Tree lookup in retrieval.--lsi-output-dir(default:lsi_index): Directory containing LSI+FAISS artifacts. Set empty to disable LSI evaluation.--use-adaptive-retrieval(default:False): Enable adaptive retrieval evaluation.--adaptive-index-dir(default:pt_index): Directory containing adaptive retrieval index artifacts.
Comparing 5 types of compression:
-
StandardPostings: Stores data as fixed-width 32-bit integers, offering the fastest access speeds but consuming the most storage space.
-
VBEPostings: Uses a variable number of bytes per integer with a continuation bit (no compression).
-
EliasGammaPostings: Employs a bit-level universal code to represent integers.
-
VBEPostingsEliasGammaTF: A hybrid approach that uses Variable Byte for large document IDs and Elias Gamma for small term frequencies.
-
EliasGammaPostingsVBETF: An inverse hybrid that applies bit-level encoding to document IDs and byte-level encoding to frequencies.
Codec Size (KB) Time (s)
--------------------------------------------------------------
StandardPostings 1557.372 2.281
VBEPostings 1024.439 2.543
EliasGammaPostings 1022.873 3.377
VBEPostingsEliasGammaTF 999.328 3.100
EliasGammaPostingsVBETF 1048.614 3.104
Conclusion:
- We picked VBEPostingsEliasGammaTF from this forth since it has smallest index file size (effective compression)
Comparing TF-IDF with BM25 over 30 queries
TF-IDF RBP = 0.5980
BM25 RBP = 0.6317
Conclusion:
- BM25 has bigger RBP score than TF-IDF
Evaluation results over 30 queries:
TF-IDF RBP = 0.6052
TF-IDF DCG = 5.4224
TF-IDF NDCG = 0.7904
TF-IDF AP = 0.5180
BM25 RBP = 0.6467
BM25 DCG = 5.6135
BM25 NDCG = 0.8145
BM25 AP = 0.5559
BM25 brute-force avg/query (s) = 0.0541
BM25 WAND avg/query (s) = 0.0523
Conclusion:
- BM25 using WAND top-K retrieval achieve faster average retrieval time taken than without WAND
Visualized [html/eval-comparison.html]:
- RBP (Rank-Biased Precision)
- DCG (Discounted Cumulative Gain)
- NDCG (Normalized DCG)
- AP (Average Precision)
Conclusions:
- Best Quality: LSI+FAISS is the most effective method across all ranking metrics (RBP, DCG, NDCG, and AP).
- Efficiency Leader: LSI+FAISS is significantly faster than the other methods, representing a ~9.3x speedup over traditional brute-force BM25.
python bsbi.py --spimi
BSBI VS SPIMI:
-
BSBI (Block-Sort-Based Indexing)
- Processes each block (directory) separately
- Creates one inverted index per block
- Fixed memory usage per block
- Merges all block indices at the end
-
SPIMI (Single-Pass In-Memory Indexing)
- Processes all documents sequentially
- Maintains a single in-memory dictionary
- Dynamically flushes to disk when memory threshold is exceeded
- More memory-efficient for varying document sizes
- Merges all flushed blocks at the end
python .\evaluation.py --use-patricia
Patricia Tree visualizations of the whole vocabulary
- Normal [html/patricia-tree.html]
- Mindmap [html/patricia-tree-mindmap.html]
- Folders [html/patricia-tree-folder.html]
Build latent semantic index (LSI) and FAISS vector index. Efficient SVD strategy for very large term-document matrices:
- Use sparse TF-IDF matrix (do not densify term-document matrix).
- Use randomized truncated SVD (not full SVD), with cost close to linear in non-zero entries.
- Keep latent dimension moderate (e.g., 128-512).
- Use FAISS approximate indexes (IVF or HNSW) for scalable top-k retrieval.
- For IVF, train on a sampled subset of vectors when corpus is very large.
This project now includes a Flask app in flask_search.py to provide a web frontend for interactive search.
What it provides:
- Web page at
/with query input, top-k input, and toggles for Patricia lookup, LSI+FAISS, and adaptive retrieval. - API endpoint at
/search(POST) that returns JSON search results. - Integrated retrieval backends:
- TF-IDF (BSBI/SPIMI)
- BM25 (BSBI/SPIMI)
- LSI+FAISS (optional)
- Adaptive retrieval (optional, with graceful warning if unavailable)
Install dependency:
pip install -r requirements.txt
Run Flask frontend:
python flask_search.py
Open in browser:
http://127.0.0.1:5000
Run with all arguments in one command:
python flask_search.py --spimi --adaptive-index-dir pt_index --lsi-output-dir lsi_index --host 127.0.0.1 --port 5000 --debug
Flask CLI arguments:
--spimi(default:False): Use SPIMI index class instead of BSBI.--adaptive-index-dir(default:pt_index): Directory for adaptive retrieval artifacts.--lsi-output-dir(default:lsi_index): Directory for LSI+FAISS artifacts.--host(default:127.0.0.1): Flask host.--port(default:5000): Flask port.--debug(default:False): Enable Flask debug mode.







