-
Python 3.10.14
-
Install the following libraries
numpy==1.26.3 scipy==1.10.1 sentence-transformers==2.7.0 torch==2.4.1+cu124 torchvision==0.19.1+cu124 gensim==4.3.3 scikit-learn==1.5.1 tqdm==4.66.5 -
Install Java
-
Download this Java jar to
./evaluationsand rename it topalmetto.jar -
Download and extract this processed Wikipedia corpus to
./evaluations/wiki_dataas an external reference corpus.Here is the folder structure:
|- evaluations | - wiki_data | - wikipedia_bd/ | - wikipedia_bd.histogram |- ... |- palmetto.jar
To run and evaluate our model, run the following command:
python run.py --model GloCOM --num_topics 50 --data_dir data/SearchSnippets
You can also specify additional arguments when running the model:
--aug_coef <float> # Default: 0.5 - Coefficient for augmentation
--prior_var <float> # Default: 0.1 - Prior variance
--weight_loss_ECR <float> # Default: 60.0 - Weight for ECR loss
We provide KNNTM OT distances and code in this link. Unzip the file and the folder structure should be like this:
```
|- data
| - SearchSnippets
| - KNNTM/
| - M_coo.npz
| - M_cos.npz
```
To run the KNNTM model
python run.py --model KNNTM --num_topics 50 --data_dir data/SearchSnippets
You can also specify additional arguments when running the model:
--alpha <float> # Default: 1.0
--num_k <int> # Default: 30
--eta <float> # Default: 0.2
--rho <float> # Default: 0.6
--p_epochs <int> # Default: 20
Some part of this implementation is based on TopMost. We also utilizes Palmetto for the evaluation of topic coherence.