GitHub - alcocer-jj/topicmodelCV: R Package that conducts 10-fold cross validation on dtm to choose optimal k number of topics. The package allows the user to select a range of k values to test, and returns a Data Frame with the three most common metrics used to quantitatively evaluate an LDA: perplexity, likelihood, and semantic coherence. Package let's users plot them all as well

Introduction

The topicmodelCV package allows a user to conduct a 10-fold cross validation method on determine the optimal number of k topics to use when wanting to implement a Latent Dirichlet Allocation (LDA) topic model. All that the package requires is for the user to insert a pre-processed document-term matrix (dtm) along with other control parameters that are fed to the tenFoldCV() function in order to split the dtm into 10 folds, run multiple LDA topic models using the LDA() function from the topicmodels package, and calculate the models' perplexity, held out likelihood, and semantic coherence (using the topicdoc package) for every k topic tested. The results for each k topic are then organized and given back to the user in a Data Frame in order to plot them out and compare. The package also includes functions to plot the results for ease of use (i.e., plot_perplexity(), plot_likelihood(), and plot_coherence()), however, the data can be used to make ones own plots.

Code Snippets

The following are examples of each function in the package and how to use it.

If a user wants to calculate the optimal k for their topic model:

results <- tenFoldCV(dtm, # document-term matrix obtained by pre-processing data
                     range_of_k, # range of topic numbers the user wishes to test
                     burnin_value, iter_value, keep_value, # control parameters that feed into LDA function
                     set_seed) # seed value used for reproducibility

Once the results are obtained, the user can either plot the data on their own, or use the following built-in functions:

plot_perplexity(results)

plot_likelihood(results)

plot_coherence(results)

Sources

The following are the sources I used to construct the package:

Barberá, P. 2021. USC POIR 613 - Computational Social Science Course. University of Southern California, Political Science and International Relations.
Brownlee, J. (2023). A Gentle Introduction to k-fold Cross-Validation. https://machinelearningmastery.com/k-fold-cross-validation/.
Friedman, D. 2022. topicdoc R package documentation. https://github.com/doug-friedman/topicdoc
Grün B, Hornik K (2024). topicmodels: Topic Models. R package version 0.2-16, https://CRAN.R-project.org/package=topicmodels.
Peter's Stats Stuff - R. (2017). Cross-Validation of Topic Modeling. https://www.r-bloggers.com/2017/01/cross-validation-of-topic-modelling/.
Zhang, Z. (2018). Text Mining for Social and Behavioral Research Using R: A Case Study on Teaching Evaluation. https://books.psychstat.org/textmining.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
R		R
man		man
.DS_Store		.DS_Store
.Rhistory		.Rhistory
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md
topicmodelCV.Rproj		topicmodelCV.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Code Snippets

Sources

About

Uh oh!

Releases

Packages

Languages

alcocer-jj/topicmodelCV

Folders and files

Latest commit

History

Repository files navigation

Introduction

Code Snippets

Sources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages