Skip to content

Commit 44cd92e

Browse files
authored
Added Gensim Section
1 parent 2749c7c commit 44cd92e

File tree

1 file changed

+13
-2
lines changed

1 file changed

+13
-2
lines changed

README.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -367,15 +367,26 @@ Similar to Jaccard index, but this time the similarity is computed as 2 * |V1 in
367367

368368
Distance is computed as 1 - similarity.
369369

370+
### Gensim
371+
Gensim is billed as a Natural Language Processing package that does ‘Topic Modeling for Humans’. But its practically much more than that.
372+
373+
If you are unfamiliar with topic modeling, it is a technique to extract the underlying topics from large volumes of text. Gensim provides algorithms like LDA and LSI (which we will see later in this post) and the necessary sophistication to build high-quality topic models.
374+
375+
You may argue that topic models and word embedding are available in other packages like scikit, R etc. But the width and scope of facilities to build and evaluate topic models are unparalleled in gensim, plus many more convenient facilities for text processing.
376+
377+
It is a great package for processing texts, working with word vector models (such as Word2Vec, FastText etc) and for building topic models.
378+
379+
Also, another significant advantage with gensim is: it lets you handle large text files without having to load the entire file in memory
380+
381+
Gensim Tutorial – A Complete Beginners Guide: https://www.machinelearningplus.com/nlp/gensim-tutorial/
382+
370383
## Experimental
371384

372385
### SIFT4
373386
SIFT4 is a general purpose string distance algorithm inspired by JaroWinkler and Longest Common Subsequence. It was developed to produce a distance measure that matches as close as possible to the human perception of string distance. Hence it takes into account elements like character substitution, character distance, longest common subsequence etc. It was developed using experimental testing, and without theoretical background.
374387

375388
**Not implemented yet**
376389

377-
378-
379390
## Users
380391
* [StringSimilarity.NET](https://github.com/feature23/StringSimilarity.NET) a .NET port of java-string-similarity
381392

0 commit comments

Comments
 (0)