You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
N-GramSimilarity as defined by Kondrak, "N-Gram Similarity and Distance", StringProcessing and InformationRetrieval, LectureNotes in ComputerScienceVolume3772, 2005, pp 115-126.
The algorithm uses affixing with special character '\n' two increase the weight of first characters. The normalization is achieved by dividing the total similarity score the original length of the longer word.
The distance between two strings is defined as the L1 norm of the difference of their profiles (the number of occurences of each k-shingle).Q-gram distance is a lower bound on Levenshtein distance, but can be computed in O(|A|+|B|), where Levenshtein requires O(|A|.|B|)
185
+
The distance between two strings is defined as the L1 norm of the difference of their profiles (thenumberofoccurencesofeachn-gram). Q-gram distance is a lower bound on Levenshtein distance, but can be computed in O(|A| + |B|), where Levenshtein requires O(|A|.|B|)
164
186
165
187
```java
166
188
import info.debatty.java.stringsimilarity.*;
@@ -182,27 +204,13 @@ public class MyApp {
182
204
}
183
205
```
184
206
185
-
## N-Gramsimilarity (Kondrak)
186
-
187
-
N-Gram Similarity as defined by Kondrak, "N-Gram Similarity and Distance", String Processing and Information Retrieval, Lecture Notes in Computer Science Volume 3772, 2005, pp 115-126.
The algorithm uses affixing with special character '\n' two increase the weight of first characters. The normalization is achieved by dividing the total similarity score the original length of the longer word.
192
-
193
-
```java
194
-
import info.debatty.java.stringsimilarity.*;
207
+
## Cosine similarity
208
+
LikeQ-Gram similarity, the profile of each input string is first computed (the number of occurences of each n-gram).The two input strings are thus considered as vectors in the space of n-grams. The similarity between the two strings is the cosine of the angle between these two vectors, and is computed as V1.V2/ (|V1|*|V2|)
LikeQ-Gram similarity, the input strings are first converted into sets of n-grams (sequences of n characters, also called k-shingles), but this time the cardinality of each n-gram is not taken into account. Each input string is simply a set of n-grams. TheJaccard index is then computed as |A inter B|/|A union B|.
206
212
213
+
## Sorensen-Dice coefficient
214
+
Similart to Jaccard index, but this time the similarity is computed as 2*|A inter B|/ (|A|+|B|).
0 commit comments