-
Notifications
You must be signed in to change notification settings - Fork 194
Open
Description
When the number of distinct items is larger than the max itemset size this library gives incorrect results.
I found this in two different ways: 1.) comparing against the brute-force method of computing all the subsets and 2.) checking that the apriori principle is satisfied by the results (i.e. the support of a superset should never be larger than the support of any of its subsets).
Here's a test:
#all combinations of length-4 from 20 different items
database = [frozenset(t) for t in itertools.combinations(range(20), r=4)]
fitemlists_w_sup = fp_growth.find_frequent_itemsets(database, minimum_support=2,
include_support=True)
fitemlists_w_sup = list(fitemlists_w_sup) # I will reuse the generator
#brute force way calculate support
counts = []
for transaction in database:
psets = powerset(transaction)
psets = [pset for pset in psets if len(pset) > 0] # remove empty set
psets = map(frozenset, psets)
counts.extend(psets)
counts = collections.Counter(counts)
#check that the brute-force method agrees with the fp-growth library
fpd = dict([(frozenset(x[0]), x[1]) for x in fitemlists_w_sup])
for fitemlist_w_sup in fitemlists_w_sup:
if fitemlist_w_sup[1] != counts[frozenset(fitemlist_w_sup[0])]:
print 'fp support is wrong. fp support: {0}, true support: {1}'\
.format(fitemlist_w_sup, counts[frozenset(fitemlist_w_sup[0])])
#check that the apriori principle is satisfied
for subpset in powerset(fitemlist_w_sup[0]):
if len(subpset) > 0:
if fpd[frozenset(subpset)] < fpd[frozenset(fitemlist_w_sup[0])]:
print 'apriori?, subset and support: {0} {1}, superset and support: {2}'\
.format(subpset, fpd[frozenset(subpset)],fitemlist_w_sup)
For reference, here's how I calculated powerset:
def powerset(iterable):
xs = list(iterable)
return itertools.chain.from_iterable(
itertools.combinations(xs, n) for n in range(max_size + 1))
Metadata
Metadata
Assignees
Labels
No labels