Skip to content

incorrect results when the number of distinct items is large #6

@rcompton

Description

@rcompton

When the number of distinct items is larger than the max itemset size this library gives incorrect results.

I found this in two different ways: 1.) comparing against the brute-force method of computing all the subsets and 2.) checking that the apriori principle is satisfied by the results (i.e. the support of a superset should never be larger than the support of any of its subsets).

Here's a test:

#all combinations of length-4 from 20 different items
database = [frozenset(t) for t in itertools.combinations(range(20), r=4)]

fitemlists_w_sup = fp_growth.find_frequent_itemsets(database, minimum_support=2,
                                                    include_support=True)
fitemlists_w_sup = list(fitemlists_w_sup)  # I will reuse the generator

#brute force way calculate support
counts = []
for transaction in database:
    psets = powerset(transaction)
    psets = [pset for pset in psets if len(pset) > 0]  # remove empty set
    psets = map(frozenset, psets)
    counts.extend(psets)
counts = collections.Counter(counts)

#check that the brute-force method agrees with the fp-growth library
fpd = dict([(frozenset(x[0]), x[1]) for x in fitemlists_w_sup])
for fitemlist_w_sup in fitemlists_w_sup:
    if fitemlist_w_sup[1] != counts[frozenset(fitemlist_w_sup[0])]:
        print 'fp support is wrong. fp support: {0}, true support: {1}'\
            .format(fitemlist_w_sup, counts[frozenset(fitemlist_w_sup[0])])

        #check that the apriori principle is satisfied
        for subpset in powerset(fitemlist_w_sup[0]):
            if len(subpset) > 0:
                if fpd[frozenset(subpset)] < fpd[frozenset(fitemlist_w_sup[0])]:
                    print 'apriori?, subset and support: {0} {1}, superset and support: {2}'\
                        .format(subpset, fpd[frozenset(subpset)],fitemlist_w_sup)

For reference, here's how I calculated powerset:

def powerset(iterable):
    xs = list(iterable)
    return itertools.chain.from_iterable(
        itertools.combinations(xs, n) for n in range(max_size + 1))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions