Skip to content

Minor optimization of hash checking #89

@Huite

Description

@Huite

The current hashing approach concatenates three SHA-1 hashes into a single string:

def _hash_csr_matrix(self, matrix):
    return (hashlib.sha1(matrix.indices).hexdigest() +
            hashlib.sha1(matrix.indptr).hexdigest() +
            hashlib.sha1(matrix.data).hexdigest())

def _is_already_factorized(self, A):
    if type(self.factorized_A) == str:
        return self._hash_csr_matrix(A) == self.factorized_A
    else:
        return self._csr_matrix_equal(A, self.factorized_A)

This always computes all three hashes and compares the full concatenated string.

We could store hashes as a tuple and use short-circuit evaluation as a minor optimization:

def _hash_array(self, array):
    return hashlib.sha1(array).hexdigest()

def _hash_csr_matrix(self, matrix):
    return (self._hash_array(matrix.indptr),
            self._hash_array(matrix.indices),
            self._hash_array(matrix.data))

def _is_already_factorized(self, A):
    if type(self.factorized_A) == tuple:
        hash_indptr, hash_indices, hash_data = self.factorized_A
        return (
            (self._hash_array(A.indptr) == hash_indptr) and
            (self._hash_array(A.indices) == hash_indices) and
            (self._hash_array(A.data) == hash_data)
        )
    else:
        return self._csr_matrix_equal(A, self.factorized_A)

Like the all check in _csr_matrix_equal, Python's and operator short-circuits; if indptr hashes don't match, it skips computing the other two hashes entirely. Comparing three strings should be cheaper than generating SHA-1 hashes.

I'll post benchmarks to verify the performance improvement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions