Add convenience wrapper for quick similarity computation by miracvbasaran · Pull Request #1 · vitali-fedulov/imagehash

miracvbasaran · 2023-01-19T21:43:39Z

Thanks for working on this, it works like a charm!

As I was using your libraries, I thought it was a low-hanging fruit to add a simple convenience wrapper around imagehash & images4, to simplify user experience. It seems to me like new Image.Similar() is the only thing users need from the imagehash library, so we can hide away the complexity of centralHash and hashSet.

Additionally, users could directly use the imagehash.Image struct to create a hashtable for quick lookup, e.g.:

var images map[string]imagehash.Image
for p := range paths {
    images[p] = imagehash.Open(p)
}

func lookup(imgToLookup imagehash.Image) string {
    for p, img := range images {
        if img.Similar(imgToLookup) {
            return p
        }
    }
    return ""
}

LMK if you think this is a good idea and what tests you'd like me to add for them.

vitali-fedulov · 2023-01-21T02:58:29Z

Hello Miraç,

I am very excited you have found the package useful! And I have looked at your pull request.

But I am confused, because the addition is contradictory to the purpose of imagehash package:

(1) The main purpose of imagehash package is to avoid each-to-each image comparison, which happens, when using images4.Similar or any (!) Similar function. Hashes in this package are not meant for each-to-each image comparison.

func Similar implies each-to-each comparison.

You propose a func Similar, which merges image4.Similar with imagehash similarity. But it is faster to use image4.Similar directly. That is imagehash is not needed at all for this scenario. And it should not impact precision of comparison.

(2) You propose the Image struct, which is a very large object to keep it in memory for every image.

The purpose of hashes and icons was to avoid keeping images in memory. Additional purpose of icon is generalization for comparison, because pixel-by-pixel comparison of full images does not generalize well.

Actually hashes are an additional memory-saving step to avoid keeping icons in memory.

RECOMMENDATION

Try to use images4 directly, unless you work with billions images.

If you do work with billions of images, then imagehash(es) are used to compare by those hashes only - not any Similar func. One huge hash table is necessary - not the one like in the example on the package README.

It is probably the confusion of terms, because I use "hash" as precise term related to an address in hash table, unlike many people use it for, say, Hamming distance calculations. In imagehash, no distances are calculated between hashes. But for icons - yes, distances are calculated.

Am I missing something? Let me know what you think.

Thank you for the contribution! It helps clarify package descriptions.

miracvbasaran · 2023-01-24T01:12:42Z

Hello Vitali,

Thanks for getting back to me! IIUC, you have two concerns:

1. Performance

Perhaps my PR got a bit confusing with all the images and similarities: imagehash.Image, image.Image, Image.Similar(), images4.Similar()... :)

My PR doesn't actually run images4.Similar() unless the imagehash comparison (comparing CentralHash with HashSet) yields a match. See the following snippet:

return foundSimilarImage && images4.Similar(img.Icon, img2.Icon)

When foundSimilarImage is false,i.e. when imagehash comparison doesn't yield a match or in other words, when images are dissimilar, images4.Similar() is not called because the first part of the expression (foundSimilarImage) evaluates to false.

This PR indeed improves performance as opposed to using images4.Similar() directly, even before going into the "billions of images territory".

I ran some informal experiments:
Setup: ~45000 images with ~200 matching image pairs.
Comparing them directly with images4.Similar(): 99s
Comparing them with imagehash.Image.Similar() (This PR): 46s

2. Memory

When I wrote this snippet for my own project before creating this PR, I didn't have images.Image in the imagehash.Image struct because I didn't have need for it. So it was like the following:

type Image struct {
	Icon         images4.IconT
	CentralHash  uint64
	HashSet      []uint64
}

I added Image images.Image when creating the PR because I thought it might be useful for others: e.g. if they want to use this function in images4.

I didn't consider the memory impact. You are right, when working with billions of images that would be prohibitive. I removed it now.

Just as an example, after this change, in my experiments an average imagehash.Image object is about ~0.18 Kb, meaning 1 billion imagehash.Image would be ~180 Mb, which is reasonably small. A user could easily hold a hashmap with tens of billions of images.

I hope that makes sense!

Cheers,
Miraç

vitali-fedulov · 2023-01-29T00:36:25Z

Miraç, got it, thanks!

At this point I am thinking:

Maybe your proposal for faster Similar func would better fit images4 package, or even a future version images5.
Indeed, from looking how people use images4, sometimes they compare 2 images only. They do not even loop in 1-for-each.
Do you mind to provide a sample of your code how you use it in one of your projects? To get an idea. Have you tried or need a hash table as a primary step? - it will be interesting to see the speed gain in addition to 2 of your other benchmarks.

Hopefully your benchmarks reflect purely the comparison step with all icons/hash structs ready, without including image reads time.

I am going slowly on this to make sure we are getting an elegant solution.

Best,
Vitali

Add convenience wrapper for quick similarity computation

aacead5

Remove image.Image field from imagehash.Image

e01d51c

vitali-fedulov added the enhancement New feature or request label Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add convenience wrapper for quick similarity computation#1

Add convenience wrapper for quick similarity computation#1
miracvbasaran wants to merge 2 commits intovitali-fedulov:masterfrom
miracvbasaran:master

miracvbasaran commented Jan 19, 2023

Uh oh!

vitali-fedulov commented Jan 21, 2023 •

edited

Loading

Uh oh!

miracvbasaran commented Jan 24, 2023

Uh oh!

vitali-fedulov commented Jan 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

miracvbasaran commented Jan 19, 2023

Uh oh!

vitali-fedulov commented Jan 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miracvbasaran commented Jan 24, 2023

1. Performance

2. Memory

Uh oh!

vitali-fedulov commented Jan 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vitali-fedulov commented Jan 21, 2023 •

edited

Loading