Videohash – Perceptual video hashing python package (opens in new tab)

(pypi.org)

48 pointsakamhy4y ago19 comments

19 comments

18 comments · 5 top-level

Tangentially-related: What's the state of the art for storing a bunch[1] of hashes like the OP (or pHash, etc.) in PostgreSQL and querying by hamming distance in a reasonable time?[2]

pg_similarity? pg_trgm? cube?

[1]: 10–50 Million

[2]: < 200ms

Twixes4y ago

AFAIK there isn't any great way. pg_similarity etc. offer useful functions, but that doesn't help with big-O here. And the indexing capabilities that exist are only useful for geometry/geography, not abstract metric spaces (the mathematical sense of "metric space", examples of metrics being Hamming distance or Levenshtein distance). I haven't found a DBMS optimized for metric spaces at all actually. The second best solution I've got is just using a columnar DBMS like ClickHouse, which still needs to scan all values, but at least that reads values from a blob which _only stores that specific column_ – hugely faster than parsing whole rows. The lack of the ideal solution is why I'm building Emdrive, an RDBMS with first class support for similarity search, based on indexing with an M-tree variation. Still very early stages ;) https://github.com/Twixes/emdrive

willcodeforfoo4y ago

This looks like a really interesting project! I'm going to check it out.

innagadadavida4y ago

One option when your DB doesn't have those primitives is to convert the bits to word 011 -> "unsetbit2 setbit1 setbit0" then treat that column to have a text index - this is equivalent to doing hamming distance search. I did this with MySQL for 20M gifs for near duplicate detection and it worked very well.

willcodeforfoo4y ago

Interesting idea, I'll have to give this a shot!

varelaz4y ago

I don't know if it makes sense to query hamming distance for hash. Closest hashes don't guarantee closest images at all. You can check for amount of parts matching by query like: select video_id from video_hashes where hash in (...) group by video_id order by count(distinct hash) desc limit 10

technically it can be fast since selection on hash could be very narrow. You need only index by hash, video_id.

CaveTech4y ago

OP is referring to phashes aka perceptual hashes, where closest hashes should indeed indicate similarity.

phenkdo4y ago· 4 in thread

can you explain how you are hashing the video? i took a quick look at the github and don't see details...

giantrobot4y ago

They're extracting video frames at some interval (default of 1 second) as 144x144px stills and then turning them into a square collage. That collage then has a perceptual hash performed on it.

The major problem here is two videos with the exact same content but slightly different times (say one with a couple second intro) will rarely if ever have a positive match.

The only cases I see where this particular scheme helpful is where you've got videos with the same contents but different encodings. The length will be the same but quality between two encodings (and names) might be different. This would help you find them in a sea of files.

A simple improvement would be to only check the frame from the middle of each video first. If the frame at the same time stamp are the same in one part you've got a non-zero probability of a match. Then you can attempt to check more frames radiating out from the center point. Negative matches will fail fast and save you work. It also matches when the lengths are dissimilar because of trims or splices at the beginning and end of the videos.

A second improvement would be to pick a frame from the A video and scan through the B video (or segment of each) to find a high probability match. Then check other segments of the video for matches in the same way.

Trying to turn a video into a single static representation and comparing it is not the best.

jsdwarf4y ago

Wouldn't it make more sense to convert the video to greyscale and e.g. detect significant changes of brightness during frames and store them as vector coordinates (% of the playtime, brightness delta)?

1 more reply

hetspookjee4y ago

I think it creates a collage of the video frames: https://github.com/akamhy/videohash/blob/8759b6ad7fdabcdf4dd...

and passes that on to the videohash.py module to generate a hash: https://github.com/akamhy/videohash/blob/main/videohash/vide...

by using the library imagehash: https://pypi.org/project/ImageHash/

1 more reply

fxtentacle4y ago

It's open source ;)

Apparently, they extract video frames using FFMPEG, create a collage out of those frames, then use the whash method of the python imagehash package.

So it's basically reducing video hashing to image hashing, which was previously solved.

varelaz4y ago· 3 in thread

I used similar approach for video hashing. Instead of interval I used key frames with ffmpeg, then you don't depend on codec. Also didn't rescale but took hash of every frame. For youtube I found that it still produces different hashes sometimes.

edit: to get only keyframes use select=eq(pict_type,I)

Farmadupe4y ago

I get that decoding only keyframes will be much faster, but how can codec independence be maintained when different codecs will insert keyframes at very different points?

Could such an algorithm ever find a duplicate between say a GIF (every frame is a keyframe) vs any modern codec with very few keyframes?

(or is this optimization specifically for videos known the be encoded with the exact same codec, and specifically with a static keyframe interval?)

varelaz4y ago

Codecs are algorithms how to generate B and P frames. I frames are just jpegs. Yes, codecs can split video differently, but in case of the same split the same frame will be encoded the same way. In most cases key frame frequency is just a number. Some formats like HLS cannot work with variable key frame frequency at all. Why it matters, because different version of the same codec can replay the same video differently for B and P (no guarantee), but not I frames. So I frames are the most stable.

mzs4y ago

faster: -skip_frame nokey

Farmadupe4y ago

As other posters have commented, hashing schemes like this are not really robust to a few very simple transformations, such as slight offsets in time, and cannot detect if one video is a clip from another.

Practically you can only expect such a tool to match up "reencodings" from one format to another.

Because of this IMO it's worth including the video length as a separate field within the hash. That way when you are doing a search you can sort the videos by length then only calculating the hamming distance for videos of similar length (a short video will never be a duplicate of a long one even if the perceptual hashes are close)

Unfortunately when all videos are the same length this doesn't change the O(n*2) time complexity, but assuming that they are not, this optimization should give a significant search time benefit and false positive reduction for large datasets.

I haven't actually checked what the author is doing, but for the sake of not having to decode entire videos (very CPU intensive) it's worth limiting the hash generation process to a small portion from the start of the video (somewhere between 30 seconds and 5 minutes has worked for me depending on the content). As well as being saving time this helps you detect reencodings where the time base is slightly altered (the frames will diverge more and more as time goes on)

(just adding some of my own experience from my own similar video hashing project!)

edit: inb4 someone complains about O(n*2) and mentions BK trees, for numbers up to ~500k hashes in my own testset a BK tree has always been slower than doing the naive search over a neatly aligned stripe of sorted-by-video-length hashes in memory. (Maybe I need to learn how to do memory arenas for cache locality) (or maybe I need to make my own video hashes be not 500 bits long)

helsinki4y ago

You may want to support either bit interleaving or a CNN that emits a vector that preserves visual locality to other videos, allowing for small changes to be ignored between hashes.

j / k navigate · click thread line to collapse

19 comments

18 comments · 5 top-level

willcodeforfoo4y ago· 6 in thread

Tangentially-related: What's the state of the art for storing a bunch[1] of hashes like the OP (or pHash, etc.) in PostgreSQL and querying by hamming distance in a reasonable time?[2]

pg_similarity? pg_trgm? cube?

[1]: 10–50 Million

[2]: < 200ms

Twixes4y ago

willcodeforfoo4y ago

This looks like a really interesting project! I'm going to check it out.

innagadadavida4y ago

willcodeforfoo4y ago

Interesting idea, I'll have to give this a shot!

varelaz4y ago

technically it can be fast since selection on hash could be very narrow. You need only index by hash, video_id.

CaveTech4y ago

OP is referring to phashes aka perceptual hashes, where closest hashes should indeed indicate similarity.

phenkdo4y ago· 4 in thread

can you explain how you are hashing the video? i took a quick look at the github and don't see details...

giantrobot4y ago

They're extracting video frames at some interval (default of 1 second) as 144x144px stills and then turning them into a square collage. That collage then has a perceptual hash performed on it.

The major problem here is two videos with the exact same content but slightly different times (say one with a couple second intro) will rarely if ever have a positive match.

Trying to turn a video into a single static representation and comparing it is not the best.

jsdwarf4y ago

1 more reply

hetspookjee4y ago

I think it creates a collage of the video frames: https://github.com/akamhy/videohash/blob/8759b6ad7fdabcdf4dd...

and passes that on to the videohash.py module to generate a hash: https://github.com/akamhy/videohash/blob/main/videohash/vide...

by using the library imagehash: https://pypi.org/project/ImageHash/

1 more reply

fxtentacle4y ago

It's open source ;)

Apparently, they extract video frames using FFMPEG, create a collage out of those frames, then use the whash method of the python imagehash package.

So it's basically reducing video hashing to image hashing, which was previously solved.

varelaz4y ago· 3 in thread

edit: to get only keyframes use select=eq(pict_type,I)

Farmadupe4y ago

I get that decoding only keyframes will be much faster, but how can codec independence be maintained when different codecs will insert keyframes at very different points?

Could such an algorithm ever find a duplicate between say a GIF (every frame is a keyframe) vs any modern codec with very few keyframes?

(or is this optimization specifically for videos known the be encoded with the exact same codec, and specifically with a static keyframe interval?)

varelaz4y ago

mzs4y ago

faster: -skip_frame nokey

Farmadupe4y ago

Practically you can only expect such a tool to match up "reencodings" from one format to another.

(just adding some of my own experience from my own similar video hashing project!)

helsinki4y ago

You may want to support either bit interleaving or a CNN that emits a vector that preserves visual locality to other videos, allowing for small changes to be ignored between hashes.

j / k navigate · click thread line to collapse