pg_similarity? pg_trgm? cube?
[1]: 10–50 Million
[2]: < 200ms
technically it can be fast since selection on hash could be very narrow. You need only index by hash, video_id.
The major problem here is two videos with the exact same content but slightly different times (say one with a couple second intro) will rarely if ever have a positive match.
The only cases I see where this particular scheme helpful is where you've got videos with the same contents but different encodings. The length will be the same but quality between two encodings (and names) might be different. This would help you find them in a sea of files.
A simple improvement would be to only check the frame from the middle of each video first. If the frame at the same time stamp are the same in one part you've got a non-zero probability of a match. Then you can attempt to check more frames radiating out from the center point. Negative matches will fail fast and save you work. It also matches when the lengths are dissimilar because of trims or splices at the beginning and end of the videos.
A second improvement would be to pick a frame from the A video and scan through the B video (or segment of each) to find a high probability match. Then check other segments of the video for matches in the same way.
Trying to turn a video into a single static representation and comparing it is not the best.
and passes that on to the videohash.py module to generate a hash: https://github.com/akamhy/videohash/blob/main/videohash/vide...
by using the library imagehash: https://pypi.org/project/ImageHash/
Apparently, they extract video frames using FFMPEG, create a collage out of those frames, then use the whash method of the python imagehash package.
So it's basically reducing video hashing to image hashing, which was previously solved.
edit: to get only keyframes use select=eq(pict_type,I)
Could such an algorithm ever find a duplicate between say a GIF (every frame is a keyframe) vs any modern codec with very few keyframes?
(or is this optimization specifically for videos known the be encoded with the exact same codec, and specifically with a static keyframe interval?)
Practically you can only expect such a tool to match up "reencodings" from one format to another.
Because of this IMO it's worth including the video length as a separate field within the hash. That way when you are doing a search you can sort the videos by length then only calculating the hamming distance for videos of similar length (a short video will never be a duplicate of a long one even if the perceptual hashes are close)
Unfortunately when all videos are the same length this doesn't change the O(n*2) time complexity, but assuming that they are not, this optimization should give a significant search time benefit and false positive reduction for large datasets.
I haven't actually checked what the author is doing, but for the sake of not having to decode entire videos (very CPU intensive) it's worth limiting the hash generation process to a small portion from the start of the video (somewhere between 30 seconds and 5 minutes has worked for me depending on the content). As well as being saving time this helps you detect reencodings where the time base is slightly altered (the frames will diverge more and more as time goes on)
(just adding some of my own experience from my own similar video hashing project!)
edit: inb4 someone complains about O(n*2) and mentions BK trees, for numbers up to ~500k hashes in my own testset a BK tree has always been slower than doing the naive search over a neatly aligned stripe of sorted-by-video-length hashes in memory. (Maybe I need to learn how to do memory arenas for cache locality) (or maybe I need to make my own video hashes be not 500 bits long)