The method you are using for comparison is commonly simdized and used for string matching.
Note also that intel/AMD SSE4+ has a 32 bit/64 bit popcnt instruction with 3 cycle latency/1 cycle throughput (for both 32 bit and 64 bit version), and so is faster for counting bits/matches than any of the methods you are using :)