Open Source at Google: Introducing CityHash (opens in new tab)

(google-opensource.blogspot.com)

78 pointsturbodog15y ago16 comments

16 comments

15 comments · 5 top-level

turbodogOP15y ago· 5 in thread

From http://code.google.com/p/cityhash/source/browse/trunk/src/ci...

// WARNING: This code has not been tested on big-endian platforms!

// It is known to work well on little-endian platforms that have a small penalty

// for unaligned reads, such as current Intel and AMD moderate-to-high-end CPUs.

// By the way, for some hash functions, given strings a and b, the hash

// of a+b is easily derived from the hashes of a and b. This property

// doesn't hold for any hash functions in this file.

jws15y ago

City64 at least fails to produce the same values on an x86 and big endian PPC (Debian squeeze on both):

  x86
  ========================
  641f1cd0505d1ff9 : I
  e394c6831e9d6e71 : do
  651c1a4b0b3f2d88 : not
  9fc671af2d051786 : know
  1fa385757f0016ec : about
  f2407bd9ce678f7f : making
  73bd225c6b6b8163 : awkward
  ea78562e35777eb1 : rhopalic
  3c854a925e6e4a9e : sentences
  84a04e9aa8dae9f5 : .
  
  PPC
  ========================
  641f1cd0505d1ff9 : I
  e394c6831e9d6e71 : do
  651c1a4b0b3f2d88 : not
  147d90cc620fc6b2 : know
  23af6044b2394703 : about
  4f80227202845190 : making
  4b8b4b3cd98a1de9 : awkward
  61b38def2ee9465f : rhopalic
  29016ee1c7c61e93 : sentences
  84a04e9aa8dae9f5 : .

For anyone that cares, the test program was…

  #include <stdio.h>
  #include <string.h>
  #include "city.h"

  const char *word[] = { "I", "do", "not", "know", "about",
                         "making", "awkward", "rhopalic", 
                         "sentences", ".", 0 };

  int main(int argc, char **argv)
  {
      for ( const char **p = word ; *p != 0; p++) {
        uint64 v = CityHash64( *p, strlen(*p));
        printf("%Lx : %s\n", v, *p);
      }
  }

DarkShikari15y ago

    #define UNALIGNED_LOAD64(p) (*(const uint64*)(p))

    #define UNALIGNED_LOAD32(p) (*(const uint32*)(p))

Any function that uses both of these violates strict aliasing rules and might miscompile on recent gcc versions. Only one of them isn't enough, because char* is allowed to alias anything in C/C++. But if you use both, then you have a uint32_t* and a uint64_t* pointing to the same memory, violating the language spec.

ffmpeg has a header of macros to avoid this problem. They have names like AV_RN32A (aligned, native-endian 32-bit read), AV_RL32 (unaligned, little-endian 32-bit read) and so forth.

haberman15y ago

> Any function that uses both of these violates strict aliasing rules

More specifically, any function that uses both of these to access the same data violates strict aliasing rules. But this would imply that the data is being loaded redundantly, which seems unlikely in an implementation where speed is a top priority.

For example, I do not believe the following function violates strict aliasing rules:

    void foo(char *p) {
      bar(UNALIAGNED_LOAD64(p), UNALIGNED_LOAD32(p+8));
    }

cpeterso15y ago

I am surprised at how half-baked this code sounds. Why would Google release a new hash library that does not support big endian platforms, uses unaligned memory access on little endian platforms, is strict-aliasing unsafe, and is implemented in C++?

sparky15y ago

For what it's worth, Snappy (the compression library Google released a couple weeks ago) has many of these same limitations. I wanted to port both of these to the RISCy, big-endian architecture I work on as part of my research, so finding out how unportable they were was kind of a bummer for me, but honestly it's a pretty reasonable tradeoff for Google; x86 is the name of the game in cheap (sorry Power 7/SPARC) commodity servers. If it were me, I might have put a bit of thought/work into portability before open-sourcing it, but I'd rather the code be out there than not, despite its limitations.

nbpoole15y ago· 3 in thread

I'm not seeing any benchmarks: since they tout speed as a major selling point, that surprises me.

Edit: For instance, the blog post mentions being inspirsed by MurmurHash, which touts its performance on its site along with benchmarks (http://sites.google.com/site/murmurhash/):

    OneAtATime - 354.163715 mb/sec
    FNV - 443.668038 mb/sec
    SuperFastHash - 985.335173 mb/sec
    lookup3 - 988.080652 mb/sec
    MurmurHash 1.0 - 1363.293480 mb/sec
    MurmurHash 2.0 - 2056.885653 mb/sec

In fact, it looks like the author of MurmurHash also developed a test suite for hash functions which includes performance testing: http://code.google.com/p/smhasher/wiki/SMHasher

elq15y ago

looking at the code, they've optimized for speed. And they also said -

> We decided to optimize for speed rather than simplicity and even included special cases for short inputs.

nbpoole15y ago

Right, I understand that. I'm just surprised that they aren't showing off benchmarks or statistics to quantify the performance gains they're seeing.

andrewcooke15y ago

and i'm not seeing any definition of what the hash is, or why it should be any good, apart from the code...

andrewcooke15y ago· 2 in thread

what does this do: #define LIKELY(x) (__builtin_expect(!!(x), 1))

oh, to answer myself, is to help predict the branch direction - it indicates which is likely the common case in the test. http://kerneltrap.org/node/4705

edge1715y ago

haven't seen this stuff in a while... given the complexity and large history-tracking tables of modern branch predictors, is this stuff still useful?

or is the code written in this manner just to make it more platform independent (i.e. cycles still matter for embedded applications, etc)?

pmr_15y ago

There is really no connection between the likely macro and branch prediction here.

The gcc built-in simply will organize the code in a manner that makes the likely statement faster to execute (e.g. branching the unlikely block).

Branch predictors are a runtime mechanism so the processor guesses which outcome of the "if" is more likely.

One is a static optimization, the other is dynamic; if you like to think about it that way.

1 more reply

ecaron15y ago

I just finished created a PHP PECL extension for CityHash. Email me if you'd like to beta-test it before I submit it to the PECL dev list.

mikelward15y ago

Middle click not working as expected? http://code.google.com/p/chromium/issues/detail?id=67844

j / k navigate · click thread line to collapse