How to implement a hash table in C (2021) (opens in new tab)

(benhoyt.com)

169 pointsbenhoyt1y ago41 comments

41 comments

30 comments · 13 top-level

timo-e-aus-e1y ago· 6 in thread

I played around with C++ when I was at university. Then never touched it again. So, with a grin I stumble over things like

"void* ht_get(...)"

Wait. What? A void pointer? Interesting... I have no clue.

I like articles like these. For someone not familiar with C it's a perfect level. In terms of explanation and the code itself.

floam1y ago

It has no type, so you can cast it to whatever it actually is. Even a function you will call.

The main use case is generic functions. And in data structures like this.

shawn_w1y ago

Casting between void pointers and function pointers is, strictly speaking, undefined behavior. Though it does tend to work, it's not something to rely on.

(And you don't need casts when converting between void pointers and data pointers)

3 more replies

maccard1y ago

Remember that this is for c, not c++. Void pointers are entirely unnecessary in c++ unless you’re interoperating with existing libraries and can’t change the interface. You’ll get faster and safer code with templates and function overloading

masfoobar1y ago

From my experience, learning a language like C goes beyond what you cover in University. Either that or the one I went to was **.

I remember back in the late 90s to early 00s, I had fun countering OOP guys claiming C "cannot do XYZ" - but it obviously can.. the C way.

For method override, my mind was blown when I discovered function pointers!

It was my "Wait. What?" moment.. along with various other things!

Below is sample code (not tested)

  // generic monster "class"
  struct monster_t {
    int health;
    struct weapon_t *weapon;
    // etc..

    void (*attack)(struct monster_t *self);   // our function pointer
    // more function pointers for update, die, hit, etc...
  };

  // impl
  void monster_notassigned_attach(struct monster_t *monster) {  // safety net (todo)
    printf("Attack not assigned to monster...\n");
  }

  void monster_dragon_attack(struct monster_t *monster) {
    // attack for dragon
    printf("Dragon is attacking you...\n");
  }

  void monster_wizard_attack(struct monster_t *monster) {
    // attack for wizard
    printf("Wizard is attacking you...\n");
  }


  // creation
  void monster_dragon_create(struct monster_t *monster) {
    // allocate, then assign.. and do...
    monster->attack = monster_dragon_attack;
  }

  void monster_wizard_create(struct monster_t *monster) {
    // allocate, then assign.. and do...
    monster->attack = monster_wizard_attack;
  }
  
  void main() {
    // testing...
    struct monster_t monsters[10];       // can store various monsters!
  
    monster_dragon_create(&monsters[0]);
    monster_wizard_create(&monsters[1]);  

    for(int i = 0; i < 2; ++i) {
      monsters[i].attack(&monsters[i]);  // yeah you have to pass it in...
    }
  }

Not suggesting I support or follow the above especially for games. It is just a proof of concept! I always tell people that coding in C (or any non-OOP language) should not be done the way you would in Java, C#, etc. As mentioned, the above is a demonstration.

unwind1y ago

This is a bit of a pet peeve with anonymous Internet writing, but: are we supposed to know when you were at university? Do you know when I was? Why not state something like the number of decades or whatever?

And yeah, a void * is how you express pointer to data of unknown type in C.

kchr1y ago

> This is a bit of a pet peeve with anonymous Internet writing, but: are we supposed to know when you were at university? Do you know when I was? Why not state something like the number of decades or whatever?

Does it matter? I tend to read it as "haven't touched it/had any use for it since university".

suprjami1y ago· 5 in thread

It's the ihih guy! Thanks for being one of the few C bloggers out there.

brabel1y ago

I enjoy reading posts about C just to see how much boilerplate and footguns higher languages let you avoid. I would like to write C, but I just have no trust I would remember or even know how to do things right (use `calloc` vs `malloc`, free the table when allocation of `entries` fails etc).

mauvehaus1y ago

Counterpoint: when you're writing C, you end up only implementing as much complexity as you actually need. I rolled a hash table for a project that only need to contain integer keys and (I think) string values. It didn't need to carry around all the complexity to handle hashing arbitrary structured data for keys, etc. Multi-threading was supported but a simple lock over the whole table was sufficient for correctness and met the required performance.

The whole thing was perhaps 200 lines of straightforward code. If you don't need to carry around support for all the third-prime-numbered-Tuesday-of-the-month scenarios that general purpose libraries have to, you end up with code you can understand fully and don't have to spend time reasoning about complexity that doesn't exist.

1 more reply

0823498723498721y ago

The thing about boilerplate is that you were always rewriting it, so avoiding the footguns became habit.

(So ... which takes longer, writing the data structure you want, as we all did then, or searching for the perfect library, as some people do now?)

EDIT: awk(1) had built-in hash tables, and dates from 1977. Note that generic hashing becomes much more useful after D-space gets larger than 16-64k! (14-16 bits of address)

suprjami1y ago

I find that stuff enjoyable. I almost never use malloc, I learnt to use calloc so it's my go-to allocator.

Whenever I write an allocation I try to write a free after it, then fill in the code in between.

Use of static analysers and sanitisers helps too.

Like some people enjoy making all their hand tools from scratch in their workshop, there's a craft and discipline to follow and a satisfaction in doing things well, I enjoy the craft of and discipline C for the same reason.

rlinge1y ago

I started to make some new toys in C too:

https://github.com/webd90kb/webd/tree/master/codes/c_project...

https://github.com/webd90kb/webd/tree/master/codes/scripts/e...

commandersaki1y ago· 2 in thread

There was a fantastic benchmark of C and C++ hash tables doing the rounds a few weeks ago, it's pretty fun reading: https://jacksonallan.github.io/c_cpp_hash_tables_benchmark/.

Unless I really didn't want to introduce dependencies, or reduce code size, I think I'd use an off the shelf hash table implementation these days. It's still a fun exercise building your own though.

fenesiistvan1y ago

One important consideration is missing from this benchmark: how they behave with multi-threading. There can be big differences between diffent hash table implementation when you have to use them from different threads (built-in smart thread safety vs external dumb locks, etc)

CyberDildonics1y ago

That has nothing to do with this. Granular lock free tables will outperform global locking by orders of magnitude, but it has to be designed differently from the ground up.

akira25011y ago· 2 in thread

I really like the API design of libjudy for general hash table like interfaces. It prevents you from having to hash a key twice just to do "check if key is not present, then set value, otherwise leave original value in hash." The pattern also meshes better with the way iterators naturally work.

Also, in terms of this table, if you add the computed key hash value to the stored hash entry, you can check that the value of the hashes match before you do the full strcmp. If you have a weak hash you might also get a benefit from checking that the first characters match before calling the full strcmp.

It would also make rehashing easier since you already have the full key available to you and don't have to use your internal set function to move entries into the new table. In the posted implementation the big-O semantics of a rehash are worst case.

Anyways.. "man hsearch.3" if you want something cheap and easy.

shawn_w1y ago

POSIX hsearch tables are terrible. There's a reason why nobody uses them.

One at a time per program, having to know how big the table needs to be at initialization time with no automatic resizing, an API that's very limited (no way to remove entries, iterate over entries, tell how many entries are in the table, ...)

akira25011y ago

> One at a time per program

Reentrant versions have existed for quite some time.

> having to know how big the table needs to be at initialization time with no automatic resizing

The posted implementation resizes by creating an entirely new table and then moving all entries. The exact same mechanism is available with hsearch.

> no way to remove entries, iterate over entries

Strong downside but the posted implementation has no removal function either. It also leaks key memory on resize and free. It's iterator would need modification to allow delete during iteration.

> tell how many entries are in the table

somewhat_reasonable_point++;

1 more reply

Joker_vD1y ago· 2 in thread

> but it is non-ideal that I’m only allowing half the range of size_t.

I am fairly certain that in C it's actually impossible to have an object whose size is larger than half of the range of size_t unless ptrdiff_t is wider than size_t, which normally isn't. Unless, of course, C standard decided to make subtracting two valid pointers into the same array (or one past the end) a potential UB, just because.

tedunangst1y ago

Actually, that is UB.

> If the result is not representable in an object of that type, the behavior is undefined. In other words, if the expressions P and Q point to, respectively, the i-th and j-th elements of an array object, the expression (P)-(Q) has the value i - j provided the value fits in an object of type ptrdiff_t.

rerdavies1y ago

Although, I can't help thinking you're going to run into bigger problems than that if the size of your buffer has grown to 9,223,372,036,854,775,807 bytes.

1 more reply

SloopJon1y ago

I was noodling in this area recently, trying to speed up some code similar to the tr utility:

    $ echo abcdef |tr abc ghi
    ghidef

For an eight-bit character set, I found that building an array to map every character improved on linear search, even for short replacement strings and relatively short input strings.

There isn't as easy a win for Unicode, so I played with some simple hash tables. Although the conventional wisdom is to use the high-order bits of a hash function's output, FNV-1a is not so good for short inputs of one or two bytes. (I used the 32-bit variant.) It was better just to use the low-order bits.

iamevn1y ago

if (size + size < size) {

I know size_t wraps around on overflow so this would actually work, but this still makes me do a double take.

MrVandemar1y ago

Enjoyable article and thorough run through. I didn't read all of it, but it really took me back to learning 'C' in comp sci in the early 90s. Great times.

dang1y ago

Discussed at the time:

How to implement a hash table in C - https://news.ycombinator.com/item?id=26590234 - March 2021 (156 comments)

foresto1y ago

In case anyone here is not already familiar with it, gperf is a perfect hash function generator.

https://www.gnu.org/software/gperf/manual/gperf.html

glouwbug1y ago

Open hash tables are typically the way to go in C. They drop the added complexity of managing a forward singly list by turning getters and setters typically into what is an array operation

aronhegedus1y ago

This came up on an interview test where I had to implement this from scratch! Fun to read

a-dub1y ago

might be neat if there were acceleration wins to be had with instructions for computing hash ints from machine ints and strings.

or even better, a full fledged linear probing hash lookup instruction.

j / k navigate · click thread line to collapse

41 comments

30 comments · 13 top-level

timo-e-aus-e1y ago· 6 in thread

I played around with C++ when I was at university. Then never touched it again. So, with a grin I stumble over things like

"void* ht_get(...)"

Wait. What? A void pointer? Interesting... I have no clue.

I like articles like these. For someone not familiar with C it's a perfect level. In terms of explanation and the code itself.

floam1y ago

It has no type, so you can cast it to whatever it actually is. Even a function you will call.

The main use case is generic functions. And in data structures like this.

shawn_w1y ago

Casting between void pointers and function pointers is, strictly speaking, undefined behavior. Though it does tend to work, it's not something to rely on.

(And you don't need casts when converting between void pointers and data pointers)

3 more replies

maccard1y ago

masfoobar1y ago

From my experience, learning a language like C goes beyond what you cover in University. Either that or the one I went to was **.

I remember back in the late 90s to early 00s, I had fun countering OOP guys claiming C "cannot do XYZ" - but it obviously can.. the C way.

For method override, my mind was blown when I discovered function pointers!

It was my "Wait. What?" moment.. along with various other things!

Below is sample code (not tested)

  // generic monster "class"
  struct monster_t {
    int health;
    struct weapon_t *weapon;
    // etc..

    void (*attack)(struct monster_t *self);   // our function pointer
    // more function pointers for update, die, hit, etc...
  };

  // impl
  void monster_notassigned_attach(struct monster_t *monster) {  // safety net (todo)
    printf("Attack not assigned to monster...\n");
  }

  void monster_dragon_attack(struct monster_t *monster) {
    // attack for dragon
    printf("Dragon is attacking you...\n");
  }

  void monster_wizard_attack(struct monster_t *monster) {
    // attack for wizard
    printf("Wizard is attacking you...\n");
  }


  // creation
  void monster_dragon_create(struct monster_t *monster) {
    // allocate, then assign.. and do...
    monster->attack = monster_dragon_attack;
  }

  void monster_wizard_create(struct monster_t *monster) {
    // allocate, then assign.. and do...
    monster->attack = monster_wizard_attack;
  }
  
  void main() {
    // testing...
    struct monster_t monsters[10];       // can store various monsters!
  
    monster_dragon_create(&monsters[0]);
    monster_wizard_create(&monsters[1]);  

    for(int i = 0; i < 2; ++i) {
      monsters[i].attack(&monsters[i]);  // yeah you have to pass it in...
    }
  }

unwind1y ago

And yeah, a void * is how you express pointer to data of unknown type in C.

kchr1y ago

Does it matter? I tend to read it as "haven't touched it/had any use for it since university".

suprjami1y ago· 5 in thread

It's the ihih guy! Thanks for being one of the few C bloggers out there.

brabel1y ago

mauvehaus1y ago

1 more reply

0823498723498721y ago

The thing about boilerplate is that you were always rewriting it, so avoiding the footguns became habit.

(So ... which takes longer, writing the data structure you want, as we all did then, or searching for the perfect library, as some people do now?)

EDIT: awk(1) had built-in hash tables, and dates from 1977. Note that generic hashing becomes much more useful after D-space gets larger than 16-64k! (14-16 bits of address)

suprjami1y ago

I find that stuff enjoyable. I almost never use malloc, I learnt to use calloc so it's my go-to allocator.

Whenever I write an allocation I try to write a free after it, then fill in the code in between.

Use of static analysers and sanitisers helps too.

rlinge1y ago

I started to make some new toys in C too:

https://github.com/webd90kb/webd/tree/master/codes/c_project...

https://github.com/webd90kb/webd/tree/master/codes/scripts/e...

commandersaki1y ago· 2 in thread

There was a fantastic benchmark of C and C++ hash tables doing the rounds a few weeks ago, it's pretty fun reading: https://jacksonallan.github.io/c_cpp_hash_tables_benchmark/.

Unless I really didn't want to introduce dependencies, or reduce code size, I think I'd use an off the shelf hash table implementation these days. It's still a fun exercise building your own though.

fenesiistvan1y ago

CyberDildonics1y ago

That has nothing to do with this. Granular lock free tables will outperform global locking by orders of magnitude, but it has to be designed differently from the ground up.

akira25011y ago· 2 in thread

Anyways.. "man hsearch.3" if you want something cheap and easy.

shawn_w1y ago

POSIX hsearch tables are terrible. There's a reason why nobody uses them.

akira25011y ago

> One at a time per program

Reentrant versions have existed for quite some time.

> having to know how big the table needs to be at initialization time with no automatic resizing

The posted implementation resizes by creating an entirely new table and then moving all entries. The exact same mechanism is available with hsearch.

> no way to remove entries, iterate over entries

Strong downside but the posted implementation has no removal function either. It also leaks key memory on resize and free. It's iterator would need modification to allow delete during iteration.

> tell how many entries are in the table

somewhat_reasonable_point++;

1 more reply

Joker_vD1y ago· 2 in thread

> but it is non-ideal that I’m only allowing half the range of size_t.

tedunangst1y ago

Actually, that is UB.

rerdavies1y ago

Although, I can't help thinking you're going to run into bigger problems than that if the size of your buffer has grown to 9,223,372,036,854,775,807 bytes.

1 more reply

SloopJon1y ago

I was noodling in this area recently, trying to speed up some code similar to the tr utility:

    $ echo abcdef |tr abc ghi
    ghidef

For an eight-bit character set, I found that building an array to map every character improved on linear search, even for short replacement strings and relatively short input strings.

iamevn1y ago

if (size + size < size) {

I know size_t wraps around on overflow so this would actually work, but this still makes me do a double take.

MrVandemar1y ago

Enjoyable article and thorough run through. I didn't read all of it, but it really took me back to learning 'C' in comp sci in the early 90s. Great times.

dang1y ago

Discussed at the time:

How to implement a hash table in C - https://news.ycombinator.com/item?id=26590234 - March 2021 (156 comments)

foresto1y ago

In case anyone here is not already familiar with it, gperf is a perfect hash function generator.

https://www.gnu.org/software/gperf/manual/gperf.html

glouwbug1y ago

Open hash tables are typically the way to go in C. They drop the added complexity of managing a forward singly list by turning getters and setters typically into what is an array operation

aronhegedus1y ago

This came up on an interview test where I had to implement this from scratch! Fun to read

a-dub1y ago

might be neat if there were acceleration wins to be had with instructions for computing hash ints from machine ints and strings.

or even better, a full fledged linear probing hash lookup instruction.

j / k navigate · click thread line to collapse