Making a StringBuffer in C, and questioning my sanity (opens in new tab)

(briandouglas.ie)

46 pointsconeonthefloor11mo ago55 comments

55 comments

24 comments · 9 top-level

o11c11mo ago· 6 in thread

Hm, this implementation seems allergic to passing types by value, which eliminates half of the allocations. It also makes the mistake of being mutable-first, and provides some fundamentally-inefficient operations.

The main mistake that this makes in common with most string implementations make is to only provide a single type, rather than a series of mostly-compatible types that can be used generically in common contexts, but which differ in ways that sometimes matter. Ownership, lifetime, representation, etc.

zahlman11mo ago

> It also makes the mistake of being mutable-first

Is mutability not part of the point of having a string buffer? Wouldn't the corresponding immutable type just be a string?

o11c11mo ago

"Buffer" just means it is used between input and output. It does not imply mutability, and many buffers indeed only take their state at construction time and are not mutable.

In my experience, the only functions a mutable string buffer needs to provide are "append string (or to-string-able)" and "undo that append" (which mostly comes up in list-like contexts, e.g. to remove a final comma); for everything else you can convert to an immutable string first.

(theoretically there might be a "split and clobber" function like `strtok`, but in my experience it isn't that useful once your APIs actually take a buffer class).

Considering the functions from this implementation, they can be divided as follows:

Lifetime methods:

  init
  free
  clear

Immutable methods:

  print
  index_of
  match_all
  split

Mutable methods:

  append
  prepend (inefficient!)
  remove
  replace

I've already mentioned `append`, and I suppose I can grant `prepend` for symmetry (though note that immutable strings do provide some sort of `concatenate`, though beware efficiency concerns). Immutable strings ubiquitously provide `replace` (and `remove` is just `replace` with an empty string), which are much safer/easier to use.

There are also a lot of common operations not provided here. And the ones that are provided fail to work with `StringBuffer` input.

remexre11mo ago

How would you recommend doing that sort of "subtyping"? _Generic and macros?

o11c11mo ago

Yup. It's a lot saner in C++, but people who refuse to use C++ for political reasons can do it the ugly way using C11 or GNU C.

2 more replies

amelius11mo ago

I wonder how an LLM would rate this code.

lelanthran11mo ago

> I wonder how an LLM would rate this code.

I dunno how it rates it now, but if this link: https://www.reddit.com/r/programming/comments/1m3dg0l/making... gets used for future training, future LLMs might make good suggestions for cleaning it up.

WalterBright11mo ago· 5 in thread

    new_capacity *= 2;

A better value is to increase size by 1.5:

https://stackoverflow.com/questions/1100311/what-is-the-idea...

tialaramex11mo ago

This seems clever if we forget that it runs on a real machine. Once we remember there's a real machine we run into discontinuities and suddenly "Naive doubling" works best for a lot of cases.

At the small end, on the real machine tiny allocations are inefficient, so rather than 1, 2, 3, 5, 8, 12, 27, 41 turns out we should probably start with 16 or even 64 bytes.

Then at the large end the "clever" reuse is a nonsense because of virtual memory, so that 1598, 2397, 3596 doesn't work out as cleanly as 1024, 2048, 4096

Folly has a growable array type which talks a big game on this 1.5 factor, but as I indicated above they quietly disable this "optimisation" for both small and large sizes in the code itself.

Of course YMMV, it is entirely possible that some particular application gets a noticeable speedup from a hand rolled 1.8x growth factor, or from starting at size 18 or whatever.

sparkie11mo ago

Also depends what we're trying to optimize. If we're trying to optimize for space then using a constant is a bad idea: Consider if we have 2Gi elements in the array: We have to grow it to 3Gi, but we may only need to add a few additional elements. That's pretty much a whole Gi of wasted space.

Clearly we don't just want a blanket constant, but the growth factor should be a function of the current length - decreasing as the size of the array grows.

For space optimization an ideal growth factor is 1+1/√(length). In the above example, where we have 2 Gi array, we would grow it only by 64ki elements. Obviously this results in many more allocations, and would only use this technique where we're optimizing for space rather than time.

We don't want to be messing around with square roots, and ideally, we want arrays to always be a multiple of some power of 2, so the trick is to approximate the square root:

    inline int64_t approx_sqrt(int64_t length) {
        return 1 << (64 - __builtin_clzll(length)) / 2;
        // if C23, use stdc_first_leading_one_ull() from <stdbit.h>
    }

    inline int64_t new_length(int64_t length) {
        if (length == 0) return 1;
        return length + approx_sqrt(length);
    }

Some examples - for all powers of 2 between UINT16_MAX and UINT32_MAX, the old and new lengths:

    old length: 2^16 -> new length: 0x00010100 (growth: 2^8)
    old length: 2^17 -> new length: 0x00020200 (growth: 2^9)
    old length: 2^18 -> new length: 0x00040200 (growth: 2^9)
    old length: 2^19 -> new length: 0x00080400 (growth: 2^10)
    old length: 2^20 -> new length: 0x00100400 (growth: 2^10)
    old length: 2^21 -> new length: 0x00200800 (growth: 2^11)
    old length: 2^22 -> new length: 0x00400800 (growth: 2^11)
    old length: 2^23 -> new length: 0x00801000 (growth: 2^12)
    old length: 2^24 -> new length: 0x01001000 (growth: 2^12)
    old length: 2^25 -> new length: 0x02002000 (growth: 2^13)
    old length: 2^26 -> new length: 0x04002000 (growth: 2^13)
    old length: 2^27 -> new length: 0x08004000 (growth: 2^14)
    old length: 2^28 -> new length: 0x10004000 (growth: 2^14)
    old length: 2^29 -> new length: 0x20008000 (growth: 2^15)
    old length: 2^30 -> new length: 0x40008000 (growth: 2^15)
    old length: 2^31 -> new length: 0x80010000 (growth: 2^16)

This is the growth rate used in Resizeable Arrays in Optimal Time and Space[1], but they don't use a single array with reallocation - instead they have an array of arrays, where growing the array appends an element to an index block which points to a data block of `approx_sqrt(length)` size, and the existing data blocks are all reused. The index block may require reallocation.

[1]:https://cs.uwaterloo.ca/research/tr/1999/09/CS-99-09.pdf

1 more reply

emmelaich11mo ago

I remember reading (decades ago) an extensive article in Software Practice and Experience reaching the same conclusion.

manwe15011mo ago

Or like Python shows there, 1.25+k which can be better (faster growth and less memory wasted) than both

1 more reply

burnt-resistor11mo ago

Yep. And probably use tcmalloc or jemalloc (deprecated?) too. Most OS sbrk/libc malloc implementations are better than they used to be, but certain profiled programs can increased performance by tuning one of the nonstandard allocators. YMMV. Test, profile, and experiment.

ranger_danger11mo ago· 2 in thread

You might be interested in https://github.com/antirez/sds

fsckboy11mo ago

neat, i like it, has some of the same ideas i've used in my string packages

but i did see a place to shave a byte in the sds data struct. The null terminator is a wasted field, that byte (or int) should be used to store the amount of free space left in the buffer (as a proxy for strlen). When there is no space left in the buffer, the free space value will be.... a very convenient 0 heheh

hey, OP said he wants to be a better C programmer!

ranger_danger11mo ago

> The null terminator is a wasted field

I think that would break its "Compatible with normal C string functions" feature.

1 more reply

p0w3n3d11mo ago· 2 in thread

By the way, I have no power here, but you people are ubiquitous, at least one of you is in C committee.

Hereby I propose an addition to C:

  namespace something { }

Which is preprocessing to something_ before every function

  class someclass { }

Which is preprocessing all the functions inside to

  someclass_fn1(someclass\* asfirstparameter, ...)

And of course the final syntax sugar

  cl\* c;
  c->fn1(a,b)

I mean this would make C much easier, as we already code object oriented in it, but the amount of preprocessing, unreadability that needs to be done in headers is simply brain exhausting

masfoobar11mo ago

????

Unfortunately, you will need to cover more ground with this proposal than what you have presented. It sounds simple but what you are suggesting is very complicated. Of course, to the OOP minds out there, they think their suggestion is a be-all-to-end-all, when in reality there are many ways to solve a problem.

I understand what you are trying to do. I am not suggesting it is wrong but C is not an OOP language. I dont see why we should PRETEND that it has OOP-class features when its all you advocating is smoke and mirrors.

Sure, C can do "OOP" but I must remind all that these techniques have been in C for years, before the OOP name was popularised and evolved into the form it is now. You are just forcing C programmers to program in such a way OOPLOVERS like when it is not required - and that is the modern OOP than has been forced on everything since the early-to-mid 90's.

You are just HIDING whats really going on

This proposal you are suggesting -- I am going to call this Typescript-C.

The problem is that this namespace/class overlay makes much ASSUMPTIONS what the programmer wants. Lets dig into this further. Here is how (I believe) you see it implemented :-

  // In my TypeScript-C.
  namespace Foo {

    class Bar {

      int baz;

      void init() {
        this->baz = 10;
      }

      void free() {
        // whatever
      }

      int process(int withval) {
        return withval + this->baz;
      }
    }
  }

  // calling example :-
  Foo.Bar sample;
  sample->init();
  printf("%d\n", sample->process(100));
  sample->free();

This would translate to :-

  // Translated C...
  typedef struct Foo_Bar {
    int baz;
  }

  void Foo_Bar_init(Foo_Bar *self) {
    self->baz = 10;
  }

  void Foo_Bar_free(Foo_Bar *self) {
    // whatever
  }

  int Foo_bar_process(Foo_Bar \*self, int withval) {
    return withval + self->baz;
  }

  // calling example :-
  Foo_Bar sample;
  Foo_Bar_init(&sample);
  printf("%d\n", Foo_Bar_process(&sample, 100));
  Foo_Bar_free();

My argument is, for starters, how is writing all this namespace and class wrapper actually better than the final code? Anyone that starts using C with namespaces and classes are going to assume we have inheritence, overrides, etc. Why?

If anyone is THAT bad they cannot write proper C code, then I suggest they move to C++, Java, C#, Dlang.. or whatever.

Its just adding extra fluff -- and C guys like to SEE what is going on. (Thats why it takes ages to truly grasp C++ internals)

p0w3n3d11mo ago

I'm thinking golang style OOP - basically structs with syntactic sugar of dot-calling. Nothing more. We wouldn't want full blown vtable C++ style inheritance with polymorphism, operator overloading etc. because this would create another C++ (D++ maybe? ;)

However while we're looking at any library, GTK for example, those structs are made like that. But you're right - this only looks simple, as there are many pitfalls we could hit when implementing it.

1 more reply

kazinator11mo ago

There is a way for this not to terminate:

  while (new_capacity < required) {
    new_capacity *= 2;
  }

1. All variables are unsigned (due to being size_t). So we don't worry about overflow UB.

2. new_capacity * 2 always produces an even number, whether truncating or not.

3. Supppose required is SIZE_MAX, the highest value of size_t; note that this is an odd number.

4. Therefore new_capacity * 2 is always < required; loop does not terminate.

burnt-resistor11mo ago

It can be refactored into creating a buffer primitive of void* buf, size_t capacity, size_t refcount. Then, the string can implement using CoW logic on a buffer and size_t length. Read-only references to substrings become cheap and copying is done whenever there's a modification or realloc can't grow the underlying buffer.

gblargg11mo ago

It's odd how it has error reporting in some areas (alloc, split can return NULL if allocation fails), but not others (append, prepend have a void return type but might require allocation internally).

teo_zero11mo ago

What I don't like is that some functions take as arguments a mix of StringBuffer's and regular C strings. This is confusing. For example why this:

  void StringBuffer_replace(StringBuffer *buf,
    const char *original,
    const char *update,
    size_t from);

instead of this:

  void StringBuffer_replace(StringBuffer *buf,
    const StringBuffer *original,
    const StringBuffer *update,
    size_t from);

improgrammer00711mo ago

I would rather focus on solving the main problem than reinvent the wheel. Just use C++ if perf is critical which gives you all these things for free. In this day and age the reasons for using C as your main language should be almost zero.

j / k navigate · click thread line to collapse

55 comments

24 comments · 9 top-level

o11c11mo ago· 6 in thread

zahlman11mo ago

> It also makes the mistake of being mutable-first

Is mutability not part of the point of having a string buffer? Wouldn't the corresponding immutable type just be a string?

o11c11mo ago

"Buffer" just means it is used between input and output. It does not imply mutability, and many buffers indeed only take their state at construction time and are not mutable.

(theoretically there might be a "split and clobber" function like `strtok`, but in my experience it isn't that useful once your APIs actually take a buffer class).

Considering the functions from this implementation, they can be divided as follows:

Lifetime methods:

  init
  free
  clear

Immutable methods:

  print
  index_of
  match_all
  split

Mutable methods:

  append
  prepend (inefficient!)
  remove
  replace

There are also a lot of common operations not provided here. And the ones that are provided fail to work with `StringBuffer` input.

remexre11mo ago

How would you recommend doing that sort of "subtyping"? _Generic and macros?

o11c11mo ago

Yup. It's a lot saner in C++, but people who refuse to use C++ for political reasons can do it the ugly way using C11 or GNU C.

2 more replies

amelius11mo ago

I wonder how an LLM would rate this code.

lelanthran11mo ago

> I wonder how an LLM would rate this code.

WalterBright11mo ago· 5 in thread

    new_capacity *= 2;

A better value is to increase size by 1.5:

https://stackoverflow.com/questions/1100311/what-is-the-idea...

tialaramex11mo ago

This seems clever if we forget that it runs on a real machine. Once we remember there's a real machine we run into discontinuities and suddenly "Naive doubling" works best for a lot of cases.

At the small end, on the real machine tiny allocations are inefficient, so rather than 1, 2, 3, 5, 8, 12, 27, 41 turns out we should probably start with 16 or even 64 bytes.

Then at the large end the "clever" reuse is a nonsense because of virtual memory, so that 1598, 2397, 3596 doesn't work out as cleanly as 1024, 2048, 4096

Folly has a growable array type which talks a big game on this 1.5 factor, but as I indicated above they quietly disable this "optimisation" for both small and large sizes in the code itself.

Of course YMMV, it is entirely possible that some particular application gets a noticeable speedup from a hand rolled 1.8x growth factor, or from starting at size 18 or whatever.

sparkie11mo ago

Clearly we don't just want a blanket constant, but the growth factor should be a function of the current length - decreasing as the size of the array grows.

We don't want to be messing around with square roots, and ideally, we want arrays to always be a multiple of some power of 2, so the trick is to approximate the square root:

    inline int64_t approx_sqrt(int64_t length) {
        return 1 << (64 - __builtin_clzll(length)) / 2;
        // if C23, use stdc_first_leading_one_ull() from <stdbit.h>
    }

    inline int64_t new_length(int64_t length) {
        if (length == 0) return 1;
        return length + approx_sqrt(length);
    }

Some examples - for all powers of 2 between UINT16_MAX and UINT32_MAX, the old and new lengths:

    old length: 2^16 -> new length: 0x00010100 (growth: 2^8)
    old length: 2^17 -> new length: 0x00020200 (growth: 2^9)
    old length: 2^18 -> new length: 0x00040200 (growth: 2^9)
    old length: 2^19 -> new length: 0x00080400 (growth: 2^10)
    old length: 2^20 -> new length: 0x00100400 (growth: 2^10)
    old length: 2^21 -> new length: 0x00200800 (growth: 2^11)
    old length: 2^22 -> new length: 0x00400800 (growth: 2^11)
    old length: 2^23 -> new length: 0x00801000 (growth: 2^12)
    old length: 2^24 -> new length: 0x01001000 (growth: 2^12)
    old length: 2^25 -> new length: 0x02002000 (growth: 2^13)
    old length: 2^26 -> new length: 0x04002000 (growth: 2^13)
    old length: 2^27 -> new length: 0x08004000 (growth: 2^14)
    old length: 2^28 -> new length: 0x10004000 (growth: 2^14)
    old length: 2^29 -> new length: 0x20008000 (growth: 2^15)
    old length: 2^30 -> new length: 0x40008000 (growth: 2^15)
    old length: 2^31 -> new length: 0x80010000 (growth: 2^16)

[1]:https://cs.uwaterloo.ca/research/tr/1999/09/CS-99-09.pdf

1 more reply

emmelaich11mo ago

I remember reading (decades ago) an extensive article in Software Practice and Experience reaching the same conclusion.

manwe15011mo ago

Or like Python shows there, 1.25+k which can be better (faster growth and less memory wasted) than both

1 more reply

burnt-resistor11mo ago

ranger_danger11mo ago· 2 in thread

You might be interested in https://github.com/antirez/sds

fsckboy11mo ago

neat, i like it, has some of the same ideas i've used in my string packages

hey, OP said he wants to be a better C programmer!

ranger_danger11mo ago

> The null terminator is a wasted field

I think that would break its "Compatible with normal C string functions" feature.

1 more reply

p0w3n3d11mo ago· 2 in thread

By the way, I have no power here, but you people are ubiquitous, at least one of you is in C committee.

Hereby I propose an addition to C:

  namespace something { }

Which is preprocessing to something_ before every function

  class someclass { }

Which is preprocessing all the functions inside to

  someclass_fn1(someclass\* asfirstparameter, ...)

And of course the final syntax sugar

  cl\* c;
  c->fn1(a,b)

I mean this would make C much easier, as we already code object oriented in it, but the amount of preprocessing, unreadability that needs to be done in headers is simply brain exhausting

masfoobar11mo ago

????

You are just HIDING whats really going on

This proposal you are suggesting -- I am going to call this Typescript-C.

The problem is that this namespace/class overlay makes much ASSUMPTIONS what the programmer wants. Lets dig into this further. Here is how (I believe) you see it implemented :-

  // In my TypeScript-C.
  namespace Foo {

    class Bar {

      int baz;

      void init() {
        this->baz = 10;
      }

      void free() {
        // whatever
      }

      int process(int withval) {
        return withval + this->baz;
      }
    }
  }

  // calling example :-
  Foo.Bar sample;
  sample->init();
  printf("%d\n", sample->process(100));
  sample->free();

This would translate to :-

  // Translated C...
  typedef struct Foo_Bar {
    int baz;
  }

  void Foo_Bar_init(Foo_Bar *self) {
    self->baz = 10;
  }

  void Foo_Bar_free(Foo_Bar *self) {
    // whatever
  }

  int Foo_bar_process(Foo_Bar \*self, int withval) {
    return withval + self->baz;
  }

  // calling example :-
  Foo_Bar sample;
  Foo_Bar_init(&sample);
  printf("%d\n", Foo_Bar_process(&sample, 100));
  Foo_Bar_free();

If anyone is THAT bad they cannot write proper C code, then I suggest they move to C++, Java, C#, Dlang.. or whatever.

Its just adding extra fluff -- and C guys like to SEE what is going on. (Thats why it takes ages to truly grasp C++ internals)

p0w3n3d11mo ago

However while we're looking at any library, GTK for example, those structs are made like that. But you're right - this only looks simple, as there are many pitfalls we could hit when implementing it.

1 more reply

kazinator11mo ago

There is a way for this not to terminate:

  while (new_capacity < required) {
    new_capacity *= 2;
  }

1. All variables are unsigned (due to being size_t). So we don't worry about overflow UB.

2. new_capacity * 2 always produces an even number, whether truncating or not.

3. Supppose required is SIZE_MAX, the highest value of size_t; note that this is an odd number.

4. Therefore new_capacity * 2 is always < required; loop does not terminate.

burnt-resistor11mo ago

gblargg11mo ago

It's odd how it has error reporting in some areas (alloc, split can return NULL if allocation fails), but not others (append, prepend have a void return type but might require allocation internally).

teo_zero11mo ago

What I don't like is that some functions take as arguments a mix of StringBuffer's and regular C strings. This is confusing. For example why this:

  void StringBuffer_replace(StringBuffer *buf,
    const char *original,
    const char *update,
    size_t from);

instead of this:

  void StringBuffer_replace(StringBuffer *buf,
    const StringBuffer *original,
    const StringBuffer *update,
    size_t from);

improgrammer00711mo ago

j / k navigate · click thread line to collapse