Love C, hate C: Web framework memory problems (opens in new tab)

(alew.is)

165 pointsOneLessThing8mo ago216 comments

216 comments

83 comments · 15 top-level

bluetomcat8mo ago· 25 in thread

Good C code will try to avoid allocations as much as possible in the first place. You absolutely don’t need to copy strings around when handling a request. You can read data from the socket in a fixed-size buffer, do all the processing in-place, and then process the next chunk in-place too. You get predictable performance and the thing will work like precise clockwork. Reading the entire thing just to copy the body of the request in another location makes no sense. Most of the “nice” javaesque XXXParser, XXXBuilder, XXXManager abstractions seen in “easier” languages make little sense in C. They obfuscate what really needs to happen in memory to solve a problem efficiently.

lelanthran8mo ago

> Good C code will try to avoid allocations as much as possible in the first place.

I've upvoted you, but I'm not so sure I agree though.

Sure, each allocation imposes a new obligation to track that allocation, but on the downside, passing around already-allocated blocks imposes a new burden for each call to ensure that the callees have the correct permissions (modify it, reallocate it, free it, etc).

If you're doing any sort of concurrency this can be hard to track - sometimes it's easier to simply allocate a new block and give it to the callee, and then the caller can forget all about it (callee then has the obligation to free it).

obviouslynotme8mo ago

The most important pattern to learn in C is to allocate a giant arena upfront and reuse it over and over in a loop. Ideally, there is only one allocation and deallocation in the entire program. As with all things multi-threaded, this becomes trickier. Luckily, web servers are embarrassingly parallel, so you can just have an arena for each worker thread. Unluckily, web servers do a large amount of string processing, so you have to be careful in how you build them to prevent the memory requirements from exploding. As always, tradeoffs can and will be made depending on what you are actually doing.

Short-run programs are even easier. You just never deallocate and then exit(0).

4 more replies

17186274408mo ago

To reduce the amount of allocation instead of:

    struct parsed_data * = parse (...);
    struct process_data * = process (..., parsed_data);
    struct foo_data * = do_foo (..., process_data);

you can do

    parse (...) {
        ...
        process (...);
        ...
    }

    process (...) {
        ...
        do_foo (...);
        ...
    }

It sounds like violating separation of concerns at first, but it has the benefit, that you can easily do procession and parsing in parallel, and all the data can become readonly. Also I was impressed when I looked at a call graph of this, since this essentially becomes the documentation of the whole program.

1 more reply

throwawaymaths8mo ago

is there any system where doing the basics of http (everything up to framework handoff of structured data) are done outside of a single concurrency unit?

1 more reply

lock18mo ago

Why does "good" C have to be zero alloc? Why should "nice" javaesque make little sense in C? Why do you implicitly assume performance is "efficient problem solving"?

Not sure why many people seem fixated on the idea that using a programming language must follow a particular approach. You can do minimal alloc Java, you can simulate OOP-like in C, etc.

Unconventional, but why do we need to restrict certain optimizations (space/time perf, "readability", conciseness, etc) to only a particular language?

bluetomcat8mo ago

Because in C, every allocation incurs a responsibility to track its lifetime and to know who will eventually free it. Copying and moving buffers is also prone to overflows, off-by-one errors, etc. The generic memory allocator is a smart but unpredictable complex beast that lives in your address space and can mess your CPU cache, can introduce undesired memory fragmentation, etc.

In Java, you don't care because the GC cleans after you and you don't usually care about millisecond-grade performance.

1 more reply

cogman108mo ago

> Why should "nice" javaesque make little sense in C?

Very importantly, because Java is tracking the memory.

In java, you could create an item, send it into a queue to be processed concurrently, but then also deal with that item where you created it. That creates a huge problem in C because the question becomes "who frees that item"?

In java, you don't care. The freeing is done automatically when nobody references the item.

In C, it's a big headache. The concurrent consumer can't free the memory because the producer might not be done with it. And the producer can't free the memory because the consumer might not have ran yet. In idiomatic java, you just have to make sure your queue is safe to use concurrently. The right thing to do in C would be to restructure things to ensure the item isn't used before it's handed off to the queue or that you send a copy of the item into the queue so the question of "who frees this" is straight forward. You can do both approaches in java, but why would you? If the item is immutable there's no harm in simply sharing the reference with 100 things and moving forward.

In C++ and Rust, you'd likely wrap that item in some sort of atomic reference counted structure.

lelanthran8mo ago

> Why does "good" C have to be zero alloc?

GP didn't say "zero-alloc", but "minimal alloc"

> Why should "nice" javaesque make little sense in C?

There's little to no indirection in idiomatic C compared with idiomatic Java.

Of course, in both languages you can write unidiomatically, but that is a great way to ensure that bugs get in and never get out.

2 more replies

estimator72928mo ago

Good C has minimal allocations because you, the human, are the memory allocator. It's up to your own meat brain to correctly track memory allocation and deallocation. Over the last century, C programmers have converged on some best practices to manage this more effectively. We statically allocate, kick allocations up the call chain as far as possible. Anything to get that bit of tracked state out of your head.

But we use different approaches for different languages because those languages are designed for that approach. You can do OOP in C and you can do manual memory management in C#. Most people don't because it's unnecessarily difficult to use languages in a way they aren't designed for. Plus when you re-invent a wheel like "classes" you will inevitably introduce a bug you wouldn't have if you'd used a language with proper support for that construct. You can use a hammer to pull out a screw, but you'd do a much better job if you used a screwdriver instead.

Programming languages are not all created equal and are absolutely not interchangeable. A language is much, much more than the text and grammar. The entire reason we have different languages is because we needed different ways to express certain classes of problems and constructs that go way beyond textual representation.

For example, in a strictly typed OOP language like C#, classes are hideously complex under the hood. Miles and miles of code to handle vtables, inheritance, polymorphism, virtual, abstract functions and fields. To implement this in C would require effort far beyond what any single programmer can produce in a reasonable time. Similarly, I'm sure one could force JavaScript to use a very strict typing and generics system like C#, but again the effort would be enormous and guaranteed to have many bugs.

We use different languages in different ways because they're different and work differently. You're asking why everyone twists their screwdrivers into screws instead of using the back end to pound a nail. Different tools, different uses.

riedel8mo ago

A long time ago I was involved in building compilers. It was common that we solved this problem with obstacks, which are basically stacked heaps. I wonder one could not build more things like this, where freeing is a bit more best effort but you have some checkpoints. (I guess one would rather need tree like stacks) Just have to disallow pointers going the wrong way. Allocation remains ugly in C and I think explicit data structures are are definitely a better way of handling it.

fulafel8mo ago

This shared memory and pointer shuffling is of course fraught with requiring correct logic to avoid memory safety bugs. Good C code doesn't get you pwned, I'd argue.

jenadine8mo ago

> Good C code doesn't get you pwned, I'd argue.

This is not a serious argument because you don't really define good C code and how easy or practical it is to do. The sentence works for every language. "Good <whatever language> code doesn't get you pwned"

But the question is whether "Average" or "Normal" C code gets you pwned? And the answer is yes, as told in the article.

1 more reply

wfn8mo ago

Agree re: no need for heap allocation - for others: I recommend reading thru whole masscan source (https://github.com/robertdavidgraham/masscan), it's a pleasure btw - iirc rather few/sparse malloc()s which are part of regular I/O processing flow (there will be malloc()s which depending on config etc. set up additional data structs but as part of setup).

fsckboy8mo ago

>Good C code will try to avoid allocations as much as possible in the first place.

there's a genius to this: if you're going to optimize prematurely, do it right out of the gate!

01HNNWZ0MV43FF8mo ago

Can you do parsing of JSON and XML without allocating?

veqq8mo ago

Of course. You can do it in a single pass/just parse the token stream. There are various implementations like: https://zserge.com/jsmn/

1 more reply

bluetomcat8mo ago

Yes, you can do it with minimal allocations - provided that the source buffer is read-only or is mutable but is unused later directly by the caller. If the buffer is mutable, any un-escaping can be done in-place because the un-escaped string will always be shorter. All the substrings you want are already in the source buffer. You just need a growable array of pointer/length pairs to know where tokens start.

acidx8mo ago

Yes! The JSON library I wrote for the Zephyr RTOS does this. Say, for instance, you have the following struct:

    struct SomeStruct {
        char *some_string;
        int some_number;
    };

You would need to declare a descriptor, linking each field to how it's spelled in the JSON (e.g. the some_string member could be "some-string" in the JSON), the byte offset from the beginning of the struct where the field is (using the offsetof() macro), and the type.

The parser is then able to go through the JSON, and initialize the struct directly, as if you had reflection in the language. It'll validate the types as well. All this without having to allocate a node type, perform copies, or things like that.

This approach has its limitations, but it's pretty efficient -- and safe!

Someone wrote a nice blog post about (and even a video) it a while back: https://blog.golioth.io/how-to-parse-json-data-in-zephyr/

The opposite is true, too -- you can use the same descriptor to serialize a struct back to JSON.

I've been maintaining it outside Zephyr for a while, although with different constraints (I'm not using it for an embedded system where memory is golden): https://github.com/lpereira/lwan/blob/master/src/samples/tec...

zzo38computer8mo ago

It depends what you intend to do with the parsed data, and where the input comes from and where the output will be going to. There are situations that allocations can be reduced or avoided, but that is not all of them. (In some cases, you do not need full parsing, e.g. to split an array, you can check if it is a string or not and the nesting level, and then find the commas outside of any arrays other than the first one, to be split.) (If the input is in memory, then you can also consider if you can modify that memory for parsing, which is sometimes suitable but sometimes not.)

However, for many applications, it will be better to use a binary format (or in some cases, a different text format) rather than JSON or XML.

(For the PostScript binary format, there is no escaping, and the structure does not need to be parsed and converted ahead of time; items in an array are consecutive and fixed size, and data it references (strings and other arrays) is given by an offset, so you can avoid most of the parsing. However, note that key/value lists in PostScript binary format is nonstandard (even though PostScript does have that type, it does not have a standard representation in the binary object format), and that PostScript has a better string type than JavaScript but a worse numeric type than JavaScript.)

megous8mo ago

Yes, you can first validate the buffer, to know it contains valid JSON, and then you can work with pointers to beginings of individual syntactic parts of JSON, and have functions that decide what type of the current element is, or move to the next element, etc. Even string work (comparisons with other escaped or unescaped strings, etc.) can be done on escaped strings directly without unescaping them to a buffer first.

Ergonomically, it's pretty much the same as parsing the JSON into some AST first, and then working on the AST. And it can be much faster than dumb parsers that use malloc for individual AST elements.

You can even do JSON path queries on top of this, without allocations.

Eg. https://xff.cz/git/megatools/tree/lib/sjson.c

gritzko8mo ago

Yep, no problem. In place parsing only requires a stack. Stack length is the maximum JSON nesting allowed. I have a C dialect exactly like that.

Ygg28mo ago

Theoretically yes. Practically there is character escaping.

That kills any non-allocation dreams. Moment you have "Hi \uxxxx isn't the UTF nice?" you will probably have to allocate. If source is read-only you have to allocate. If source is mutable you have to waste CPU to rewrite the string.

4 more replies

lelanthran8mo ago

> Can you do parsing of JSON and XML without allocating?

If the source JSON/XML is in a writeable buffer, with some helper functions you can do it. I've done it for a few small-memory systems.

self_awareness8mo ago

That mythical "Good C Code", which is known only to some people who I never met.

pjmlp8mo ago

These abstractions were already common in enterprise C code decades before Java came to be, thanks to stuff like Yourdon Structured Method.

Using fixed size buffers doesn't fix out of bounds errors, and stack corruption caused by such bugs.

Naturally we all know good C programmers never make them. /s

jacquesm8mo ago· 12 in thread

There are many, many more such issues with that code. The person that posted it is new to C and had an AI help them to write the code. That's a recipe for disaster, it means the OP does not actually understand what they wrote. It looks nice but it is full of footguns and even though it is a useful learning exercise it also is a great example of why it is better run battle tested frame works than to inexpertly roll your own.

As a learning exercise it is useful, but it should never see production use. What is interesting is that the apparent cleanliness of the code (it reads very well) is obscuring the fact that the quality is actually quite low.

If anything I think the conclusion should be that AI+novice does not create anything that is useable without expert review and that that probably adds up to a net negative other than that the novice will (hopefully) learn something. It would be great if someone could put in the time to do a full review of the code, I have just read through it casually and already picked up a couple of problems, I'm pretty sure that if you did a thorough job of it there would be many more.

drnick18mo ago

> What is interesting is that the apparent cleanliness of the code (it reads very well) is obscuring the fact that the quality is actually quite low.

I think this is a general feature and one of the greatest advantages of C. It's simple, and it reads well. Modern C++ and Rust are just horrible to look at.

messe8mo ago

I slightly unironically believe that one of the biggest hindrances to rust's growth is that it adopted the :: syntax from C++ rather than just using a single . for namespacing.

1 more reply

citbl8mo ago

The safer the C code, the more horrible it starts looking though... e.g.

    my_func(char msg[static 1])

4 more replies

nurettin8mo ago

> should never see production use.

I have an issue with high strung opinions like this. I wrote plenty of crappy delphi code while learning the language that saw production use and made a living from it.

Sure, it wasn't the best experience for users, it took years to iron out all the bugs and there was plenty of frustration during the support phase (mostly null pointer exceptions and db locks in gui).

But nobody would be better off now if that code never saw production use. A lot of business was built around it.

zdragnar8mo ago

Buggy code that just crashes or produces incorrect results are a whole different category. In C a bug can compromise a server and your users. See the openssl heart bleed vulnerability as a prime example.

Once upon a time, you could put up a relatively vulnerable server, and unless you got a ton of traffic, there weren't too many things that would attack it. Nowadays, pretty much anything Internet facing will get a constant stream of probes. Putting up a server requires a stricter mindset than it used to.

jacquesm8mo ago

There are minimum standards for deployment to the open web. I think - and you're of course entirely free to have a different opinion - that those are not met with this code.

1 more reply

lifthrasiir8mo ago

Yeah, I recently wrote a moderate amount of C code [1] entirely with Gemini and while it was much better than what I initially expected I needed a constant steering to avoid inefficient or less safe code. It needed an extensive fuzzing to get the minimal amount of confidence, which caught at least two serious problems---seriously, it's much better than most C programmers, but still.

[1] https://github.com/lifthrasiir/wah/blob/main/wah.h

jacquesm8mo ago

I've been doing this the better part of a lifetime and I still need to be careful so don't feel bad about it. Just like rust has an 'unsafe' keyword I realize all of my code is potentially unsafe. Guarding against UB, use-after-free, array overruns and so on is a lot of extra work and you only need to slip up once to have a bug, and if you're unlucky something exploitable. You get better at this over the years. But if I know something needs to be bullet proof the C compiler would not be my first tool of choice.

One good defense is to reduce your scope continuously. The smaller you make your scope the smaller the chances of something escaping your attention. Stay away from globals and global data structures. Make it impossible to inspect the contents of a box without going through a well defined interface. Use assertions liberally. Avoid fault propagation, abort immediately when something is out of the expected range.

1 more reply

OneLessThingOP8mo ago

This is exactly my problem with LLM C code, lack of confidence. On the other hand, when my projects get big enough to the point where I cannot keep the code base generally loaded into my brains cache they eventually get to the point where my confidence comes from extensive testing regardless. So maybe it's not such a bad approach.

I do think that LLM C code if made with great testing tooling in concert has great promise.

1 more reply

lelanthran8mo ago

> It needed an extensive fuzzing to get the minimal amount of confidence, which caught at least two serious problems---seriously, it's much better than most C programmers, but still.

How are you doing your fuzzing? You need either valgrind (or compiler sanitiser flags) in the loop for a decent level of confidence.

1 more reply

OneLessThingOP8mo ago

I agree that it reads really well which is why I was also surprised the quality is not high when I looked deeper. The author claims to have only used AI for the json code, so your conclusion may be off, it could just be a novice doing novice things.

I suppose I was just surprised to find this code promoted in my feed when it's not up to snuff. And I'm not hating, I do in fact love the project idea.

citbl8mo ago

The irony is also that AI could have been used to audit the code and find these issues. All the author had to do was to question.

messe8mo ago· 11 in thread

> Another interesting choice in this project is to make lengths signed:

There are good reasons for this choice in C (and C++) due to broken integer promotion and casting rules.

See: "Subscripts and sizes should be signed" (Bjarne Stroustrup) https://open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0...

As a nice bonus, it means that ubsan traps on overflow (unsigned overflows just wrap).

uecker8mo ago

I do not agree that the integer promotion or casting (?) rules are broken in C. That some people make mistakes because they do not know them is a different problem.

The reason you should make length signed is that you can use the sanitizer to find or mitigate overflow as you correctly observe, while unsigned wraparound leads to bugs which are basically impossible to find. But this has nothing to do with integer promotion and wraparound bugs can also create bugs in - say - Rust.

messe8mo ago

I meant implicit casting, but I guess that really falls under promotion in most cases where it's relevant here (I'm on a train from Aarhus to Copenhagen right now to catch a flight, and I've slept considerably less than usual, so apologies if I'm making some slight mistakes).

The issues really arise when you mix signed/unsigned arithmetic and end up promoting everything to signed unexpectedly. That's usually "okay", as long as you're not doing arithmetic on anything smaller than an int.

As an aside, if you like C enough to have opinions on promotion rules then you might enjoy the programming language Zig. It's around the same level as C, but with much nicer ergonomics, and overflow traps by default in Debug/ReleaseSafe optimization modes. If you want explicit two's complement overflow it has +%, *% and -% variants of the usual arithmetic operations, as well as saturating +|, *|, -| variants that clamp to [minInt(T), maxInt(T)].

EDIT to the aside: it's also true if you hate C enough to have opinions on promotion rules.

2 more replies

OneLessThingOP8mo ago

It's interesting to hear these takes. I've never had problems catching unsigned wrap bugs with plain old memory sanitizers, though I must admit to not having a lot of experience with ubsan in particular. Maybe I should use it more.

2 more replies

01HNNWZ0MV43FF8mo ago

> That some people make mistakes because they do not know them is a different problem.

We can argue til we're blue in the face that people should just not make any mistakes, but history is against us - People will always make mistakes.

That's why surgeons are supposed to follow checklists and count their sponges in and out

Sukera8mo ago

Could you expand on how these wraparound bugs happen in Rust? As far as I know, integer overflow panics (i.e. aborts) your code when compiled in debug mode, which I think is often used for testing.

bringbart8mo ago

>while unsigned wraparound leads to bugs which are basically impossible to find.

What?

unsigned sizes are way easier to check, you just need one invariant:

if(x < capacity) // good to go

Always works, regardless how x is calculated and you never have to worry about undefined behavior when computing x. And the same invariant is used for forward and backward loops - some people bring up i >= 0 as a problem with unsigned, but that's because you should use i < n for backward loops as well, The One True Invariant.

kstenerud8mo ago

Yup, unsigned math is just nasty.

Actually, unchecked math on an integer is going to be bad regardless of whether it's signed or unsigned. The difference is that with signed integers, your sanity check is simple and always the same and requires no thought for edge cases: `if(index < 0 || index > max)`. Plus ubsan, as mentioned above.

My policy is: Always use signed, unless you have a specific reason to use unsigned (such as memory addresses).

bringbart8mo ago

unsigned is easier: 'if(index >= max)' and has fewer edge cases because you don't need to worry about undefined behavior when computing index.

1 more reply

lelanthran8mo ago

> The difference is that with signed integers, your sanity check is simple and always the same and requires no thought for edge cases: `if(index < 0 || index > max)`

Wait, what? How is that easier than `if (index > max)`?

1 more reply

accelbred8mo ago

If using C23, _BitInt allows for integer types without promotion.

user____name8mo ago

I just put assertions to check the ranges of all sizes and indices upon function entry, doubles as documentation, and I mostly don't have to worry about signedness as a result.

yipikaya8mo ago· 7 in thread

As an aside, it's amusing that it took 25 years for C coders to embrace the C99 named struct designator feature:

    HttpParser parser = {
        .isValid = true,
        .requestBuffer = strdup(request),
        .requestLength = strlen(request),
        .position = 0,
    };

All the kids are doing it now!

rurban8mo ago

It's only Microsoft's fault to have not implemented it for decades in MSVC. They stayed at C89 forever.

flykespice8mo ago

I never understand the reason why Microsoft lagged so much behind on newer c standards adoption. Did their compiler infrastructure made it difficult to adopt newer standards flexible? Or they simply did not care?

1 more reply

17186274408mo ago

This is nice for constant data, but strdup can return NULL here, which is again never checked.

> it took 25 years for C coders to embrace the C99 named struct designator feature

Not sure if this actually true, but this is kind of the feature of C, 20 years old code or compiler is supposed to work just fine, so you just wait for some time to settle things. For fast and shiny, there is Javascript.

davemp8mo ago

I’m still regularly getting on projects and moving C89 variable declarations from the start of functions to where they’re initialized, but I guess it’s not the kids doing it.

17186274408mo ago

I only declare variables at the begin of a block, not because I would need C89 compatibility, but because I find it clearer to establish the working set of variables upfront. This doesn't restrict me in anyway, because I just start a new block, when I feel the need. I also try to keep the scope of a variable as small as possible.

mkfs8mo ago

> C89 variable declarations from the start of functions

Technically it's the start of a block.

1 more reply

varjag8mo ago

Some of the most famous C codebases (e.g. the Linux kernel) been using them for some time.

lelanthran8mo ago· 4 in thread

I can't completely blame the language here: anyone "coding" in a language new to them using an LLM is going to have real problems.

OneLessThingOP8mo ago

It's funny the author says this was 90% written without AI, and that AI was mostly used for the json code. I think they're just new to C.

Trust me I love C. Probably over 90% of my lifetime code has been written in C. But python newbies don't get their web frameworks stack smashed. That's kind of nice.

lelanthran8mo ago

> But python newbies don't get their web frameworks stack smashed. That's kind of nice.

Hah! True :-)

The thing is, smashed stacks are difficult to exploit deterministically or automatically. Even heartbleed, as widespread as it was, was not a guaranteed RCE.

OTOH, an exploit in a language like Python is almost certainly going to be easier to exploit deterministically. Log4j, for example, was a guaranteed exploit and the skill level required was basically "Create a Java object".

This is because of the ease with which even very junior programmers can create something that appears to run and work and not crash.

1 more reply

uyzstvqs8mo ago

It's a double-sided coin. LLMs are probably the best way to learn programming languages right now. But if you vibecode in a programming language that you don't understand, it's going to be a disaster sooner or later.

This is also the reason why AI will not replace any actual jobs with merit.

AdieuToLogic8mo ago

> LLMs are probably the best way to learn programming languages right now.

Books still exist, be they in print or electronic form.

2 more replies

kazinator8mo ago· 3 in thread

I definitely don't love C that does atoi on a Content-Length value that came from the network and passes that to malloc.

Even before we get to how a malicious would interact with malloc, there is this:

> The functions atof, atoi, atol, and atoll are not required to affect the value of the integer expression errno on an error. If the value of the result cannot be represented, the behavior is undefined. [ISO C N3220 draft]

That includes not only out-of-range values by garbage that cannot be converted to a number at all. atoi("foo") can behave in any manner whatsoever and return anything.

Those functions are okay to use on something that has been validated in a way that it cannot cause a problem. If you know you have a nonempty sequence of nothing but digits, possibly with a minus sign, and the number digits is small enough that the value will fit into int, you are okay.

> A malicious user can pass Content-Length of 4294967295

But why would they when it's fewer keystrokes to use -1, which will go to 4294967295 on a 32 bit malloc, while scaling to 18446744073709551615 on 64 bit?

noobermin8mo ago

This seems more like a coding problem than a C problem. Most people would at least validate input before allocating anything.

kazinator8mo ago

A HTTP request processing layer would typically have some configuration of the maximum content size that can be received in one request and reject anything larger.

That's why, e.g., you have to mess with a "php.ini" file to get your self-hosted webmail to handle large attachments (which are uploads from the browser UI to the back-end).

17186274408mo ago

> But why would they when it's fewer keystrokes to use -1, which will go to 4294967295 on a 32 bit malloc, while scaling to 18446744073709551615 on 64 bit?

If that user wants to exploit your application it's better not to pass such a high value, since malloc typically detects size > SIZE_MAX/2. But then this code also doesn't check for malloc to return NULL, so this might also what leads to an exploit.

acidx8mo ago· 2 in thread

One thing to note, too, is that `atoi()` should be avoided as much as possible. On error (parse error, overflow, etc), it has an unspecified return value (!), although most libcs will return 0, which can be just as bad in some scenarios.

Also not mentioned, is that atoi() can return a negative number -- which is then passed to malloc(), that takes a size_t, which is unsigned... which will make it become a very large number if a negative number is passed as its argument.

It's better to use strtol(), but even that is a bit tricky to use, because it doesn't touch errno when there's no error but you need to check errno to know if things like overflow happened, so you need to set errno to 0 before calling the function. The man page explains how to use it properly.

I think it would be a very interesting exercise for that web framework author to make its HTTP request parser go through a fuzz-tester; clang comes with one that's quite good and easy to use (https://llvm.org/docs/LibFuzzer.html), especially if used alongside address sanitizer or the undefined behavior sanitizer. Errors like the one I mentioned will most likely be found by a fuzzer really quickly. :)

MathMonkeyMan8mo ago

Unspecified, really? cppreference's [C documentation][1] says that it returns zero. The [OpenGroup][2] documentation doesn't specify a return value when the conversion can't be performed. This recent [draft][3] of the ISO standard for C says that if the value cannot be represented (does that mean over/underflow, bad parse, both, neither?), then it's undefined behavior.

So three references give three different answers.

You could always use sscanf instead, which tells you how many values were scanned (e.g. zero or one).

[1]: https://en.cppreference.com/w/c/string/byte/atoi.html

[2]: https://pubs.opengroup.org/onlinepubs/9799919799/functions/a...

[3]: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf

acidx8mo ago

The Linux man page (https://man7.org/linux/man-pages/man3/atoi.3.html#VERSIONS) says that POSIX.1 leaves it unspecified. As you found out, it's really something that should be avoided as much as possible, because pretty much everywhere disagrees how it should behave, especially if you value portability.

sscanf() is not a good replacement either! It's better to use strtol() instead. Either do what Lwan does (https://github.com/lpereira/lwan/blob/master/src/lib/lwan-co...), or look (https://cvsweb.openbsd.org/src/lib/libc/stdlib/strtonum.c?re...) at how OpenBSD implemented strtonum(3).

For instance, if you try to parse a number that's preceded by a lot of spaces, sscanf() will take a long time going through it. I've been hit by that when fuzzing Lwan.

Even cURL is avoiding sscanf(): https://daniel.haxx.se/blog/2025/04/07/writing-c-for-curl/

1 more reply

AdieuToLogic8mo ago· 2 in thread

While the classic "Parse, don’t validate"[0] paper uses Haskell instead of C as its illustrative programming language, the approach detailed is very much applicable in these scenarios.

0 - https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

lelanthran8mo ago

> While the classic "Parse, don’t validate"[0] paper uses Haskell instead of C as its illustrative programming language, the approach detailed is very much applicable in these scenarios.

Good thing someone (i.e. me) took the time to demonstrate PdV in C: https://www.lelanthran.com/chap13/content.html

nxobject8mo ago

I appreciate that link - now I see the parallels between “consolidate allocation in C to the extent that the rest of your code doesn’t have to worry”, and “consolidate validation in C” to the extent that…”.

erichocean8mo ago· 2 in thread

Give Fil-C a try, the speed hit is pretty minimal and you get full memory safety.

https://fil-c.org/

Karrot_Kream8mo ago

Wow this is really cool, I'd never seen this before. Thanks!

adhamsalama8mo ago

Why isn't this used more?

dang8mo ago

Recent and related:

Show HN: I built a web framework in C - https://news.ycombinator.com/item?id=45526890 - Oct 2025 (208 comments)

jqpabc1238mo ago

Reads like an indictment of vibe coding.

LLMs are fundamentally probabilistic --- not deterministic.

This basically means that anything produced this way is highly suspect. And this framework is an example.

qalmakka8mo ago

Integer operations, the one thing in computers where basically there's no non-annoying way to do them right except being over pedantic with checks

qalmakka8mo ago

OT: using the `strcasecmp` family of functions is basically asking for trouble - unless you've previously set the locale to "C", which is basically the only locale with a defined behaviour. Otherwise you're basically bound to run onto very funny internationalisation issues you'd rather know nothing about (and fail the Turkey Test)

pizlonator8mo ago

Just compile it with Fil-C

ge968mo ago

Long as you allocate me, it's alright

j / k navigate · click thread line to collapse

216 comments

83 comments · 15 top-level

bluetomcat8mo ago· 25 in thread

lelanthran8mo ago

> Good C code will try to avoid allocations as much as possible in the first place.

I've upvoted you, but I'm not so sure I agree though.

obviouslynotme8mo ago

Short-run programs are even easier. You just never deallocate and then exit(0).

4 more replies

17186274408mo ago

To reduce the amount of allocation instead of:

    struct parsed_data * = parse (...);
    struct process_data * = process (..., parsed_data);
    struct foo_data * = do_foo (..., process_data);

you can do

    parse (...) {
        ...
        process (...);
        ...
    }

    process (...) {
        ...
        do_foo (...);
        ...
    }

1 more reply

throwawaymaths8mo ago

is there any system where doing the basics of http (everything up to framework handoff of structured data) are done outside of a single concurrency unit?

1 more reply

lock18mo ago

Why does "good" C have to be zero alloc? Why should "nice" javaesque make little sense in C? Why do you implicitly assume performance is "efficient problem solving"?

Not sure why many people seem fixated on the idea that using a programming language must follow a particular approach. You can do minimal alloc Java, you can simulate OOP-like in C, etc.

Unconventional, but why do we need to restrict certain optimizations (space/time perf, "readability", conciseness, etc) to only a particular language?

bluetomcat8mo ago

In Java, you don't care because the GC cleans after you and you don't usually care about millisecond-grade performance.

1 more reply

cogman108mo ago

> Why should "nice" javaesque make little sense in C?

Very importantly, because Java is tracking the memory.

In java, you don't care. The freeing is done automatically when nobody references the item.

In C++ and Rust, you'd likely wrap that item in some sort of atomic reference counted structure.

lelanthran8mo ago

> Why does "good" C have to be zero alloc?

GP didn't say "zero-alloc", but "minimal alloc"

> Why should "nice" javaesque make little sense in C?

There's little to no indirection in idiomatic C compared with idiomatic Java.

Of course, in both languages you can write unidiomatically, but that is a great way to ensure that bugs get in and never get out.

2 more replies

estimator72928mo ago

riedel8mo ago

fulafel8mo ago

This shared memory and pointer shuffling is of course fraught with requiring correct logic to avoid memory safety bugs. Good C code doesn't get you pwned, I'd argue.

jenadine8mo ago

> Good C code doesn't get you pwned, I'd argue.

But the question is whether "Average" or "Normal" C code gets you pwned? And the answer is yes, as told in the article.

1 more reply

wfn8mo ago

fsckboy8mo ago

>Good C code will try to avoid allocations as much as possible in the first place.

there's a genius to this: if you're going to optimize prematurely, do it right out of the gate!

01HNNWZ0MV43FF8mo ago

Can you do parsing of JSON and XML without allocating?

veqq8mo ago

Of course. You can do it in a single pass/just parse the token stream. There are various implementations like: https://zserge.com/jsmn/

1 more reply

bluetomcat8mo ago

acidx8mo ago

Yes! The JSON library I wrote for the Zephyr RTOS does this. Say, for instance, you have the following struct:

    struct SomeStruct {
        char *some_string;
        int some_number;
    };

This approach has its limitations, but it's pretty efficient -- and safe!

Someone wrote a nice blog post about (and even a video) it a while back: https://blog.golioth.io/how-to-parse-json-data-in-zephyr/

The opposite is true, too -- you can use the same descriptor to serialize a struct back to JSON.

zzo38computer8mo ago

However, for many applications, it will be better to use a binary format (or in some cases, a different text format) rather than JSON or XML.

megous8mo ago

Ergonomically, it's pretty much the same as parsing the JSON into some AST first, and then working on the AST. And it can be much faster than dumb parsers that use malloc for individual AST elements.

You can even do JSON path queries on top of this, without allocations.

Eg. https://xff.cz/git/megatools/tree/lib/sjson.c

gritzko8mo ago

Yep, no problem. In place parsing only requires a stack. Stack length is the maximum JSON nesting allowed. I have a C dialect exactly like that.

Ygg28mo ago

Theoretically yes. Practically there is character escaping.

4 more replies

lelanthran8mo ago

> Can you do parsing of JSON and XML without allocating?

If the source JSON/XML is in a writeable buffer, with some helper functions you can do it. I've done it for a few small-memory systems.

self_awareness8mo ago

That mythical "Good C Code", which is known only to some people who I never met.

pjmlp8mo ago

These abstractions were already common in enterprise C code decades before Java came to be, thanks to stuff like Yourdon Structured Method.

Using fixed size buffers doesn't fix out of bounds errors, and stack corruption caused by such bugs.

Naturally we all know good C programmers never make them. /s

jacquesm8mo ago· 12 in thread

drnick18mo ago

> What is interesting is that the apparent cleanliness of the code (it reads very well) is obscuring the fact that the quality is actually quite low.

I think this is a general feature and one of the greatest advantages of C. It's simple, and it reads well. Modern C++ and Rust are just horrible to look at.

messe8mo ago

I slightly unironically believe that one of the biggest hindrances to rust's growth is that it adopted the :: syntax from C++ rather than just using a single . for namespacing.

1 more reply

citbl8mo ago

The safer the C code, the more horrible it starts looking though... e.g.

    my_func(char msg[static 1])

4 more replies

nurettin8mo ago

> should never see production use.

I have an issue with high strung opinions like this. I wrote plenty of crappy delphi code while learning the language that saw production use and made a living from it.

But nobody would be better off now if that code never saw production use. A lot of business was built around it.

zdragnar8mo ago

jacquesm8mo ago

There are minimum standards for deployment to the open web. I think - and you're of course entirely free to have a different opinion - that those are not met with this code.

1 more reply

lifthrasiir8mo ago

[1] https://github.com/lifthrasiir/wah/blob/main/wah.h

jacquesm8mo ago

1 more reply

OneLessThingOP8mo ago

I do think that LLM C code if made with great testing tooling in concert has great promise.

1 more reply

lelanthran8mo ago

> It needed an extensive fuzzing to get the minimal amount of confidence, which caught at least two serious problems---seriously, it's much better than most C programmers, but still.

How are you doing your fuzzing? You need either valgrind (or compiler sanitiser flags) in the loop for a decent level of confidence.

1 more reply

OneLessThingOP8mo ago

I suppose I was just surprised to find this code promoted in my feed when it's not up to snuff. And I'm not hating, I do in fact love the project idea.

citbl8mo ago

The irony is also that AI could have been used to audit the code and find these issues. All the author had to do was to question.

messe8mo ago· 11 in thread

> Another interesting choice in this project is to make lengths signed:

There are good reasons for this choice in C (and C++) due to broken integer promotion and casting rules.

See: "Subscripts and sizes should be signed" (Bjarne Stroustrup) https://open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0...

As a nice bonus, it means that ubsan traps on overflow (unsigned overflows just wrap).

uecker8mo ago

I do not agree that the integer promotion or casting (?) rules are broken in C. That some people make mistakes because they do not know them is a different problem.

messe8mo ago

EDIT to the aside: it's also true if you hate C enough to have opinions on promotion rules.

2 more replies

OneLessThingOP8mo ago

2 more replies

01HNNWZ0MV43FF8mo ago

> That some people make mistakes because they do not know them is a different problem.

We can argue til we're blue in the face that people should just not make any mistakes, but history is against us - People will always make mistakes.

That's why surgeons are supposed to follow checklists and count their sponges in and out

Sukera8mo ago

Could you expand on how these wraparound bugs happen in Rust? As far as I know, integer overflow panics (i.e. aborts) your code when compiled in debug mode, which I think is often used for testing.

bringbart8mo ago

>while unsigned wraparound leads to bugs which are basically impossible to find.

What?

unsigned sizes are way easier to check, you just need one invariant:

if(x < capacity) // good to go

kstenerud8mo ago

Yup, unsigned math is just nasty.

My policy is: Always use signed, unless you have a specific reason to use unsigned (such as memory addresses).

bringbart8mo ago

unsigned is easier: 'if(index >= max)' and has fewer edge cases because you don't need to worry about undefined behavior when computing index.

1 more reply

lelanthran8mo ago

> The difference is that with signed integers, your sanity check is simple and always the same and requires no thought for edge cases: `if(index < 0 || index > max)`

Wait, what? How is that easier than `if (index > max)`?

1 more reply

accelbred8mo ago

If using C23, _BitInt allows for integer types without promotion.

user____name8mo ago

I just put assertions to check the ranges of all sizes and indices upon function entry, doubles as documentation, and I mostly don't have to worry about signedness as a result.

yipikaya8mo ago· 7 in thread

As an aside, it's amusing that it took 25 years for C coders to embrace the C99 named struct designator feature:

    HttpParser parser = {
        .isValid = true,
        .requestBuffer = strdup(request),
        .requestLength = strlen(request),
        .position = 0,
    };

All the kids are doing it now!

rurban8mo ago

It's only Microsoft's fault to have not implemented it for decades in MSVC. They stayed at C89 forever.

flykespice8mo ago

1 more reply

17186274408mo ago

This is nice for constant data, but strdup can return NULL here, which is again never checked.

> it took 25 years for C coders to embrace the C99 named struct designator feature

davemp8mo ago

I’m still regularly getting on projects and moving C89 variable declarations from the start of functions to where they’re initialized, but I guess it’s not the kids doing it.

17186274408mo ago

mkfs8mo ago

> C89 variable declarations from the start of functions

Technically it's the start of a block.

1 more reply

varjag8mo ago

Some of the most famous C codebases (e.g. the Linux kernel) been using them for some time.

lelanthran8mo ago· 4 in thread

I can't completely blame the language here: anyone "coding" in a language new to them using an LLM is going to have real problems.

OneLessThingOP8mo ago

It's funny the author says this was 90% written without AI, and that AI was mostly used for the json code. I think they're just new to C.

Trust me I love C. Probably over 90% of my lifetime code has been written in C. But python newbies don't get their web frameworks stack smashed. That's kind of nice.

lelanthran8mo ago

> But python newbies don't get their web frameworks stack smashed. That's kind of nice.

Hah! True :-)

The thing is, smashed stacks are difficult to exploit deterministically or automatically. Even heartbleed, as widespread as it was, was not a guaranteed RCE.

This is because of the ease with which even very junior programmers can create something that appears to run and work and not crash.

1 more reply

uyzstvqs8mo ago

This is also the reason why AI will not replace any actual jobs with merit.

AdieuToLogic8mo ago

> LLMs are probably the best way to learn programming languages right now.

Books still exist, be they in print or electronic form.

2 more replies

kazinator8mo ago· 3 in thread

I definitely don't love C that does atoi on a Content-Length value that came from the network and passes that to malloc.

Even before we get to how a malicious would interact with malloc, there is this:

That includes not only out-of-range values by garbage that cannot be converted to a number at all. atoi("foo") can behave in any manner whatsoever and return anything.

> A malicious user can pass Content-Length of 4294967295

But why would they when it's fewer keystrokes to use -1, which will go to 4294967295 on a 32 bit malloc, while scaling to 18446744073709551615 on 64 bit?

noobermin8mo ago

This seems more like a coding problem than a C problem. Most people would at least validate input before allocating anything.

kazinator8mo ago

A HTTP request processing layer would typically have some configuration of the maximum content size that can be received in one request and reject anything larger.

That's why, e.g., you have to mess with a "php.ini" file to get your self-hosted webmail to handle large attachments (which are uploads from the browser UI to the back-end).

17186274408mo ago

> But why would they when it's fewer keystrokes to use -1, which will go to 4294967295 on a 32 bit malloc, while scaling to 18446744073709551615 on 64 bit?

acidx8mo ago· 2 in thread

MathMonkeyMan8mo ago

So three references give three different answers.

You could always use sscanf instead, which tells you how many values were scanned (e.g. zero or one).

[1]: https://en.cppreference.com/w/c/string/byte/atoi.html

[2]: https://pubs.opengroup.org/onlinepubs/9799919799/functions/a...

[3]: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf

acidx8mo ago

For instance, if you try to parse a number that's preceded by a lot of spaces, sscanf() will take a long time going through it. I've been hit by that when fuzzing Lwan.

Even cURL is avoiding sscanf(): https://daniel.haxx.se/blog/2025/04/07/writing-c-for-curl/

1 more reply

AdieuToLogic8mo ago· 2 in thread

While the classic "Parse, don’t validate"[0] paper uses Haskell instead of C as its illustrative programming language, the approach detailed is very much applicable in these scenarios.

0 - https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

lelanthran8mo ago

> While the classic "Parse, don’t validate"[0] paper uses Haskell instead of C as its illustrative programming language, the approach detailed is very much applicable in these scenarios.

Good thing someone (i.e. me) took the time to demonstrate PdV in C: https://www.lelanthran.com/chap13/content.html

nxobject8mo ago

erichocean8mo ago· 2 in thread

Give Fil-C a try, the speed hit is pretty minimal and you get full memory safety.

https://fil-c.org/

Karrot_Kream8mo ago

Wow this is really cool, I'd never seen this before. Thanks!

adhamsalama8mo ago

Why isn't this used more?

dang8mo ago

Recent and related:

Show HN: I built a web framework in C - https://news.ycombinator.com/item?id=45526890 - Oct 2025 (208 comments)

jqpabc1238mo ago

Reads like an indictment of vibe coding.

LLMs are fundamentally probabilistic --- not deterministic.

This basically means that anything produced this way is highly suspect. And this framework is an example.

qalmakka8mo ago

Integer operations, the one thing in computers where basically there's no non-annoying way to do them right except being over pedantic with checks

qalmakka8mo ago

pizlonator8mo ago

Just compile it with Fil-C

ge968mo ago

Long as you allocate me, it's alright

j / k navigate · click thread line to collapse