Show HN: A 6KB, fast and mostly-correct XML parser in C (opens in new tab)

(dev.yorhel.nl)

3 pointsyorhel12y ago5 comments

5 comments

This looks interesting. Is it brand new? I've seen parsers with similar names, but, of course, almost all short letter combinations containing xml have been used, so that's expected.

It doesn't allocate or buffer, so when it says that it verifies proper nesting, I assume that doesn't mean verifying that start tags and end tags have matching names. I think that would be impossible.

I've been looking for a decent C parser similar to this to try some experiments with. Dealing with XML can be dreary, but I like to try to think of how you might make an API that doesn't make your code suck. It would be at a higher level, so I might be able to use this.

It looks like you pass the parse function a character at a time. To be fast, you'd want to somehow encourage that to get inlined. I'm not sure how easy or possible it is when you put the function in a separate .c file. There would probably be ways to stuff everything into the header.

I'm interested particularly in parsers that don't buffer. It seems like a fundamentally more flexible and potentially better way to parse XML is by memory mapping and letting the VM do what it does best.

yorhelOP12y ago

It's new, yes. I wrote it about two months ago because I needed a parser and found the existing solutions too bloated.

> It doesn't allocate or buffer, so when it says that it verifies proper nesting, I assume that doesn't mean verifying that start tags and end tags have matching names. I think that would be impossible.

It does match start tags and end tags, and in order to do so it does, indeed, need a small buffer. However, the only thing that needs to be stored in this buffer is a stack of element names, and that rarely needs much space - a simple fixed-size buffer is sufficient. It can be safely allocated on the stack if the application wants to. See the documentation for the yxml_init() for more details. The "mostly-bufferless" part of the homepage refers to the lack of a complicated input buffer and the avoidance of complex buffer management for output where possible.

> It looks like you pass the parse function a character at a time. To be fast, you'd want to somehow encourage that to get inlined.

The function is quite large. You can probably get it inlined when put inside a header file and when it's only called in a single place, and you may indeed see a performance boost. I haven't measured how significant such an improvement may be.

However, if performance is the primary goal (it wasn't for yxml), a completely different design would probably be better. Yxml needs to manage an explicit state object for each input character. For each new character it has to look at the state to know how it's handled, handle it, and then update the state for the next character. I was honestly surprised to see that yxml performed quite well in my benchmarks.

A more efficient design is probably the goto-based state machine approach that, e.g. ragel[1] can output. The reason I didn't go for that approach is that it complicates buffering. You either need to have the full file mapped into memory (not possible when reading from a socket or from a compressed stream), or you need to implement your own buffering and provide hooks so that the application can fill the buffer when it is depleted. The latter approach is used by most other XML parsers.

[1] http://www.complang.org/ragel/

mtdewcmu12y ago

I was interested in yours specifically because it's the only one I've seen that doesn't buffer (almost). You might as well drop the last little buffer and be completely bufferless. Then it will never cause an error or block, and you'll have no dependencies. If you want to support validation, make a higher-level interface to this and do it there. It will be just as easy.

1 more reply

j / k navigate · click thread line to collapse

5 comments

mtdewcmu12y ago

This looks interesting. Is it brand new? I've seen parsers with similar names, but, of course, almost all short letter combinations containing xml have been used, so that's expected.

yorhelOP12y ago

It's new, yes. I wrote it about two months ago because I needed a parser and found the existing solutions too bloated.

> It looks like you pass the parse function a character at a time. To be fast, you'd want to somehow encourage that to get inlined.

[1] http://www.complang.org/ragel/

mtdewcmu12y ago

1 more reply

j / k navigate · click thread line to collapse