The Hunt for Error -22 (opens in new tab)

(tweedegolf.nl)

66 pointsmpalme1y ago25 comments

25 comments

19 comments · 9 top-level

macgyverismo1y ago· 6 in thread

I can see how the author came to the conclusion that libmodem written in rust would have prevented the issue, but isn't it simply pushing the problem further down the stack?

The author needed to use unsafe in order to pass his pointer to libmodem, but libmodem is going to require a pointer with static lifetime itself. Which would have prevented the issue in the first place had the author done this.

I can see why you wouldn't want to use static, it hinders testability, but that means you need to ensure that the pointer you supply libmodem outlives libmodem. I would use RAII to do that in C++ and I am sure in rust you could/would do the same.

I guess I am asking, is there anything here that a libmodem written in rust would have magically solved? It feels like wishful thinking, but I am open to learn where I am mistaken.

In any case, kudos for finding this bug. Having worked with Zephyr/NRF connect SDK and this exact chip myself I can definitely relate to the pain they (can) bring.

swatcoder1y ago

The existing C interface doesn't have means to describe the lifetime of the data being passed in. It just takes a pointer. An experienced C programmer would often understand what's happening by convention and not encounter the problem.

But the custom Rust wrapper was composed as a game of telephone (ugh), with the author blindly mimicking "Jonathan" who seemed to have been blindly mimicking a sloppy (and later repaired) example from Nordic.

The argument is that if the library and its internals were originally written in Rust, which has richer semantics for object lifetimes, Rust would have been able to formally convey that the input data needed to outlive the individual function call, throwing an error at compile time.

The wrapper could have enforced this constraint itself, as it probably does now, but the handoff between Rust and C needs somebody to account for and understand the by convention stuff in C so that it can be expressed formally in Rust, and that human process failed to happen here.

rcxdude1y ago

Part of the issue is that there's not really a convention in C. If it's not documented, you should probably read the source code to find out. (C programmers often think there's a convention, but that's because there's one option that's obvious to them but then other programmers will have a different 'obvious' option, which is why this is so often not documented at all)

wrs1y ago

If I read the article correctly, Nordic changed the rules on this function without saying anything. It used to work with a stack-allocated config and now it doesn’t. The only way for the caller to know about that in C is a comment.

1 more reply

orf1y ago

> I guess I am asking, is there anything here that a libmodem written in rust would have magically solved?

I'm not following your comment, but I think the point is simply "the lifetime of the config is in the function signature, rather than hopefully (sometimes) being in the documentation, and hopefully (sometimes) correct".

HALtheWise1y ago

It sounds like one function in libmodem accepts a pointer to a configuration struct, then stores that pointer (or an interior pointer from within it), which is then later used by another libmodem function later. If all of libmodem were written in Rust, this could be done without any use of unsafe, but it would require the lifetime on the original "reference" to provably outlive the second function getting called, probably by being static.

consp1y ago

The author mentioned in the first chapter that everything works fine in rust, since it solves all problems. So I guess they throw "better in rust" against every problem.

The assumption nobody ever makes mistakes is mistake one.

fch421y ago· 4 in thread

Caution: I'm a C programmer "by history" and therefore I see "C Interfaces" as that and don't always / immediately view them through the rust lens of ownership...

Reading the article (nice troubleshooting story!), my summary, as a C programmer, is that the "C Interface" here "takes ownership". Given C cannot express this properly, a pointer is passed - and the called function "simply" makes the assumption that from hence-on, what was given to it will remain.

As "semantics" this (the need to pass an "owned" piece of data to a function) isn't unusual irrespective of the programming language. Just in case of Rust, this is explicit in the interface (if the func takes a non-ref arg, or a shared smart ref of sorts), while in C ... this can lead to errors of the observed kind. I haven't looked whether any of the sources or docs of libmodem say "this pointer must be either global/static or malloc'ed (and the caller shall not free it)".

A rust wrapper for this could / should possibly "leak a reference" here; Something that prevents the initialisation object from being dropped. yea, accepted, needs "nasty hacks" whether static lifetime, Pin, manual drops, explicit Arc leaks, ... possible though.

It'd be nice if libmodem were stricter about such ownership, agreed, and then a rust wrapper could take advantage. Takes time to evolve; is there a bug report / enhancement request out there for this in libmodem ?

steveklabnik1y ago

> Takes time to evolve; is there a bug report / enhancement request out there for this in libmodem ?

The end of the post says

> This would have been so simple to put in the docs. I've opened a ticket on their DevZone forum. As of writing they've still not updated the docs of the init function.

And they've replied

> Thank you for reporting this, it will be fixed in the next `libmodem` release by the end of the month.

kevin_thibedeau1y ago

C would benefit from a new keyword that allows a function to require a pointer argument be a non-stack object when passed from a function that returns. That would cover the most common cases where you accidentally pass in a temporary object that must persist.

bennettnate51y ago

The problem with this is that a pointer value may be passed through multiple functions or nontrivial control flow before reaching a function with this keyword. To effectively check if the pointer is in the stack or .rodata, it would need to do some kind of reverse data dependency analysis of all the possible places that pointer could come from to ensure none of the control flows could cause that pointer to be changed to a stack pointer.

1 more reply

cozzyd1y ago

It seems like it wouldn't be hard to support a vendor-specific function parameters attribute like __attribute__((ptr_in_section(".data",".rodata"))__ or __attribute__((ptr_not_in_section(".stack"))

1 more reply

Neywiny1y ago

I'll admit like others I'm more C biased. This whole debugging story is not very flattering. Tbh my team would not be happy if I told them I didn't 3-4 weeks on something because I used an unsupported language, library, debugger, changed compilers often (did I see him say nightly builds?? On production??), etc etc.

I'm still not sure I understand why he couldn't just diff between versions. And the black box thing seems like a fool's errand. If changing the order of random things makes the issue go away, you can't change anything. The only thing you can do is use the binary you already have. Especially because even if you have 2 not working versions, fixing one doesn't necessarily fix the other. This debug effort felt very sloppy.

It's also weird looking at a lot of this code. The first assembly function pushes 4 values on the stack and only needed to push 2. I've had my fair share of bugs that make me go to dissembly but that also felt very time wastey here. The author evidently did not have enough of a grasp of what to expect for it to help at all.

While true that nRF should've put something in a log, the author admits they don't support this development flow. It's like the old addage about APIs. Any change no matter how small will break someone's usage.

wrs1y ago

Two general debugging lessons to note here: (1) Data watchpoints are a lifesaver that should be in your toolbox. (2) When something funky is happening to a data structure in a C program, look at its address to see if it’s in the stack or the heap, in case the answer is a surprise.

cozzyd1y ago

Even if libmodem were written in rust, since they ship it as a blob, it would have to use the C ABI and you could easily have the same problem.

sagacity1y ago

This reminds of a similar issue I had recently when working with Rust and dx12 using the excellent windows-rs crates. Dx12 is designed around the concept of command buffers that only get submitted and evaluated at (usually) the end of the frame. This means that you need to take care to keep references around to the data you're submitting because otherwise dx12 will be looking at broken pointers once the command buffers actually get executed.

The tooling around this is much easier to deal with it, of course, since it's all just Windows and there's a bunch of sane debug layers that you can use.

Massive respect to be able to debug the issue on an embedded system!

diffuse_l1y ago

I'm not sure about the exact hardware setup used here, but is it possible that something like valgrind would have helped here?

I guess not, as the seems to run on the microcontroller, but I remember getting at least some warning from valgrind in similar situations

antithesis-nl1y ago

Yeah, overly complicated firmware like this just isn't great. It's always hard to debug and fix these issues, and there is no "if everything were just written in X" miracle-world where that goes away.

For example, I've encountered hardware that would occasionally write unexpected-error details to a memory location that was completely undocumented. And if you expect more than a shrug from a vendor after pointing out such things, well...

heywire1y ago

I love reading debugging stories like these. Thanks for sharing!

j / k navigate · click thread line to collapse

25 comments

19 comments · 9 top-level

macgyverismo1y ago· 6 in thread

I can see how the author came to the conclusion that libmodem written in rust would have prevented the issue, but isn't it simply pushing the problem further down the stack?

I guess I am asking, is there anything here that a libmodem written in rust would have magically solved? It feels like wishful thinking, but I am open to learn where I am mistaken.

In any case, kudos for finding this bug. Having worked with Zephyr/NRF connect SDK and this exact chip myself I can definitely relate to the pain they (can) bring.

swatcoder1y ago

rcxdude1y ago

wrs1y ago

1 more reply

orf1y ago

> I guess I am asking, is there anything here that a libmodem written in rust would have magically solved?

HALtheWise1y ago

consp1y ago

The author mentioned in the first chapter that everything works fine in rust, since it solves all problems. So I guess they throw "better in rust" against every problem.

The assumption nobody ever makes mistakes is mistake one.

fch421y ago· 4 in thread

Caution: I'm a C programmer "by history" and therefore I see "C Interfaces" as that and don't always / immediately view them through the rust lens of ownership...

steveklabnik1y ago

> Takes time to evolve; is there a bug report / enhancement request out there for this in libmodem ?

The end of the post says

> This would have been so simple to put in the docs. I've opened a ticket on their DevZone forum. As of writing they've still not updated the docs of the init function.

And they've replied

> Thank you for reporting this, it will be fixed in the next `libmodem` release by the end of the month.

kevin_thibedeau1y ago

bennettnate51y ago

1 more reply

cozzyd1y ago

It seems like it wouldn't be hard to support a vendor-specific function parameters attribute like __attribute__((ptr_in_section(".data",".rodata"))__ or __attribute__((ptr_not_in_section(".stack"))

1 more reply

Neywiny1y ago

wrs1y ago

cozzyd1y ago

Even if libmodem were written in rust, since they ship it as a blob, it would have to use the C ABI and you could easily have the same problem.

sagacity1y ago

The tooling around this is much easier to deal with it, of course, since it's all just Windows and there's a bunch of sane debug layers that you can use.

Massive respect to be able to debug the issue on an embedded system!

diffuse_l1y ago

I'm not sure about the exact hardware setup used here, but is it possible that something like valgrind would have helped here?

I guess not, as the seems to run on the microcontroller, but I remember getting at least some warning from valgrind in similar situations

antithesis-nl1y ago

heywire1y ago

I love reading debugging stories like these. Thanks for sharing!

j / k navigate · click thread line to collapse