Inline assembly in Linux (opens in new tab)

(github.com)

158 points0xAX10y ago33 comments

33 comments

I'm trying to add call/cc to node, or to lua.

Recap: call/cc is the ability to save the current state of a running thread, then revert to that state at a later point in time. In other words, at any point in your program, you can say "Save the current stack." It's saved as a function. Later, whenever you call that function, the current stack is thrown out, and replaced with the saved stack.

This is very useful for a number of reasons. It's also a very rare feature to have in your language.

Neither node nor lua has support for this. The closest I've found is a Lua extension which adds coroutine.clone(). In principle, this is the solution. In practice, it has a number of limitations, such as restrictions on when you're allowed to call coroutine.clone(). (For example, if your stack looks like Lua -> C -> Lua, then it won't work.)

I tried to channel my inner Mike Pall and solve this problem once and for all, but I'm not Mike Pall, and this is really hard. I was hoping you might know of any possible solution which is (a) practical, (b) cross-platform, and (c) works in all cases.

Why post this here? Because this post happens to attract exactly the kind of people that might know a way forward. There must be a way. Apologies for the off-topic comment.

kenOfYugen10y ago

James Long is currently playing with call/cc in JavaScript [1].

He mentioned a paper titled "Exceptional Continuations in JavaScript" [2], that describes the method he used in his implementation.

Maybe you should get in touch?

[1] https://twitter.com/jlongster/status/726259468405751808

[2] http://www.schemeworkshop.org/2007/procPaper4.pdf

sitkack10y ago

I assume you have read http://okmij.org/ftp/continuations/against-callcc.html

haberman10y ago

> if your stack looks like Lua -> C -> Lua, then it won't work.

I don't think you can safely solve this in the general case. There is a key problem I don't think you can work around.

Say your stack looks like C(1) -> Lua -> C -> Lua. The outermost C frames might not know anything about Lua (they just use some library that uses Lua as a library). Say you try to take a snapshot of this stack to create a continuation. You probably just want to snapshot the Lua -> C -> Lua part, since that is the portion of the stack representing the execution of the Lua program.

Now say all these frames return. Then the main program calls Lua again, through through a slightly different code-path, and now you have C(2) -> Lua. Say the embedded Lua program decides to resume the continuation.

Now keep in mind that the C stack is not position-independent. The C stack can contain pointers to the C stack, so when you resume, you need your resumed stack to live at exactly the same address as last time it ran. But what if C(1) and C(2) are not exactly the same size? What if we called one extra function before calling Lua the second time? It is impossible to copy the continuation's C stack back into its original position. So it's impossible to resume the continuation.

You could try to snapshot the entire C stack to get around this, including the outermost C frames. But this would be most unexpected for the C program that is using the Lua interpreter. Lua is supposed to just be a regular C library: you call a function, it does things, and then returns. It wouldn't be acceptable that calling a Lua interpreter function like lua_call() backs your entire C program to a previous state just because the embedded Lua program used a fancy feature called continuations!

There are many other things that would make this tricky at best to get working, but I think the problem above really tanks the idea completely.

sillysaurus310y ago

But what if C(1) and C(2) are not exactly the same size?

Say you want to resume continuation K, which has a stack of some size N.

The current thread has a stack of size M. If M >= N, everything is fine: you can safely overwrite the current stack with K's stack.

If M < N, recurse until M >= N.

You could try to snapshot the entire C stack to get around this, including the outermost C frames.

Indeed! This is a solution.

It wouldn't be acceptable...

I like doing unacceptable things in my programs. It's the best part of programming, really.

There are a lot of solid arguments against call/cc. I think the most persuasive argument in favor of call/cc is that you become more powerful. Whatever metric you use to measure power, call/cc will improve it: Smaller code, less time spent writing code, and you can even write algorithms that you otherwise would not be able to.

Personally, I want call/cc in order to be able to use choose and fail. It's the ability to write programs that are guaranteed to never call fail(). pg explains it well:

"For example, this is a perfectly legitimate nondeterministic algorithm for discovering whether you have a known ancestor called Igor:

  Function Ig(n)
    if name(n) = ‘Igor’
      return n
    if parents(n)
      return Ig(choose(parents(n)))
    fail

The fail operator is used to influence the value returned by choose. If we ever encounter a fail, choose would have chosen incorrectly. By definition choose guesses correctly."

Call/cc makes this possible. There are a lot of fun things to do. The last few chapters of On Lisp show some particularly interesting sketches.

2 more replies

aktau10y ago

I'm sure you aware you can implement call/1cc with Lua's coroutines: http://www.inf.puc-rio.br/~roberto/docs/MCC15-04.pdf. But maybe you want the full power of call/cc :).

to3m10y ago

If you want to actually clone the stack, you're probably stuck.

If you want to use this to do something that takes multiple invocations of lua_resume to complete without the calling Lua code being aware, you might be able to use lua_yieldk.

loeg10y ago

getcontext(3) / setcontext(3). Not standard, but exists in both Linux and FreeBSD.

nkurz10y ago

I've been using a lot of inline assembly lately, and while the Stockholm syndrome might be in effect, I'm coming to like the GCC syntax. For me, main thing that has helped has been to adopt a consistent syntax. Here's some examples of what I'm currently using for an AVX2 popcnt optimization, with some explanation.

  #define ASM_VEC_BYTE_COUNT_SET(vec, sum, mask, shuf)                  \
    __asm volatile ("vpsrld $4, %[VEC], %[SUM]\n"                       \
                    "vpand %[MASK], %[VEC], %[VEC]\n"                   \
                    "vpand %[MASK], %[SUM], %[SUM]\n"                   \
                    "vpshufb %[VEC], %[SHUF], %[VEC]\n"                 \
                    "vpshufb %[SUM], %[SHUF], %[SUM]\n"                 \
                    "vpaddb %[VEC], %[SUM], %[SUM]\n" :                 \
                    /* rd/wr ymm */ [VEC] "+&x" (vec),                  \
                    /* write ymm */ [SUM] "=&x" (sum) :                  \
        	    /* read ymm  */ [MASK] "x" (mask),                  \
                    /* read ymm  */ [SHUF] "x" (shuf))

1) Try to use the %[symbolic] syntax rather than %[n] numeric. It's slightly longer to write, but usually clearer to read. Use upper case for the symbolic name. Put your inputs one per line, with a preceding comment.

2) If you are using the same assembly more than once in your program, declare your assembly within a #define macro, then use the macro in your code.

3) Use "__asm volatile". Declaring "volatile" is not required, but once you are writing inline assembly you usually know more than the compiler about where the block should go.

5) If you have multiple lines of assembly and output registers, you are almost always safer to use "+&" and "=&" for your constraint rather than just "+" or "=". Search for "early clobber" for details.

6) Strongly prefer single type constraints. The more flexibility you give the compiler, the more likely it will defeat your efforts at optimization. Use explicit memory addressing modes rather than "m". The modifier "c" is needed for the offset.

  #define ASM_VEC_LOAD_OFFSET_MEM(off, mem, vec)                    \
    __asm volatile ("vmovdqu %c[OFF](%[MEM]), %[VEC]\n" :           \
                    /* destination */ [VEC] "=x" (vec) :            \
                    /* byte offset */ [OFF] "i" (off),              \
                    /* mem address */ [MEM] "r" (mem))

7) The register constraints for vectors are tricky, because the "x" constraint is used for both XMM and YMM vectors. There is no way to specify that one wants only one or the other. This sort of makes sense, since in hardware they share the same register. You can use the "q" modifier when you need to specify XMM syntax in the output when you need both forms of the same vector.

brigade10y ago

3 - using volatile for asm that doesn't have otherwise inexpressible side effects has the same askance that using it for thread safety has. If you think you need it, maybe you needed to add a "memory" clobber instead.

5 - I can't think of any meaning early clobber has on an input+output constraint ("+")?

6 - there are many cases where you really do want to give the compiler flexibility in addressing modes. Unfortunately clang tends to ignore that and generate (reg) regardless.

7 - not really different than GPRs; you use "r" as the constraint then a modifier like "k" for the size.

I guess the lesson is that yeah gcc inline asm is powerful, but they try to leave it undocumented for a reason. Also, who stole number 4?

nkurz10y ago

re 3: If it were for correctness, I'd agree. But I don't need volatile to make it work, I need it to produce the assembly I want. If one instruction can execute only on Port 1 (popcnt) and the other can execute on Ports 0, 1, 5, or 6, there's sometimes a 50% performance difference based on the order two seemingly independent instructions are executed. Volatile also prevents the compiler from hoisting loads ahead of my inline assembly, which sometimes makes a difference. Clobbering "mem" might force other reloads that I don't want to happen.

re 5: Barring compiler bugs, I think you'd be right if correctness was the only issue. But I'm pretty sure I've sometimes solved problems by adding it, although this may have been when working around the POPCNT bug that added a false dependency on the output. It also might have been when reading and writing a variable multiple times?

re 6: In theory, yes. But usually in these cases you should be writing intrinsics or straight C instead of inline assembly. The place where this comes up most for me is when I have two variables that use the same index, and I want to ensure "DEC/JNZ" fusion at the end of the loop. If I let the compiler choose, it will find a way to defeat me by incrementing both array addresses. The other case is when you explicitly want a store to use Port 7 for address generation, which only happens without an index register.

re 7: Yes, I just personally find it more confusing because "x" fits so well with "XMM", and thus it feels odd to use it when you want only a "YMM". Also, see here for problems with a Clang and %q[VEC]: http://stackoverflow.com/questions/34459803/in-gnu-c-inline-...

re 4: Oops, I forgot to renumber. I had another comment suggesting that one always use the "V" VEX prefix on vector commands and the explicit output register, but deleted it because it seemed off topic.

klodolph10y ago

Cannot disagree more about #3. You almost never want asm volatile. The compiler is mostly doing data flow analysis, and I've seen so many programmers who don't understand that. So, if the compiler's data flow analysis doesn't put your asm block where you want, you just give up and put "volatile" on it. NO! Just let the compiler figure it out. You may be smarter about generating the assembly in this case, but the compiler is still very good at putting the assembly in the right place in your code. Usually, if I see "asm volatile" in someone's code, I step back and think "there's probably something wrong with the assembly" and I go back and read the manual on asm operand constraints, and then I find something wrong with the constraints. With the correct constraints / clobbers in place, my experience is that removing "volatile" only improves things.

Of course this is not true for synchronization primitives and the like.

nkurz10y ago

I understand that most others share your position, and mostly agree when it comes to volatile variables. I'd also agree with you if removing "volatile" caused the code to break. But I think that it can be necessary for performance, and don't think that there are true downsides. I believe if you are using assembly it is because you don't want the compiler to attempt any further optimizations. For the cases when I want to drop to assembly, it's because I've already decided the register allocation and instruction ordering I want, and will verify the assembly that is generated.

My goal is to "lock in" an established level of performance once I've achieved it, so that compiler upgrades or changes don't result in performance drops. I often compare the output of multiple compilers with a matrix of optimization flags, choose the best blocks from each, and then hand-optimize from there while cross-referencing Agner's handbooks with Likwid's performance reports. If I've chosen to use inline assembly, the chances that the compiler will succeed in further optimizing my code is very low.

I realize it's not a popular view, but I think that using volatile with __asm is usually the correct approach. If you don't need "volatile", you probably should be using an intrinsic instead. I think the alternative (which may in fact be the better solution) is dropping to straight assembly for the entire function or distributing binary code.

1 more reply

nafest10y ago

I'd recommend to use intrinsics for SIMD vectorization, which is portable to platforms that don't support the GCC syntax (e.g. Windows with MSVC). You can use Intel's Intrinsics Guide (https://software.intel.com/sites/landingpage/IntrinsicsGuide...) to find the intrinsics that corresponds to the instructions you are using.

nkurz10y ago

Yes, that's a great link, and I agree that if you can get the performance you want with Intrinsics they are usually a better choice. But if you need compiler-portable high performance, I find that it can be really hard to get good performance on GCC, ICC, and Clang simultaneously with intrinsics.

Another approach that's not quite there yet but is becoming more possible is to use https://www.cilkplus.org to annotate your C code to force automatic vectorization. It's native to ICC, built-in to GCC 5.0+, and available as an extension to Clang: https://news.ycombinator.com/item?id=11550250

pjmlp10y ago

I never liked them.

The asm {} blocks of PC compilers are so much developer friendly.

I rather use an external Assembler than GCC's inline syntax.

ndesaulniers10y ago

Is it ever the case that inline assembly is required over separate object sources just in assembly? I would have thought it would be preferred to not use inline assembly, and simply link in object files of what you need. It would seem simpler syntax-wise, too. Why prefer inline assembly?

cyphar10y ago

There's several macros in the kernel which contain inline assembly, and you can't use code written in assembly because it would require using the stack to call the function (the case I'm thinking of is the switch_to macro which switches between tasks in the kernel).

rwmj10y ago

One reason is that you just want to call a single instruction, and the overhead of making a function call to another file would be too much. Picking an example from the Linux kernel pretty much at random:

    #define mb()    asm volatile("mfence":::"memory")
    #define rmb()   asm volatile("lfence":::"memory")
    #define wmb()   asm volatile("sfence" ::: "memory")

Those are memory barriers, so they essentially must be inline (they'd be too slow and possibly even change their meaning if they were located in a separate source file and you had to call them).

TwoBit10y ago

GCC inline assembly is one of the most terrible things I've ever had to work with. Somebody seriously need to redesign it or replace it. Aside from that,a decent reference manual for the existing version would be welcome.

AndyKelley10y ago

I'm working on a new language that has inline assembly and I pretty much just copied GCC's syntax[1]. Do you have any specific suggestions on how to do better?

[1]: https://github.com/andrewrk/zig/blob/master/std/linux_x86_64...

Ameo10y ago

Having lived my development life so far removed from the actual physical CPU/memory, thinking about implementing this kind of low-level stuff into actual code is mind-boggling to me.

j / k navigate · click thread line to collapse

33 comments

sillysaurus310y ago

I'm trying to add call/cc to node, or to lua.

This is very useful for a number of reasons. It's also a very rare feature to have in your language.

Why post this here? Because this post happens to attract exactly the kind of people that might know a way forward. There must be a way. Apologies for the off-topic comment.

kenOfYugen10y ago

James Long is currently playing with call/cc in JavaScript [1].

He mentioned a paper titled "Exceptional Continuations in JavaScript" [2], that describes the method he used in his implementation.

Maybe you should get in touch?

[1] https://twitter.com/jlongster/status/726259468405751808

[2] http://www.schemeworkshop.org/2007/procPaper4.pdf

sitkack10y ago

I assume you have read http://okmij.org/ftp/continuations/against-callcc.html

haberman10y ago

> if your stack looks like Lua -> C -> Lua, then it won't work.

I don't think you can safely solve this in the general case. There is a key problem I don't think you can work around.

There are many other things that would make this tricky at best to get working, but I think the problem above really tanks the idea completely.

sillysaurus310y ago

But what if C(1) and C(2) are not exactly the same size?

Say you want to resume continuation K, which has a stack of some size N.

The current thread has a stack of size M. If M >= N, everything is fine: you can safely overwrite the current stack with K's stack.

If M < N, recurse until M >= N.

You could try to snapshot the entire C stack to get around this, including the outermost C frames.

Indeed! This is a solution.

It wouldn't be acceptable...

I like doing unacceptable things in my programs. It's the best part of programming, really.

Personally, I want call/cc in order to be able to use choose and fail. It's the ability to write programs that are guaranteed to never call fail(). pg explains it well:

"For example, this is a perfectly legitimate nondeterministic algorithm for discovering whether you have a known ancestor called Igor:

  Function Ig(n)
    if name(n) = ‘Igor’
      return n
    if parents(n)
      return Ig(choose(parents(n)))
    fail

The fail operator is used to influence the value returned by choose. If we ever encounter a fail, choose would have chosen incorrectly. By definition choose guesses correctly."

Call/cc makes this possible. There are a lot of fun things to do. The last few chapters of On Lisp show some particularly interesting sketches.

2 more replies

aktau10y ago

I'm sure you aware you can implement call/1cc with Lua's coroutines: http://www.inf.puc-rio.br/~roberto/docs/MCC15-04.pdf. But maybe you want the full power of call/cc :).

to3m10y ago

If you want to actually clone the stack, you're probably stuck.

If you want to use this to do something that takes multiple invocations of lua_resume to complete without the calling Lua code being aware, you might be able to use lua_yieldk.

loeg10y ago

getcontext(3) / setcontext(3). Not standard, but exists in both Linux and FreeBSD.

nkurz10y ago

  #define ASM_VEC_BYTE_COUNT_SET(vec, sum, mask, shuf)                  \
    __asm volatile ("vpsrld $4, %[VEC], %[SUM]\n"                       \
                    "vpand %[MASK], %[VEC], %[VEC]\n"                   \
                    "vpand %[MASK], %[SUM], %[SUM]\n"                   \
                    "vpshufb %[VEC], %[SHUF], %[VEC]\n"                 \
                    "vpshufb %[SUM], %[SHUF], %[SUM]\n"                 \
                    "vpaddb %[VEC], %[SUM], %[SUM]\n" :                 \
                    /* rd/wr ymm */ [VEC] "+&x" (vec),                  \
                    /* write ymm */ [SUM] "=&x" (sum) :                  \
        	    /* read ymm  */ [MASK] "x" (mask),                  \
                    /* read ymm  */ [SHUF] "x" (shuf))

2) If you are using the same assembly more than once in your program, declare your assembly within a #define macro, then use the macro in your code.

3) Use "__asm volatile". Declaring "volatile" is not required, but once you are writing inline assembly you usually know more than the compiler about where the block should go.

  #define ASM_VEC_LOAD_OFFSET_MEM(off, mem, vec)                    \
    __asm volatile ("vmovdqu %c[OFF](%[MEM]), %[VEC]\n" :           \
                    /* destination */ [VEC] "=x" (vec) :            \
                    /* byte offset */ [OFF] "i" (off),              \
                    /* mem address */ [MEM] "r" (mem))

brigade10y ago

5 - I can't think of any meaning early clobber has on an input+output constraint ("+")?

6 - there are many cases where you really do want to give the compiler flexibility in addressing modes. Unfortunately clang tends to ignore that and generate (reg) regardless.

7 - not really different than GPRs; you use "r" as the constraint then a modifier like "k" for the size.

I guess the lesson is that yeah gcc inline asm is powerful, but they try to leave it undocumented for a reason. Also, who stole number 4?

nkurz10y ago

klodolph10y ago

Of course this is not true for synchronization primitives and the like.

nkurz10y ago

1 more reply

nafest10y ago

nkurz10y ago

pjmlp10y ago

I never liked them.

The asm {} blocks of PC compilers are so much developer friendly.

I rather use an external Assembler than GCC's inline syntax.

ndesaulniers10y ago

cyphar10y ago

rwmj10y ago

    #define mb()    asm volatile("mfence":::"memory")
    #define rmb()   asm volatile("lfence":::"memory")
    #define wmb()   asm volatile("sfence" ::: "memory")

Those are memory barriers, so they essentially must be inline (they'd be too slow and possibly even change their meaning if they were located in a separate source file and you had to call them).

TwoBit10y ago

AndyKelley10y ago

I'm working on a new language that has inline assembly and I pretty much just copied GCC's syntax[1]. Do you have any specific suggestions on how to do better?

[1]: https://github.com/andrewrk/zig/blob/master/std/linux_x86_64...

Ameo10y ago

Having lived my development life so far removed from the actual physical CPU/memory, thinking about implementing this kind of low-level stuff into actual code is mind-boggling to me.

j / k navigate · click thread line to collapse