I only used traits to more easily implement the scenes; a Scene needs to implement a new(), a start() and an update(), so that I can put them in an array and call them like scenes[current_scene_idx].update() from the main loop.
Also, I used some short and simple closures to avoid repeating the same code in many places (like a scope-local write() closure for the menus that wraps drawtext() with some default parameters).
The vast majority of the time is spent in the triangle filling code, where probably some autovectorization is going on when mixing colors. I tried some SIMD there on x86 and didn't see visible improvements.
Apart from obvious and low-hanging fruit (keeping structs simple, keeping the cache happy, don't pass data around needlessly) I didn't do anything interesting. And TBH profiling it shows a lot of cache misses, but I didn't bother further.