A big-hammer approach is to set thread affinity for your process to one hyperthread/processor. But that loses the opportunity for lovely parallelism.
A finer-grained approach is to have a flag bit that prevents preemption, perhaps even just preemption by threads of the same process. This is weaker than CLI because it doesn't prevent I/O callbacks etc from preempting; ideally those would be suspended as well for the process.
This assume a non-priviledge flag word i.e. user-mode code owns the "process flags", not the kernel.
My favorite solution is a "process signal register" in hardware. Its a wide register full of test-and-set bits, shared by threads of a process. They can be used to implement critical section, semaphore, event, even waiting on a timer. All without a trip thru the kernel - essentially zero-latency kernel primitives.