Actually Duff's device is very likely to be a performance hit most of the time. It will have bad branch predictor behavior and it creates unstructured code, making it harder for the compiler to reason about and optimize.
Although it should be noted that the original person to use Duff's device did admit that he tried everything else to optimize it and only that worked, and he never advocated that people should use it as a general optimization technique.
That's been my experience too, not just of Duff's device but in general with trying to unroll loops. There may have been a point where it was practical (although early in my career I didn't do the profile-first-optimise-later thing very well, I confess). But if you think you can manually unroll a loop in a way that beats the branch predictor and a good compiler on most modern hardware, you're probably fooling yourself.
I suspect I'm just not in the group of people for whom it is relevant any more. Way back I was optimising for console hardware when that hardware was quite rudimentary. Now the hardware is much more sophisticated, I don't need it.
Can you say what kind of stuff and in what kinds of situations you unroll loops. Just out of curiosity.
It's worth noting that there are definitely exceptions where loop unrolling is a good practice for optimization. When writing code that will run on obscure old architectures, and even some common ones, the CPU doesn't have things like branch prediction and caches, and your compiler is likely quite rudimentary.