MLIR is going to be orthogonal to C++ performance. You're talking about the efficiency of an intermediate representation. But that IR turns into TF opcodes, which then need to execute natively.
You can achieve the same efficiency in C++ via a custom TF op. You end up with native instructions either way. And you have access to the entire memmapped model in the op, allowing you to do as you please.
You're only debugging in one place, because the same C++ code runs during training in your notebook that runs during inference on your device.
You're getting ease of implementation and usage of Swift. But your code will be less portable; you won't be able to run the same model on Android or on the server. And there's not necessarily any performance benefit over doing the same in C++, definitely not if your kernel is simple.