Regarding:
When we produce first and the database update fails (because of incorrect state) it means in the worst case we enter a loop of continuously sending out duplicate messages until the issue is resolved
I don't understand where either 1) the incorrect state or 2) the need to continuously send duplicate messages come from.
Regarding:
The Job might still fail during execution, in which case it’s retried with exponential backoff, but at least no updates are lost. While the issue persists, further state change messages will be queued up also as Jobs (with same group value). Once the (transient) issue resolves, and we can again produce messages to Kafka, the updates would go out in logical order for the rest of the system and eventually everyone would be in sync.
This is the part that is equivalent to Kafka-first, except with all the extra steps of a job scheduling, grouping, tracking, and execution framework on top of it.