1. Leader replicates to followers
2. Followers get the messages and send ACK to leader
3. Leader gets the confirmation from followers, increments high watermark (HW) and replies client that messages is commited.
4. NOW: Leader fails before it could piggyback to followers that HW has been incremented.
5. The question is: Since potential leader is not aware of the fact that HW has been incremented, it becomes the new leader and truncates the log to old HW, which means we have lost an entry in the log that has been confirmed by previous leader.
As a result, client has been confirmed that the message has been successfully written, but it has been lost.
In step 5 of your example, the new leader does not truncate its log; only the new followers do.
Basically, every message that was committed is guaranteed to be in every in-sync replica's log. (Some of these may be after the replica's HW.) But the converse is not true: some replica's logs will contain messages that may not have been committed.
So the new leader has to keep its entire log -- which includes all acknowledged messages, plus some unacknowledged ones that become committed as a result of the leadership transition. But the new followers have to discard messages above their HW, because those messages might not be present on the new leader.
> The replica that registers first becomes the new leader. The new leader chooses its LEO as the new HW.
Isn't it a better idea to elect a process who has the largest offset? Basically during leader election, a process should get confirmation of majority of correct processes. If another process observes that it has larger offset in the log, rejects the election and starts another round of election. Eventualy a correct process who has largest offset will become leader. Regarding correctness, this sounds to me as a safer election.
Kafka is a at-least-once system, not transactional.
It's been a while since I've read the code though, so it might have changed.
Imagine Kafka but I am able to run my processors within the same JVM. You might even call them servlets for Kafka.
If you want to work on streaming data (streaming logs), you can use Spark and others...