One weak point I see - this tool only measures how much an individual input token would be changed to decrease the loss. The reason a token might have large gradients might be related to how many times it appears in the training set and how consistent is the training set with the evaluation set, not just how much the prediction disagrees with the target label.
So it jointly measures data coverage, consistency and data to target fit. Just my intuition, might be wrong.