I've tried this for Wordnet hierarchical classification with a Standard Cross Entropy loss:
https://github.com/glassroom/heinsen_tree#sample-usage-with-...
It worked for me, but I had to modify the code to use all hypernym paths, giving me 147,200 classes, one per path. English only. For synsets with more than one path, I split target probability mass over their paths. For prediction, I added the predicted probs of hypernym paths ending at each synset.