I see this stuff everywhere online and it's often taught this way so I don't blame folks for repeating it, but I think it's likely promulgated by folks who don't train LSTMs with long contexts.
LSTMs do add something like a "skip-connection" (before that term was a thing) which helps deal with the catastrophic vanishing gradients you get from e.g. Jordan RNNs right from the jump.
However (!), while this stops us from seeing vanishing gradients after e.g. 10s or 100s of time-steps, when you start seeing multiple 1000s of tokens, the wheels start falling off. I saw this in my own research, training on amino acid sequences of 3,000 length led to a huge amount of instability. It was only after tokenizing the amino acid sequences (which was uncommon at the time) which got us down to ~1500 timesteps on average, did we start seeing stable losses at training. Check-out the ablation at [0].
You can think of ResNets by analogy. ResNets didn't "solve" vanishing gradients, there's a practical limit of the depth of networks, but it did go a long way towards dealing with it.
EDIT: I wanted to add, while I was trying to troubleshoot this for myself, it was super hard to find evidence online of why I was seeing instability. Everything pertaining to "vanishing gradients" and LSTMs were blog posts and pre-prints which just merrily repeated "LSTMs solve the problem of vanishing gradients". That made it hard for me, a junior PhD at the time, to suss out the fact that LSTMs do demonstrably and reliably suffer from vanishing gradients at longer contexts.
[0] https://academic.oup.com/bioinformatics/article/38/16/3958/6...