Here's an analogy. In some churches, the faithful will kneel in order to pray. Now I come along and say that praying 200 times per day will lead to better health. We put this to a test, and find it's true. But there are conflating factors - is it the praying, or the physical exercise through kneeling, or doing it in a church, or all of the above, which lead to the outcome?
A Kneelologist could stop, be happy that it works, and promote Kneelism as a healthy activity. But a non-kneelologist could point out that it's similar to calisthenics, which was already known to give similar positive results, is simpler to understand because it doesn't require the prayer component, and can done by people who are against prayer or don't have ready access to a church.
(Or for a real world example, the asanas from hatha yoga are used as exercise, and called 'yoga' even though yoga is a much broader topic.)
The scientific approach would address some important factors: 1) is the effect real and reproducible?, 2) when should be be used instead of other forms of treatment, and 3) what are the possible conflating factors and can we disentangle them to improve 1) and 2)?
Applying that to NLP, and making this up because I don't know the details. What if NLP is an incorrect synthesis of real-world observations that were already known at the time NLP was developed? In that, the ability of NLP to predict similar effects is not surprising. Other psychology models developed since Bandura's work in the 1960s also need to "predict" that behavior rehearsal can be an effective treatment.
Instead, what new predictions does NLP make which are different from other behavioral models? Can those predictions be tested? Or failing the ability to make new predictions, is it a simper model which it at least equally effective as other models in describing behavior?
That's where the science comes in.
NLP might work. But so might cognitive psychology, and with seemingly fewer worries about charlatans.