I'd be pretty surprised if it did help in novel positions. Which would make this an interesting LLM benchmark honestly: Beating Stockfish from random (but equal) middle game positions. Or to mix it up, from random Chess960 positions.
Of course, the basis of the logic the LLM would play with would come from the engine used for the original evals. So beating Stockfish from a dataset based on Stockfish evals would seem completely insufficient.