My understanding is that you're now instead saying the goal is to see how well a machine can imitate a human male imitating a woman. This still seems inconsistent with the rest of the paper, like Turing's mention of "the part of B being taken by a man", that the example interrogator questions appear to discern intelligence opposed to gender in the machine vs human game, or that a hypothetical flipped test would involve a man imitating a machine and failing due to slowness in arithmetic.
I believe both of your proposed interpretations miss the central idea of Turing's proposal and alter it in a way that would make significance of the test/results questionable. Specifically, it's intended to be a sufficiently general arena to introduce any relevant task/test to determine intelligence (like feeding in chess positions, as in Turing's example question) - not just one oddly specific task.
While it's likely possible to make ad-hoc changes until it's consistent, the standard reading (machine imitating human) seems infinitely more plausible to me. There's plenty in the paper justifying the significance of determining whether a machine can answer questions indistinguishably from a human, yet no reasoning at all for why it would be narrowed to the very specific case of having the machine imitate one gender imitating the other gender.