In fact the LLMs that everyone uses today typically have access to specialized task specific tooling. Obviously specialized tools aren't appropriate for a test that measures the ability to generalize but generic tools are par for the course. Writing a bot to play a game for you would certainly serve to demonstrate an understanding of the task.