Just look at any papers that put models through mathematical benchmarks. The model isn't memorizing these problems. For example I just generated 2 random 64 bit integers and asked ChatGPT to add them.
"6769545085823578960 + 16027170449476717488"
ChatGPT said the answer is 22796715535300296448. It got the correct answer even though the problem wasn't in its training data.