In the non XDP case (ebpf on TC) you have to allocate a sk buff and initialize it. This is very expensive, there's tons of accounting in the struct itself, and components that track every sk buff. Then there are the various CPU bound routing layers.
Overall the network core of Linux is very efficient. The actual page pool buffer isn't copied until the user reads data. But there's a million features the stack needs to support, and all of these cost efficiency.