It doesn't use the builtin ONVIF motion detection but instead processes every keyframe and does motion detection between the keyframes. So not every frame is being decoded to process, but enough to perform motion capture.
There's some interesting work w.r.t motion detection that doesn't actually require decoding the packets but rather looks at the motion vectors to estimate motion. That would save up a lot of resources..