Radar and ultrasound both give drastically less data than a simple 720p webcam. After postprocessing their output bandwidth is more similar to a 9-axis IMU than a camera.
Yes and each time you process the incoming camera data in the neural network you have extra calculations due to the sensor fusion with another data source, regardless of its bitrate. Have you worked with deep learning much?