If you're looking at two screens with identical images, then that might be a valid argument, but VR headsets provide stereoscopic viewing by presenting different images of the virtual scene from each eye's viewpoint-- this is known as binocular disparity. It's the same principle used in 3D TVs and anything that requires you to wear special glasses.
The "head translations" you're talking about gives the visual system depth cues via motion parallax, where objects in the foreground appear to move faster than those in the background when the head is moved from side-to-side.
These two things together (stereoscopy and motion parallax) yield a very strong sense of "3D depth", called stereopsis. Having controllers with six degrees of freedom (6DOF: translation along and rotation about the x-, y-, and z-axes) to manipulate and interact with 3D data should be superior, as it is no longer necessary to map 2D mouse inputs to 3D operations which would also decrease cognitive load, in theory.