You only need to make one mistaken assumption to introduce non-determinism. Say you wait for the page to load, then click a button, is that valid? Sometimes it is, but some UI frameworks might render the button after the page reports being loaded. Your test will generally pass regardless, but mysteriously fail periodically.
You also tend to never achieve full reliability because the environments aren't entirely stable. Almost everyone sees more "random" failures on old IE versions. Poor over-stressed IE9 running in a VM will have a bad day and your test will fail.