A more charitable view on the PyTorch forums thread is that Ed's explanation "torch.jit, at this point in time, is not designed to take pointwise loops as you’ve written here, and compile them into machine code directly." is a description of "this is the feature you're implicitly asking about, and it isn't implemented in PyTorch yet". What would have been a better answer in your view?
It would seem btw that 1.5 years later, people are working on implementing generated kernels for reductions, even if it is still somewhat experimental.
The speed comparison, which is about 9 months old, is marginally related. The issue here is optimization of pointwise operations, which would be handled by the JIT fuser, except it is disabled by default on the CPU. The latest version of the benchmark code seems to not run on recent PyTorch versions as given. It still still a fair comparison in terms of it is what a user will get by default. I won't be the first PyTorch developer to say that Julia and its libs do a great job at JITed optimizations. Nonetheless I'm relatively certain that a determined PyTorch user would find ways to get that a better optimization of that ODE step using some of the disabled by default features.
The other truth, of course, is that for PyTorch, there still is more emphasis on GPU when it comes to implementing optimizations.
(Disclaimer: I'm one of the people on that thread.)