Midi has to fire all the notes one at a time. You cannot represent even everything represented by a score through MIDI because it's serial and sends one message at a time. Stretch that out to 16 sharp clicky attacks and you'll notice that MIDI cannot fire them all at once, it'll blur.
However, if you want to control attack, reverb, tone, volume, i.e., all the stuff that makes it seem like a real instrument and not a computer-generated tone, then you would need to dedicate some of the 16 channels on each port to each of those different controls. To handle 16 xylophones, you would need a MIDI interface with enough ports to handle the note data plus however many control channels you want to use. (Note that from the interface's perspective, the data incoming from 1 port is just a single channel, so an "8-channel" MIDI interface is actually handling 128 channels of MIDI data.)
All these messages have to take turns. It's not a tracker, or a modular synth, where you can parallel click rise times electronically: it's not DINsync, where a chain of voltage pulses synchronize individual sequencers and aren't themselves notes.
1000 messages a second (at three bytes for each note-on) seems like a lot but it really isn't. With your 16 instruments (any drum, the xylophones, whatever) you can fire about 62 notes on all instruments per second. That seems like a lot too, but it's a hard limit, and it means your sixteen instruments have to 'blur' across 16 milliseconds to all fire off a note.
That means every time you fire all the instruments as one click, instead the attacks make up a 960hz tone. That's NOT one attack. Humans on percussion instruments can do better than that. It's the equivalent of 5.5 feet of space between speakers playing back: if you're time-aligning a tweeter and a midrange to produce a unified click, and you misalign one of the drivers by putting it five and a half feet back from where it would be, you'll notice the misalignment. Midi's timing issues are also noticeable.
When that paper was written it took several weeks and many millions of dollars of equipment to render primitive, mono-color 3d graphics. Desktop computers had 512 kilobytes of RAM and the highest-end desktops 32 MB of hard drive storage space. Computer screens had two colors: black and green. Audio cards capable of making beeps and clicks were the cutting-edge. WIFI was still a decade away.
"Intention" (as a tentative term)
The question becomes: what has impeded the creation of a MIDI file that can be confused with an actual concert from Arturo Benedetti Michelangeli.
The current version of MIDI is capable of replicating any of his performances, even down to the randomness.
Note that if you want to replicate the audio quality of his performances, you will need a high-quality MIDI instrument; the ones that ship with Windows will not suffice. These MIDI instruments can range from a few dollars to thousands of dollars. (See, e.g., Native Instruments)
In such case, we have a theoretical suggestion that «nothing is preventing this», but not an actual proof based on a "Turing test"-like scenario which would have specialists fooled, to corroborate that the new MIDI 2 would suffice.