All these messages have to take turns. It's not a tracker, or a modular synth, where you can parallel click rise times electronically: it's not DINsync, where a chain of voltage pulses synchronize individual sequencers and aren't themselves notes.
1000 messages a second (at three bytes for each note-on) seems like a lot but it really isn't. With your 16 instruments (any drum, the xylophones, whatever) you can fire about 62 notes on all instruments per second. That seems like a lot too, but it's a hard limit, and it means your sixteen instruments have to 'blur' across 16 milliseconds to all fire off a note.
That means every time you fire all the instruments as one click, instead the attacks make up a 960hz tone. That's NOT one attack. Humans on percussion instruments can do better than that. It's the equivalent of 5.5 feet of space between speakers playing back: if you're time-aligning a tweeter and a midrange to produce a unified click, and you misalign one of the drivers by putting it five and a half feet back from where it would be, you'll notice the misalignment. Midi's timing issues are also noticeable.