That's not the main point even though it probably helps. As OkayPhysicist said above, without a nonlinearity, you could collapse all the weight matrices into a single matrix. If you have 2 layers (same size, for simplicity) described by weight matrices A and B, you could multiply them and get C, which you could use for inference.
Now, you can do this same trick not only with 2 layers but 100 million, all collapsing into a single matrix after multiplication. If the nonlinearities weren't there, the effective information content of the whole NN would collapse into that of a single-layer NN.