But the entire NN itself (Perceptron ones, which most LLMs are) is still completely using nothing but linearity to store all the knowledge from the training process. All the weights are just an 'm' in the basic line equation 'y=m*x+b'. The entire training process does nothing but adjust a bunch of slopes of a bunch of lines. It's totally linear. No non-linearity at all.
> nothing but linearity
No, if you have non linearities, the NN itself is not linear. The non linearities are not there primarily to keep the outputs in a given range, though that's important, too.
So, for XOR, (x, y) -> (x, y, xy), and it becomes trivial for a linear NN to solve.
Architectures like Mamba have a linear recurrent state space system as their core, so even though you need a nonlinearity somewhere, it doesn't need to be pervasive. And linear recurrent networks are surprisingly powerful (https://arxiv.org/abs/2303.06349, https://arxiv.org/abs/1802.03308).
Precisely what the `Activation Function` does is to squash an output into a range (normally below one, like tanh). That's the only non-linearity I'm aware of. What other non-linearities are there?
All the training does is adjust linear weights tho, like I said. All the training is doing is adjusting the slopes of lines.
"only" is doing a lot work here because that non-linearity is enough to vastly expand the landscape of functions that an NN can approximate. If the NN was linear, you could greatly simplify the computational needs of the whole thing (as was implied by another commenter above) but you'd also not get a GPT out of it.
ReLU enables this by being nonlinear in a simple way, specifically by outputting zero for negative inputs, so each linear unit can then limit its contribution to a portion of the output curve.
(This is a lot easier to see on a whiteboard!)
This isn't the primary purpose of the activation function, and in fact it's not even necessary. For example see ReLU (probably the most common activation function), leaky ReLU, or for a sillier example: https://youtu.be/Ae9EKCyI1xU?si=KgjhMrOsFEVo2yCe