For the particular problem discussed in the original article, there are at least two ways the multiplication A'A could potentially be made faster:
1) The blas _GEMM matrix multiplication routine lets you specify whether input matrix arguments are supposed to be transposed. This gets rid of the explicit transposition, AND in this problem, it lets you compute each element as the dot product of two unit stride vectors, instead of a dot product of an unit stride vector with a nonunit stride vector. For SSE, this makes a huge difference.
2) For the particular case A'A, there is the even more specialized _SYRK routine, which at the very least should be cache friendlier than a naive _GEMM (_GEMM could also figure out that it can use _SYRK for this problem, and presumably it does so in some implementation)