>> Transformers are RASP programs, which includes sorting programs. See the Weiss paper (
https://arxiv.org/pdf/2106.06981.pdf).
That's one preprint on arxiv, that makes a wild claim about a new concept that they acronymise as "RASP". It's not any kind of established terminology, nor is it anything but a claim.
What is certainly established is that a function, and an algorithm, are different objects. To clarify, a function is a mapping between the elements of two sets, whereas an algorithm is a sequence of operations that calculates the result of a function and is guaranteed to terminate. Algorithms are also typically understood to be provably correct and to have some provable asymptotic complexity (as opposed, for example, to heuristics) but that's not a requirement.
So for example, if you have a function ƒ between sets X and Y, and an algorithm P that calculates the result of ƒ, then you can give any element of X to P and it will return (in fact, construct) an element of Y. Crucially, ƒ is not P, and P is not ƒ.
Now, when you train a machine learning model, you are typically training a function ƒ̂ (with a little hat) to approximate ƒ. That means that your trained ƒ̂ is a function that maps some of the elements of X to the same elements of Y as ƒ, but not all. It's an approximation. So you get some amount of error, as in your experiment.
So what you've done in your experiment is that you trained a model to approximate a mapping between the set of lists, to itself (where the input list is any of the lists in your training set and the output is the same list, sorted). Your model is not an algorithm, and you cannot train an algorithm with a language model.
I appreciate that, learning an algorithm, is what you wanted to achieve, but in science we don't choose the answer that pleases us, we choose the answer that makes the most sense- and a good heuristic for that is that the answer that makes more sense is the simplest one. Here, in order to convince yourself that you have trained a language model to learn an algorithm, rather than an approximator, you have chosen to rely on a preprint with a completely novel and untested concept that someone put on the internet, rather than the well-understood abstractions of elementary computer science, so not at all the simplest explanation. That is not a good idea. You will not understand what is going on, if you rely on that kind of explanation. I assume you are trying to understand?
Edit: incidentally, you don't need a transformer to train an approximator to a sorting function. You can do that with a multi-layer perceptron, or a logistic regression, certainly with an LSTM. Ceteris paribus, you'll get the same results.
>> The probability of a test list existing in the training set is less than 10^-70.
But the same probability if you held the test set out would be 0, so why not do that? It's not hard to do.
Is there a good reason not to do that?
Btw, lists are composite objects. How much overlap is there between your training and test lists? Do you know?
Edit: meh. HN messes up my nice f-with-hook-and-combining-circumflex-accent. DAAAAANG!!!!