I personally suspect that there are surprises in store for me by rooting around in this space. I have a .caffemodel that appears to correctly label better than chance, so my next step is to create a test dataset that's independent of original and see what happens. Personally, I find it unsettling that GPT 3 now contains 175 billion parameters. Why not go in the opposite direction and test the lower bound of what is possible?