In general GPT3 is not SotA on (any?) classification task, did you just not have enough data to fine tune a discriminative transformer model? Inference should be cheaper with a smaller transformer/also less lock-in.
I can't go into too much detail here about why we couldn't do that, but one aspect that we found VERY useful is that GPT-3 could draw on real world knowledge not present in the dataset to enhance the results.