This is a little confusing. You turned the text into indices? So numbers? Then compressed that? Or the text as numbers without any extra compression is only 1kb?
The tokenizer the models use,(sentence piece) is more or less based on one way to do compression.(bpe). It's not really clear what your testing.