I previously evaluated a special purpose tool which identifies the best n-grams and uses dynamic programming during encoding. That gets about 70% compression on SMILES string. I also tried the off-the-shelf femtozip which got about 60% compression but had more decompression overhead than I like.
Shoco, trained on 1,455,763 SMILES strings (average of 56 letters each), and tested with 100,000 strings from the training set, reports "average compression ratio: 47%".
CC(C)=CCC/C(C)=C/C=C/C(=O)N1CCCC1
CC(=O)NC(C(=O)N1CCSCC1)[C@H]1CC(C(=O)O)C[C@@H]1N=C(N)N
O=C(CC(c1ccc(F)cc1)(c1ccc(F)cc1)c1ccc(F)cc1)N1C[C@H](O)C[C@H]1C(=O)N1CCC[C@@H]1C(=O)NC[C@@H]1CCCNC1
On the raw data set (on record per line), wc reports: 1455763 1455763 82882385
while | gzip -c | wc -c reports 18773892.50%
Look how well it can compress "ababababababababababab".
0%