Shoco: a fast compressor for short strings (opens in new tab)

(ed-von-schleck.github.io)

31 pointsmultipass10y ago19 comments

19 comments

12 comments · 6 top-level

dalke10y ago· 2 in thread

I have a background project of exploring how to compress SMILES strings, which is a notation for storing chemical information. For example, "C" is methane, "CC" is ethane, "C=C" is ethene, "CCO" is ethyl alcohol, "C1CCCCC1" is cyclohexane, and "c1ccccc1", which contains aromatic carbons, is benzene. The average length of a SMILES string for real-world molecules is about 50 characters.

I previously evaluated a special purpose tool which identifies the best n-grams and uses dynamic programming during encoding. That gets about 70% compression on SMILES string. I also tried the off-the-shelf femtozip which got about 60% compression but had more decompression overhead than I like.

Shoco, trained on 1,455,763 SMILES strings (average of 56 letters each), and tested with 100,000 strings from the training set, reports "average compression ratio: 47%".

bmh10010y ago

Could you provide more information about your SMILES test? How many unique symbols were there? How does gzip do? This is an interesting use case.

dalke10y ago

Sure. I'm switching this conversation to email though, using the gmail account in your profile. Short version is, I trained it on the RDKit-generated SMILES strings from ChEMBL-20. Three of the strings look like this:

    CC(C)=CCC/C(C)=C/C=C/C(=O)N1CCCC1
    CC(=O)NC(C(=O)N1CCSCC1)[C@H]1CC(C(=O)O)C[C@@H]1N=C(N)N
    O=C(CC(c1ccc(F)cc1)(c1ccc(F)cc1)c1ccc(F)cc1)N1C[C@H](O)C[C@H]1C(=O)N1CCC[C@@H]1C(=O)NC[C@@H]1CCCNC1

On the raw data set (on record per line), wc reports:

     1455763 1455763 82882385

while | gzip -c | wc -c reports 18773892.

1 more reply

Khao10y ago· 2 in thread

I get negative compression percentage when I put words with "é" in the test box.

jozan10y ago

In default it doesn't work well with non-ASCII characters.

https://ed-von-schleck.github.io/shoco/#how-it-works

Semiapies10y ago

Between this (an ASCII-only compressor in 2015?) and the other aspects brought up here, it seems downright toylike.

techwizrd10y ago· 1 in thread

I wonder what'd happen if you used this on base64 strings.

bmh10010y ago

I would love to see a blog post about that test, if you're willing.

thrownaway242410y ago· 1 in thread

I can't tell you how many times I've said to myself "if only these very short ASCII strings were even shorter!"

BrandonSmith10y ago

At scale, and if you are paying for transmission costs, it can have a massive impact.

knodi12310y ago

Look how well it can compress "fofofofofofofofofofofo".

50%

Look how well it can compress "ababababababababababab".

rurban10y ago

Will test against smaz for our internal JSON compressed protocol. smaz compressed fine but was too slow. The ability to train the model sounds convincing.

j / k navigate · click thread line to collapse

19 comments

12 comments · 6 top-level

dalke10y ago· 2 in thread

Shoco, trained on 1,455,763 SMILES strings (average of 56 letters each), and tested with 100,000 strings from the training set, reports "average compression ratio: 47%".

bmh10010y ago

Could you provide more information about your SMILES test? How many unique symbols were there? How does gzip do? This is an interesting use case.

dalke10y ago

    CC(C)=CCC/C(C)=C/C=C/C(=O)N1CCCC1
    CC(=O)NC(C(=O)N1CCSCC1)[C@H]1CC(C(=O)O)C[C@@H]1N=C(N)N
    O=C(CC(c1ccc(F)cc1)(c1ccc(F)cc1)c1ccc(F)cc1)N1C[C@H](O)C[C@H]1C(=O)N1CCC[C@@H]1C(=O)NC[C@@H]1CCCNC1

On the raw data set (on record per line), wc reports:

     1455763 1455763 82882385

while | gzip -c | wc -c reports 18773892.

1 more reply

Khao10y ago· 2 in thread

I get negative compression percentage when I put words with "é" in the test box.

jozan10y ago

In default it doesn't work well with non-ASCII characters.

https://ed-von-schleck.github.io/shoco/#how-it-works

Semiapies10y ago

Between this (an ASCII-only compressor in 2015?) and the other aspects brought up here, it seems downright toylike.

techwizrd10y ago· 1 in thread

I wonder what'd happen if you used this on base64 strings.

bmh10010y ago

I would love to see a blog post about that test, if you're willing.

thrownaway242410y ago· 1 in thread

I can't tell you how many times I've said to myself "if only these very short ASCII strings were even shorter!"

BrandonSmith10y ago

At scale, and if you are paying for transmission costs, it can have a massive impact.

knodi12310y ago

Look how well it can compress "fofofofofofofofofofofo".

50%

Look how well it can compress "ababababababababababab".

rurban10y ago

Will test against smaz for our internal JSON compressed protocol. smaz compressed fine but was too slow. The ability to train the model sounds convincing.

j / k navigate · click thread line to collapse