CC(C)=CCC/C(C)=C/C=C/C(=O)N1CCCC1
CC(=O)NC(C(=O)N1CCSCC1)[C@H]1CC(C(=O)O)C[C@@H]1N=C(N)N
O=C(CC(c1ccc(F)cc1)(c1ccc(F)cc1)c1ccc(F)cc1)N1C[C@H](O)C[C@H]1C(=O)N1CCC[C@@H]1C(=O)NC[C@@H]1CCCNC1
On the raw data set (on record per line), wc reports: 1455763 1455763 82882385
while | gzip -c | wc -c reports 18773892.I wish you wouldn't do that. That defeats the entire point of a website such as this. Just because you don't think that this is interesting to random people doesn't mean that random people don't think this is interesting.
Assuming that bmh100 is both interested in working on this and doesn't have the domain knowledge, I gave a synopsis of the SMILES notation, its use as a molecule identifier, a way to reproduce my data set, and a couple of possible alternatives for getting something similar. (Each method requiring less domain knowledge and more CS experience.)
This this is a big chunk to chew on, and this is the weekend, I figure it will take a few days to digest and be able to response. Since HN doesn't have notifications, how long should I actively check this thread for replies?
By sending email, I also invite a response after a couple of months, should that be the case. (I yesterday got a followup on a topic that was 4 years old.) So no, supporting these long-term research exchanges is not one of the main goals of HN.
You'll note that I also answered what bmh100 asked for here. If you find it interesting, then feel feel to ask interesting questions.
Oops! I forgot information about the unique symbols. Here's each unique symbol and its count.
'c' 17253540 'S' 432387 '8' 5960
'C' 14942972 'l' 373194 'i' 3307
')' 7687638 '/' 342041 '9' 2354
'(' 7687638 's' 164740 '0' 1457
'O' 5116146 'o' 161862 'e' 1421
'1' 4388585 '+' 137743 'K' 1280
'2' 3327339 '5' 119946 'L' 442
'=' 3311807 '.' 98163 'A' 287
'N' 2949791 '#' 85482 'Z' 147
'@' 2241799 'B' 82464 'b' 122
'[' 1952216 'r' 79603 'g' 87
']' 1952216 '\\' 78388 'M' 64
'n' 1707728 'P' 42036 'T' 48
'3' 1597703 '6' 29852 'p' 37
'H' 1514805 'I' 15180 't' 21
'-' 521529 '7' 11771 'R' 12
'F' 497963 'a' 11311 'V' 8
'4' 485537 '%' 6450 'X' 3
and the first 80 most common bigrams and last 19 of the 833 total (using "\n" to indicate "end of word") are: 'cc' 7981192 ')N' 998239 '2C' 458454 'N1' 245581 ... many lines omitted ...
'CC' 3965060 '](' 943201 ')O' 456391 'n1' 228523 '91' 1
'O)' 3141345 '1\n' 892805 'F)' 456060 'O=' 228415 '#S' 1
'C(' 2841639 'NC' 850938 'C2' 426474 '3C' 223729 'lS' 1
'c1' 2792900 '2)' 793573 'N(' 413951 'C=' 217307 '03' 1
'(C' 2596059 '(O' 779762 '3)' 410549 'O\n' 211774 'P\\' 1
'=O' 2496004 'Cc' 764483 '1)' 406001 '/C' 207957 'PN' 1
')c' 2489724 'OC' 764313 'Cl' 373140 '[n' 205910 'PO' 1
'c(' 2444133 'CN' 744679 '-c' 362109 ']2' 203805 'K]' 1
'(=' 2296603 '@@' 732946 'Oc' 358157 'C3' 202290 'p3' 1
'c2' 2105356 ')[' 705999 ')n' 353823 ')=' 196635 'I1' 1
')C' 1692206 'nc' 699014 'cn' 345084 'S(' 196491 'i@' 1
'[C' 1513171 'CO' 675894 '(F' 344658 '2n' 188360 'II' 1
'H]' 1512910 '1C' 660311 '(c' 344068 'n2' 180310 'i-' 1
'C@' 1507657 '=C' 611606 'N)' 332280 'nH' 177385 ']I' 1
'C)' 1489856 '3c' 598751 ']1' 308170 '1n' 174940 'Bc' 1
'1c' 1463358 '(N' 595762 'Nc' 298653 '@]' 174511 'Bi' 1
'@H' 1333765 'C1' 588362 'c4' 291931 '4c' 173949 'B.' 1
'2c' 1204003 ')(' 504595 '(-' 290801 '1O' 173802 'H7' 1
'c3' 1031134 'C[' 478529 'l)' 289924 'N[' 170445 '.n' 1
Not something I would think is particularly interesting for an HN post.Among other things, "a synopsis of the SMILES notation, its use as a molecule identifier, a way to reproduce my data set, and a couple of possible alternatives for getting something similar" is something I would be interested in. And, considering the upvotes I got for my grandparent comment, something that other people would be interested in as well.
Also: http://hnnotify.com/