undefined | Better HN

0 pointsdalke10y ago0 comments

Sure. I'm switching this conversation to email though, using the gmail account in your profile. Short version is, I trained it on the RDKit-generated SMILES strings from ChEMBL-20. Three of the strings look like this:

    CC(C)=CCC/C(C)=C/C=C/C(=O)N1CCCC1
    CC(=O)NC(C(=O)N1CCSCC1)[C@H]1CC(C(=O)O)C[C@@H]1N=C(N)N
    O=C(CC(c1ccc(F)cc1)(c1ccc(F)cc1)c1ccc(F)cc1)N1C[C@H](O)C[C@H]1C(=O)N1CCC[C@@H]1C(=O)NC[C@@H]1CCCNC1

On the raw data set (on record per line), wc reports:

     1455763 1455763 82882385

while | gzip -c | wc -c reports 18773892.

CC(C)=CCC/C(C)=C/C=C/C(=O)N1CCCC1 CC(=O)NC(C(=O)N1CCSCC1)[C@H]1CC(C(=O)O)C[C@@H]1N=C(N)N O=C(CC(c1ccc(F)cc1)(c1ccc(F)cc1)c1ccc(F)cc1)N1C[C@H](O)C[C@H]1C(=O)N1CCC[C@@H]1C(=O)NC[C@@H]1CCCNC1

0 comments

4 comments · 1 top-level

TheLoneWolfling10y ago· 3 in thread

> I'm switching this conversation to email though

I wish you wouldn't do that. That defeats the entire point of a website such as this. Just because you don't think that this is interesting to random people doesn't mean that random people don't think this is interesting.

dalkeOP10y ago

The email I sent was 178K long, with 1000 real-world examples (to get an idea of the character distribution), and the .h file model generated by shoco on the entire data set.

Assuming that bmh100 is both interested in working on this and doesn't have the domain knowledge, I gave a synopsis of the SMILES notation, its use as a molecule identifier, a way to reproduce my data set, and a couple of possible alternatives for getting something similar. (Each method requiring less domain knowledge and more CS experience.)

This this is a big chunk to chew on, and this is the weekend, I figure it will take a few days to digest and be able to response. Since HN doesn't have notifications, how long should I actively check this thread for replies?

By sending email, I also invite a response after a couple of months, should that be the case. (I yesterday got a followup on a topic that was 4 years old.) So no, supporting these long-term research exchanges is not one of the main goals of HN.

You'll note that I also answered what bmh100 asked for here. If you find it interesting, then feel feel to ask interesting questions.

dalkeOP10y ago

> "You'll note that I also answered what bmh100 asked for here."

Oops! I forgot information about the unique symbols. Here's each unique symbol and its count.

   'c' 17253540     'S'   432387     '8'     5960
   'C' 14942972     'l'   373194     'i'     3307
   ')'  7687638     '/'   342041     '9'     2354
   '('  7687638     's'   164740     '0'     1457
   'O'  5116146     'o'   161862     'e'     1421
   '1'  4388585     '+'   137743     'K'     1280
   '2'  3327339     '5'   119946     'L'      442
   '='  3311807     '.'    98163     'A'      287
   'N'  2949791     '#'    85482     'Z'      147
   '@'  2241799     'B'    82464     'b'      122
   '['  1952216     'r'    79603     'g'       87
   ']'  1952216    '\\'    78388     'M'       64
   'n'  1707728     'P'    42036     'T'       48
   '3'  1597703     '6'    29852     'p'       37
   'H'  1514805     'I'    15180     't'       21
   '-'   521529     '7'    11771     'R'       12
   'F'   497963     'a'    11311     'V'        8
   '4'   485537     '%'     6450     'X'        3

and the first 80 most common bigrams and last 19 of the 833 total (using "\n" to indicate "end of word") are:

  'cc' 7981192   ')N'  998239   '2C'  458454   'N1'  245581   ... many lines omitted ...
  'CC' 3965060   ']('  943201   ')O'  456391   'n1'  228523   '91'       1
  'O)' 3141345  '1\n'  892805   'F)'  456060   'O='  228415   '#S'       1
  'C(' 2841639   'NC'  850938   'C2'  426474   '3C'  223729   'lS'       1
  'c1' 2792900   '2)'  793573   'N('  413951   'C='  217307   '03'       1
  '(C' 2596059   '(O'  779762   '3)'  410549  'O\n'  211774  'P\\'       1
  '=O' 2496004   'Cc'  764483   '1)'  406001   '/C'  207957   'PN'       1
  ')c' 2489724   'OC'  764313   'Cl'  373140   '[n'  205910   'PO'       1
  'c(' 2444133   'CN'  744679   '-c'  362109   ']2'  203805   'K]'       1
  '(=' 2296603   '@@'  732946   'Oc'  358157   'C3'  202290   'p3'       1
  'c2' 2105356   ')['  705999   ')n'  353823   ')='  196635   'I1'       1
  ')C' 1692206   'nc'  699014   'cn'  345084   'S('  196491   'i@'       1
  '[C' 1513171   'CO'  675894   '(F'  344658   '2n'  188360   'II'       1
  'H]' 1512910   '1C'  660311   '(c'  344068   'n2'  180310   'i-'       1
  'C@' 1507657   '=C'  611606   'N)'  332280   'nH'  177385   ']I'       1
  'C)' 1489856   '3c'  598751   ']1'  308170   '1n'  174940   'Bc'       1
  '1c' 1463358   '(N'  595762   'Nc'  298653   '@]'  174511   'Bi'       1
  '@H' 1333765   'C1'  588362   'c4'  291931   '4c'  173949   'B.'       1
  '2c' 1204003   ')('  504595   '(-'  290801   '1O'  173802   'H7'       1
  'c3' 1031134   'C['  478529   'l)'  289924   'N['  170445   '.n'       1

Not something I would think is particularly interesting for an HN post.

TheLoneWolfling10y ago

If it's not confidential (and I am assuming it isn't), why not just link it in a gist or something? That way other people can also take a crack at it.

Among other things, "a synopsis of the SMILES notation, its use as a molecule identifier, a way to reproduce my data set, and a couple of possible alternatives for getting something similar" is something I would be interested in. And, considering the upvotes I got for my grandparent comment, something that other people would be interested in as well.

Also: http://hnnotify.com/

1 more reply

j / k navigate · click thread line to collapse

0 comments

4 comments · 1 top-level

TheLoneWolfling10y ago· 3 in thread

> I'm switching this conversation to email though

dalkeOP10y ago

The email I sent was 178K long, with 1000 real-world examples (to get an idea of the character distribution), and the .h file model generated by shoco on the entire data set.

You'll note that I also answered what bmh100 asked for here. If you find it interesting, then feel feel to ask interesting questions.

dalkeOP10y ago

> "You'll note that I also answered what bmh100 asked for here."

Oops! I forgot information about the unique symbols. Here's each unique symbol and its count.

   'c' 17253540     'S'   432387     '8'     5960
   'C' 14942972     'l'   373194     'i'     3307
   ')'  7687638     '/'   342041     '9'     2354
   '('  7687638     's'   164740     '0'     1457
   'O'  5116146     'o'   161862     'e'     1421
   '1'  4388585     '+'   137743     'K'     1280
   '2'  3327339     '5'   119946     'L'      442
   '='  3311807     '.'    98163     'A'      287
   'N'  2949791     '#'    85482     'Z'      147
   '@'  2241799     'B'    82464     'b'      122
   '['  1952216     'r'    79603     'g'       87
   ']'  1952216    '\\'    78388     'M'       64
   'n'  1707728     'P'    42036     'T'       48
   '3'  1597703     '6'    29852     'p'       37
   'H'  1514805     'I'    15180     't'       21
   '-'   521529     '7'    11771     'R'       12
   'F'   497963     'a'    11311     'V'        8
   '4'   485537     '%'     6450     'X'        3

and the first 80 most common bigrams and last 19 of the 833 total (using "\n" to indicate "end of word") are:

  'cc' 7981192   ')N'  998239   '2C'  458454   'N1'  245581   ... many lines omitted ...
  'CC' 3965060   ']('  943201   ')O'  456391   'n1'  228523   '91'       1
  'O)' 3141345  '1\n'  892805   'F)'  456060   'O='  228415   '#S'       1
  'C(' 2841639   'NC'  850938   'C2'  426474   '3C'  223729   'lS'       1
  'c1' 2792900   '2)'  793573   'N('  413951   'C='  217307   '03'       1
  '(C' 2596059   '(O'  779762   '3)'  410549  'O\n'  211774  'P\\'       1
  '=O' 2496004   'Cc'  764483   '1)'  406001   '/C'  207957   'PN'       1
  ')c' 2489724   'OC'  764313   'Cl'  373140   '[n'  205910   'PO'       1
  'c(' 2444133   'CN'  744679   '-c'  362109   ']2'  203805   'K]'       1
  '(=' 2296603   '@@'  732946   'Oc'  358157   'C3'  202290   'p3'       1
  'c2' 2105356   ')['  705999   ')n'  353823   ')='  196635   'I1'       1
  ')C' 1692206   'nc'  699014   'cn'  345084   'S('  196491   'i@'       1
  '[C' 1513171   'CO'  675894   '(F'  344658   '2n'  188360   'II'       1
  'H]' 1512910   '1C'  660311   '(c'  344068   'n2'  180310   'i-'       1
  'C@' 1507657   '=C'  611606   'N)'  332280   'nH'  177385   ']I'       1
  'C)' 1489856   '3c'  598751   ']1'  308170   '1n'  174940   'Bc'       1
  '1c' 1463358   '(N'  595762   'Nc'  298653   '@]'  174511   'Bi'       1
  '@H' 1333765   'C1'  588362   'c4'  291931   '4c'  173949   'B.'       1
  '2c' 1204003   ')('  504595   '(-'  290801   '1O'  173802   'H7'       1
  'c3' 1031134   'C['  478529   'l)'  289924   'N['  170445   '.n'       1

Not something I would think is particularly interesting for an HN post.

TheLoneWolfling10y ago

If it's not confidential (and I am assuming it isn't), why not just link it in a gist or something? That way other people can also take a crack at it.

Also: http://hnnotify.com/

1 more reply

j / k navigate · click thread line to collapse