Skip to content

Molecule curation

Karen Yook edited this page Jun 8, 2013 · 9 revisions

Table of Contents

Requested changes

for WS24?

Add two fields/tables to the Molecule OA:

  • Field name "GO target" ; Table name "mop_gotarget"
  • Field name "Gene target"; Table name "mop_genetarget"
Both are multiple entry. Cell Component target should use the GO obo used in the other OA's (i.e., GO, Expresssion Pattern). Gene target should use the Gene list used in other OA's as well. Add both fields between "Molecule Use" and "MeSH/CTD of Default" fields.

Do not dump these fields, no model change requested yet- accruing data before requesting the changes.

for WS235

  • Assign WBMoleculeIDs to all WB molecules.
    • Add a WBMolID field to the OA between WBPaper and Public name
    • Assign WBMolIDs to all molecules based on the pgid.
    • Change the dumper on tazendra acedb/karen/Molecule/dump_molecule_ace.pl to dump the WBMolID as the Molecule name as shown below.
 Molecule : "WBMolecule:00000001"
 Public_name "beta-selinene"
 Database "NLM_MeSH" "UID" "C087920"
 Database "CTD"  "ChemicalID" "C087920"
 Database "ChemIDplus"  "17066-67-0"
 Database "ChEBI" "CHEBI_ID" "10443"
 Database "KEGG COMPOUND" "ACCESSION_NUMBER" "C09723"
  • Change ...{ print OUT "Database\t\"ChemIDplus\" \"$hash{chemi}{$joinkey}\"\n"; } to
  ...{ print OUT "Database\t\"ChemIDplus\" \"RN\" \"$hash{chemi}{$joinkey}\"\n"; }
  • Fix SMMID print out line to read SMID-DB
  • Dump "Molecule_use" as "Molecule_use" (this is already in the model)
  • Dump "Paper" as "Reference" (this requires model change approval)

WS234 (6/12)

  • automate dump on Wed morning for pick up by spica at 8am.
 0 4 * * wed /home/acedb/karen/Molecule/dump_molecule_ace.pl 
 writes to /home/acedb/karen/Molecule/Molecule.ace
  • spica pick up will also grab database.ace flat file in same directory: acedb/karen/Molecule/

Jobs

1. Line 42 'ChEMI' needs to be changed to 'ChEBI'

 if ($hash{chebi}{$joinkey}) { print OUT "Database\t\"ChEMI\" \"CHEBI_ID\" \"$hash{chebi}...

2. Lines 40 and 41 need to be suppressed when $molecule = "WBMol:"

  print OUT "Database\t\"NLM_MeSH\" \"UID\" \"$molecule\"\n";
  print OUT "Database\t\"CTD\" \"ChemicalID\" \"$molecule\"\n";

3. After changes to OA - for the .ace dump create storage and print out for SMMID DB values

  if ($hash{smmid}{$joinkey}) { print OUT "Database\t\"SMMID DB\" \"$hash{smmid}{$joinkey}\"\n"; }

Mol dumper overview

on tazendra acedb/karen/Molecule/dump_molecule_ace.pl (also /home/acedb/public_html/karen/)
dumps molecule OA postgres tables starting WS219
dumps to:
/home/acedb/public_html/karen/Molecule.ace
/home/acedb/karen/Molecule/dump_molecule.ace.pl
which is symlinked to : http://tazendra.caltech.edu/~acedb/karen/Molecule.ace (a publicly accessible list of the molecules)
(if you want to keep copies, make a copy before the dump; the dump script overwrites the file)
the original source file is now Molecule.ace.20100728

NOTE: Variation (strain, transgene) phenotype-molecule associations are dumped through usepackage.pl.

mop_ tables

(tazendra or mangolassi) ~postgres/public_html/cgi-bin/oa/wormOA.pm

use &initWormFields to load a specific OA, each of them has a 3 letter code that corresponds to the postgres table name. 'mop' is for the molecule oa and the 'mop_' tables in postgres
mop calls &initWormMopFields currently around line 1200

mop hash set :

  $fields{mop}{id}{type}                             = 'text';
  $fields{mop}{id}{label}                            = 'pgid';
  $fields{mop}{id}{tab}                              = 'all';
  $fields{mop}{paper}{type}                          = 'multiontology';
  $fields{mop}{paper}{label}                         = 'WBPaper';
  $fields{mop}{paper}{tab}                           = 'all';
  $fields{mop}{paper}{ontology_type}                 = 'WBPaper';
  $fields{mop}{name}{type}                           = 'text';
  $fields{mop}{name}{label}                          = 'Name';
  $fields{mop}{name}{tab}                            = 'all';

most correspond to a postgres table (except mop {id}, which is the pgid. So in the example above, there are 2 postgres tables, mop_paper and mop_name

  • {type} is the type of data the table holds text/bigtext/ontology/dropdown/toggle/multiontology/multidropdown
  • {label} is the field name in the OA
  • {tab} is which tab number it should show in. 'all' shows up in all tabs, used in OAs with no numbered tabs
  • {ontology_type} refers to the type of ontology/multiontology values used for autocomplete and validation.
  • {ontology_table} is used if the ontology_type is 'obo', and the value refers to the obo_table that is updated via cronjob or populated just once. e.g.>br/>
  $fields{mop}{chebi}{type}                          = 'ontology';
  $fields{mop}{chebi}{label}                         = 'ChEBI_ID';
  $fields{mop}{chebi}{tab}                           = 'all';
  $fields{mop}{chebi}{ontology_type}                 = 'obo';
  $fields{mop}{chebi}{ontology_table}                = 'chebi';

in this case

  $fields{mop}{chebi}{ontology_type}                 = 'obo';
  $fields{mop}{chebi}{ontology_table}                = 'chebi';
 maps to obo_<name|data|syn>_chebi

if the {type} is dropdown, there's a {dropdown_type} which refers to the type of data like the {ontology_type}, but the values are hardcoded further down in the code instead of stored in postgres for querying.

For the molecule OA the tables are:
mop_cellcomp
mop_genetarget

dump_molecule_ace.pl

 # dump mop_ data into Molecule.ace for citace upload  2010 07 31
 
 use strict;
 use diagnostics;
 use DBI;
 
 my $dbh = DBI->connect ( "dbi:Pg:dbname=testdb", "", "") or die "Cannot connect to database!\n"; 
 
 my @tables = qw( molecule publicname synonym chemi chebi kegg );
 my %hash;
 my $result;
 
 
 my $directory = '/home/acedb/karen/Molecule';
 chdir ($directory) or die "Cannot chdir to $directory : $!";
 
 my $outfile = 'Molecule.ace';
 open (OUT, ">$outfile") or die "Cannot create $outfile : $!";
 
 foreach my $table (@tables) {
  my $pgtable = 'mop_' . $table;
  $result = $dbh->prepare( "SELECT * FROM $pgtable" );
  $result->execute() or die "Cannot prepare statement: $DBI::errstr\n"; 
  while (my @row = $result->fetchrow) {
    if ($row[0]) { $hash{$table}{$row[0]} = $row[1]; } }
 }
 
 foreach my $joinkey (sort {$a<=>$b} keys %{ $hash{molecule} }) {
  my ($molecule, $publicname, $syns, $chemi, $chebi, $kegg);
  if ($hash{molecule}{$joinkey}) { $molecule = $hash{molecule}{$joinkey}; }
  next unless $molecule;
 #   next if ($molecule =~ m/WBMol/);		# always print objects for Karen 2011 07 05
  print OUT "Molecule : \"$molecule\"\n";
  if ($hash{publicname}{$joinkey}) { print OUT "Public_name\t\"$hash{publicname}{$joinkey}\"\n"; }
  if ($hash{synonym}{$joinkey}) { 
    my (@syns) = split/ \| /, $hash{synonym}{$joinkey};
    foreach my $syn (@syns) { print OUT "Synonym\t\"$syn\"\n"; } }
  print OUT "Database\t\"NLM_MeSH\" \"UID\" \"$molecule\"\n";
  print OUT "Database\t\"CTD\" \"ChemicalID\" \"$molecule\"\n";
  if ($hash{chebi}{$joinkey}) { print OUT "Database\t\"ChEMI\" \"CHEBI_ID\" \"$hash{chebi}{$joinkey}\"\n"; }
  if ($hash{chemi}{$joinkey}) { print OUT "Database\t\"ChemIDplus\" \"$hash{chemi}{$joinkey}\"\n"; }
  if ($hash{kegg}{$joinkey}) { print OUT "Database\t\"KEGG COMPOUND\" \"ACCESSION_NUMBER\" \"$hash{kegg} {$joinkey}\"\n"; }
  print OUT "\n";
 } # foreach my $joinkey (sort {$a<=>$b} keys %{ $hash{molecule} })
 
 close (OUT) or die "Cannot close $outfile : $!";
 
  __END__
  
 Molecule : "C087920"
 Public_name "beta-selinene"
 Database "NLM_MeSH" "UID" "C087920"
 Database "CTD"  "ChemicalID" "C087920"
 Database "ChemIDplus"  "17066-67-0"
 Database "ChEBI" "CHEBI_ID" "10443"
 Database "KEGG COMPOUND" "ACCESSION_NUMBER" "C09723"
 
 
 my $result = $dbh->prepare( "SELECT * FROM two_comment WHERE two_comment ~ ?" );
 $result->execute() or die "Cannot prepare statement: $DBI::errstr\n"; 
 while (my @row = $result->fetchrow) {
  if ($row[0]) { 
    $row[0] =~ s/
 //g;
    $row[1] =~ s/
 //g;
    $row[2] =~ s/
 //g;
    print "$row[0]\t$row[1]\t$row[2]\n";
  } # if ($row[0])
 } # while (@row = $result->fetchrow)
  
 __END__
  
 my $result = $dbh->prepare( 'SELECT * FROM two_comment WHERE two_comment ~ ?' );
 $result->execute('elegans') or die "Cannot prepare statement: $DBI::errstr\n"; 
 
 $result->execute("doesn't") or die "Cannot prepare statement: $DBI::errstr\n"; 
 my $var = "doesn't";
 $result->execute($var) or die "Cannot prepare statement: $DBI::errstr\n"; 
 
 my $data = 'data';
 unless (is_utf8($data)) { from_to($data, "iso-8859-1", "utf8"); }
 
 my $result = $dbh->do( "DELETE FROM friend WHERE firstname = 'bl\"ah'" );
 (also do for INSERT and UPDATE if don't have a variable to interpolate with ? )
 
 can cache prepared SELECTs with $dbh->prepare_cached( &c. );
  
 if ($result->rows == 0) { print "No names matched.\n\n"; }	# test if no return
  
 $result->finish;	# allow reinitializing of statement handle (done with query)
 $dbh->disconnect;	# disconnect from DB
  
 http://209.85.173.132/search?q=cache:5CFTbTlhBGMJ:www.perl.com/pub/1999/10/DBI.html+dbi+prepare+execute&cd=4&hl=en&ct=clnk&gl=us
  
 interval stuff : 
 SELECT * FROM afp_passwd WHERE joinkey NOT IN (SELECT joinkey FROM afp_lasttouched) AND joinkey NOT IN  (SELECT joinkey FROM cfp_curator) AND afp_timestamp < CURRENT_TIMESTAMP - interval '21 days' AND afp_timestamp > CURRENT_TIMESTAMP - interval '28 days';
 
 casting stuff to substring on other types :
 SELECT * FROM afp_passwd WHERE CAST (afp_timestamp AS TEXT) ~ '2009-05-14';
 
 to concatenate string to query result :
  SELECT 'WBPaper' || joinkey FROM pap_identifier WHERE pap_identifier ~ 'pmid';
 to get :
  SELECT DISTINCT(gop_paper_evidence) FROM gop_paper_evidence WHERE gop_paper_evidence NOT IN (SELECT  'WBPaper' || joinkey FROM pap_identifier WHERE pap_identifier ~ 'pmid') AND gop_paper_evidence != '';

Clone this wiki locally