BioSQL, bioperl-db and UniGene

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

BioSQL, bioperl-db and UniGene

Marc Saric
Hi all,

I've got some questions regarding BioSQL I would like to ask here:

I am currently writing an app which should map microarray probe
sequences to target sequences. It should do so in a generalized manner
(i.e. any microarray against an arbitrary sequence-database). Currently
I need UniGene for Zebrafish (Dr.*) and several Oligonucleotide libs,
among them an Affymetrix array.

Due to the fact, that UniGene is a moving target (especially for
unfinished genomes) it would be good to do the mapping in a fully
automated way.

I am thinking about doing sequence-based mapping of probe-sequences with
BLAT or GMAP  (like ProbeLynx does for Ensembl/TIGR-based data, but
unfortunately that tool is quite hard to port/extend for other databases).

In addition I would like to have annotation based mapping (i.e. take the
accession from the vendor-provided mapping and have a look to which
UniGene-cluster it maps) as a fallback/second option for microarrays,
where probe sequences are not published.

I have installed/setup Bioperl 1.5.1 and the CVS-versions of biosql and
bioperl-db with MySQL 4.1.12/Mac OS X and was able to load Taxon- and
UniGene-data from flatfiles, at least the Cluster-IDs and Accessions as
available from the *.data file.

I was also able to rewrite microarray probes from various tab-delimited
formats or FASTA to Genbank, which worked ok for loading (albeit slow,
but...).

(I hope you are still with me after this lengthy intro... :-) )

1st question:

Due to the fact that the loader does not like raw FASTA-files, what
would be the most elegant/efficient way of loading all sequence-files
for the UniGene build as well (normaly provided in a FASTA-file called
*.seq.all, Dr.seq.all in my case). And how to associate them with the
cluster data (i.e. there are allready entries in bioentry for all
sequences, but they are missing the sequence data and most of their
detail annotation, so this might be some kind of update).

2nd question:

What would be the best way of integrating BLAT/GMAP (same format as
BLAT) results. I'm thinking about parsing the file and writing the
mapping-results as a annotation into the database, linked to each
probe-sequence. Data would include the hit(s) found for each probe,
wether it hits more than one cluster and possibly some additional notes.

>From there I would write out a report or custom sequence file for use in
other tools.

If possible I would also like to accumulate annotations (like mapping
against different UniGene builds over time).

3rd question:

Due to the fact, that UniGene changes frequently, I would like to have
some kind of versioning, so that I can keep old versions of UniGene as a
backup and add new ones (i.e. not only keeping the mapping results but
also keeping all the source sequences).

If I understand it right, the load_seqdatabase script does not support
this and has no (command-line) option for overriding the "database" name
(i.e. for UniGene it will always be set to "UniGene" in biodatabase and
thus overwrite old versions)?

Do you see any fundamental problems here for versioning the data (except
storage space)?

Thanks in advance.

Links:

ProbeLynx http://koch.pathogenomics.ca/probelynx/
D.rerio UniGene: http://www.ncbi.nlm.nih.gov/UniGene/UGOrg.cgi?TAXID=7955


--
Bye,

Marc Saric

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://portal.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: BioSQL, bioperl-db and UniGene

Sean Davis-3
Hi Marc.  

I currently do something similar for all our arrays (about 7 different
platforms, three species, depending on need) at NHGRI/NIMH/NINDS.  I use
blat to map oligo sequences to refseq, ensembl, unigene_unique (the single
"best" sequence for a unigene cluster), UCSC known genes, and Human
Invitational as well as to several genome builds for each species.  I run
blat locally and then load all the blat results into one large database
table (about 5 million rows in the current build).  I also have an
annotation database that includes Entrez Gene, refseq, ensembl, unigene,
Human Invitational, UCSC knownGene, gene ontology, homologene, and a few
other things.  After doing the blats, I then choose the best hit for each
transcript database and map that to an associated gene model using the
annotation database.  I end up with oligos mapped to zero to many
transcripts for all large transcript databases, oligos mapped to zero to
many genes (and local storage of all the gene objects and associated
information for easy access), as well as mappings to multiple sources of
metadata.  Doing the blats for all these is quite fast (but DO NOT plan on
using bioperl to parse the 5M blat results.  Doing so will take DAYS).

Note that the process does not include storing all the sequences in the
database--there isn't a need for doing so if you are just blatting.  Also, I
do not use biosql in this situation because I found it rather slow for
mapping between different entities.  It did require building a database of
my own, but doing so makes it fairly easy to add tables as needed to support
another public database or to support a website, for example.  If you don't
want to build your own annotation database (the largest part of doing what I
have been doing), you can use one of several available including GeneKeyDB
(by our own Stefan Kirov) or Dragon DB.

Let me know if I can be of more help.

Sean



On 1/5/06 8:26 AM, "Marc Saric" <[hidden email]> wrote:

> Hi all,
>
> I've got some questions regarding BioSQL I would like to ask here:
>
> I am currently writing an app which should map microarray probe
> sequences to target sequences. It should do so in a generalized manner
> (i.e. any microarray against an arbitrary sequence-database). Currently
> I need UniGene for Zebrafish (Dr.*) and several Oligonucleotide libs,
> among them an Affymetrix array.
>
> Due to the fact, that UniGene is a moving target (especially for
> unfinished genomes) it would be good to do the mapping in a fully
> automated way.
>
> I am thinking about doing sequence-based mapping of probe-sequences with
> BLAT or GMAP  (like ProbeLynx does for Ensembl/TIGR-based data, but
> unfortunately that tool is quite hard to port/extend for other databases).
>
> In addition I would like to have annotation based mapping (i.e. take the
> accession from the vendor-provided mapping and have a look to which
> UniGene-cluster it maps) as a fallback/second option for microarrays,
> where probe sequences are not published.
>
> I have installed/setup Bioperl 1.5.1 and the CVS-versions of biosql and
> bioperl-db with MySQL 4.1.12/Mac OS X and was able to load Taxon- and
> UniGene-data from flatfiles, at least the Cluster-IDs and Accessions as
> available from the *.data file.
>
> I was also able to rewrite microarray probes from various tab-delimited
> formats or FASTA to Genbank, which worked ok for loading (albeit slow,
> but...).
>
> (I hope you are still with me after this lengthy intro... :-) )
>
> 1st question:
>
> Due to the fact that the loader does not like raw FASTA-files, what
> would be the most elegant/efficient way of loading all sequence-files
> for the UniGene build as well (normaly provided in a FASTA-file called
> *.seq.all, Dr.seq.all in my case). And how to associate them with the
> cluster data (i.e. there are allready entries in bioentry for all
> sequences, but they are missing the sequence data and most of their
> detail annotation, so this might be some kind of update).
>
> 2nd question:
>
> What would be the best way of integrating BLAT/GMAP (same format as
> BLAT) results. I'm thinking about parsing the file and writing the
> mapping-results as a annotation into the database, linked to each
> probe-sequence. Data would include the hit(s) found for each probe,
> wether it hits more than one cluster and possibly some additional notes.
>
>> From there I would write out a report or custom sequence file for use in
> other tools.
>
> If possible I would also like to accumulate annotations (like mapping
> against different UniGene builds over time).
>
> 3rd question:
>
> Due to the fact, that UniGene changes frequently, I would like to have
> some kind of versioning, so that I can keep old versions of UniGene as a
> backup and add new ones (i.e. not only keeping the mapping results but
> also keeping all the source sequences).
>
> If I understand it right, the load_seqdatabase script does not support
> this and has no (command-line) option for overriding the "database" name
> (i.e. for UniGene it will always be set to "UniGene" in biodatabase and
> thus overwrite old versions)?
>
> Do you see any fundamental problems here for versioning the data (except
> storage space)?
>
> Thanks in advance.
>
> Links:
>
> ProbeLynx http://koch.pathogenomics.ca/probelynx/
> D.rerio UniGene: http://www.ncbi.nlm.nih.gov/UniGene/UGOrg.cgi?TAXID=7955
>


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://portal.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: BioSQL, bioperl-db and UniGene

Hilmar Lapp
In reply to this post by Marc Saric

On Jan 5, 2006, at 5:26 AM, Marc Saric wrote:

> I am currently writing an app which should map microarray probe
> sequences to target sequences. It should do so in a generalized manner
> (i.e. any microarray against an arbitrary sequence-database). Currently
> I need UniGene for Zebrafish (Dr.*) and several Oligonucleotide libs,
> among them an Affymetrix array.

First off, you have seen the TIGR RESOURCERER application
(http://www.tigr.org/tigr-scripts/magic/r1.pl), right?

> [...]
> 1st question:
>
> Due to the fact that the loader does not like raw FASTA-files,

The loader likes all formats that Bio::SeqIO likes, so it doesn't
harbor any disdain for FASTA format. The only problem is that FASTA
format doesn't designate fields for accession, version, and name but
rather leaves it up to the file producer. This can be easily solved by
writing a custom SeqProcessor as pointed out several times before, for
instance:

http://portal.open-bio.org/pipermail/bioperl-l/2004-June/016204.html
http://portal.open-bio.org/pipermail/bioperl-l/2005-August/019579.html

>  what
> would be the most elegant/efficient way of loading all sequence-files
> for the UniGene build as well (normaly provided in a FASTA-file called
> *.seq.all, Dr.seq.all in my case). And how to associate them with the
> cluster data (i.e. there are allready entries in bioentry for all
> sequences, but they are missing the sequence data and most of their
> detail annotation, so this might be some kind of update).

See above for the format issue. As for automatically updating your
sequences, use --lookup and possibly other update-related options for
load_seqdatabase.pl (see its POD).

>
> 2nd question:
>
> What would be the best way of integrating BLAT/GMAP (same format as
> BLAT) results. I'm thinking about parsing the file and writing the
> mapping-results as a annotation into the database, linked to each
> probe-sequence. Data would include the hit(s) found for each probe,
> wether it hits more than one cluster and possibly some additional
> notes.
>
>> From there I would write out a report or custom sequence file for use
>> in
> other tools.
>
> If possible I would also like to accumulate annotations (like mapping
> against different UniGene builds over time).

I'm not sure exactly what your question is. Note that you can attach
anything you like to sequences in the database, e.g., features, and
annotations.

You can do so using Bioperl pretty easily. The sequence of steps is
basically, 1) retrieve sequence object, 2) add annotation and/or
features, 3) call $pseq->store(), and commit with $pseq->commit().

There are some pertinent code fragments in
http://www.open-bio.org/bosc2003/slides/Persistent_Bioperl_BOSC03.pdf

Let me know if this doesn't answer your question.

>
> 3rd question:
>
> Due to the fact, that UniGene changes frequently, I would like to have
> some kind of versioning, so that I can keep old versions of UniGene as
> a
> backup and add new ones (i.e. not only keeping the mapping results but
> also keeping all the source sequences).
>
> If I understand it right, the load_seqdatabase script does not support
> this and has no (command-line) option for overriding the "database"
> name
> (i.e. for UniGene it will always be set to "UniGene" in biodatabase and
> thus overwrite old versions)?

Yes - the reason is that an instance of Bio::Cluster::Unigene will
default its namespace to 'UniGene' if none if provided by the caller,
and the Unigene parser doesn't provide one. load_seqdatabase itself
doesn't touch the namespace of the object if its been set already.

I'm not quite happy with this myself, as basically it takes away
control from the user. Now I do think load_seqdatabase.pl's policy is
correct; but maybe the right thing to do for Bio::Cluster::Unigene is
not to default to a non-mandatory value if none is provided. What if I
just propose to make that change.

What you can do regardless of this is before you want to load a new
UniGene version rename the existing namespace to something that
includes the version. Then all entries will be created fresh under the
then-new namespace 'UniGene'.

Note that source sequences do not change because UniGene changes -
there will be new cluster members and other member sequences will be
retired from the cluster, but their sequences only change if the
respective GenBank sequence changes, which will not only increment the
version but also lead to a new GI number, which basically means a new
cluster member (as they are references by GI number).

>
> Do you see any fundamental problems here for versioning the data
> (except
> storage space)?

No, not at all.

Let me know if I didn't address your questions.

        -hilmar

>
> Thanks in advance.
>
> Links:
>
> ProbeLynx http://koch.pathogenomics.ca/probelynx/
> D.rerio UniGene:
> http://www.ncbi.nlm.nih.gov/UniGene/UGOrg.cgi?TAXID=7955
>
>
> --
> Bye,
>
> Marc Saric
>
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://portal.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: BioSQL, bioperl-db and UniGene

Sean Davis-3



On 1/5/06 2:02 PM, "Hilmar Lapp" <[hidden email]> wrote:

>
> On Jan 5, 2006, at 5:26 AM, Marc Saric wrote:
>
>> I am currently writing an app which should map microarray probe
>> sequences to target sequences. It should do so in a generalized manner
>> (i.e. any microarray against an arbitrary sequence-database). Currently
>> I need UniGene for Zebrafish (Dr.*) and several Oligonucleotide libs,
>> among them an Affymetrix array.
>
> First off, you have seen the TIGR RESOURCERER application
> (http://www.tigr.org/tigr-scripts/magic/r1.pl), right?

And, since we started out talking about microarrays, are you aware of the
BioConductor project and their annotation efforts, as well as a connection
to Resourcerer?

Sean


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://portal.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: BioSQL, bioperl-db and UniGene

Marc Saric
First of all, thanks for reading my lenghty mail and thanks for answering.

Sean Davis wrote:

>>First off, you have seen the TIGR RESOURCERER application
>>(http://www.tigr.org/tigr-scripts/magic/r1.pl), right?
>
> And, since we started out talking about microarrays, are you aware of the
> BioConductor project and their annotation efforts, as well as a connection
> to Resourcerer?

Or

http://source.stanford.edu/cgi-bin/source/sourceSearch

if you are interested in human, mouse or rat (I am currently not).

Yes, I have seen all of that before and some more tools, although I
think I have'nt had the time to really try out everything or dig deeply
into all tools available.

The main point is, that there are some microarray-oligo-sets, which are
not covered by an existing source or would need integration and
crosschecking of results and because I had some experience with a set
(non-zebrafish), which was not covered by any source I could find on the
net, I prefer to do my own annotation, preferably based on the probe
sequence.

I will have a closer look at the stuff you pointed out to me and write
again to bioperl-l with some more questions in the near future.

Thanks again.

--
Bye,

Marc Saric

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://portal.open-bio.org/mailman/listinfo/bioperl-l