Writing and retrieving Genbank files from BioSQL

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Writing and retrieving Genbank files from BioSQL

Rik Rademaker
Dear all,

I am a biologist trying to write genbank files to bioSQL. I am comfortable
in writing python scripts but there is a problem with BioPython and that
is  that the molecule type in the locus line is lost (eg 'circular DNA'
becomes just 'DNA'). I am now trying to figure out how BioPerl is doing
this and how BioPerl is writing this information to BioSQL.

I have a BioSQL database (MySQL) and I can commit to BioSQL eg via this
program:
#!/usr/bin/perl
 
use strict;

use Bio::DB::BioDB;
use Bio::DB::GenBank;

#Load Genbank file
my $genbank_id = 'L08752';

my $genDB = new Bio::DB::GenBank;
my $sequence = $genDB->get_Seq_by_id($genbank_id);

my $db=Bio::DB::BioDB->new(-database => 'BioSQL',
                           -user => 'root',
                           -dbname => 'bioseqdb',
                           -host => 'localhost',
                           -driver => 'mysql');

my $pobj = $db->create_persistent($sequence);
$pobj->create();
$pobj->commit();

This works and I see the data appearing in BioSQL.

I am having  hard time to figure out how to retrieve the sequence object
from BioSQL and to write it back to a genbank on harddisk.
Could someone give me some suggestions?

I would like to know if BioPerl is capable to maintain the 'circular'
property in the locus line after the data has been exported to and
retrieved from BioSQL. Sofar,  I have not identified tables/fields that
store the circular property. Next step would to implement this behavior in
BioPython for which I already contacted Peter Cock.

Kind regards, Rik.
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Writing and retrieving Genbank files from BioSQL

Roy Chaudhuri-3
Just noticed that my replies to Rik only went to the Google group, not
sure if those e-mails eventually filter through to bioperl-l? In case,
they don't here are my combined messages:

Hi Rik,

See this discussion between myself and Peter:
http://bioperl.org/pipermail/bioperl-l/2011-July/035435.html

There isn't a "circular" column in the BioSQL schema (although there
probably should be). However, you can hack around this by adding an
annotation tag called "is_circular" to your sequence before storing it
in BioSQL. The tag would have the value of $seq->is_circular (either 1
or undef). The code which extracts the sequence from the database would
need to be aware of this and convert the annotation tag back into the
BioPerl is_circular value (or the BioPython equivalent). I think Peter
was working on this in BioPython at the time so there may already be
code to do this.

Here is some BioPerl code to retrieve a sequence from BioSQL and print
it to STDOUT:

#!/usr/bin/perl
use warnings;
use strict;
use Bio::SeqIO;
use Bio::DB::Query::BioQuery;
use Bio::DB::BioDB;
my $accession='L08752';
my $dbadap= Bio::DB::BioDB->new(-database => 'biosql',
                                 -dbname   => 'bioseqdb',
                                 -user => 'root',
                                 -pass => 'pass',
                                 -driver => 'mysql');
my $query = Bio::DB::Query::BioQuery->new(-datacollections =>
["Bio::SeqI entry"],
                                           -where =>
["entry.accession_number='$accession'"]
                                          );
my $objadap = $dbadap->get_object_adaptor('Bio::SeqI');
my $seq=$objadap->find_by_query($query)->next_object;
die "Accession $accession not found\n" if not defined $seq;
my($circular)=$seq->annotation->remove_Annotations('is_circular');
$seq->is_circular($circular);
Bio::SeqIO->new(-format=>'genbank')->write_seq($seq);

Cheers,
Roy.

On 21/04/2014 15:55, Rik Rademaker wrote:

> Dear all,
>
> I am a biologist trying to write genbank files to bioSQL. I am comfortable
> in writing python scripts but there is a problem with BioPython and that
> is  that the molecule type in the locus line is lost (eg 'circular DNA'
> becomes just 'DNA'). I am now trying to figure out how BioPerl is doing
> this and how BioPerl is writing this information to BioSQL.
>
> I have a BioSQL database (MySQL) and I can commit to BioSQL eg via this
> program:
> #!/usr/bin/perl
>
> use strict;
>
> use Bio::DB::BioDB;
> use Bio::DB::GenBank;
>
> #Load Genbank file
> my $genbank_id = 'L08752';
>
> my $genDB = new Bio::DB::GenBank;
> my $sequence = $genDB->get_Seq_by_id($genbank_id);
>
> my $db=Bio::DB::BioDB->new(-database => 'BioSQL',
>                             -user => 'root',
>                             -dbname => 'bioseqdb',
>                             -host => 'localhost',
>                             -driver => 'mysql');
>
> my $pobj = $db->create_persistent($sequence);
> $pobj->create();
> $pobj->commit();
>
> This works and I see the data appearing in BioSQL.
>
> I am having  hard time to figure out how to retrieve the sequence object
> from BioSQL and to write it back to a genbank on harddisk.
> Could someone give me some suggestions?
>
> I would like to know if BioPerl is capable to maintain the 'circular'
> property in the locus line after the data has been exported to and
> retrieved from BioSQL. Sofar,  I have not identified tables/fields that
> store the circular property. Next step would to implement this behavior in
> BioPython for which I already contacted Peter Cock.
>
> Kind regards, Rik.
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Writing and retrieving Genbank files from BioSQL

Roy Chaudhuri-3
Hi Peter,

Just found your message on the Google group.

I think the correct way to deal with this would be to modify the BioSQL
schema to include columns for is_circular and molecule type in the
bioentry table.

However, if the BioSQL schema is considered immutable, a workaround
would be for BioPerl, BioPython etc. to agree on a standard way of
storing the information in the existing BioSQL schema.

I'd suggest we do this with two annotation tags:
"is_circular", with a value of BioPerl $seq->is_circular (1 or NULL)
"molecule", with a value of BioPerl $seq->molecule (DNA, RNA etc.)

Once a sequence is removed from the database, these annotation tags
could be removed and put in the correct place in the BioPerl/BioPython
object model.

Cheers,
Roy.

 > Thank you for chasing this issue Rik :)
 >
 > From the Biopython point of view, all I really need to know
 > is where the linear/circular and molecule type information
 > from the GenBank LOCUS line end up in the BioSQL tables
 > (to make Biopython put it in the same place).
 >
 > https://redmine.open-bio.org/issues/2578
 >
 > Sadly I don't currently have a working BioSQL + BioPerl test
 > setup (it would be great if BioPerl could add SQLite support -
 > which would make it easy to do cross project testing).
 >
 > Thanks,
 >
 > Peter
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Writing and retrieving Genbank files from BioSQL

Fields, Christopher J
I do think Roy is correct in saying this could be set as annotation.  In my opinion I would like Seq-pertinent information (alphabet, circular, mol type) at the level of the sequence.  At least, to me that makes more sense, as it describes information directly relevant about the sequence itself, whereas to me annotation are more about the sequence record (pubs, taxonomy, etc).  But I’m probably splitting hairs.

BTW I agree re: SQLite support in bioperl-db.  However, bioperl-db uses a home-grown ORM, so this would entail creating a SQLite-specific DB loader.  The intent has been to move bioperl-db over to a consistent ORM (DBIx::Class) that would easily allow this, but that GSoC project didn’t have takers :P

chris

On Apr 23, 2014, at 11:58 AM, Roy Chaudhuri <[hidden email]> wrote:

> Hi Peter,
>
> Just found your message on the Google group.
>
> I think the correct way to deal with this would be to modify the BioSQL schema to include columns for is_circular and molecule type in the bioentry table.
>
> However, if the BioSQL schema is considered immutable, a workaround would be for BioPerl, BioPython etc. to agree on a standard way of storing the information in the existing BioSQL schema.
>
> I'd suggest we do this with two annotation tags:
> "is_circular", with a value of BioPerl $seq->is_circular (1 or NULL)
> "molecule", with a value of BioPerl $seq->molecule (DNA, RNA etc.)
>
> Once a sequence is removed from the database, these annotation tags could be removed and put in the correct place in the BioPerl/BioPython object model.
>
> Cheers,
> Roy.
>
> > Thank you for chasing this issue Rik :)
> >
> > From the Biopython point of view, all I really need to know
> > is where the linear/circular and molecule type information
> > from the GenBank LOCUS line end up in the BioSQL tables
> > (to make Biopython put it in the same place).
> >
> > https://redmine.open-bio.org/issues/2578
> >
> > Sadly I don't currently have a working BioSQL + BioPerl test
> > setup (it would be great if BioPerl could add SQLite support -
> > which would make it easy to do cross project testing).
> >
> > Thanks,
> >
> > Peter
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Writing and retrieving Genbank files from BioSQL

Peter Cock
CC'ing Hilmar and the BioSQL list.

Yes, it might be nice to extend the BioSQL schema for these
fields (circular, molecule type, etc).

Right now, whatever bioperl-db does is the effective standard,
so any assistance specifying what exactly it does would be
enough to make Biopython do the same.

Given we didn't get a GSoC project student to work on
BioSQL and SQLite, perhaps a plan B is needed?

Thanks,

Peter.

On Wednesday, April 23, 2014, Fields, Christopher J <[hidden email]>
wrote:

> I do think Roy is correct in saying this could be set as annotation.  In
> my opinion I would like Seq-pertinent information (alphabet, circular, mol
> type) at the level of the sequence.  At least, to me that makes more sense,
> as it describes information directly relevant about the sequence itself,
> whereas to me annotation are more about the sequence record (pubs,
> taxonomy, etc).  But I’m probably splitting hairs.
>
> BTW I agree re: SQLite support in bioperl-db.  However, bioperl-db uses a
> home-grown ORM, so this would entail creating a SQLite-specific DB loader.
>  The intent has been to move bioperl-db over to a consistent ORM
> (DBIx::Class) that would easily allow this, but that GSoC project didn’t
> have takers :P
>
> chris
>
> On Apr 23, 2014, at 11:58 AM, Roy Chaudhuri <[hidden email]<javascript:;>>
> wrote:
>
> > Hi Peter,
> >
> > Just found your message on the Google group.
> >
> > I think the correct way to deal with this would be to modify the BioSQL
> schema to include columns for is_circular and molecule type in the bioentry
> table.
> >
> > However, if the BioSQL schema is considered immutable, a workaround
> would be for BioPerl, BioPython etc. to agree on a standard way of storing
> the information in the existing BioSQL schema.
> >
> > I'd suggest we do this with two annotation tags:
> > "is_circular", with a value of BioPerl $seq->is_circular (1 or NULL)
> > "molecule", with a value of BioPerl $seq->molecule (DNA, RNA etc.)
> >
> > Once a sequence is removed from the database, these annotation tags
> could be removed and put in the correct place in the BioPerl/BioPython
> object model.
> >
> > Cheers,
> > Roy.
> >
> > > Thank you for chasing this issue Rik :)
> > >
> > > From the Biopython point of view, all I really need to know
> > > is where the linear/circular and molecule type information
> > > from the GenBank LOCUS line end up in the BioSQL tables
> > > (to make Biopython put it in the same place).
> > >
> > > https://redmine.open-bio.org/issues/2578
> > >
> > > Sadly I don't currently have a working BioSQL + BioPerl test
> > > setup (it would be great if BioPerl could add SQLite support -
> > > which would make it easy to do cross project testing).
> > >
> > > Thanks,
> > >
> > > Peter
> > _______________________________________________
> > Bioperl-l mailing list
> > [hidden email] <javascript:;>
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l