How is is_circular recorded in BioSQL (by BioPerl)?

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

How is is_circular recorded in BioSQL (by BioPerl)?

Peter Cock
Hi all,

I'm trying to check how (currently) BioSQL should be used to record
if a sequence is circular or linear. I know this property is exposed in
BioPerl as the boolean is_circular() method from Bio::PrimarySeq,
and based on this old thread the value gets stored in BioSQL as a
sequence level annotation:

http://www.bioperl.org/pipermail/biosql-l/2005-June/000843.html
http://www.bioperl.org/pipermail/biosql-l/2005-June/000846.html
http://www.bioperl.org/pipermail/biosql-l/2005-June/000849.html
http://www.bioperl.org/pipermail/biosql-l/2005-June/000859.html

The term is_circular also matches nicely with GFF3, other than a
possible difference in capitalisation:

"Is_circular A flag to indicate whether a feature is circular."

and:

"For a circular genome, the landmark feature should include
Is_circular=true in column 9."

http://www.sequenceontology.org/gff3.shtml.

Can anyone confirm how exactly the is_circular (or Is_circular?)
annotation is used in BioSQL by BioPerl? I am guessing that
it is in the standard bioentry_qualifier_value table, with the
term_id referencing "is_circular" (check case), rank 0, and
value of "true" or "false" (check case).

I want to make Biopython's BioSQL usage consistent, see:
https://redmine.open-bio.org/issues/2578

Thanks,

Peter
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Roy Chaudhuri-3
Hi Peter,

As far as I understand, is_circular is not stored in BioSQL by default
when using bp_load_seqdatabase.pl. As indicated in the thread you
quoted, you can optionally store it as an annotation using a
SequenceProcessor - I use a copy of Bio::Seq::BaseSeqProcessor modified
with the following subroutine:

sub process_seq {
     my ($self,$seq) = @_;
     my $value=Bio::Annotation::SimpleValue->new(-tagname=>'is_circular',
                                            -value=>$seq->is_circular);
     $seq->annotation->add_Annotation($value);
     return ($seq);
}

This SequenceProcessor module can be specified to bp_load_seqdatabase.pl
using the --pipeline flag, and results in the value of is_circular being
stored in the table bioentry_qualifier_value. Its value can be
determined using SQL like:

select q.value from bioentry e
join bioentry_qualifier_value q using(bioentry_id)
join term t using(term_id)
where e.accession='U00096' and t.name='is_circular'

In the thread you mentioned, it was suggested that a specific
is_circular column be added to the BioSQL schema (in the biosequence
table), but I don't think this has been implemented yet.

Cheers,
Roy.

On 25/07/2011 09:29, Peter Cock wrote:

> Hi all,
>
> I'm trying to check how (currently) BioSQL should be used to record
> if a sequence is circular or linear. I know this property is exposed in
> BioPerl as the boolean is_circular() method from Bio::PrimarySeq,
> and based on this old thread the value gets stored in BioSQL as a
> sequence level annotation:
>
> http://www.bioperl.org/pipermail/biosql-l/2005-June/000843.html
> http://www.bioperl.org/pipermail/biosql-l/2005-June/000846.html
> http://www.bioperl.org/pipermail/biosql-l/2005-June/000849.html
> http://www.bioperl.org/pipermail/biosql-l/2005-June/000859.html
>
> The term is_circular also matches nicely with GFF3, other than a
> possible difference in capitalisation:
>
> "Is_circular A flag to indicate whether a feature is circular."
>
> and:
>
> "For a circular genome, the landmark feature should include
> Is_circular=true in column 9."
>
> http://www.sequenceontology.org/gff3.shtml.
>
> Can anyone confirm how exactly the is_circular (or Is_circular?)
> annotation is used in BioSQL by BioPerl? I am guessing that
> it is in the standard bioentry_qualifier_value table, with the
> term_id referencing "is_circular" (check case), rank 0, and
> value of "true" or "false" (check case).
>
> I want to make Biopython's BioSQL usage consistent, see:
> https://redmine.open-bio.org/issues/2578
>
> Thanks,
>
> Peter
> _______________________________________________
> BioSQL-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/biosql-l

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Peter Cock
On Mon, Jul 25, 2011 at 12:14 PM, Roy Chaudhuri <[hidden email]> wrote:

> Hi Peter,
>
> As far as I understand, is_circular is not stored in BioSQL by default when
> using bp_load_seqdatabase.pl. As indicated in the thread you quoted, you can
> optionally store it as an annotation using a SequenceProcessor - I use a
> copy of Bio::Seq::BaseSeqProcessor modified with the following subroutine:
>
> sub process_seq {
>    my ($self,$seq) = @_;
>    my $value=Bio::Annotation::SimpleValue->new(-tagname=>'is_circular',
>                                           -value=>$seq->is_circular);
>    $seq->annotation->add_Annotation($value);
>    return ($seq);
> }
>
> This SequenceProcessor module can be specified to bp_load_seqdatabase.pl
> using the --pipeline flag, and results in the value of is_circular being
> stored in the table bioentry_qualifier_value. Its value can be determined
> using SQL like:
>
> select q.value from bioentry e
> join bioentry_qualifier_value q using(bioentry_id)
> join term t using(term_id)
> where e.accession='U00096' and t.name='is_circular'
>

That's interesting - I hadn't realised this was optional in BioPerl.

Can you tell what value is actually put in the database? Presumably
whatever Perl defaults to as the string representation of a boolean?

> In the thread you mentioned, it was suggested that a specific is_circular
> column be added to the BioSQL schema (in the biosequence table), but I don't
> think this has been implemented yet.
>
> Cheers,
> Roy.

Confirmed, there is no such column in the biosequence table (yet).

Thank you,

Peter

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Roy Chaudhuri-3
> Can you tell what value is actually put in the database? Presumably
> whatever Perl defaults to as the string representation of a boolean?

The database value is either 1 or NULL (equivalent to 1 or undef in Perl).
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Peter Cock
On Mon, Jul 25, 2011 at 12:27 PM, Roy Chaudhuri <[hidden email]> wrote:
>> Can you tell what value is actually put in the database? Presumably
>> whatever Perl defaults to as the string representation of a boolean?
>
> The database value is either 1 or NULL (equivalent to 1 or undef in Perl).
>

Excellent - I can do the same in Biopython then.

I don't suppose you happen to know where the molecule type goes
(also in the GenBank/EMBL LOCUS/ID line, e.g. genomic DNA)?

Thank you,

Peter
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Roy Chaudhuri-3
I don't think there's any specific handling, but (in GenBank files at
least) mol_type is recorded as a tag in the source feature, so it will
be stored in BioSQL like any other feature tag (in
seqfeature_qualifier_value).

On 25/07/2011 12:30, Peter Cock wrote:

> On Mon, Jul 25, 2011 at 12:27 PM, Roy Chaudhuri<[hidden email]>  wrote:
>>> Can you tell what value is actually put in the database? Presumably
>>> whatever Perl defaults to as the string representation of a boolean?
>>
>> The database value is either 1 or NULL (equivalent to 1 or undef in Perl).
>>
>
> Excellent - I can do the same in Biopython then.
>
> I don't suppose you happen to know where the molecule type goes
> (also in the GenBank/EMBL LOCUS/ID line, e.g. genomic DNA)?
>
> Thank you,
>
> Peter

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Peter Cock
On Mon, Jul 25, 2011 at 1:03 PM, Roy Chaudhuri <[hidden email]> wrote:
> I don't think there's any specific handling, but (in GenBank files at least)
> mol_type is recorded as a tag in the source feature, so it will be stored in
> BioSQL like any other feature tag (in seqfeature_qualifier_value).

I'd forgotten in my question this potential slight redundancy in the
GenBank format!

Consider this example, the molecule type is only in the LOCUS
line (DNA), and incidentally there are two source features:

http://biopython.org/SRC/biopython/Tests/GenBank/NT_019265.gb

Likewise in the current version of the sample record on the NCBI
website, the molecule type is only in the LOCUS line (in this case
again just as DNA, but other values are mentioned), and not in the
source feature:

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.htm

However in this third example, the molecule type is in the LOCUS
line (as DNA) and in the source feature (as genomic DNA):

http://biopython.org/SRC/biopython/Tests/GenBank/NC_000932.gb

The GenBank/EMBL feature annotation is quite straightforward
with mapping to BioSQL (and I'm pretty sure the Biopython and
BioPerl are consistent here). Its all the header information that
isn't as pinned down.

Let me clarify that I'm interested in if and where BioPerl stores the
molecule type from the GenBank LOCUS line in BioSQL (and I'm
expecting this to go in bioentry_qualifier_value table under some
tag name).

Thanks again,

Peter

P.S.

As as been discussed before, the BioSQL documentation would
benefit from at least one worked example of a (small) GenBank
file showing where each field ends up in the database. It would be
a reasonable amount of work though - but could then be used for
a basic compliance unit test by all the Bio* interfaces to BioSQL.
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Roy Chaudhuri-3
>> I don't think there's any specific handling, but (in GenBank files
>> at least) mol_type is recorded as a tag in the source feature, so
>> it will be stored in BioSQL like any other feature tag (in
>> seqfeature_qualifier_value).
>
> I'd forgotten in my question this potential slight redundancy in the
>  GenBank format!

No problem, I forgot in my answer that for some obscure reason people
may be interested in looking at GenBank files that aren't bacterial
genome sequences.

> Let me clarify that I'm interested in if and where BioPerl stores
> the molecule type from the GenBank LOCUS line in BioSQL (and I'm
> expecting this to go in bioentry_qualifier_value table under some tag
> name).

As far as I can tell, the only fields stored by default in
bioentry_qualifier_value are keyword, date_changed and
secondary_accession (although my database only contains GenBank
bacterial genomes). As with the is_circular hack, you could store the
molecule type by adding it as an annotation in the SequenceProcessor
(it's stored as $seq->molecule by BioPerl).

Actually, when round-tripping a GenBank file through BioSQL, the LOCUS
line molecule type ends up in lower case, which makes me wonder if it is
coming from alphabet in the biosequence table.

> P.S.
>
> As as been discussed before, the BioSQL documentation would benefit
> from at least one worked example of a (small) GenBank file showing
> where each field ends up in the database. It would be a reasonable
> amount of work though - but could then be used for a basic compliance
> unit test by all the Bio* interfaces to BioSQL.

I agree that this would be very useful - the SearchIO HOWTO has a
similar treatment of a BLAST report that I often refer to.
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Peter Cock
On Mon, Jul 25, 2011 at 3:12 PM, Roy Chaudhuri <[hidden email]> wrote:

>>> I don't think there's any specific handling, but (in GenBank files
>>> at least) mol_type is recorded as a tag in the source feature, so
>>> it will be stored in BioSQL like any other feature tag (in
>>> seqfeature_qualifier_value).
>>
>> I'd forgotten in my question this potential slight redundancy in the
>>  GenBank format!
>
> No problem, I forgot in my answer that for some obscure reason people
> may be interested in looking at GenBank files that aren't bacterial genome
> sequences.

Sampling bias ;)

>> Let me clarify that I'm interested in if and where BioPerl stores
>> the molecule type from the GenBank LOCUS line in BioSQL (and I'm
>> expecting this to go in bioentry_qualifier_value table under some tag
>> name).
>
> As far as I can tell, the only fields stored by default in
> bioentry_qualifier_value are keyword, date_changed and secondary_accession
> (although my database only contains GenBank bacterial genomes). As with the
> is_circular hack, you could store the molecule type by adding it as an
> annotation in the SequenceProcessor (it's stored as $seq->molecule by
> BioPerl).

OK, that makes sense.

> Actually, when round-tripping a GenBank file through BioSQL, the LOCUS line
> molecule type ends up in lower case, which makes me wonder if it is coming
> from alphabet in the biosequence table.

If so, that may break for viral GenBank files where the LOCUS line may say
RNA, but the sequence is given using acgt (i.e. the DNA alphabet).

>> P.S.
>>
>> As as been discussed before, the BioSQL documentation would benefit
>> from at least one worked example of a (small) GenBank file showing
>> where each field ends up in the database. It would be a reasonable
>> amount of work though - but could then be used for a basic compliance
>> unit test by all the Bio* interfaces to BioSQL.
>
> I agree that this would be very useful - the SearchIO HOWTO has a similar
> treatment of a BLAST report that I often refer to.

If only we could clone/fork bioinformaticians ;)

Peter

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Roy Chaudhuri-3
>> Actually, when round-tripping a GenBank file through BioSQL, the LOCUS line
>> molecule type ends up in lower case, which makes me wonder if it is coming
>> from alphabet in the biosequence table.
>
> If so, that may break for viral GenBank files where the LOCUS line may say
> RNA, but the sequence is given using acgt (i.e. the DNA alphabet).

Just tried it and that does seem to be the case. It's not the only thing
that breaks on round tripping, for example circular genomes become linear.
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Peter Cock
On Mon, Jul 25, 2011 at 5:25 PM, Roy Chaudhuri <[hidden email]> wrote:

>>> Actually, when round-tripping a GenBank file through BioSQL, the LOCUS
>>> line molecule type ends up in lower case, which makes me wonder if it is
>>> coming from alphabet in the biosequence table.
>>
>> If so, that may break for viral GenBank files where the LOCUS line may say
>> RNA, but the sequence is given using acgt (i.e. the DNA alphabet).
>
> Just tried it and that does seem to be the case. It's not the only thing
> that breaks on round tripping, for example circular genomes become linear.
>

Sounds like one or two bug reports are needed,
http://redmine.open-bio.org/projects/bioperl

We already have one open on Biopython for this:
https://redmine.open-bio.org/issues/2578

Peter
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Fields, Christopher J
In reply to this post by Peter Cock
On Jul 25, 2011, at 10:39 AM, Peter Cock wrote:

> On Mon, Jul 25, 2011 at 3:12 PM, Roy Chaudhuri <[hidden email]> wrote:
>>>> I don't think there's any specific handling, but (in GenBank files
>>>> at least) mol_type is recorded as a tag in the source feature, so
>>>> it will be stored in BioSQL like any other feature tag (in
>>>> seqfeature_qualifier_value).
>>>
>>> I'd forgotten in my question this potential slight redundancy in the
>>>  GenBank format!
>>
>> No problem, I forgot in my answer that for some obscure reason people
>> may be interested in looking at GenBank files that aren't bacterial genome
>> sequences.
>
> Sampling bias ;)
>
>>> Let me clarify that I'm interested in if and where BioPerl stores
>>> the molecule type from the GenBank LOCUS line in BioSQL (and I'm
>>> expecting this to go in bioentry_qualifier_value table under some tag
>>> name).
>>
>> As far as I can tell, the only fields stored by default in
>> bioentry_qualifier_value are keyword, date_changed and secondary_accession
>> (although my database only contains GenBank bacterial genomes). As with the
>> is_circular hack, you could store the molecule type by adding it as an
>> annotation in the SequenceProcessor (it's stored as $seq->molecule by
>> BioPerl).
>
> OK, that makes sense.
>
>> Actually, when round-tripping a GenBank file through BioSQL, the LOCUS line
>> molecule type ends up in lower case, which makes me wonder if it is coming
>> from alphabet in the biosequence table.
>
> If so, that may break for viral GenBank files where the LOCUS line may say
> RNA, but the sequence is given using acgt (i.e. the DNA alphabet).

Not sure, but that's worth checking on.  Truthfully, our interest has typically been in favor more towards parsing data into the proper classes for downstream analysis than round-tripping sequence formats.  Not that the latter isn't important, but that there is frankly more interest in doing something more than rote sequence format conversion.

>>> P.S.
>>>
>>> As as been discussed before, the BioSQL documentation would benefit
>>> from at least one worked example of a (small) GenBank file showing
>>> where each field ends up in the database. It would be a reasonable
>>> amount of work though - but could then be used for a basic compliance
>>> unit test by all the Bio* interfaces to BioSQL.
>>
>> I agree that this would be very useful - the SearchIO HOWTO has a similar
>> treatment of a BLAST report that I often refer to.
>
> If only we could clone/fork bioinformaticians ;)
>
> Peter

:)

chris


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Fields, Christopher J
In reply to this post by Peter Cock
On Jul 25, 2011, at 11:29 AM, Peter Cock wrote:

> On Mon, Jul 25, 2011 at 5:25 PM, Roy Chaudhuri <[hidden email]> wrote:
>>>> Actually, when round-tripping a GenBank file through BioSQL, the LOCUS
>>>> line molecule type ends up in lower case, which makes me wonder if it is
>>>> coming from alphabet in the biosequence table.
>>>
>>> If so, that may break for viral GenBank files where the LOCUS line may say
>>> RNA, but the sequence is given using acgt (i.e. the DNA alphabet).
>>
>> Just tried it and that does seem to be the case. It's not the only thing
>> that breaks on round tripping, for example circular genomes become linear.
>>
>
> Sounds like one or two bug reports are needed,
> http://redmine.open-bio.org/projects/bioperl
>
> We already have one open on Biopython for this:
> https://redmine.open-bio.org/issues/2578
>
> Peter

If these reports have examples we can work on a fix.

chris
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Fields, Christopher J
In reply to this post by Peter Cock
On Jul 25, 2011, at 11:29 AM, Peter Cock wrote:

> On Mon, Jul 25, 2011 at 5:25 PM, Roy Chaudhuri <[hidden email]> wrote:
>>>> Actually, when round-tripping a GenBank file through BioSQL, the LOCUS
>>>> line molecule type ends up in lower case, which makes me wonder if it is
>>>> coming from alphabet in the biosequence table.
>>>
>>> If so, that may break for viral GenBank files where the LOCUS line may say
>>> RNA, but the sequence is given using acgt (i.e. the DNA alphabet).
>>
>> Just tried it and that does seem to be the case. It's not the only thing
>> that breaks on round tripping, for example circular genomes become linear.
>>
>
> Sounds like one or two bug reports are needed,
> http://redmine.open-bio.org/projects/bioperl
>
> We already have one open on Biopython for this:
> https://redmine.open-bio.org/issues/2578
>
> Peter

If these reports have examples we can work on a fix.

chris
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Hilmar Lapp-5
In reply to this post by Roy Chaudhuri-3
I realize I'm chiming in here late, but the below sums it up quite well. In fact, biosequence.alphabet column was originally (pre-2002) called molecule, and the BioPerl Genbank writer defaults to alphabet() if molecule() is not defined.

-hilmar

Sent with a tap.

On Jul 25, 2011, at 9:12 AM, Roy Chaudhuri <[hidden email]> wrote:

> As with the is_circular hack, you could store the molecule type by adding it as an annotation in the SequenceProcessor (it's stored as $seq->molecule by BioPerl).
>
> Actually, when round-tripping a GenBank file through BioSQL, the LOCUS line molecule type ends up in lower case, which makes me wonder if it is coming from alphabet in the biosequence table.

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l