Odd problem with get_tag_values

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Odd problem with get_tag_values

Adlai Burman-2
I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:

#!/usr/bin/perl                                                                                                    
use strict;
use warnings;
use IO::String;
use Bio::Perl;
use Bio::SeqIO;
use IO::String;

my @files = </Users/adlai/Dropbox/atrsh/*>;
foreach my $file(@files){


my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit.
.
.
.
#do nifty stuff
}

For some files this approach works just fine.
For others the script dies immediately with the error message:

------------- EXCEPTION -------------
MSG: asking for tag value that does not exist gene
STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
STACK toplevel tosend.pl:16
-------------------------------------

The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
Does anyone know why this is a problem and what can be done to circumvent it?

Thanks,
Adlai


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Brian Osborne-2
Adlai,

You are absolutely sure that every single CDS feature has a "gene" tag inside it?

If this is not the case then you have to use the "if ($cds_feature->has_tag("gene")) …" type of logic.

Brian O.

On Feb 24, 2012, at 3:43 PM, Adlai Burman wrote:

> I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:
>
> #!/usr/bin/perl                                                                                                    
> use strict;
> use warnings;
> use IO::String;
> use Bio::Perl;
> use Bio::SeqIO;
> use IO::String;
>
> my @files = </Users/adlai/Dropbox/atrsh/*>;
> foreach my $file(@files){
>
>
> my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
> my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit.
> .
> .
> .
> #do nifty stuff
> }
>
> For some files this approach works just fine.
> For others the script dies immediately with the error message:
>
> ------------- EXCEPTION -------------
> MSG: asking for tag value that does not exist gene
> STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
> STACK toplevel tosend.pl:16
> -------------------------------------
>
> The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
> Does anyone know why this is a problem and what can be done to circumvent it?
>
> Thanks,
> Adlai
>
>
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Jason Stajich-5
In reply to this post by Adlai Burman-2
not all CDS will be annotated with a 'gene' tag, this is due to variation in how annotation is done and that there is not a requirement that there be a gene tag for all CDS features.

You can protect your query - we often do this when dealing with data from the wild by testing for has_tag first.

my %strands;
for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
 if( $cds->has_tag('gene') ) {
        my ($gene) = $cds->get_tag_values('gene'); # get the 1st one, this returns a list
        $strands{$gene} = $cds->strand;
 } else { # look in alternative places for a name, e.g. locus,
  ...
 }
}

An alternative is to loop through your list of tags in order of preference

my %strands;
for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
 for my $tag ( qw(gene locus name product accession note) ) {  
   if( $cds->has_tag($tag) ) {
        my ($name) = $cds->get_tag_values($tag); # get the 1st one, this returns a list
        $strands{$name} = $cds->strand;
        $seen = 1;
        last;
 }
 if( ! $seen ) {
        warn("not tag found for feature at ", $cds->location->to_FTstring, "\n");
 }
}

On Feb 24, 2012, at 12:43 PM, Adlai Burman wrote:

> I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:
>
> #!/usr/bin/perl                                                                                                    
> use strict;
> use warnings;
> use IO::String;
> use Bio::Perl;
> use Bio::SeqIO;
> use IO::String;
>
> my @files = </Users/adlai/Dropbox/atrsh/*>;
> foreach my $file(@files){
>
>
> my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
> my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit.
> .
> .
> .
> #do nifty stuff
> }
>
> For some files this approach works just fine.
> For others the script dies immediately with the error message:
>
> ------------- EXCEPTION -------------
> MSG: asking for tag value that does not exist gene
> STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
> STACK toplevel tosend.pl:16
> -------------------------------------
>
> The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
> Does anyone know why this is a problem and what can be done to circumvent it?
>
> Thanks,
> Adlai
>
>
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
[hidden email]
[hidden email]


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Adlai Burman-2
In reply to this post by Brian Osborne-2
Hey, Brian.
No, I am not absolutely sure about that but I am checking now. In the process of checking this I found out that one of the files that successfully parsed (NC_015139) had NO Features tags (other that "source"). No "CDS", no "gene" etc...
ok, now I am sure. I checked one record that didn't parse and and all the CDS's have "gene" tags. Go figure.
Regarding your suggestion: I agree, checking for the existence of a tag is an important thing to do and everything parses great in a script I wrote which uses that. This, however, might be problematic for two reasons:
(1) The fact that the aforementioned featureless record parses and one of the crashers does have a full complement of properly placed "genes" suggest that this might not address the problem, and
(2) On a more humbling note, I don't know how to embed such a check into the one line hash generator, my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; , which wold be perfect for what I am coding now.

Thanks for your response,
Adlai
On Feb 24, 2012, at 10:03 PM, Brian Osborne wrote:

> Adlai,
>
> You are absolutely sure that every single CDS feature has a "gene" tag inside it?
>
> If this is not the case then you have to use the "if ($cds_feature->has_tag("gene")) …" type of logic.
>
> Brian O.
>
> On Feb 24, 2012, at 3:43 PM, Adlai Burman wrote:
>
>> I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:
>>
>> #!/usr/bin/perl                                                                                                    
>> use strict;
>> use warnings;
>> use IO::String;
>> use Bio::Perl;
>> use Bio::SeqIO;
>> use IO::String;
>>
>> my @files = </Users/adlai/Dropbox/atrsh/*>;
>> foreach my $file(@files){
>>
>>
>> my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
>> my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit.
>> .
>> .
>> .
>> #do nifty stuff
>> }
>>
>> For some files this approach works just fine.
>> For others the script dies immediately with the error message:
>>
>> ------------- EXCEPTION -------------
>> MSG: asking for tag value that does not exist gene
>> STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
>> STACK toplevel tosend.pl:16
>> -------------------------------------
>>
>> The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
>> Does anyone know why this is a problem and what can be done to circumvent it?
>>
>> Thanks,
>> Adlai
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> [hidden email]
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Adlai Burman-2
In reply to this post by Brian Osborne-2
P.S. In case anyone is interested (and I hope they are), here is an example of one record that fails here and one that doesn't:

NC_015820 parses and NC_012927 fails. Unlike what I said earlier, some of the good ones have exons and some of the crashers don't.

On Feb 24, 2012, at 10:03 PM, Brian Osborne wrote:

> Adlai,
>
> You are absolutely sure that every single CDS feature has a "gene" tag inside it?
>
> If this is not the case then you have to use the "if ($cds_feature->has_tag("gene")) …" type of logic.
>
> Brian O.
>
> On Feb 24, 2012, at 3:43 PM, Adlai Burman wrote:
>
>> I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:
>>
>> #!/usr/bin/perl                                                                                                    
>> use strict;
>> use warnings;
>> use IO::String;
>> use Bio::Perl;
>> use Bio::SeqIO;
>> use IO::String;
>>
>> my @files = </Users/adlai/Dropbox/atrsh/*>;
>> foreach my $file(@files){
>>
>>
>> my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
>> my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit.
>> .
>> .
>> .
>> #do nifty stuff
>> }
>>
>> For some files this approach works just fine.
>> For others the script dies immediately with the error message:
>>
>> ------------- EXCEPTION -------------
>> MSG: asking for tag value that does not exist gene
>> STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
>> STACK toplevel tosend.pl:16
>> -------------------------------------
>>
>> The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
>> Does anyone know why this is a problem and what can be done to circumvent it?
>>
>> Thanks,
>> Adlai
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> [hidden email]
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Fields, Christopher J
In reply to this post by Adlai Burman-2
On 02/24/2012 02:43 PM, Adlai Burman wrote:

> I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:
>
> #!/usr/bin/perl
> use strict;
> use warnings;
> use IO::String;
> use Bio::Perl;
> use Bio::SeqIO;
> use IO::String;
>
> my @files =</Users/adlai/Dropbox/atrsh/*>;
> foreach my $file(@files){
>
>
> my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file =>  $file)->next_seq->get_SeqFeatures;
> my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit.
> .
> .
> .
> #do nifty stuff
> }
>
> For some files this approach works just fine.
> For others the script dies immediately with the error message:
>
> ------------- EXCEPTION -------------
> MSG: asking for tag value that does not exist gene
> STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
> STACK toplevel tosend.pl:16
> -------------------------------------

There are two possibilities:

1) There is at least one feature w/o a 'gene' tag for those files.
2) This is a bug.

Either way it's hard to tell b/c we don't have the example data you are checking.

I would note this is *not* the way to screen for features with specific tags, though, at least with the current API.  You have to actually check for the presence of the tag first with has_tag('gene').  You could do that within the grep:

{ $_->primary_tag eq 'CDS' && $_->has_tag('gene') }

> The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
> Does anyone know why this is a problem and what can be done to circumvent it?
>
> Thanks,
> Adlai

chris


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Adlai Burman-2
In reply to this post by Jason Stajich-5
Thanks so much, Jason.
I will give that a try in after I get a few hours of much needed sleep :-)


On Feb 24, 2012, at 10:21 PM, Jason Stajich wrote:

> not all CDS will be annotated with a 'gene' tag, this is due to variation in how annotation is done and that there is not a requirement that there be a gene tag for all CDS features.
>
> You can protect your query - we often do this when dealing with data from the wild by testing for has_tag first.
>
> my %strands;
> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
> if( $cds->has_tag('gene') ) {
> my ($gene) = $cds->get_tag_values('gene'); # get the 1st one, this returns a list
> $strands{$gene} = $cds->strand;
> } else { # look in alternative places for a name, e.g. locus,
>  ...
> }
> }
>
> An alternative is to loop through your list of tags in order of preference
>
> my %strands;
> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
> for my $tag ( qw(gene locus name product accession note) ) {  
>   if( $cds->has_tag($tag) ) {
> my ($name) = $cds->get_tag_values($tag); # get the 1st one, this returns a list
> $strands{$name} = $cds->strand;
>        $seen = 1;
>        last;
> }
> if( ! $seen ) {
> warn("not tag found for feature at ", $cds->location->to_FTstring, "\n");
> }
> }
>
> On Feb 24, 2012, at 12:43 PM, Adlai Burman wrote:
>
>> I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:
>>
>> #!/usr/bin/perl                                                                                                    
>> use strict;
>> use warnings;
>> use IO::String;
>> use Bio::Perl;
>> use Bio::SeqIO;
>> use IO::String;
>>
>> my @files = </Users/adlai/Dropbox/atrsh/*>;
>> foreach my $file(@files){
>>
>>
>> my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
>> my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit.
>> .
>> .
>> .
>> #do nifty stuff
>> }
>>
>> For some files this approach works just fine.
>> For others the script dies immediately with the error message:
>>
>> ------------- EXCEPTION -------------
>> MSG: asking for tag value that does not exist gene
>> STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
>> STACK toplevel tosend.pl:16
>> -------------------------------------
>>
>> The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
>> Does anyone know why this is a problem and what can be done to circumvent it?
>>
>> Thanks,
>> Adlai
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> [hidden email]
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Jason Stajich
> [hidden email]
> [hidden email]
>
>


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Fields, Christopher J
Using has_tag('gene') as a pre-screen works for me for both example seqs.

chris

On Feb 24, 2012, at 3:33 PM, Adlai Burman wrote:

> Thanks so much, Jason.
> I will give that a try in after I get a few hours of much needed sleep :-)
>
>
> On Feb 24, 2012, at 10:21 PM, Jason Stajich wrote:
>
>> not all CDS will be annotated with a 'gene' tag, this is due to variation in how annotation is done and that there is not a requirement that there be a gene tag for all CDS features.
>>
>> You can protect your query - we often do this when dealing with data from the wild by testing for has_tag first.
>>
>> my %strands;
>> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
>> if( $cds->has_tag('gene') ) {
>> my ($gene) = $cds->get_tag_values('gene'); # get the 1st one, this returns a list
>> $strands{$gene} = $cds->strand;
>> } else { # look in alternative places for a name, e.g. locus,
>> ...
>> }
>> }
>>
>> An alternative is to loop through your list of tags in order of preference
>>
>> my %strands;
>> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
>> for my $tag ( qw(gene locus name product accession note) ) {  
>>  if( $cds->has_tag($tag) ) {
>> my ($name) = $cds->get_tag_values($tag); # get the 1st one, this returns a list
>> $strands{$name} = $cds->strand;
>>       $seen = 1;
>>       last;
>> }
>> if( ! $seen ) {
>> warn("not tag found for feature at ", $cds->location->to_FTstring, "\n");
>> }
>> }
>>
>> On Feb 24, 2012, at 12:43 PM, Adlai Burman wrote:
>>
>>> I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:
>>>
>>> #!/usr/bin/perl                                                                                                    
>>> use strict;
>>> use warnings;
>>> use IO::String;
>>> use Bio::Perl;
>>> use Bio::SeqIO;
>>> use IO::String;
>>>
>>> my @files = </Users/adlai/Dropbox/atrsh/*>;
>>> foreach my $file(@files){
>>>
>>>
>>> my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
>>> my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit.
>>> .
>>> .
>>> .
>>> #do nifty stuff
>>> }
>>>
>>> For some files this approach works just fine.
>>> For others the script dies immediately with the error message:
>>>
>>> ------------- EXCEPTION -------------
>>> MSG: asking for tag value that does not exist gene
>>> STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
>>> STACK toplevel tosend.pl:16
>>> -------------------------------------
>>>
>>> The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
>>> Does anyone know why this is a problem and what can be done to circumvent it?
>>>
>>> Thanks,
>>> Adlai
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> [hidden email]
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Jason Stajich
>> [hidden email]
>> [hidden email]
>>
>>
>
>
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Adlai Burman-2
In reply to this post by Jason Stajich-5
Jason,
Your first solution, indeed, did the trick (though I'm not sure why). There was no need to for checking "else." I'm not sure why some records with a full set of "gene" tags would not parse without the check, but everything parsed with it.

Brian, you were right.

Thanks again,

Adlai
On Feb 24, 2012, at 10:21 PM, Jason Stajich wrote:

> not all CDS will be annotated with a 'gene' tag, this is due to variation in how annotation is done and that there is not a requirement that there be a gene tag for all CDS features.
>
> You can protect your query - we often do this when dealing with data from the wild by testing for has_tag first.
>
> my %strands;
> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
> if( $cds->has_tag('gene') ) {
> my ($gene) = $cds->get_tag_values('gene'); # get the 1st one, this returns a list
> $strands{$gene} = $cds->strand;
> } else { # look in alternative places for a name, e.g. locus,
>  ...
> }
> }
>
> An alternative is to loop through your list of tags in order of preference
>
> my %strands;
> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
> for my $tag ( qw(gene locus name product accession note) ) {  
>   if( $cds->has_tag($tag) ) {
> my ($name) = $cds->get_tag_values($tag); # get the 1st one, this returns a list
> $strands{$name} = $cds->strand;
>        $seen = 1;
>        last;
> }
> if( ! $seen ) {
> warn("not tag found for feature at ", $cds->location->to_FTstring, "\n");
> }
> }
>
> On Feb 24, 2012, at 12:43 PM, Adlai Burman wrote:
>
>> I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:
>>
>> #!/usr/bin/perl                                                                                                    
>> use strict;
>> use warnings;
>> use IO::String;
>> use Bio::Perl;
>> use Bio::SeqIO;
>> use IO::String;
>>
>> my @files = </Users/adlai/Dropbox/atrsh/*>;
>> foreach my $file(@files){
>>
>>
>> my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
>> my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit.
>> .
>> .
>> .
>> #do nifty stuff
>> }
>>
>> For some files this approach works just fine.
>> For others the script dies immediately with the error message:
>>
>> ------------- EXCEPTION -------------
>> MSG: asking for tag value that does not exist gene
>> STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
>> STACK toplevel tosend.pl:16
>> -------------------------------------
>>
>> The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
>> Does anyone know why this is a problem and what can be done to circumvent it?
>>
>> Thanks,
>> Adlai
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> [hidden email]
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Jason Stajich
> [hidden email]
> [hidden email]
>
>


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Adlai Burman-2
In reply to this post by Fields, Christopher J

On Feb 24, 2012, at 10:46 PM, Fields, Christopher J wrote:

> Using has_tag('gene') as a pre-screen works for me for both example seqs.
>

Me too :-)

Dobrou noc and cheers,

Adlai

> chris
>
> On Feb 24, 2012, at 3:33 PM, Adlai Burman wrote:
>
>> Thanks so much, Jason.
>> I will give that a try in after I get a few hours of much needed sleep :-)
>>
>>
>> On Feb 24, 2012, at 10:21 PM, Jason Stajich wrote:
>>
>>> not all CDS will be annotated with a 'gene' tag, this is due to variation in how annotation is done and that there is not a requirement that there be a gene tag for all CDS features.
>>>
>>> You can protect your query - we often do this when dealing with data from the wild by testing for has_tag first.
>>>
>>> my %strands;
>>> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
>>> if( $cds->has_tag('gene') ) {
>>> my ($gene) = $cds->get_tag_values('gene'); # get the 1st one, this returns a list
>>> $strands{$gene} = $cds->strand;
>>> } else { # look in alternative places for a name, e.g. locus,
>>> ...
>>> }
>>> }
>>>
>>> An alternative is to loop through your list of tags in order of preference
>>>
>>> my %strands;
>>> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
>>> for my $tag ( qw(gene locus name product accession note) ) {  
>>> if( $cds->has_tag($tag) ) {
>>> my ($name) = $cds->get_tag_values($tag); # get the 1st one, this returns a list
>>> $strands{$name} = $cds->strand;
>>>      $seen = 1;
>>>      last;
>>> }
>>> if( ! $seen ) {
>>> warn("not tag found for feature at ", $cds->location->to_FTstring, "\n");
>>> }
>>> }
>>>
>>> On Feb 24, 2012, at 12:43 PM, Adlai Burman wrote:
>>>
>>>> I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:
>>>>
>>>> #!/usr/bin/perl                                                                                                    
>>>> use strict;
>>>> use warnings;
>>>> use IO::String;
>>>> use Bio::Perl;
>>>> use Bio::SeqIO;
>>>> use IO::String;
>>>>
>>>> my @files = </Users/adlai/Dropbox/atrsh/*>;
>>>> foreach my $file(@files){
>>>>
>>>>
>>>> my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
>>>> my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit.
>>>> .
>>>> .
>>>> .
>>>> #do nifty stuff
>>>> }
>>>>
>>>> For some files this approach works just fine.
>>>> For others the script dies immediately with the error message:
>>>>
>>>> ------------- EXCEPTION -------------
>>>> MSG: asking for tag value that does not exist gene
>>>> STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
>>>> STACK toplevel tosend.pl:16
>>>> -------------------------------------
>>>>
>>>> The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
>>>> Does anyone know why this is a problem and what can be done to circumvent it?
>>>>
>>>> Thanks,
>>>> Adlai
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> [hidden email]
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>> Jason Stajich
>>> [hidden email]
>>> [hidden email]
>>>
>>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> [hidden email]
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Fields, Christopher J
In reply to this post by Adlai Burman-2
There is possibly a slight disconnect here.  The primary tag you are looking for is 'CDS'.  Here is an example of both the primary tag for 'CDS' and 'gene' from NC_012927:

     gene            complement(join(91716..92516,69640..69753))
                     /locus_tag="BaolC_p001"
                     /trans_splicing
                     /db_xref="GeneID:8223103"
     CDS             complement(join(91716..91744,92285..92516,69640..69753))
                     /locus_tag="BaolC_p001"
                     /trans_splicing
                     /codon_start=1
                     /transl_table=11
                     /product="ribosomal protein S12"
                     /protein_id="YP_003029720.1"
                     /db_xref="GI:253729537"
                     /db_xref="GeneID:8223103"
                     /translation="MPTVKQLIRNARQPIRNARKSAALKGCPQRRGTCARVYTINPKK
                     PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYRIIRGTL
                     DAVAVKNRQQGRSKYGVKKPKK"

This 'CDS' example has no 'gene' regular tag, in fact none that I looked at did.  But there is another feature (above it) that does have the *primary* tag 'gene' and same locus_tag and dbxref GeneID (it does have tags such as 'locus_tag', 'trans_splicing', etc).  Is that what you mean?

The way that BioPerl deals with this is to return two different independent features, one with the 'CDS' primary tag and one with the 'gene' primary tag.  Past attempts to somehow combine these (and then disambiguate them again later if needed for output) can be problematic, or at least they once were.

chris


On Feb 24, 2012, at 3:55 PM, Adlai Burman wrote:

> Jason,
> Your first solution, indeed, did the trick (though I'm not sure why). There was no need to for checking "else." I'm not sure why some records with a full set of "gene" tags would not parse without the check, but everything parsed with it.
>
> Brian, you were right.
>
> Thanks again,
>
> Adlai
> On Feb 24, 2012, at 10:21 PM, Jason Stajich wrote:
>
>> not all CDS will be annotated with a 'gene' tag, this is due to variation in how annotation is done and that there is not a requirement that there be a gene tag for all CDS features.
>>
>> You can protect your query - we often do this when dealing with data from the wild by testing for has_tag first.
>>
>> my %strands;
>> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
>> if( $cds->has_tag('gene') ) {
>> my ($gene) = $cds->get_tag_values('gene'); # get the 1st one, this returns a list
>> $strands{$gene} = $cds->strand;
>> } else { # look in alternative places for a name, e.g. locus,
>> ...
>> }
>> }
>>
>> An alternative is to loop through your list of tags in order of preference
>>
>> my %strands;
>> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
>> for my $tag ( qw(gene locus name product accession note) ) {  
>>  if( $cds->has_tag($tag) ) {
>> my ($name) = $cds->get_tag_values($tag); # get the 1st one, this returns a list
>> $strands{$name} = $cds->strand;
>>       $seen = 1;
>>       last;
>> }
>> if( ! $seen ) {
>> warn("not tag found for feature at ", $cds->location->to_FTstring, "\n");
>> }
>> }
>>
>> On Feb 24, 2012, at 12:43 PM, Adlai Burman wrote:
>>
>>> I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:
>>>
>>> #!/usr/bin/perl                                                                                                    
>>> use strict;
>>> use warnings;
>>> use IO::String;
>>> use Bio::Perl;
>>> use Bio::SeqIO;
>>> use IO::String;
>>>
>>> my @files = </Users/adlai/Dropbox/atrsh/*>;
>>> foreach my $file(@files){
>>>
>>>
>>> my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
>>> my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit.
>>> .
>>> .
>>> .
>>> #do nifty stuff
>>> }
>>>
>>> For some files this approach works just fine.
>>> For others the script dies immediately with the error message:
>>>
>>> ------------- EXCEPTION -------------
>>> MSG: asking for tag value that does not exist gene
>>> STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
>>> STACK toplevel tosend.pl:16
>>> -------------------------------------
>>>
>>> The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
>>> Does anyone know why this is a problem and what can be done to circumvent it?
>>>
>>> Thanks,
>>> Adlai
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> [hidden email]
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Jason Stajich
>> [hidden email]
>> [hidden email]
>>
>>
>
>
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Adlai Burman-2
I apologize, Chris. That was a terrible example I sent. I accidentally sent the only record for which there actually are no regular 'gene' tags. Other gb records that failed before Jason's tag check include NC_005086 and they all have regular 'gene' tags. I should have referenced that one.
I appreciate your checking into that and sorry about the red herring. Boy is my face blushed.

A.


On Feb 24, 2012, at 11:38 PM, Fields, Christopher J wrote:

> There is possibly a slight disconnect here.  The primary tag you are looking for is 'CDS'.  Here is an example of both the primary tag for 'CDS' and 'gene' from NC_012927:
>
>     gene            complement(join(91716..92516,69640..69753))
>                     /locus_tag="BaolC_p001"
>                     /trans_splicing
>                     /db_xref="GeneID:8223103"
>     CDS             complement(join(91716..91744,92285..92516,69640..69753))
>                     /locus_tag="BaolC_p001"
>                     /trans_splicing
>                     /codon_start=1
>                     /transl_table=11
>                     /product="ribosomal protein S12"
>                     /protein_id="YP_003029720.1"
>                     /db_xref="GI:253729537"
>                     /db_xref="GeneID:8223103"
>                     /translation="MPTVKQLIRNARQPIRNARKSAALKGCPQRRGTCARVYTINPKK
>                     PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYRIIRGTL
>                     DAVAVKNRQQGRSKYGVKKPKK"
>
> This 'CDS' example has no 'gene' regular tag, in fact none that I looked at did.  But there is another feature (above it) that does have the *primary* tag 'gene' and same locus_tag and dbxref GeneID (it does have tags such as 'locus_tag', 'trans_splicing', etc).  Is that what you mean?
>
> The way that BioPerl deals with this is to return two different independent features, one with the 'CDS' primary tag and one with the 'gene' primary tag.  Past attempts to somehow combine these (and then disambiguate them again later if needed for output) can be problematic, or at least they once were.
>
> chris
>
>
> On Feb 24, 2012, at 3:55 PM, Adlai Burman wrote:
>
>> Jason,
>> Your first solution, indeed, did the trick (though I'm not sure why). There was no need to for checking "else." I'm not sure why some records with a full set of "gene" tags would not parse without the check, but everything parsed with it.
>>
>> Brian, you were right.
>>
>> Thanks again,
>>
>> Adlai
>> On Feb 24, 2012, at 10:21 PM, Jason Stajich wrote:
>>
>>> not all CDS will be annotated with a 'gene' tag, this is due to variation in how annotation is done and that there is not a requirement that there be a gene tag for all CDS features.
>>>
>>> You can protect your query - we often do this when dealing with data from the wild by testing for has_tag first.
>>>
>>> my %strands;
>>> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
>>> if( $cds->has_tag('gene') ) {
>>> my ($gene) = $cds->get_tag_values('gene'); # get the 1st one, this returns a list
>>> $strands{$gene} = $cds->strand;
>>> } else { # look in alternative places for a name, e.g. locus,
>>> ...
>>> }
>>> }
>>>
>>> An alternative is to loop through your list of tags in order of preference
>>>
>>> my %strands;
>>> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
>>> for my $tag ( qw(gene locus name product accession note) ) {  
>>> if( $cds->has_tag($tag) ) {
>>> my ($name) = $cds->get_tag_values($tag); # get the 1st one, this returns a list
>>> $strands{$name} = $cds->strand;
>>>      $seen = 1;
>>>      last;
>>> }
>>> if( ! $seen ) {
>>> warn("not tag found for feature at ", $cds->location->to_FTstring, "\n");
>>> }
>>> }
>>>
>>> On Feb 24, 2012, at 12:43 PM, Adlai Burman wrote:
>>>
>>>> I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:
>>>>
>>>> #!/usr/bin/perl                                                                                                    
>>>> use strict;
>>>> use warnings;
>>>> use IO::String;
>>>> use Bio::Perl;
>>>> use Bio::SeqIO;
>>>> use IO::String;
>>>>
>>>> my @files = </Users/adlai/Dropbox/atrsh/*>;
>>>> foreach my $file(@files){
>>>>
>>>>
>>>> my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
>>>> my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit.
>>>> .
>>>> .
>>>> .
>>>> #do nifty stuff
>>>> }
>>>>
>>>> For some files this approach works just fine.
>>>> For others the script dies immediately with the error message:
>>>>
>>>> ------------- EXCEPTION -------------
>>>> MSG: asking for tag value that does not exist gene
>>>> STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
>>>> STACK toplevel tosend.pl:16
>>>> -------------------------------------
>>>>
>>>> The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
>>>> Does anyone know why this is a problem and what can be done to circumvent it?
>>>>
>>>> Thanks,
>>>> Adlai
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> [hidden email]
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>> Jason Stajich
>>> [hidden email]
>>> [hidden email]
>>>
>>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> [hidden email]
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Fields, Christopher J
No problem. This highlights the #1 problem with genbank files (and the reasons we keep things as simple as possible), namely lack of consistency.  Not a problem with NCBI per se, but a problem nonetheless.

Chris

On Feb 24, 2012, at 5:07 PM, "Adlai Burman" <[hidden email]> wrote:

> I apologize, Chris. That was a terrible example I sent. I accidentally sent the only record for which there actually are no regular 'gene' tags. Other gb records that failed before Jason's tag check include NC_005086 and they all have regular 'gene' tags. I should have referenced that one.
> I appreciate your checking into that and sorry about the red herring. Boy is my face blushed.
>
> A.
>
>
> On Feb 24, 2012, at 11:38 PM, Fields, Christopher J wrote:
>
>> There is possibly a slight disconnect here.  The primary tag you are looking for is 'CDS'.  Here is an example of both the primary tag for 'CDS' and 'gene' from NC_012927:
>>
>>    gene            complement(join(91716..92516,69640..69753))
>>                    /locus_tag="BaolC_p001"
>>                    /trans_splicing
>>                    /db_xref="GeneID:8223103"
>>    CDS             complement(join(91716..91744,92285..92516,69640..69753))
>>                    /locus_tag="BaolC_p001"
>>                    /trans_splicing
>>                    /codon_start=1
>>                    /transl_table=11
>>                    /product="ribosomal protein S12"
>>                    /protein_id="YP_003029720.1"
>>                    /db_xref="GI:253729537"
>>                    /db_xref="GeneID:8223103"
>>                    /translation="MPTVKQLIRNARQPIRNARKSAALKGCPQRRGTCARVYTINPKK
>>                    PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYRIIRGTL
>>                    DAVAVKNRQQGRSKYGVKKPKK"
>>
>> This 'CDS' example has no 'gene' regular tag, in fact none that I looked at did.  But there is another feature (above it) that does have the *primary* tag 'gene' and same locus_tag and dbxref GeneID (it does have tags such as 'locus_tag', 'trans_splicing', etc).  Is that what you mean?
>>
>> The way that BioPerl deals with this is to return two different independent features, one with the 'CDS' primary tag and one with the 'gene' primary tag.  Past attempts to somehow combine these (and then disambiguate them again later if needed for output) can be problematic, or at least they once were.
>>
>> chris
>>
>>
>> On Feb 24, 2012, at 3:55 PM, Adlai Burman wrote:
>>
>>> Jason,
>>> Your first solution, indeed, did the trick (though I'm not sure why). There was no need to for checking "else." I'm not sure why some records with a full set of "gene" tags would not parse without the check, but everything parsed with it.
>>>
>>> Brian, you were right.
>>>
>>> Thanks again,
>>>
>>> Adlai
>>> On Feb 24, 2012, at 10:21 PM, Jason Stajich wrote:
>>>
>>>> not all CDS will be annotated with a 'gene' tag, this is due to variation in how annotation is done and that there is not a requirement that there be a gene tag for all CDS features.
>>>>
>>>> You can protect your query - we often do this when dealing with data from the wild by testing for has_tag first.
>>>>
>>>> my %strands;
>>>> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
>>>> if( $cds->has_tag('gene') ) {
>>>>    my ($gene) = $cds->get_tag_values('gene'); # get the 1st one, this returns a list
>>>>    $strands{$gene} = $cds->strand;
>>>> } else { # look in alternative places for a name, e.g. locus,
>>>> ...
>>>> }
>>>> }
>>>>
>>>> An alternative is to loop through your list of tags in order of preference
>>>>
>>>> my %strands;
>>>> for my $cds ( grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) {
>>>> for my $tag ( qw(gene locus name product accession note) ) {  
>>>> if( $cds->has_tag($tag) ) {
>>>>    my ($name) = $cds->get_tag_values($tag); # get the 1st one, this returns a list
>>>>    $strands{$name} = $cds->strand;
>>>>     $seen = 1;
>>>>     last;
>>>> }
>>>> if( ! $seen ) {
>>>>    warn("not tag found for feature at ", $cds->location->to_FTstring, "\n");
>>>> }
>>>> }
>>>>
>>>> On Feb 24, 2012, at 12:43 PM, Adlai Burman wrote:
>>>>
>>>>> I have come across a perplexing problem with trying to parse sequence features into hashes from gb records. This is the minimal code which shows my problem:
>>>>>
>>>>> #!/usr/bin/perl                                                                                                    
>>>>> use strict;
>>>>> use warnings;
>>>>> use IO::String;
>>>>> use Bio::Perl;
>>>>> use Bio::SeqIO;
>>>>> use IO::String;
>>>>>
>>>>> my @files = </Users/adlai/Dropbox/atrsh/*>;
>>>>> foreach my $file(@files){
>>>>>
>>>>>
>>>>> my @cds_features = grep {$_->primary_tag eq 'CDS' } Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
>>>>> my %strands = map {$_->get_tag_values('gene'), $_->strand} @cds_features; ##This Is The Culprit.
>>>>> .
>>>>> .
>>>>> .
>>>>> #do nifty stuff
>>>>> }
>>>>>
>>>>> For some files this approach works just fine.
>>>>> For others the script dies immediately with the error message:
>>>>>
>>>>> ------------- EXCEPTION -------------
>>>>> MSG: asking for tag value that does not exist gene
>>>>> STACK Bio::SeqFeature::Generic::get_tag_values /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
>>>>> STACK toplevel tosend.pl:16
>>>>> -------------------------------------
>>>>>
>>>>> The difference in the files that parse and those that don't seems to be that the files that crash have "intron" and "exon" tags. They ALL have "gene" tags.
>>>>> Does anyone know why this is a problem and what can be done to circumvent it?
>>>>>
>>>>> Thanks,
>>>>> Adlai
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> [hidden email]
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>> Jason Stajich
>>>> [hidden email]
>>>> [hidden email]
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> [hidden email]
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Francisco J. Ossandón
I have been parsing Genbank files with Bioperl for a long time now, so for
many types of data I wrote a code that always checks  if the data exists
before asking for it, to avoid scripts crashes/warnings (I remember I made
customs GBKs deleting specific lines to manually check for crashes). I use
that in all my programs and they work fine. I use Perl ternaries for most
one-liners:

my $version    = $seq_obj->version || '';
my $definition = $seq_obj->desc    || '';
my $dna_shape  = $seq_obj->is_circular ? 'circular' : 'linear';

my $prot_id = $feat->has_tag('protein_id') ?
($feat->get_tag_values('protein_id'))[0] : '';
my $product = $feat->has_tag('product')    ?
($feat->get_tag_values('product'))[0]    : '';

I usually try to code defensively when parsing Genbanks, expecting every
step to go wrong. For example, try to parse “Escherichia coli str. K-12
substr. MG1655 chromosome” (NC_000913.gbk), and you will see many things go
wrong (like CDS without /protein_id or /product tags). ;)

Francisco J. Ossandon


-----Mensaje original-----
De: [hidden email]
[mailto:[hidden email]] En nombre de Fields,
Christopher J
Enviado el: viernes, 24 de febrero de 2012 20:29
Para: Adlai Burman
CC: <[hidden email]>; Jason Stajich
Asunto: Re: [Bioperl-l] Odd problem with get_tag_values

No problem. This highlights the #1 problem with genbank files (and the
reasons we keep things as simple as possible), namely lack of consistency.
Not a problem with NCBI per se, but a problem nonetheless.

Chris

On Feb 24, 2012, at 5:07 PM, "Adlai Burman" <[hidden email]>
wrote:

> I apologize, Chris. That was a terrible example I sent. I accidentally
sent the only record for which there actually are no regular 'gene' tags.
Other gb records that failed before Jason's tag check include NC_005086 and
they all have regular 'gene' tags. I should have referenced that one.
> I appreciate your checking into that and sorry about the red herring. Boy
is my face blushed.
>
> A.
>
>
> On Feb 24, 2012, at 11:38 PM, Fields, Christopher J wrote:
>
>> There is possibly a slight disconnect here.  The primary tag you are
looking for is 'CDS'.  Here is an example of both the primary tag for 'CDS'
and 'gene' from NC_012927:
>>
>>    gene            complement(join(91716..92516,69640..69753))
>>                    /locus_tag="BaolC_p001"
>>                    /trans_splicing
>>                    /db_xref="GeneID:8223103"
>>    CDS
complement(join(91716..91744,92285..92516,69640..69753))
>>                    /locus_tag="BaolC_p001"
>>                    /trans_splicing
>>                    /codon_start=1
>>                    /transl_table=11
>>                    /product="ribosomal protein S12"
>>                    /protein_id="YP_003029720.1"
>>                    /db_xref="GI:253729537"
>>                    /db_xref="GeneID:8223103"
>>
/translation="MPTVKQLIRNARQPIRNARKSAALKGCPQRRGTCARVYTINPKK
>>
PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYRIIRGTL
>>                    DAVAVKNRQQGRSKYGVKKPKK"
>>
>> This 'CDS' example has no 'gene' regular tag, in fact none that I looked
at did.  But there is another feature (above it) that does have the
*primary* tag 'gene' and same locus_tag and dbxref GeneID (it does have tags
such as 'locus_tag', 'trans_splicing', etc).  Is that what you mean?
>>
>> The way that BioPerl deals with this is to return two different
independent features, one with the 'CDS' primary tag and one with the 'gene'
primary tag.  Past attempts to somehow combine these (and then disambiguate
them again later if needed for output) can be problematic, or at least they
once were.
>>
>> chris
>>
>>
>> On Feb 24, 2012, at 3:55 PM, Adlai Burman wrote:
>>
>>> Jason,
>>> Your first solution, indeed, did the trick (though I'm not sure why).
There was no need to for checking "else." I'm not sure why some records with
a full set of "gene" tags would not parse without the check, but everything
parsed with it.
>>>
>>> Brian, you were right.
>>>
>>> Thanks again,
>>>
>>> Adlai
>>> On Feb 24, 2012, at 10:21 PM, Jason Stajich wrote:
>>>
>>>> not all CDS will be annotated with a 'gene' tag, this is due to
variation in how annotation is done and that there is not a requirement that
there be a gene tag for all CDS features.
>>>>
>>>> You can protect your query - we often do this when dealing with data
from the wild by testing for has_tag first.
>>>>
>>>> my %strands;
>>>> for my $cds ( grep {$_->primary_tag eq 'CDS' }
>>>> Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) { if(
$cds->has_tag('gene') ) {
>>>>    my ($gene) = $cds->get_tag_values('gene'); # get the 1st one, this
returns a list

>>>>    $strands{$gene} = $cds->strand; } else { # look in alternative
>>>> places for a name, e.g. locus, ...
>>>> }
>>>> }
>>>>
>>>> An alternative is to loop through your list of tags in order of
>>>> preference
>>>>
>>>> my %strands;
>>>> for my $cds ( grep {$_->primary_tag eq 'CDS' }
>>>> Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures ) { for
>>>> my $tag ( qw(gene locus name product accession note) ) { if(
$cds->has_tag($tag) ) {
>>>>    my ($name) = $cds->get_tag_values($tag); # get the 1st one, this
returns a list

>>>>    $strands{$name} = $cds->strand;
>>>>     $seen = 1;
>>>>     last;
>>>> }
>>>> if( ! $seen ) {
>>>>    warn("not tag found for feature at ",
>>>> $cds->location->to_FTstring, "\n"); } }
>>>>
>>>> On Feb 24, 2012, at 12:43 PM, Adlai Burman wrote:
>>>>
>>>>> I have come across a perplexing problem with trying to parse sequence
features into hashes from gb records. This is the minimal code which shows
my problem:
>>>>>
>>>>> #!/usr/bin/perl

>>>>> use strict;
>>>>> use warnings;
>>>>> use IO::String;
>>>>> use Bio::Perl;
>>>>> use Bio::SeqIO;
>>>>> use IO::String;
>>>>>
>>>>> my @files = </Users/adlai/Dropbox/atrsh/*>; foreach my
>>>>> $file(@files){
>>>>>
>>>>>
>>>>> my @cds_features = grep {$_->primary_tag eq 'CDS' }
>>>>> Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;
>>>>> my %strands = map {$_->get_tag_values('gene'), $_->strand}
@cds_features; ##This Is The Culprit.

>>>>> .
>>>>> .
>>>>> .
>>>>> #do nifty stuff
>>>>> }
>>>>>
>>>>> For some files this approach works just fine.
>>>>> For others the script dies immediately with the error message:
>>>>>
>>>>> ------------- EXCEPTION -------------
>>>>> MSG: asking for tag value that does not exist gene STACK
>>>>> Bio::SeqFeature::Generic::get_tag_values
>>>>> /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
>>>>> STACK toplevel tosend.pl:16
>>>>> -------------------------------------
>>>>>
>>>>> The difference in the files that parse and those that don't seems to
be that the files that crash have "intron" and "exon" tags. They ALL have
"gene" tags.
>>>>> Does anyone know why this is a problem and what can be done to
circumvent it?

>>>>>
>>>>> Thanks,
>>>>> Adlai
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> [hidden email]
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>> Jason Stajich
>>>> [hidden email]
>>>> [hidden email]
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> [hidden email]
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Odd problem with get_tag_values

Roy Chaudhuri-3
In reply to this post by Adlai Burman-2
Just to chip in on this, you can use get_tagset_values instead of
get_tag_values - the former has (to me) the more Perl-ish behaviour of
returning an empty list if there are none of the requested tags present,
meaning that you can skip the has_tag step.

Cheers,
Roy.

On 24/02/2012 21:57, Adlai Burman wrote:

>
> On Feb 24, 2012, at 10:46 PM, Fields, Christopher J wrote:
>
>> Using has_tag('gene') as a pre-screen works for me for both example
>> seqs.
>>
>
> Me too :-)
>
> Dobrou noc and cheers,
>
> Adlai
>> chris
>>
>> On Feb 24, 2012, at 3:33 PM, Adlai Burman wrote:
>>
>>> Thanks so much, Jason. I will give that a try in after I get a
>>> few hours of much needed sleep :-)
>>>
>>>
>>> On Feb 24, 2012, at 10:21 PM, Jason Stajich wrote:
>>>
>>>> not all CDS will be annotated with a 'gene' tag, this is due to
>>>> variation in how annotation is done and that there is not a
>>>> requirement that there be a gene tag for all CDS features.
>>>>
>>>> You can protect your query - we often do this when dealing with
>>>> data from the wild by testing for has_tag first.
>>>>
>>>> my %strands; for my $cds ( grep {$_->primary_tag eq 'CDS' }
>>>> Bio::SeqIO->new(-file =>  $file)->next_seq->get_SeqFeatures )
>>>> { if( $cds->has_tag('gene') ) { my ($gene) =
>>>> $cds->get_tag_values('gene'); # get the 1st one, this returns a
>>>> list $strands{$gene} = $cds->strand; } else { # look in
>>>> alternative places for a name, e.g. locus, ... } }
>>>>
>>>> An alternative is to loop through your list of tags in order of
>>>> preference
>>>>
>>>> my %strands; for my $cds ( grep {$_->primary_tag eq 'CDS' }
>>>> Bio::SeqIO->new(-file =>  $file)->next_seq->get_SeqFeatures )
>>>> { for my $tag ( qw(gene locus name product accession note) ) {
>>>> if( $cds->has_tag($tag) ) { my ($name) =
>>>> $cds->get_tag_values($tag); # get the 1st one, this returns a
>>>> list $strands{$name} = $cds->strand; $seen = 1; last; } if( !
>>>> $seen ) { warn("not tag found for feature at ",
>>>> $cds->location->to_FTstring, "\n"); } }
>>>>
>>>> On Feb 24, 2012, at 12:43 PM, Adlai Burman wrote:
>>>>
>>>>> I have come across a perplexing problem with trying to parse
>>>>> sequence features into hashes from gb records. This is the
>>>>> minimal code which shows my problem:
>>>>>
>>>>> #!/usr/bin/perl use strict; use warnings; use IO::String; use
>>>>> Bio::Perl; use Bio::SeqIO; use IO::String;
>>>>>
>>>>> my @files =</Users/adlai/Dropbox/atrsh/*>; foreach my
>>>>> $file(@files){
>>>>>
>>>>>
>>>>> my @cds_features = grep {$_->primary_tag eq 'CDS' }
>>>>> Bio::SeqIO->new(-file =>  $file)->next_seq->get_SeqFeatures;
>>>>> my %strands = map {$_->get_tag_values('gene'), $_->strand}
>>>>> @cds_features; ##This Is The Culprit. . . . #do nifty stuff
>>>>> }
>>>>>
>>>>> For some files this approach works just fine. For others the
>>>>> script dies immediately with the error message:
>>>>>
>>>>> ------------- EXCEPTION ------------- MSG: asking for tag
>>>>> value that does not exist gene STACK
>>>>> Bio::SeqFeature::Generic::get_tag_values
>>>>> /Users/adlai/Downloads/BioPerl-1.6.1/Bio/SeqFeature/Generic.pm:517
>>>>>
>>>>>
STACK toplevel tosend.pl:16

>>>>> -------------------------------------
>>>>>
>>>>> The difference in the files that parse and those that don't
>>>>> seems to be that the files that crash have "intron" and
>>>>> "exon" tags. They ALL have "gene" tags. Does anyone know why
>>>>> this is a problem and what can be done to circumvent it?
>>>>>
>>>>> Thanks, Adlai
>>>>>
>>>>>
>>>>> _______________________________________________ Bioperl-l
>>>>> mailing list [hidden email]
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>> Jason Stajich [hidden email] [hidden email]
>>>>
>>>>
>>>
>>>
>>> _______________________________________________ Bioperl-l mailing
>>> list [hidden email]
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
>
> _______________________________________________ Bioperl-l mailing
> list [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l