Bio::DB::Fasta problem: unable to fetch all sequences via get_PrimarySeq_stream

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Bio::DB::Fasta problem: unable to fetch all sequences via get_PrimarySeq_stream

Helene RIMBERT
Dear BioPerl developers,

I come with a question regarding the
get_PrimarySeq_stream !

I am using the Bio::DB:Fasta module to access my fasta sequences and i am facing some problem with the
get_PrimarySeq_stream().
When i check the content of the db object, all the sequences are indexed (i mean that i can see all the sequences ids in the offsets hash).

I then use the get_PrimarySeq_stream to loop over all my sequences, but only 1 sequence is retrieved from the stream object.
I tried to look for some explanations, and the only thing i could find is that it seems that my seq_ids are considered as undef. during the while($dbstream->next_seq()) statement when reaching
IndexedBase.pm line 1116

I tried to loop over all sequence ids using my @seq_ids = $self->{fastaObj}->get_all_primary_ids; and it works very well.

I don't understand why the stream object does not retrieve all the sequences whereas get_all_primary_ids does!
Is there something wrong with my input FASTA (my ids are very long...) or am i missing something?

I am really interested in finding out why i am not able to use
get_PrimarySeq_stream !

Many thanks in advance :)

Regards,

Helene

#----------------------------------
#
here is the part of code that causes problem:

# initialize db::fasta object
$self->{fastaObj} =  Bio::DB::Fasta->new("test2.fna", -reindex => 1);

# create stream object
my $seq_stream = $self->{fastaObj}->get_PrimarySeq_stream();
$self->{nbSeqFetchedInStream}=0;

# loop over all seq in BioDBFasta obj using stream obj.
while ($self->{seq} = $seq_stream->next_seq()){
#foreach my $seq_id (@seq_ids){
    #$self->{seq} = $self->{fastaObj}->get_Seq_by_id($seq_id); # to use with foreach loop

    print (" New sequence: ", Dumper $self->{seq});
    $self->{nbSeqFetchedInStream}++;
}
print (" Fetched sequences in _PrimarySeq_stream: $self->{nbSeqFetchedInStream}");
#----------------------------------





--

--> Nouvelle adresse e-mail: [hidden email] <--

Hélène RIMBERT
Bioinformatic Engineer
[hidden email]
UMR 1095 INRA/UBP – Site de Crouel
Tèl. : +33 (0)4 73 62 43 49
5 chemin de beaulieu
63039 Clermont-Ferrand Cedex 2
France
https://www6.ara.inra.fr/umr1095_eng/

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Bio::DB::Fasta problem: unable to fetch all sequences via get_PrimarySeq_stream

Fields, Christopher J

We would probably need a list of IDs, but this has happened before a few times.  In some cases it’s an issue of line ending mismatches, which can be normalized using a tool like dos2unix.  However if you have IDs that could be evaluated as False the issue is trickier and not so easy to fix, primarily because the returned value is stringified to the display ID (which is one reason I hate object stringification).

 

For example, the following would likely short-circuit without showing sequence IDs, as having a seq ID of ‘0’ (note this does not include the description, which is separate) will evaluate as False and kill the while loop:

 

>0 desc1

ATATATGTGC

>1 desc2

CGCGCCGCGC

 

The issue, the problems with a fix, and a workaround are described here: https://github.com/bioperl/bioperl-live/issues/170

 

chris

 

From: Bioperl-l <bioperl-l-bounces+cjfields=[hidden email]> on behalf of Helene RIMBERT <[hidden email]>
Date: Monday, November 14, 2016 at 10:16 AM
To: "[hidden email]" <[hidden email]>
Subject: [Bioperl-l] Bio::DB::Fasta problem: unable to fetch all sequences via get_PrimarySeq_stream

 

Dear BioPerl developers,

I come with a question regarding the get_PrimarySeq_stream !

I am using the Bio::DB:Fasta module to access my fasta sequences and i am facing some problem with the get_PrimarySeq_stream().

When i check the content of the db object, all the sequences are indexed (i mean that i can see all the sequences ids in the offsets hash).

I then use the get_PrimarySeq_stream to loop over all my sequences, but only 1 sequence is retrieved from the stream object.
I tried to look for some explanations, and the only thing i could find is that it seems that my seq_ids are considered as undef. during the while($dbstream->next_seq()) statement when reaching
IndexedBase.pm line 1116

I tried to loop over all sequence ids using my @seq_ids = $self->{fastaObj}->get_all_primary_ids; and it works very well.

I don't understand why the stream object does not retrieve all the sequences whereas get_all_primary_ids does!
Is there something wrong with my input FASTA (my ids are very long...) or am i missing something?

I am really interested in finding out why i am not able to use get_PrimarySeq_stream !

Many thanks in advance :)

Regards,

Helene

#----------------------------------
# here is the part of code that causes problem:
# initialize db::fasta object
$self->{fastaObj} =  Bio::DB::Fasta->new("test2.fna", -reindex => 1);

# create stream object
my $seq_stream = $self->{fastaObj}->get_PrimarySeq_stream();
$self->{nbSeqFetchedInStream}=0;

# loop over all seq in BioDBFasta obj using stream obj.
while ($self->{seq} = $seq_stream->next_seq()){
#foreach my $seq_id (@seq_ids){
    #$self->{seq} = $self->{fastaObj}->get_Seq_by_id($seq_id); # to use with foreach loop

    print (" New sequence: ", Dumper $self->{seq});
    $self->{nbSeqFetchedInStream}++;
}
print (" Fetched sequences in _PrimarySeq_stream: $self->{nbSeqFetchedInStream}");
#----------------------------------




--

--> Nouvelle adresse e-mail: [hidden email] <--

Hélène RIMBERT
Bioinformatic Engineer
[hidden email]
UMR 1095 INRA/UBP – Site de Crouel
Tèl. : +33 (0)4 73 62 43 49
5 chemin de beaulieu
63039 Clermont-Ferrand Cedex 2
France
https://www6.ara.inra.fr/umr1095_eng/

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Bio::DB::Fasta problem: unable to fetch all sequences via get_PrimarySeq_stream

Helene RIMBERT
Dear Chris,
thank you for your quick replies :)!

I am having a look at the link you mentioned right now!

I attached some script and the fasta exemple!

Just for the information:
perl --version:
This is perl 5, version 22, subversion 1 (v5.22.1)

&


BioPerl: 1.6.924-3

Thanks again for your answer!

Best regards,

Helene



Le 14/11/2016 à 18:31, Fields, Christopher J a écrit :

We would probably need a list of IDs, but this has happened before a few times.  In some cases it’s an issue of line ending mismatches, which can be normalized using a tool like dos2unix.  However if you have IDs that could be evaluated as False the issue is trickier and not so easy to fix, primarily because the returned value is stringified to the display ID (which is one reason I hate object stringification).

 

For example, the following would likely short-circuit without showing sequence IDs, as having a seq ID of ‘0’ (note this does not include the description, which is separate) will evaluate as False and kill the while loop:

 

>0 desc1

ATATATGTGC

>1 desc2

CGCGCCGCGC

 

The issue, the problems with a fix, and a workaround are described here: https://github.com/bioperl/bioperl-live/issues/170

 

chris

 

From: Bioperl-l [hidden email] on behalf of Helene RIMBERT [hidden email]
Date: Monday, November 14, 2016 at 10:16 AM
To: [hidden email] [hidden email]
Subject: [Bioperl-l] Bio::DB::Fasta problem: unable to fetch all sequences via get_PrimarySeq_stream

 

Dear BioPerl developers,

I come with a question regarding the get_PrimarySeq_stream !

I am using the Bio::DB:Fasta module to access my fasta sequences and i am facing some problem with the get_PrimarySeq_stream().

When i check the content of the db object, all the sequences are indexed (i mean that i can see all the sequences ids in the offsets hash).

I then use the get_PrimarySeq_stream to loop over all my sequences, but only 1 sequence is retrieved from the stream object.
I tried to look for some explanations, and the only thing i could find is that it seems that my seq_ids are considered as undef. during the while($dbstream->next_seq()) statement when reaching
IndexedBase.pm line 1116

I tried to loop over all sequence ids using my @seq_ids = $self->{fastaObj}->get_all_primary_ids; and it works very well.

I don't understand why the stream object does not retrieve all the sequences whereas get_all_primary_ids does!
Is there something wrong with my input FASTA (my ids are very long...) or am i missing something?

I am really interested in finding out why i am not able to use get_PrimarySeq_stream !

Many thanks in advance :)

Regards,

Helene

#----------------------------------
# here is the part of code that causes problem:
# initialize db::fasta object
$self->{fastaObj} =  Bio::DB::Fasta->new("test2.fna", -reindex => 1);

# create stream object
my $seq_stream = $self->{fastaObj}->get_PrimarySeq_stream();
$self->{nbSeqFetchedInStream}=0;

# loop over all seq in BioDBFasta obj using stream obj.
while ($self->{seq} = $seq_stream->next_seq()){
#foreach my $seq_id (@seq_ids){
    #$self->{seq} = $self->{fastaObj}->get_Seq_by_id($seq_id); # to use with foreach loop

    print (" New sequence: ", Dumper $self->{seq});
    $self->{nbSeqFetchedInStream}++;
}
print (" Fetched sequences in _PrimarySeq_stream: $self->{nbSeqFetchedInStream}");
#----------------------------------




--

--> Nouvelle adresse e-mail: [hidden email] <--

Hélène RIMBERT
Bioinformatic Engineer
[hidden email]
UMR 1095 INRA/UBP – Site de Crouel
Tèl. : +33 (0)4 73 62 43 49
5 chemin de beaulieu
63039 Clermont-Ferrand Cedex 2
France
https://www6.ara.inra.fr/umr1095_eng/

--

--> Nouvelle adresse e-mail: [hidden email] <--

Hélène RIMBERT
Bioinformatic Engineer
[hidden email]
UMR 1095 INRA/UBP – Site de Crouel
Tèl. : +33 (0)4 73 62 43 49
5 chemin de beaulieu
63039 Clermont-Ferrand Cedex 2
France
https://www6.ara.inra.fr/umr1095_eng/

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l

test.fna (1K) Download Attachment
bugged.pl (1K) Download Attachment