GenBank files CONTIG line

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

GenBank files CONTIG line

Matthew Laird
Good afternoon,

I wanted to report what I think is an issue but I'm not positive yet.  I
found this old mailing list posting from May
(http://lists.open-bio.org/pipermail/bioperl-l/2014-May/071583.html)
about the changes to NCBI's genbank files, and I just grabbed the latest
bioperl live with August's patch to hopefully solve it.  That part
worked great, instead of spewing a few GB of warns and the whole
sequence multiple times it read the genbank file and wrote out an embl
file perfectly fine.

However the current bioperl live created a new issue.  I have a mirror
of NCBI's bacterial genomes directory (yes, I know, I need to move to
the new directory structure in the next 6 months) and this pipeline
takes the genbank file and makes the embl, ptt, faa, and fna as needed.
  This usually takes seconds.  Whatever changed in bioperl live compared
to BioPerl 1.6.922 causes the script to spin doing something very
intensely for tens of minutes, slowly writing out the ptt file.

Simply copying genbank.pm from bioperl live to my install directory
solved both the CONTIG issue and kept the whole conversion process
speedy.  So I'm happy for now, but I wanted to mention this in case it
rings a bell with anyone on what could have changed to make parsing a
gbk in to a ptt so much less efficient now.

Thanks.

--
Matthew Laird
Lead Software Developer, Bioinformatics
Brinkman Laboratory
Simon Fraser University, Burnaby, BC, Canada
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: GenBank files CONTIG line

Fields, Christopher J
Mathew,

That’s particularly odd; the problem pops up after that commit was added?  

Would it be possible for you to post a bug report on GitHub issues about this?

    https://github.com/bioperl/bioperl-live/issues

If this is involving anything other that a simple next_seq/write_seq loop (e.g. if you are doing any additional data manipulation) it would also help to see example code so we can replicate this; once that is in place we can work on bisecting down the problematic commit.

chris

On Sep 16, 2014, at 4:16 PM, Matthew Laird <[hidden email]> wrote:

> Good afternoon,
>
> I wanted to report what I think is an issue but I'm not positive yet.  I found this old mailing list posting from May (http://lists.open-bio.org/pipermail/bioperl-l/2014-May/071583.html) about the changes to NCBI's genbank files, and I just grabbed the latest bioperl live with August's patch to hopefully solve it.  That part worked great, instead of spewing a few GB of warns and the whole sequence multiple times it read the genbank file and wrote out an embl file perfectly fine.
>
> However the current bioperl live created a new issue.  I have a mirror of NCBI's bacterial genomes directory (yes, I know, I need to move to the new directory structure in the next 6 months) and this pipeline takes the genbank file and makes the embl, ptt, faa, and fna as needed.  This usually takes seconds.  Whatever changed in bioperl live compared to BioPerl 1.6.922 causes the script to spin doing something very intensely for tens of minutes, slowly writing out the ptt file.
>
> Simply copying genbank.pm from bioperl live to my install directory solved both the CONTIG issue and kept the whole conversion process speedy.  So I'm happy for now, but I wanted to mention this in case it rings a bell with anyone on what could have changed to make parsing a gbk in to a ptt so much less efficient now.
>
> Thanks.
>
> --
> Matthew Laird
> Lead Software Developer, Bioinformatics
> Brinkman Laboratory
> Simon Fraser University, Burnaby, BC, Canada
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://mailman.open-bio.org/mailman/listinfo/bioperl-l


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: GenBank files CONTIG line

Brian Osborne-2
In reply to this post by Matthew Laird
Matthew,

What's an easy way for me to reproduce this performance problem, making the *ptt file?

Brian O.

On Sep 16, 2014, at 5:16 PM, Matthew Laird <[hidden email]> wrote:

> Good afternoon,
>
> I wanted to report what I think is an issue but I'm not positive yet.  I found this old mailing list posting from May (http://lists.open-bio.org/pipermail/bioperl-l/2014-May/071583.html) about the changes to NCBI's genbank files, and I just grabbed the latest bioperl live with August's patch to hopefully solve it.  That part worked great, instead of spewing a few GB of warns and the whole sequence multiple times it read the genbank file and wrote out an embl file perfectly fine.
>
> However the current bioperl live created a new issue.  I have a mirror of NCBI's bacterial genomes directory (yes, I know, I need to move to the new directory structure in the next 6 months) and this pipeline takes the genbank file and makes the embl, ptt, faa, and fna as needed.  This usually takes seconds.  Whatever changed in bioperl live compared to BioPerl 1.6.922 causes the script to spin doing something very intensely for tens of minutes, slowly writing out the ptt file.
>
> Simply copying genbank.pm from bioperl live to my install directory solved both the CONTIG issue and kept the whole conversion process speedy.  So I'm happy for now, but I wanted to mention this in case it rings a bell with anyone on what could have changed to make parsing a gbk in to a ptt so much less efficient now.
>
> Thanks.
>
> --
> Matthew Laird
> Lead Software Developer, Bioinformatics
> Brinkman Laboratory
> Simon Fraser University, Burnaby, BC, Canada
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://mailman.open-bio.org/mailman/listinfo/bioperl-l


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: GenBank files CONTIG line

Matthew Laird
In reply to this post by Matthew Laird
Hi all,

So I finally found time to dig in to this issue from last year,
reflected in github issue #83, which was causing a significant slowdown
in parsing genbank files. It turns out a change in Bio::PrimarySeqI made
within the last year has some unintended consequences. Functions like
trunc() and recom() now use $self->clone to make a new sequence object
to return except for sequence objects of type Bio::Seq::LargePrimarySeq
or Bio::Seq::LargeSeq, I think at the least Bio::Seq::RichSeq should be
added to that list.

The case I'm seeing (simple driver attached to illustrate) is looping
through a genbank file and doing a trunc() on the sequence based on the
coordinates of an individual cds feature. By using clone() on the
sequence first you're cloning the entire original sequence object
including all the features - all the cds, gene, etc records - then
simply writing the new sub-sequence in to this cloned object. I don't
think that's what the user would expect to happen. It works fine for
simple sequence objects, but for a rich sequence this is probably the
wrong behaviour.  To see this uncomment the Dumper line in the driver
below and you'll see all the returned truncated sub-sequence objects
still have all the original features, including those outside of the
target range. This issue is then further compounded if you try to do
further manipulations on the returned sequence, assuming it'd be a
simple sequence as it was in previous releases - suddenly you have a
large sequence object with a lot of extra features to manipulate you
might not have been expecting (the source of my original issue).

I'm creating a pull request adding Bio::Seq::RichSeq as a seq type not
to use clone() for to trunc, but perhaps the change should also be made
to all the other functions in Bio::PrimarySeqI that now use clone()
since using clone on a rich sequence type in this situation has a very
noticeable effect on performance. Alternatively on trunc() we should
parse a rich sequence type to ensure all features in the returned object
really are limited to the requested range, however that would have
significant performance consequences.

Thoughts? Thanks gang!

On 14-09-16 02:16 PM, Matthew Laird wrote:

> Good afternoon,
>
> I wanted to report what I think is an issue but I'm not positive yet.  
> I found this old mailing list posting from May
> (http://lists.open-bio.org/pipermail/bioperl-l/2014-May/071583.html)
> about the changes to NCBI's genbank files, and I just grabbed the
> latest bioperl live with August's patch to hopefully solve it. That
> part worked great, instead of spewing a few GB of warns and the whole
> sequence multiple times it read the genbank file and wrote out an embl
> file perfectly fine.
>
> However the current bioperl live created a new issue.  I have a mirror
> of NCBI's bacterial genomes directory (yes, I know, I need to move to
> the new directory structure in the next 6 months) and this pipeline
> takes the genbank file and makes the embl, ptt, faa, and fna as
> needed.  This usually takes seconds.  Whatever changed in bioperl live
> compared to BioPerl 1.6.922 causes the script to spin doing something
> very intensely for tens of minutes, slowly writing out the ptt file.
>
> Simply copying genbank.pm from bioperl live to my install directory
> solved both the CONTIG issue and kept the whole conversion process
> speedy.  So I'm happy for now, but I wanted to mention this in case it
> rings a bell with anyone on what could have changed to make parsing a
> gbk in to a ptt so much less efficient now.
>
> Thanks.
>

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l

test_bioperl.pl (774 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Attachment Removed] GenBank files CONTIG line

Fields, Christopher J
Matthew,

The attachment appears to be scrubbed, can you send it directly or post it as a gist?  Just need the code and maybe the NCBI ID of the test case.

The fix seems sane enough; for large files it might be easier to have a back-end store (e.g. Bio::DB::SeqFeature::Store) that can house the original data persistently and sections of the data output as needed.  Not sure if you want to take that on :)  

chris

> On Mar 19, 2015, at 1:09 AM, Matthew Laird <[hidden email]> wrote:
>
> Hi all,
>
> So I finally found time to dig in to this issue from last year, reflected in github issue #83, which was causing a significant slowdown in parsing genbank files. It turns out a change in Bio::PrimarySeqI made within the last year has some unintended consequences. Functions like trunc() and recom() now use $self->clone to make a new sequence object to return except for sequence objects of type Bio::Seq::LargePrimarySeq or Bio::Seq::LargeSeq, I think at the least Bio::Seq::RichSeq should be added to that list.
>
> The case I'm seeing (simple driver attached to illustrate) is looping through a genbank file and doing a trunc() on the sequence based on the coordinates of an individual cds feature. By using clone() on the sequence first you're cloning the entire original sequence object including all the features - all the cds, gene, etc records - then simply writing the new sub-sequence in to this cloned object. I don't think that's what the user would expect to happen. It works fine for simple sequence objects, but for a rich sequence this is probably the wrong behaviour.  To see this uncomment the Dumper line in the driver below and you'll see all the returned truncated sub-sequence objects still have all the original features, including those outside of the target range. This issue is then further compounded if you try to do further manipulations on the returned sequence, assuming it'd be a simple sequence as it was in previous releases - suddenly you have a large sequence object wi!
 th a lot of extra features to manipulate you might not have been expecting (the source of my original issue).

>
> I'm creating a pull request adding Bio::Seq::RichSeq as a seq type not to use clone() for to trunc, but perhaps the change should also be made to all the other functions in Bio::PrimarySeqI that now use clone() since using clone on a rich sequence type in this situation has a very noticeable effect on performance. Alternatively on trunc() we should parse a rich sequence type to ensure all features in the returned object really are limited to the requested range, however that would have significant performance consequences.
>
> Thoughts? Thanks gang!
>
> On 14-09-16 02:16 PM, Matthew Laird wrote:
>> Good afternoon,
>>
>> I wanted to report what I think is an issue but I'm not positive yet.  I found this old mailing list posting from May (http://lists.open-bio.org/pipermail/bioperl-l/2014-May/071583.html) about the changes to NCBI's genbank files, and I just grabbed the latest bioperl live with August's patch to hopefully solve it. That part worked great, instead of spewing a few GB of warns and the whole sequence multiple times it read the genbank file and wrote out an embl file perfectly fine.
>>
>> However the current bioperl live created a new issue.  I have a mirror of NCBI's bacterial genomes directory (yes, I know, I need to move to the new directory structure in the next 6 months) and this pipeline takes the genbank file and makes the embl, ptt, faa, and fna as needed.  This usually takes seconds.  Whatever changed in bioperl live compared to BioPerl 1.6.922 causes the script to spin doing something very intensely for tens of minutes, slowly writing out the ptt file.
>>
>> Simply copying genbank.pm from bioperl live to my install directory solved both the CONTIG issue and kept the whole conversion process speedy.  So I'm happy for now, but I wanted to mention this in case it rings a bell with anyone on what could have changed to make parsing a gbk in to a ptt so much less efficient now.
>>
>> Thanks.
>>
>
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://mailman.open-bio.org/mailman/listinfo/bioperl-l


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: [Attachment Removed] GenBank files CONTIG line

Matthew Laird
Thanks Chris,

I'll send it to you directly for simplicity.

As for a larger fix - indeed, it would be nice to have features within
the bounds of the trunc range in the returned RichSeq object rather than
losing all that information. But doing that efficiently... I'll take
your suggestion under advisement to try and find time to implement that
bigger change. :)  But for now simply ensuring you don't get blob of
features in the returned object if you've only asked for 10bp... this
seems like a quicker initial fix.

On 15-03-19 11:14 AM, Fields, Christopher J wrote:

> Matthew,
>
> The attachment appears to be scrubbed, can you send it directly or post it as a gist?  Just need the code and maybe the NCBI ID of the test case.
>
> The fix seems sane enough; for large files it might be easier to have a back-end store (e.g. Bio::DB::SeqFeature::Store) that can house the original data persistently and sections of the data output as needed.  Not sure if you want to take that on :)
>
> chris
>
>> On Mar 19, 2015, at 1:09 AM, Matthew Laird<[hidden email]>  wrote:
>>
>> Hi all,
>>
>> So I finally found time to dig in to this issue from last year, reflected in github issue #83, which was causing a significant slowdown in parsing genbank files. It turns out a change in Bio::PrimarySeqI made within the last year has some unintended consequences. Functions like trunc() and recom() now use $self->clone to make a new sequence object to return except for sequence objects of type Bio::Seq::LargePrimarySeq or Bio::Seq::LargeSeq, I think at the least Bio::Seq::RichSeq should be added to that list.
>>
>> The case I'm seeing (simple driver attached to illustrate) is looping through a genbank file and doing a trunc() on the sequence based on the coordinates of an individual cds feature. By using clone() on the sequence first you're cloning the entire original sequence object including all the features - all the cds, gene, etc records - then simply writing the new sub-sequence in to this cloned object. I don't think that's what the user would expect to happen. It works fine for simple sequence objects, but for a rich sequence this is probably the wrong behaviour.  To see this uncomment the Dumper line in the driver below and you'll see all the returned truncated sub-sequence objects still have all the original features, including those outside of the target range. This issue is then further compounded if you try to do further manipulations on the returned sequence, assuming it'd be a simple sequence as it was in previous releases - suddenly you have a large sequence object wi
th a lot of extra features to manipulate you might not have been expecting (the source of my original issue).

>>
>> I'm creating a pull request adding Bio::Seq::RichSeq as a seq type not to use clone() for to trunc, but perhaps the change should also be made to all the other functions in Bio::PrimarySeqI that now use clone() since using clone on a rich sequence type in this situation has a very noticeable effect on performance. Alternatively on trunc() we should parse a rich sequence type to ensure all features in the returned object really are limited to the requested range, however that would have significant performance consequences.
>>
>> Thoughts? Thanks gang!
>>
>> On 14-09-16 02:16 PM, Matthew Laird wrote:
>>> Good afternoon,
>>>
>>> I wanted to report what I think is an issue but I'm not positive yet.  I found this old mailing list posting from May (http://lists.open-bio.org/pipermail/bioperl-l/2014-May/071583.html) about the changes to NCBI's genbank files, and I just grabbed the latest bioperl live with August's patch to hopefully solve it. That part worked great, instead of spewing a few GB of warns and the whole sequence multiple times it read the genbank file and wrote out an embl file perfectly fine.
>>>
>>> However the current bioperl live created a new issue.  I have a mirror of NCBI's bacterial genomes directory (yes, I know, I need to move to the new directory structure in the next 6 months) and this pipeline takes the genbank file and makes the embl, ptt, faa, and fna as needed.  This usually takes seconds.  Whatever changed in bioperl live compared to BioPerl 1.6.922 causes the script to spin doing something very intensely for tens of minutes, slowly writing out the ptt file.
>>>
>>> Simply copying genbank.pm from bioperl live to my install directory solved both the CONTIG issue and kept the whole conversion process speedy.  So I'm happy for now, but I wanted to mention this in case it rings a bell with anyone on what could have changed to make parsing a gbk in to a ptt so much less efficient now.
>>>
>>> Thanks.
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> [hidden email]
>> http://mailman.open-bio.org/mailman/listinfo/bioperl-l
>

--
Matthew Laird
Lead Software Developer, Bioinformatics
Brinkman Laboratory
Simon Fraser University, Burnaby, BC, Canada
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l