randomizing fastq sequences

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

randomizing fastq sequences

shalu sharma
Hi,
   i am trying to test one program for which i need to change order of
sequences in a fastq file.
My fastq file contains about 50,000 sequences.
Is there any script that can do this task?

Thanks
Shalu
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: randomizing fastq sequences

simon andrews (BI)

On 7 Feb 2011, at 22:07, shalu sharma wrote:

> Hi,
>   i am trying to test one program for which i need to change order of
> sequences in a fastq file.
> My fastq file contains about 50,000 sequences.
> Is there any script that can do this task?

Since FastQ is supported in SeqIO you could do something like (untested):

#!/usr/bin/perl
use warnings;
use strict;
use List::Util 'shuffle';
use Bio::SeqIO;

my @seqs;

my $in = Bio::SeqIO->new(-file => 'your_intput.fastq',
                         -format => 'Fastq');

while (my $seq = $in -> next_seq()) {
    push @seqs,$seq;
}

@seqs = shuffle(@seqs);

my $out = Bio::SeqIO->new(-file => '>your_output.fastq',
                          -format => 'Fastq');

foreach my $seq (@seqs) {
    $out->write_seq($seq);
}

## End

This has the disadvantage that it will hold all of the sequences in memory whilst shuffling, but I don't think there's an easy way around that.

Simon.
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: randomizing fastq sequences

Frank Schwach
If memory is an issue then I guess you could create a file of just the
sequence IDs (one per line), then shuffle those (using List::Util like
Simon demonstrated). In the end you would substitute the IDs for the
whole fastq entry again, which you can do without reading an entire file
into memory (might be bit slow but that probably doesn't
matter)
Frank


simon andrews (BI) wrote:

> On 7 Feb 2011, at 22:07, shalu sharma wrote:
>
>  
>> Hi,
>>   i am trying to test one program for which i need to change order of
>> sequences in a fastq file.
>> My fastq file contains about 50,000 sequences.
>> Is there any script that can do this task?
>>    
>
> Since FastQ is supported in SeqIO you could do something like (untested):
>
> #!/usr/bin/perl
> use warnings;
> use strict;
> use List::Util 'shuffle';
> use Bio::SeqIO;
>
> my @seqs;
>
> my $in = Bio::SeqIO->new(-file => 'your_intput.fastq',
> -format => 'Fastq');
>
> while (my $seq = $in -> next_seq()) {
>     push @seqs,$seq;
> }
>
> @seqs = shuffle(@seqs);
>
> my $out = Bio::SeqIO->new(-file => '>your_output.fastq',
>  -format => 'Fastq');
>
> foreach my $seq (@seqs) {
>     $out->write_seq($seq);
> }
>
> ## End
>
> This has the disadvantage that it will hold all of the sequences in memory whilst shuffling, but I don't think there's an easy way around that.
>
> Simon.
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>  


--
 The Wellcome Trust Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: randomizing fastq sequences

Roy Chaudhuri-3
TMTOWTDI, maybe also use the Tie::File module?

Something like:

#!/usr/bin/perl
use warnings FATAL=>qw(all);
use Modern::Perl;
use Tie::File;
use Fcntl qw(O_RDONLY);
use List::Util qw(shuffle);
my @fastq;
tie @fastq, 'Tie::File', $ARGV[0], mode=>O_RDONLY or die $!;
say join "\n", @fastq[4*$_..4*$_+3] for shuffle 0..$#fastq/4;

Cheers,
Roy.

On 08/02/2011 09:08, Frank Schwach wrote:

> If memory is an issue then I guess you could create a file of just the
> sequence IDs (one per line), then shuffle those (using List::Util like
> Simon demonstrated). In the end you would substitute the IDs for the
> whole fastq entry again, which you can do without reading an entire file
> into memory (might be bit slow but that probably doesn't
> matter)
> Frank
>
>
> simon andrews (BI) wrote:
>> On 7 Feb 2011, at 22:07, shalu sharma wrote:
>>
>>
>>> Hi,
>>>    i am trying to test one program for which i need to change order of
>>> sequences in a fastq file.
>>> My fastq file contains about 50,000 sequences.
>>> Is there any script that can do this task?
>>>
>>
>> Since FastQ is supported in SeqIO you could do something like (untested):
>>
>> #!/usr/bin/perl
>> use warnings;
>> use strict;
>> use List::Util 'shuffle';
>> use Bio::SeqIO;
>>
>> my @seqs;
>>
>> my $in = Bio::SeqIO->new(-file =>  'your_intput.fastq',
>> -format =>  'Fastq');
>>
>> while (my $seq = $in ->  next_seq()) {
>>      push @seqs,$seq;
>> }
>>
>> @seqs = shuffle(@seqs);
>>
>> my $out = Bio::SeqIO->new(-file =>  '>your_output.fastq',
>>  -format =>  'Fastq');
>>
>> foreach my $seq (@seqs) {
>>      $out->write_seq($seq);
>> }
>>
>> ## End
>>
>> This has the disadvantage that it will hold all of the sequences in memory whilst shuffling, but I don't think there's an easy way around that.
>>
>> Simon.
>> _______________________________________________
>> Bioperl-l mailing list
>> [hidden email]
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
>

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: randomizing fastq sequences

Frank Schwach
nice one - but if I understand it correctly it relies on there being
exactly 4 lines for each record. This is probably the case but it would
be a good idea to double-check the fastq file in question, just to make
sure.

Frank


Roy Chaudhuri wrote:

> TMTOWTDI, maybe also use the Tie::File module?
>
> Something like:
>
> #!/usr/bin/perl
> use warnings FATAL=>qw(all);
> use Modern::Perl;
> use Tie::File;
> use Fcntl qw(O_RDONLY);
> use List::Util qw(shuffle);
> my @fastq;
> tie @fastq, 'Tie::File', $ARGV[0], mode=>O_RDONLY or die $!;
> say join "\n", @fastq[4*$_..4*$_+3] for shuffle 0..$#fastq/4;
>
> Cheers,
> Roy.
>
> On 08/02/2011 09:08, Frank Schwach wrote:
>> If memory is an issue then I guess you could create a file of just the
>> sequence IDs (one per line), then shuffle those (using List::Util like
>> Simon demonstrated). In the end you would substitute the IDs for the
>> whole fastq entry again, which you can do without reading an entire file
>> into memory (might be bit slow but that probably doesn't
>> matter)
>> Frank
>>
>>
>> simon andrews (BI) wrote:
>>> On 7 Feb 2011, at 22:07, shalu sharma wrote:
>>>
>>>
>>>> Hi,
>>>>    i am trying to test one program for which i need to change order of
>>>> sequences in a fastq file.
>>>> My fastq file contains about 50,000 sequences.
>>>> Is there any script that can do this task?
>>>>
>>>
>>> Since FastQ is supported in SeqIO you could do something like
>>> (untested):
>>>
>>> #!/usr/bin/perl
>>> use warnings;
>>> use strict;
>>> use List::Util 'shuffle';
>>> use Bio::SeqIO;
>>>
>>> my @seqs;
>>>
>>> my $in = Bio::SeqIO->new(-file =>  'your_intput.fastq',
>>>              -format =>  'Fastq');
>>>
>>> while (my $seq = $in ->  next_seq()) {
>>>      push @seqs,$seq;
>>> }
>>>
>>> @seqs = shuffle(@seqs);
>>>
>>> my $out = Bio::SeqIO->new(-file =>  '>your_output.fastq',
>>>               -format =>  'Fastq');
>>>
>>> foreach my $seq (@seqs) {
>>>      $out->write_seq($seq);
>>> }
>>>
>>> ## End
>>>
>>> This has the disadvantage that it will hold all of the sequences in
>>> memory whilst shuffling, but I don't think there's an easy way
>>> around that.
>>>
>>> Simon.
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> [hidden email]
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>>
>
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


--
 The Wellcome Trust Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: randomizing fastq sequences

Roy Chaudhuri-3
Sorry, I should have included that caveat.

On 08/02/2011 11:57, Frank Schwach wrote:

> nice one - but if I understand it correctly it relies on there being
> exactly 4 lines for each record. This is probably the case but it would
> be a good idea to double-check the fastq file in question, just to make
> sure.
>
> Frank
>
>
> Roy Chaudhuri wrote:
>> TMTOWTDI, maybe also use the Tie::File module?
>>
>> Something like:
>>
>> #!/usr/bin/perl
>> use warnings FATAL=>qw(all);
>> use Modern::Perl;
>> use Tie::File;
>> use Fcntl qw(O_RDONLY);
>> use List::Util qw(shuffle);
>> my @fastq;
>> tie @fastq, 'Tie::File', $ARGV[0], mode=>O_RDONLY or die $!;
>> say join "\n", @fastq[4*$_..4*$_+3] for shuffle 0..$#fastq/4;
>>
>> Cheers,
>> Roy.
>>
>> On 08/02/2011 09:08, Frank Schwach wrote:
>>> If memory is an issue then I guess you could create a file of just the
>>> sequence IDs (one per line), then shuffle those (using List::Util like
>>> Simon demonstrated). In the end you would substitute the IDs for the
>>> whole fastq entry again, which you can do without reading an entire file
>>> into memory (might be bit slow but that probably doesn't
>>> matter)
>>> Frank
>>>
>>>
>>> simon andrews (BI) wrote:
>>>> On 7 Feb 2011, at 22:07, shalu sharma wrote:
>>>>
>>>>
>>>>> Hi,
>>>>>     i am trying to test one program for which i need to change order of
>>>>> sequences in a fastq file.
>>>>> My fastq file contains about 50,000 sequences.
>>>>> Is there any script that can do this task?
>>>>>
>>>>
>>>> Since FastQ is supported in SeqIO you could do something like
>>>> (untested):
>>>>
>>>> #!/usr/bin/perl
>>>> use warnings;
>>>> use strict;
>>>> use List::Util 'shuffle';
>>>> use Bio::SeqIO;
>>>>
>>>> my @seqs;
>>>>
>>>> my $in = Bio::SeqIO->new(-file =>   'your_intput.fastq',
>>>>               -format =>   'Fastq');
>>>>
>>>> while (my $seq = $in ->   next_seq()) {
>>>>       push @seqs,$seq;
>>>> }
>>>>
>>>> @seqs = shuffle(@seqs);
>>>>
>>>> my $out = Bio::SeqIO->new(-file =>   '>your_output.fastq',
>>>>                -format =>   'Fastq');
>>>>
>>>> foreach my $seq (@seqs) {
>>>>       $out->write_seq($seq);
>>>> }
>>>>
>>>> ## End
>>>>
>>>> This has the disadvantage that it will hold all of the sequences in
>>>> memory whilst shuffling, but I don't think there's an easy way
>>>> around that.
>>>>
>>>> Simon.
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> [hidden email]
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> [hidden email]
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: randomizing fastq sequences

Cook, Malcolm-2
In reply to this post by shalu sharma
Gotta chime in....

If
        you're working with fastq files
        are working in unix and have the `shuf` command available

I recommand you to install cdbyank http://sourceforge.net/projects/cdbfasta/ which provides for indexing fasta and fastq files and providing random access to them

Index the fastq, then extract the IDs with cdyank, pipe them through `shuf` and then through cdyank again to pull out the sequences.

Like this example, which uses a test fastq from my local install of bioperl:

> cd ~/local/src/bioperl-live/t/data/fastq/
> cdbfasta -Q example.fastq
3 entries from file example.fastq were indexed in file example.fastq.cidx
> cdbyank -l example.fastq.cidx | shuf | cdbyank example.fastq.cidx > shuf_example.fastq

There would be issues if your IDs are not unique.

Malcolm Cook
Stowers Institute for Medical Research -  Bioinformatics
Kansas City, Missouri  USA
 
 

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of
> shalu sharma
> Sent: Monday, February 07, 2011 4:08 PM
> To: [hidden email]
> Subject: [Bioperl-l] randomizing fastq sequences
>
> Hi,
>    i am trying to test one program for which i need to change
> order of sequences in a fastq file.
> My fastq file contains about 50,000 sequences.
> Is there any script that can do this task?
>
> Thanks
> Shalu
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: randomizing fastq sequences

Fields, Christopher J
Just to note, I have been thinking about wrapping this for fast indexing and retrieval of FASTQ for bioperl (this came up in a prior thread, with the same suggestion from Malcolm IIRC).

chris

On Feb 8, 2011, at 9:12 AM, Cook, Malcolm wrote:

> Gotta chime in....
>
> If
> you're working with fastq files
> are working in unix and have the `shuf` command available
>
> I recommand you to install cdbyank http://sourceforge.net/projects/cdbfasta/ which provides for indexing fasta and fastq files and providing random access to them
>
> Index the fastq, then extract the IDs with cdyank, pipe them through `shuf` and then through cdyank again to pull out the sequences.
>
> Like this example, which uses a test fastq from my local install of bioperl:
>
>> cd ~/local/src/bioperl-live/t/data/fastq/
>> cdbfasta -Q example.fastq
> 3 entries from file example.fastq were indexed in file example.fastq.cidx
>> cdbyank -l example.fastq.cidx | shuf | cdbyank example.fastq.cidx > shuf_example.fastq
>
> There would be issues if your IDs are not unique.
>
> Malcolm Cook
> Stowers Institute for Medical Research -  Bioinformatics
> Kansas City, Missouri  USA
>
>
>
>> -----Original Message-----
>> From: [hidden email]
>> [mailto:[hidden email]] On Behalf Of
>> shalu sharma
>> Sent: Monday, February 07, 2011 4:08 PM
>> To: [hidden email]
>> Subject: [Bioperl-l] randomizing fastq sequences
>>
>> Hi,
>>   i am trying to test one program for which i need to change
>> order of sequences in a fastq file.
>> My fastq file contains about 50,000 sequences.
>> Is there any script that can do this task?
>>
>> Thanks
>> Shalu
>> _______________________________________________
>> Bioperl-l mailing list
>> [hidden email]
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: randomizing fastq sequences

shalu sharma
Hi All,
    Thanks for all the suggestions.
@Simon Andrew and Roy:
   Your method worked perfect but now memory is the issue.
Now i have to select 50K fastq sequences from a illumina data (around 70 mil
reads) randomly , so is there again any module that can select random
sequences from fastq file?

I can still use same methods on 50k sequences but getting 50k from huge data
set is a problem.
Also at some point i need to shuffle the  fastq reads (order of
nucleotides).

I am really sorry for asking lot of things , i know i am really bad in
handling fastq sequences.
i would really appreciate your suggestions.

Thanks
Shalu

On Tue, Feb 8, 2011 at 10:53 AM, Chris Fields <[hidden email]> wrote:

> Just to note, I have been thinking about wrapping this for fast indexing
> and retrieval of FASTQ for bioperl (this came up in a prior thread, with the
> same suggestion from Malcolm IIRC).
>
> chris
>
> On Feb 8, 2011, at 9:12 AM, Cook, Malcolm wrote:
>
> > Gotta chime in....
> >
> > If
> >       you're working with fastq files
> >       are working in unix and have the `shuf` command available
> >
> > I recommand you to install cdbyank
> http://sourceforge.net/projects/cdbfasta/ which provides for indexing
> fasta and fastq files and providing random access to them
> >
> > Index the fastq, then extract the IDs with cdyank, pipe them through
> `shuf` and then through cdyank again to pull out the sequences.
> >
> > Like this example, which uses a test fastq from my local install of
> bioperl:
> >
> >> cd ~/local/src/bioperl-live/t/data/fastq/
> >> cdbfasta -Q example.fastq
> > 3 entries from file example.fastq were indexed in file example.fastq.cidx
> >> cdbyank -l example.fastq.cidx | shuf | cdbyank example.fastq.cidx >
> shuf_example.fastq
> >
> > There would be issues if your IDs are not unique.
> >
> > Malcolm Cook
> > Stowers Institute for Medical Research -  Bioinformatics
> > Kansas City, Missouri  USA
> >
> >
> >
> >> -----Original Message-----
> >> From: [hidden email]
> >> [mailto:[hidden email]] On Behalf Of
> >> shalu sharma
> >> Sent: Monday, February 07, 2011 4:08 PM
> >> To: [hidden email]
> >> Subject: [Bioperl-l] randomizing fastq sequences
> >>
> >> Hi,
> >>   i am trying to test one program for which i need to change
> >> order of sequences in a fastq file.
> >> My fastq file contains about 50,000 sequences.
> >> Is there any script that can do this task?
> >>
> >> Thanks
> >> Shalu
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> [hidden email]
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> > _______________________________________________
> > Bioperl-l mailing list
> > [hidden email]
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: randomizing fastq sequences

Fields, Christopher J
Shalu,

(Note: this isn't a Perl solution): I do think this problem has been solved somewhat in R/BioC if you have it installed, in the ShortRead package (see 'Sampler-class' in the ShortRead docs).

I think using perl and the current BioPerl Bio::Index::Fastq indexing scheme for FASTQ will be problematic/slow for very large files with millions of sequences (i.e. pretty much anything that is rolling out of modern day sequencing pipelines), as the current indexing implementation uses a very simple indexing scheme using DB_File, originally designed years ago for much smaller sequencing samples.  Think: Sanger sequencing.  

Of course, this is with the caveat that I haven't tested this out personally, but I recall some complaints about this in the past (Jason?).  There was an effort to deal with this at one point with AnyDBM_File (which allows SQLite now) but I don't think it progressed very far, primarily b/c there simply hasn't been enough demand.  Most users seem to sample randomly from BAM files instead, which are conveniently accessible via samtools/Picard/bamtools/etc (bamtools has a 'random' option for this purpose).

chris

On Feb 8, 2011, at 10:48 AM, shalu sharma wrote:

> Hi All,
>     Thanks for all the suggestions.
> @Simon Andrew and Roy:
>    Your method worked perfect but now memory is the issue.
> Now i have to select 50K fastq sequences from a illumina data (around 70 mil reads) randomly , so is there again any module that can select random sequences from fastq file?
>
> I can still use same methods on 50k sequences but getting 50k from huge data set is a problem.
> Also at some point i need to shuffle the  fastq reads (order of nucleotides).
>
> I am really sorry for asking lot of things , i know i am really bad in handling fastq sequences.
> i would really appreciate your suggestions.
>
> Thanks
> Shalu
>
> On Tue, Feb 8, 2011 at 10:53 AM, Chris Fields <[hidden email]> wrote:
> Just to note, I have been thinking about wrapping this for fast indexing and retrieval of FASTQ for bioperl (this came up in a prior thread, with the same suggestion from Malcolm IIRC).
>
> chris
>
> On Feb 8, 2011, at 9:12 AM, Cook, Malcolm wrote:
>
> > Gotta chime in....
> >
> > If
> >       you're working with fastq files
> >       are working in unix and have the `shuf` command available
> >
> > I recommand you to install cdbyank http://sourceforge.net/projects/cdbfasta/ which provides for indexing fasta and fastq files and providing random access to them
> >
> > Index the fastq, then extract the IDs with cdyank, pipe them through `shuf` and then through cdyank again to pull out the sequences.
> >
> > Like this example, which uses a test fastq from my local install of bioperl:
> >
> >> cd ~/local/src/bioperl-live/t/data/fastq/
> >> cdbfasta -Q example.fastq
> > 3 entries from file example.fastq were indexed in file example.fastq.cidx
> >> cdbyank -l example.fastq.cidx | shuf | cdbyank example.fastq.cidx > shuf_example.fastq
> >
> > There would be issues if your IDs are not unique.
> >
> > Malcolm Cook
> > Stowers Institute for Medical Research -  Bioinformatics
> > Kansas City, Missouri  USA
> >
> >
> >
> >> -----Original Message-----
> >> From: [hidden email]
> >> [mailto:[hidden email]] On Behalf Of
> >> shalu sharma
> >> Sent: Monday, February 07, 2011 4:08 PM
> >> To: [hidden email]
> >> Subject: [Bioperl-l] randomizing fastq sequences
> >>
> >> Hi,
> >>   i am trying to test one program for which i need to change
> >> order of sequences in a fastq file.
> >> My fastq file contains about 50,000 sequences.
> >> Is there any script that can do this task?
> >>
> >> Thanks
> >> Shalu
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> [hidden email]
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> > _______________________________________________
> > Bioperl-l mailing list
> > [hidden email]
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: randomizing fastq sequences

simon andrews (BI)
In reply to this post by shalu sharma

On 8 Feb 2011, at 16:48, shalu sharma wrote:

> Hi All,
>    Thanks for all the suggestions.
> @Simon Andrew and Roy:
>   Your method worked perfect but now memory is the issue.
> Now i have to select 50K fastq sequences from a illumina data (around 70 mil
> reads) randomly , so is there again any module that can select random
> sequences from fastq file?

The simple approach to this is to do it in two passes.  

In the first pass you simply find out how many fastq entries you have in your file.  You then randomly select 50K numbers from 1..[number of fastq seqs in file].

In the second pass you pull out any sequences at an index position you randomly selected.  If you don't mind them being in the same order then you can just write them out immediately and use virtually no memory, or you could put them in an array and shuffle them before writing (using the same memory as the 50K experiment).

If you're going to be doing a lot of shuffling on the same dataset then it would be worth looking into doing a proper indexing of your file, as others have suggested, but if you're only going to do this once per dataset then it might not save you any time.


> Also at some point i need to shuffle the  fastq reads (order of
> nucleotides).

Same basic process - extract the sequence, split it into an array, shuffle the array, reassign the sequence.  If you want to keep the original quality scores associated with the same bases you'll need to shuffle indices and then reassemble both the shuffled sequences and qualities at the same time.


Simon.





_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l