match patterns to retrieve seqs

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

match patterns to retrieve seqs

Philipp Schiffer-2
Hi!

I am trying to write a small script to retrieve such sequences
(including headers) from a fasta file that match the headers I listed in
another file.
The fasta file looks like:
 >maker-scaffold15249_size4120-snap-gene-0.2-mRNA-1 protein
AED:0.0448532948532949 eAED:0.0448532948532949 QI:42|-1|0|1|-1|0|1|1|191
MDILSLPLSFVCLMTIAWVVLAAALFLLWENGWSYFNSFYFTVVSFSTVGLGDMTPDYTR.............
 >snap_masked-scaffold15249_size4120-abinit-gene-0.1-mRNA-1 protein
AED:1 eAED:1 QI:0|0|0|0|1|1|4|0|108
MHAMWMIRKFRLQWRWCMHGRPRDEQPEHHCVMGFLAVTHANRAAYNTDTLPTMLTERRK............
 >snap_masked-scaffold15249_size4120-abinit-gene-0.0-mRNA-1 protein
AED:1 eAED:1 QI:55|0|0|0|1|1|2|227|5
MLHHR.......

while the list file is:
maker-scaffold539_size40162-snap-gene-0.4-mRNA-1
snap_masked-scaffold2197_size27871-abinit-gene-0.3-mRNA-1
maker-scaffold1000_size34843-snap-gene-0.7-mRNA-1
maker-scaffold10087_size13457-snap-gene-0.3-mRNA-1

My perl so far would be:

#! /usr/bin/perl -w

use Bio::SeqIO;

my $usage = "list_from_fasta.pl fastafile format headerlistfile\n";
my $file   = shift or die $usage;
my $format = shift or die $usage;
my $headerlist = shift or die $usage;

open LIST, "<$headerlist" or die $!;
my $queries = <LIST>; #might be a hash as well, don't know what's better

my $fastafile = Bio::SeqIO->new(-file => "<$file", -format => $format);
while (my $fastaseq = $fastafile->next_seq){

     if ($fastaseq ->id =~ /(@queries)/) # here is the problem, after
much trying it seems actually to look for matches, but does not find
any, still there must be

     {print $fastaseq-> id,"\n" and print $fastaseq->seq,"\n";}
   ;
};

So I am wondering, is this the right way to tackle the problem. Will it
work? Should I use another bioPerl module? My thinking is, that the
pattern matching operation after the 'if' statement is wrong.

Any help highly appreciated!

Thanks a lot.

Philipp

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: match patterns to retrieve seqs

Philipp Schiffer-2
Sorry,

at least my $queries should read my queries@.


Am 18.03.12 17:49, schrieb Philipp Schiffer:

> Hi!
>
> I am trying to write a small script to retrieve such sequences
> (including headers) from a fasta file that match the headers I listed in
> another file.
> The fasta file looks like:
>  >maker-scaffold15249_size4120-snap-gene-0.2-mRNA-1 protein
> AED:0.0448532948532949 eAED:0.0448532948532949 QI:42|-1|0|1|-1|0|1|1|191
> MDILSLPLSFVCLMTIAWVVLAAALFLLWENGWSYFNSFYFTVVSFSTVGLGDMTPDYTR.............
>  >snap_masked-scaffold15249_size4120-abinit-gene-0.1-mRNA-1 protein
> AED:1 eAED:1 QI:0|0|0|0|1|1|4|0|108
> MHAMWMIRKFRLQWRWCMHGRPRDEQPEHHCVMGFLAVTHANRAAYNTDTLPTMLTERRK............
>  >snap_masked-scaffold15249_size4120-abinit-gene-0.0-mRNA-1 protein
> AED:1 eAED:1 QI:55|0|0|0|1|1|2|227|5
> MLHHR.......
>
> while the list file is:
> maker-scaffold539_size40162-snap-gene-0.4-mRNA-1
> snap_masked-scaffold2197_size27871-abinit-gene-0.3-mRNA-1
> maker-scaffold1000_size34843-snap-gene-0.7-mRNA-1
> maker-scaffold10087_size13457-snap-gene-0.3-mRNA-1
>
> My perl so far would be:
>
> #! /usr/bin/perl -w
>
> use Bio::SeqIO;
>
> my $usage = "list_from_fasta.pl fastafile format headerlistfile\n";
> my $file = shift or die $usage;
> my $format = shift or die $usage;
> my $headerlist = shift or die $usage;
>
> open LIST, "<$headerlist" or die $!;
> my $queries = <LIST>; #might be a hash as well, don't know what's better
>
> my $fastafile = Bio::SeqIO->new(-file => "<$file", -format => $format);
> while (my $fastaseq = $fastafile->next_seq){
>
> if ($fastaseq ->id =~ /(@queries)/) # here is the problem, after much
> trying it seems actually to look for matches, but does not find any,
> still there must be
>
> {print $fastaseq-> id,"\n" and print $fastaseq->seq,"\n";}
> ;
> };
>
> So I am wondering, is this the right way to tackle the problem. Will it
> work? Should I use another bioPerl module? My thinking is, that the
> pattern matching operation after the 'if' statement is wrong.
>
> Any help highly appreciated!
>
> Thanks a lot.
>
> Philipp
>
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: match patterns to retrieve seqs

Frank Schwach
In reply to this post by Philipp Schiffer-2
as you correctly guessed, your regex
/(@queries)/
isn't doing what you were hoping it would do, i.e. match against all
elements of the array. You could compile a regex with all the
alternative patterns from your array but I think that's not very
efficient, so you could try something like this instead:

my $does_match = 0;
$fastaseq ->id =~ /$_/ && $does_match = 1 for @queries;
if ($does_match){
...
}

Frank




On 18/03/12 16:49, Philipp Schiffer wrote:

> Hi!
>
> I am trying to write a small script to retrieve such sequences
> (including headers) from a fasta file that match the headers I listed in
> another file.
> The fasta file looks like:
>  >maker-scaffold15249_size4120-snap-gene-0.2-mRNA-1 protein
> AED:0.0448532948532949 eAED:0.0448532948532949 QI:42|-1|0|1|-1|0|1|1|191
> MDILSLPLSFVCLMTIAWVVLAAALFLLWENGWSYFNSFYFTVVSFSTVGLGDMTPDYTR.............
>  >snap_masked-scaffold15249_size4120-abinit-gene-0.1-mRNA-1 protein
> AED:1 eAED:1 QI:0|0|0|0|1|1|4|0|108
> MHAMWMIRKFRLQWRWCMHGRPRDEQPEHHCVMGFLAVTHANRAAYNTDTLPTMLTERRK............
>  >snap_masked-scaffold15249_size4120-abinit-gene-0.0-mRNA-1 protein
> AED:1 eAED:1 QI:55|0|0|0|1|1|2|227|5
> MLHHR.......
>
> while the list file is:
> maker-scaffold539_size40162-snap-gene-0.4-mRNA-1
> snap_masked-scaffold2197_size27871-abinit-gene-0.3-mRNA-1
> maker-scaffold1000_size34843-snap-gene-0.7-mRNA-1
> maker-scaffold10087_size13457-snap-gene-0.3-mRNA-1
>
> My perl so far would be:
>
> #! /usr/bin/perl -w
>
> use Bio::SeqIO;
>
> my $usage = "list_from_fasta.pl fastafile format headerlistfile\n";
> my $file = shift or die $usage;
> my $format = shift or die $usage;
> my $headerlist = shift or die $usage;
>
> open LIST, "<$headerlist" or die $!;
> my $queries = <LIST>; #might be a hash as well, don't know what's better
>
> my $fastafile = Bio::SeqIO->new(-file => "<$file", -format => $format);
> while (my $fastaseq = $fastafile->next_seq){
>
> if ($fastaseq ->id =~ /(@queries)/) # here is the problem, after much
> trying it seems actually to look for matches, but does not find any,
> still there must be
>
> {print $fastaseq-> id,"\n" and print $fastaseq->seq,"\n";}
> ;
> };
>
> So I am wondering, is this the right way to tackle the problem. Will it
> work? Should I use another bioPerl module? My thinking is, that the
> pattern matching operation after the 'if' statement is wrong.
>
> Any help highly appreciated!
>
> Thanks a lot.
>
> Philipp
>
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


--
 The Wellcome Trust Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: match patterns to retrieve seqs

Roy Chaudhuri-3
In reply to this post by Philipp Schiffer-2
Hi Philipp,

There are several problems with your code. Firstly, you will want to
remove trailing newlines from your headers using chomp. Also, Perl
patterns interpolate as if they were double quoted strings, which means
that the elements of your array are all concatenated (separated by
spaces) and you are trying to match them all in one go. Finally, it
would be better to output using Bio::SeqIO rather than print.

Since you are just looking for exact matches, it would be much more
efficient to read your headers into a hash, and then use exists rather
than pattern matching:
my %query;
while (<LIST>) {
   chomp;
   $query{$_}=1;
}
my $fastafile = Bio::SeqIO->new(-file =>$file, -format =>$format);
my $out=Bio::SeqIO->new(-format=>$format);
while (my $fastaseq = $fastafile->next_seq){
   if (exists $query{$fastaseq->id}) {
      $out->write_seq($fastaseq);
   }
}

You should also look at the Bio::DB::Fasta module, which indexes your
fasta file so is particularly useful for long files that you will access
more than once:
http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/DB/Fasta.pm

Hope this helps.
Cheers,
Roy.

On 18/03/2012 20:36, Philipp Schiffer wrote:

> Sorry,
>
> at least my $queries should read my queries@.
>
>
> Am 18.03.12 17:49, schrieb Philipp Schiffer:
>> Hi!
>>
>> I am trying to write a small script to retrieve such sequences
>> (including headers) from a fasta file that match the headers I listed in
>> another file.
>> The fasta file looks like:
>>   >maker-scaffold15249_size4120-snap-gene-0.2-mRNA-1 protein
>> AED:0.0448532948532949 eAED:0.0448532948532949 QI:42|-1|0|1|-1|0|1|1|191
>> MDILSLPLSFVCLMTIAWVVLAAALFLLWENGWSYFNSFYFTVVSFSTVGLGDMTPDYTR.............
>>   >snap_masked-scaffold15249_size4120-abinit-gene-0.1-mRNA-1 protein
>> AED:1 eAED:1 QI:0|0|0|0|1|1|4|0|108
>> MHAMWMIRKFRLQWRWCMHGRPRDEQPEHHCVMGFLAVTHANRAAYNTDTLPTMLTERRK............
>>   >snap_masked-scaffold15249_size4120-abinit-gene-0.0-mRNA-1 protein
>> AED:1 eAED:1 QI:55|0|0|0|1|1|2|227|5
>> MLHHR.......
>>
>> while the list file is:
>> maker-scaffold539_size40162-snap-gene-0.4-mRNA-1
>> snap_masked-scaffold2197_size27871-abinit-gene-0.3-mRNA-1
>> maker-scaffold1000_size34843-snap-gene-0.7-mRNA-1
>> maker-scaffold10087_size13457-snap-gene-0.3-mRNA-1
>>
>> My perl so far would be:
>>
>> #! /usr/bin/perl -w
>>
>> use Bio::SeqIO;
>>
>> my $usage = "list_from_fasta.pl fastafile format headerlistfile\n";
>> my $file = shift or die $usage;
>> my $format = shift or die $usage;
>> my $headerlist = shift or die $usage;
>>
>> open LIST, "<$headerlist" or die $!;
>> my $queries =<LIST>; #might be a hash as well, don't know what's better
>>
>> my $fastafile = Bio::SeqIO->new(-file =>  "<$file", -format =>  $format);
>> while (my $fastaseq = $fastafile->next_seq){
>>
>> if ($fastaseq ->id =~ /(@queries)/) # here is the problem, after much
>> trying it seems actually to look for matches, but does not find any,
>> still there must be
>>
>> {print $fastaseq->  id,"\n" and print $fastaseq->seq,"\n";}
>> ;
>> };
>>
>> So I am wondering, is this the right way to tackle the problem. Will it
>> work? Should I use another bioPerl module? My thinking is, that the
>> pattern matching operation after the 'if' statement is wrong.
>>
>> Any help highly appreciated!
>>
>> Thanks a lot.
>>
>> Philipp
>>
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: match patterns to retrieve seqs

Jason Stajich-5
In reply to this post by Philipp Schiffer-2
And take a look at the script in the bioperl distribution

index/bp_seqret.pl

which will do exactly this.

Jason
On Mar 18, 2012, at 9:49 AM, Philipp Schiffer wrote:

> Hi!
>
> I am trying to write a small script to retrieve such sequences (including headers) from a fasta file that match the headers I listed in another file.
> The fasta file looks like:
> >maker-scaffold15249_size4120-snap-gene-0.2-mRNA-1 protein AED:0.0448532948532949 eAED:0.0448532948532949 QI:42|-1|0|1|-1|0|1|1|191
> MDILSLPLSFVCLMTIAWVVLAAALFLLWENGWSYFNSFYFTVVSFSTVGLGDMTPDYTR.............
> >snap_masked-scaffold15249_size4120-abinit-gene-0.1-mRNA-1 protein AED:1 eAED:1 QI:0|0|0|0|1|1|4|0|108
> MHAMWMIRKFRLQWRWCMHGRPRDEQPEHHCVMGFLAVTHANRAAYNTDTLPTMLTERRK............
> >snap_masked-scaffold15249_size4120-abinit-gene-0.0-mRNA-1 protein AED:1 eAED:1 QI:55|0|0|0|1|1|2|227|5
> MLHHR.......
>
> while the list file is:
> maker-scaffold539_size40162-snap-gene-0.4-mRNA-1
> snap_masked-scaffold2197_size27871-abinit-gene-0.3-mRNA-1
> maker-scaffold1000_size34843-snap-gene-0.7-mRNA-1
> maker-scaffold10087_size13457-snap-gene-0.3-mRNA-1
>
> My perl so far would be:
>
> #! /usr/bin/perl -w
>
> use Bio::SeqIO;
>
> my $usage = "list_from_fasta.pl fastafile format headerlistfile\n";
> my $file   = shift or die $usage;
> my $format = shift or die $usage;
> my $headerlist = shift or die $usage;
>
> open LIST, "<$headerlist" or die $!;
> my $queries = <LIST>; #might be a hash as well, don't know what's better
>
> my $fastafile = Bio::SeqIO->new(-file => "<$file", -format => $format);
> while (my $fastaseq = $fastafile->next_seq){
>
>    if ($fastaseq ->id =~ /(@queries)/) # here is the problem, after much trying it seems actually to look for matches, but does not find any, still there must be
>
>    {print $fastaseq-> id,"\n" and print $fastaseq->seq,"\n";}
>  ;
> };
>
> So I am wondering, is this the right way to tackle the problem. Will it work? Should I use another bioPerl module? My thinking is, that the pattern matching operation after the 'if' statement is wrong.
>
> Any help highly appreciated!
>
> Thanks a lot.
>
> Philipp
>
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
[hidden email]
[hidden email]


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: match patterns to retrieve seqs

Philipp Schiffer-2
Thanks Jason! I guess it's better to try around (and fail in my case) a
bit before getting to know this. ;-)

Am 19.03.12 17:40, schrieb Jason Stajich:

> And take a look at the script in the bioperl distribution
>
> index/bp_seqret.pl
>
> which will do exactly this.
>
> Jason
> On Mar 18, 2012, at 9:49 AM, Philipp Schiffer wrote:
>
>> Hi!
>>
>> I am trying to write a small script to retrieve such sequences
>> (including headers) from a fasta file that match the headers I listed
>> in another file.
>> The fasta file looks like:
>> >maker-scaffold15249_size4120-snap-gene-0.2-mRNA-1 protein
>> AED:0.0448532948532949 eAED:0.0448532948532949 QI:42|-1|0|1|-1|0|1|1|191
>> MDILSLPLSFVCLMTIAWVVLAAALFLLWENGWSYFNSFYFTVVSFSTVGLGDMTPDYTR.............
>> >snap_masked-scaffold15249_size4120-abinit-gene-0.1-mRNA-1 protein
>> AED:1 eAED:1 QI:0|0|0|0|1|1|4|0|108
>> MHAMWMIRKFRLQWRWCMHGRPRDEQPEHHCVMGFLAVTHANRAAYNTDTLPTMLTERRK............
>> >snap_masked-scaffold15249_size4120-abinit-gene-0.0-mRNA-1 protein
>> AED:1 eAED:1 QI:55|0|0|0|1|1|2|227|5
>> MLHHR.......
>>
>> while the list file is:
>> maker-scaffold539_size40162-snap-gene-0.4-mRNA-1
>> snap_masked-scaffold2197_size27871-abinit-gene-0.3-mRNA-1
>> maker-scaffold1000_size34843-snap-gene-0.7-mRNA-1
>> maker-scaffold10087_size13457-snap-gene-0.3-mRNA-1
>>
>> My perl so far would be:
>>
>> #! /usr/bin/perl -w
>>
>> use Bio::SeqIO;
>>
>> my $usage = "list_from_fasta.pl fastafile format headerlistfile\n";
>> my $file = shift or die $usage;
>> my $format = shift or die $usage;
>> my $headerlist = shift or die $usage;
>>
>> open LIST, "<$headerlist" or die $!;
>> my $queries = <LIST>; #might be a hash as well, don't know what's better
>>
>> my $fastafile = Bio::SeqIO->new(-file => "<$file", -format => $format);
>> while (my $fastaseq = $fastafile->next_seq){
>>
>> if ($fastaseq ->id =~ /(@queries)/) # here is the problem, after much
>> trying it seems actually to look for matches, but does not find any,
>> still there must be
>>
>> {print $fastaseq-> id,"\n" and print $fastaseq->seq,"\n";}
>> ;
>> };
>>
>> So I am wondering, is this the right way to tackle the problem. Will
>> it work? Should I use another bioPerl module? My thinking is, that the
>> pattern matching operation after the 'if' statement is wrong.
>>
>> Any help highly appreciated!
>>
>> Thanks a lot.
>>
>> Philipp
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> [hidden email] <mailto:[hidden email]>
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Jason Stajich
> [hidden email] <mailto:[hidden email]>
> [hidden email] <mailto:[hidden email]>
>
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: match patterns to retrieve seqs

EvoPHS
In reply to this post by Frank Schwach
Thanks for help everybody

On Monday, March 19, 2012 3:36:37 PM UTC+1, Frank Schwach wrote:

>
> as you correctly guessed, your regex
> /(@queries)/
> isn't doing what you were hoping it would do, i.e. match against all
> elements of the array. You could compile a regex with all the
> alternative patterns from your array but I think that's not very
> efficient, so you could try something like this instead:
>
> my $does_match = 0;
> $fastaseq ->id =~ /$_/ && $does_match = 1 for @queries;
> if ($does_match){
> ...
> }
>
> Frank
>
>
> On 18/03/12 16:49, Philipp Schiffer wrote:
> > Hi!
> >
> > I am trying to write a small script to retrieve such sequences
> > (including headers) from a fasta file that match the headers I listed in
> > another file.
> > The fasta file looks like:
> >  >maker-scaffold15249_size4120-snap-gene-0.2-mRNA-1 protein
> > AED:0.0448532948532949 eAED:0.0448532948532949 QI:42|-1|0|1|-1|0|1|1|191
> > MDILSLPLSFVCLMTIAWVVLAAALFLLWENGWSYFNSFYFTVVSFSTVGLGDMTPDYTR.............
> >  >snap_masked-scaffold15249_size4120-abinit-gene-0.1-mRNA-1 protein
> > AED:1 eAED:1 QI:0|0|0|0|1|1|4|0|108
> > MHAMWMIRKFRLQWRWCMHGRPRDEQPEHHCVMGFLAVTHANRAAYNTDTLPTMLTERRK............
> >  >snap_masked-scaffold15249_size4120-abinit-gene-0.0-mRNA-1 protein
> > AED:1 eAED:1 QI:55|0|0|0|1|1|2|227|5
> > MLHHR.......
> >
> > while the list file is:
> > maker-scaffold539_size40162-snap-gene-0.4-mRNA-1
> > snap_masked-scaffold2197_size27871-abinit-gene-0.3-mRNA-1
> > maker-scaffold1000_size34843-snap-gene-0.7-mRNA-1
> > maker-scaffold10087_size13457-snap-gene-0.3-mRNA-1
> >
> > My perl so far would be:
> >
> > #! /usr/bin/perl -w
> >
> > use Bio::SeqIO;
> >
> > my $usage = "list_from_fasta.pl fastafile format headerlistfile\n";
> > my $file = shift or die $usage;
> > my $format = shift or die $usage;
> > my $headerlist = shift or die $usage;
> >
> > open LIST, "<$headerlist" or die $!;
> > my $queries = <LIST>; #might be a hash as well, don't know what's better
> >
> > my $fastafile = Bio::SeqIO->new(-file => "<$file", -format => $format);
> > while (my $fastaseq = $fastafile->next_seq){
> >
> > if ($fastaseq ->id =~ /(@queries)/) # here is the problem, after much
> > trying it seems actually to look for matches, but does not find any,
> > still there must be
> >
> > {print $fastaseq-> id,"\n" and print $fastaseq->seq,"\n";}
> > ;
> > };
> >
> > So I am wondering, is this the right way to tackle the problem. Will it
> > work? Should I use another bioPerl module? My thinking is, that the
> > pattern matching operation after the 'if' statement is wrong.
> >
> > Any help highly appreciated!
> >
> > Thanks a lot.
> >
> > Philipp
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > [hidden email]
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
> --
>  The Wellcome Trust Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: match patterns to retrieve seqs

Florent Angly
In reply to this post by Philipp Schiffer-2
Sure, this is a valid way to approach the problem. However, the line
that gives you grief is most likely this one:
> if ($fastaseq ->id =~ /(@queries)/)
You cannot use an array like this in a regexp and hope that it will
match any of its elements. Most likely, it is evaluated as a scalar,
i.e. the number of elements in the array, just like if you were doing:
my $val = @queries;

My solution would be to build a regular expression containing all the
queries:
> my $re;
> for my $query (@queries) {
>    $re .= "$query|"; # query 1, OR query 2, ...
> }
> $re =~ s/\|$//;    # remove trailing |
> $re = '^'.$re.'$'; # anchor the regexp for full-matches only (no
> partial matches)
> $re = qr/$re/;     # compile regexp for added speed

Then you can use this regular expression to match the sequence IDs:
> if ($fastaseq->id =~ /$re/) {

Regards,
Florent


On 19/03/12 06:36, Philipp Schiffer wrote:

> Sorry,
>
> at least my $queries should read my queries@.
>
>
> Am 18.03.12 17:49, schrieb Philipp Schiffer:
>> Hi!
>>
>> I am trying to write a small script to retrieve such sequences
>> (including headers) from a fasta file that match the headers I listed in
>> another file.
>> The fasta file looks like:
>> >maker-scaffold15249_size4120-snap-gene-0.2-mRNA-1 protein
>> AED:0.0448532948532949 eAED:0.0448532948532949 QI:42|-1|0|1|-1|0|1|1|191
>> MDILSLPLSFVCLMTIAWVVLAAALFLLWENGWSYFNSFYFTVVSFSTVGLGDMTPDYTR.............
>>
>> >snap_masked-scaffold15249_size4120-abinit-gene-0.1-mRNA-1 protein
>> AED:1 eAED:1 QI:0|0|0|0|1|1|4|0|108
>> MHAMWMIRKFRLQWRWCMHGRPRDEQPEHHCVMGFLAVTHANRAAYNTDTLPTMLTERRK............
>> >snap_masked-scaffold15249_size4120-abinit-gene-0.0-mRNA-1 protein
>> AED:1 eAED:1 QI:55|0|0|0|1|1|2|227|5
>> MLHHR.......
>>
>> while the list file is:
>> maker-scaffold539_size40162-snap-gene-0.4-mRNA-1
>> snap_masked-scaffold2197_size27871-abinit-gene-0.3-mRNA-1
>> maker-scaffold1000_size34843-snap-gene-0.7-mRNA-1
>> maker-scaffold10087_size13457-snap-gene-0.3-mRNA-1
>>
>> My perl so far would be:
>>
>> #! /usr/bin/perl -w
>>
>> use Bio::SeqIO;
>>
>> my $usage = "list_from_fasta.pl fastafile format headerlistfile\n";
>> my $file = shift or die $usage;
>> my $format = shift or die $usage;
>> my $headerlist = shift or die $usage;
>>
>> open LIST, "<$headerlist" or die $!;
>> my $queries = <LIST>; #might be a hash as well, don't know what's better
>>
>> my $fastafile = Bio::SeqIO->new(-file => "<$file", -format => $format);
>> while (my $fastaseq = $fastafile->next_seq){
>>
>> if ($fastaseq ->id =~ /(@queries)/) # here is the problem, after much
>> trying it seems actually to look for matches, but does not find any,
>> still there must be
>>
>> {print $fastaseq-> id,"\n" and print $fastaseq->seq,"\n";}
>> ;
>> };
>>
>> So I am wondering, is this the right way to tackle the problem. Will it
>> work? Should I use another bioPerl module? My thinking is, that the
>> pattern matching operation after the 'if' statement is wrong.
>>
>> Any help highly appreciated!
>>
>> Thanks a lot.
>>
>> Philipp
>>
> _______________________________________________
> Bioperl-l mailing list
> [hidden email]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l