Re: Enquiry on gi_taxid_nucl.dmp.gz

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Enquiry on gi_taxid_nucl.dmp.gz

Jason Stajich-3
hi - please keep questions on list.


I think one of your problem is your first use of $gi2taxidfile is wrong.
when you call tie you want to specify an dbfile you want to store the
index in.
So call it "/tmp/gi2taxid.idx" or something like that.

In my code here
http://github.com/bioperl/bioperl-live/blob/master/scripts/taxa/classify_hits_kingdom.PLS
you will see on line 97 we construct the name of the index file to be
the folder, plus 'idx', plus the name gi2taxid which will be the name of
index file.

Also it would be safer for the split to be whitespace matching and that
you want the the two first columns from the file.  Doing this would
eliminate the need for the chomp on the line above.

  my ($gi, $taxid) = split(/\s+/, $_);

instead of

  chomp;
  my ($gi, $taxid) = split(" ", $_,2);

There may be other problems but these should be fixed first -- and
please send queries to the mailing list rather than to me directly so
that others can answer questions.

-jason
Amali Thrimawithana wrote, On 8/24/10 8:13 PM:

> Dear Jason
>
> Thank you very much for the information. I manage to get the information on
> different taxonomic  levels with the help of one of your example code
> "local_taxonomydb_query". However I am having trouble with creating a local
> index file of the gi_taxid_nucl.dmp so that I am able to get the taxonomic
> id given the GI number of NCBI. At the moment I am using the tie() function
> with DB_file and then storing the detail into a hash. However when I try to
> retrieve a taxonomic ID given the GI number, it is not returning any thing
> but an error. Below is part of the code (borrowed from the example code
> classify kingdom), can you please let me know where I am going wrong?
> ...
> my $dbh2 = tie(%taxid4gi, 'DB_File', $gi2taxidfile);
>
> if( ! $done ) {
>      my $fh;
>     open(GI2TAXID, "$gi2taxidfile") or die $!; #here passing the unzipped
> gi_taxid_nucl.dmp
>     my$i=0;
>      while (<GI2TAXID>) {
>        chomp;
>         my ($gi, $taxid) = split(" ", $_, 2);
>         $taxid4gi{$gi} = $taxid
>         if exists $taxid4gi{$gi};
>         $i++;
>       unless( $DEBUG&&  $i % 100000  ) {
>          warn "$i\n";
>      }
>      }
>      $dbh2->sync;
> }
> my $gi2='183397240';
> my $taxd2=$taxid4gi{$gi2};
>   print $taxd2, " \n";
>
> Any help would be much appreciated
>
> Thanking you
> Amali
>
> On 23 August 2010 06:29, Jason Stajich<[hidden email]>  wrote:
>
>    
>> Hi Amali -
>>
>> This is how I'd print out the full classification by using the Tree methods
>> (with probably a different way of initializing the $db object to your
>> flatfiles location).
>>
>> #!/usr/bin/perl -w
>> use strict;
>> use Bio::DB::Taxonomy;
>>
>> my $db= Bio::DB::Taxonomy->new(-source =>  'flatfile',
>>                    -nodesfile =>  'taxonomy/nodes.dmp',
>>                    -namesfile =>  'taxonomy/names.dmp');
>>
>> my $taxonid = $db->get_taxonid('Homo sapiens');
>> my $taxon = $db->get_taxon(-taxonid =>  $taxonid);
>> my $tree = Bio::Tree::Tree->new(-node =>  $taxon);
>> my @taxa = $tree->get_nodes;
>> print join(",", map { $_->scientific_name } @taxa), "\n";
>>
>> -jason
>>
>> Amali Thrimawithana wrote, On 8/18/10 3:56 PM:
>>
>>   Dear Dr Stajich,
>>      
>>> I am a Masters student at Auckland university and my research is on
>>> identifying yeast species present in wine by the use of 454 sequencing. In
>>> order to carry out this research, a pipeline is being built in which at
>>> the
>>> final step each representative OTU need to be classified at different
>>> taxonomic levels (ie: at Phylum, family, class, genus and species) by
>>> using
>>> the results from BLAST. To identify the sequences at each taxonomic level,
>>> I
>>> have been trying out the Bio::DB::Taxonomy module in bioperl. Using this
>>> module, I am able to get the genus and species level by splitting the
>>> scientific name returned by the Bio::taxon object. But unfortunately I am
>>> uncertain on how to get the information for the other levels of the rank.
>>> I
>>> have tried several commands including "my @class =
>>> $node->classification;",
>>> but it does not work. Hence, could you please let me know how I might be
>>> able to get the higher levels of taxonomy such as class and phylum using
>>> bioperl?
>>>
>>> Look forward to hearing from you soon
>>>
>>> Thanking You
>>>
>>> Amali
>>>
>>>
>>>        
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Enquiry on gi_taxid_nucl.dmp.gz

Roy Chaudhuri-3
 > Also it would be safer for the split to be whitespace matching and that
> you want the the two first columns from the file.  Doing this would
> eliminate the need for the chomp on the line above.
>
>    my ($gi, $taxid) = split(/\s+/, $_);
>
> instead of
>
>    chomp;
>    my ($gi, $taxid) = split(" ", $_,2);

Sorry to be pedantic, but according to perldoc -f split: "As a special
case, specifying a PATTERN of space (' ') will split on white space just
as "split" with no arguments does"

The only difference between patterns of " " and /\s+/ is that the latter
will return an initial null field if there is leading white space, which
may or may not be what you want.

$ perl -e 'print join("-", split(" ", " 1\t2  3")), "\n"'
1-2-3
$ perl -e 'print join("-", split(/\s+/, " 1\t2  3")), "\n"'
-1-2-3

Cheers.
Roy.
_______________________________________________
Bioperl-l mailing list
[hidden email]
http://lists.open-bio.org/mailman/listinfo/bioperl-l