Problems downloading and parsing GenBank records

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Problems downloading and parsing GenBank records

Moller, Abraham
Hi all,

I have been using a script to parse GenBank files to find taxonomic information corresponding to bacterial genomes. After several tries, my script has failed with the following error:

...
Bacteria_Actinobacteria_Streptomycetales_Streptomycetaceae_Streptomyces_Streptomyces_sp._4F
Bacteria_Actinobacteria_Streptomycetales_Streptomycetaceae_Streptomyces_Streptomyces_glaucescens
--------------------- WARNING ---------------------
MSG: Unbalanced quote in:
/locus_tag="M271_25565"
/inference="COORDINATES: ab initio prediction:GeneMarkS+"
/note="Derived by automated computational analysis using
gene prediction method: GeneMarkS+."
/codon_start=1
/transl_table=11
/product="membrane protein"
/protein_id="YP_008791527.1"
/db_xref="GeneID:17596261"
/translation="MPSPTSLAPAGPTATPTRTTATARRLMAICGTLLAALLCALSVG
ANSASAHAALTSTDPADGSVVKTAPREVTLNFSEGVLLSGDSVRVLDPKGKRVDTGKT
AHVDGKSSTAAAGLHSGLPDG Error: External viewer error: Empty Response. Bytes read: 0 Status: TimeoutNo further qualifiers will be added for this feature
---------------------------------------------------`

After this, the script seems to halt for hours at least, if not indefinitely...
Is this a BioPerl or GenBank issue? Any help would be appreciated.

Thanks,
Jon Moller

--
Abraham (Jon) Moller
Microbiology and Chemistry | 2016
Cell, Molecular, and Structural Biology (CMSB) BS/MS | Liang Bioinfo Lab
Microbiology Club President 



_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Problems downloading and parsing GenBank records

Fields, Christopher J

Hi Jon,

 

It looks like the script is attempting to parse a bad Genbank record, one that was truncated by an external error from NCBI, and failing (which is probably a good thing if the record is faulty). 

 

I noticed the record for that protein no longer is valid (it’s discontinued); the genome was replaced with this one:

 

https://www.ncbi.nlm.nih.gov/genome/?term=txid1343740[Organism:noexp]

 

Was this an older cached record?

 

chris

 

From: Bioperl-l <bioperl-l-bounces+cjfields=[hidden email]> on behalf of "Moller, Abraham" <[hidden email]>
Date: Tuesday, June 20, 2017 at 7:24 PM
To: "[hidden email]" <[hidden email]>
Subject: [Bioperl-l] Problems downloading and parsing GenBank records

 

Hi all,

I have been using a script to parse GenBank files to find taxonomic information corresponding to bacterial genomes. After several tries, my script has failed with the following error:

...
Bacteria_Actinobacteria_Streptomycetales_Streptomycetaceae_Streptomyces_Streptomyces_sp._4F
Bacteria_Actinobacteria_Streptomycetales_Streptomycetaceae_Streptomyces_Streptomyces_glaucescens
--------------------- WARNING ---------------------
MSG: Unbalanced quote in:
/locus_tag="M271_25565"
/inference="COORDINATES: ab initio prediction:GeneMarkS+"
/note="Derived by automated computational analysis using
gene prediction method: GeneMarkS+."
/codon_start=1
/transl_table=11
/product="membrane protein"
/protein_id="YP_008791527.1"
/db_xref="GeneID:17596261"
/translation="MPSPTSLAPAGPTATPTRTTATARRLMAICGTLLAALLCALSVG
ANSASAHAALTSTDPADGSVVKTAPREVTLNFSEGVLLSGDSVRVLDPKGKRVDTGKT
AHVDGKSSTAAAGLHSGLPDG Error: External viewer error: Empty Response. Bytes read: 0 Status: TimeoutNo further qualifiers will be added for this feature
---------------------------------------------------`

After this, the script seems to halt for hours at least, if not indefinitely...
Is this a BioPerl or GenBank issue? Any help would be appreciated.

Thanks,

Jon Moller


--

Abraham (Jon) Moller

Microbiology and Chemistry | 2016

Cell, Molecular, and Structural Biology (CMSB) BS/MS | Liang Bioinfo Lab

Microbiology Club President 

 

 


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Problems downloading and parsing GenBank records

Moller, Abraham
Hi Chris,

Thanks for explaining what is going on. The protein sequence (YP_008791527.1) indeed comes from a GenBank record that has been removed (NC_022785). It seems the FASTA file containing the list of sequence accessions I am using (in each header) includes accessions to truncated or removed GenBank records.

I wonder if should simply manually curate my FASTA file every time I come upon such error (replace NC_022785 with the newer CP006567 - the newer Streptomyces rapamycinicus NRRL 5491 genome). This seems to have come up about a quarter of the way through fully parsing the FASTA file.

Regards,
Jon



On Tue, Jun 20, 2017 at 10:53 PM, Fields, Christopher J <[hidden email]> wrote:

Hi Jon,

 

It looks like the script is attempting to parse a bad Genbank record, one that was truncated by an external error from NCBI, and failing (which is probably a good thing if the record is faulty). 

 

I noticed the record for that protein no longer is valid (it’s discontinued); the genome was replaced with this one:

 

https://www.ncbi.nlm.nih.gov/genome/?term=txid1343740[Organism:noexp]

 

Was this an older cached record?

 

chris

 

From: Bioperl-l <bioperl-l-bounces+cjfields=[hidden email]> on behalf of "Moller, Abraham" <[hidden email]>
Date: Tuesday, June 20, 2017 at 7:24 PM
To: "[hidden email]" <[hidden email]>
Subject: [Bioperl-l] Problems downloading and parsing GenBank records

 

Hi all,

I have been using a script to parse GenBank files to find taxonomic information corresponding to bacterial genomes. After several tries, my script has failed with the following error:

...
Bacteria_Actinobacteria_Streptomycetales_Streptomycetaceae_Streptomyces_Streptomyces_sp._4F
Bacteria_Actinobacteria_Streptomycetales_Streptomycetaceae_Streptomyces_Streptomyces_glaucescens
--------------------- WARNING ---------------------
MSG: Unbalanced quote in:
/locus_tag="M271_25565"
/inference="COORDINATES: ab initio prediction:GeneMarkS+"
/note="Derived by automated computational analysis using
gene prediction method: GeneMarkS+."
/codon_start=1
/transl_table=11
/product="membrane protein"
/protein_id="YP_008791527.1"
/db_xref="GeneID:17596261"
/translation="MPSPTSLAPAGPTATPTRTTATARRLMAICGTLLAALLCALSVG
ANSASAHAALTSTDPADGSVVKTAPREVTLNFSEGVLLSGDSVRVLDPKGKRVDTGKT
AHVDGKSSTAAAGLHSGLPDG Error: External viewer error: Empty Response. Bytes read: 0 Status: TimeoutNo further qualifiers will be added for this feature
---------------------------------------------------`

After this, the script seems to halt for hours at least, if not indefinitely...
Is this a BioPerl or GenBank issue? Any help would be appreciated.

Thanks,

Jon Moller


--

Abraham (Jon) Moller

Microbiology and Chemistry | 2016

Cell, Molecular, and Structural Biology (CMSB) BS/MS | Liang Bioinfo Lab

Microbiology Club President 

 

 




--
Abraham (Jon) Moller
Microbiology and Chemistry | 2016
Cell, Molecular, and Structural Biology (CMSB) BS/MS | Liang Bioinfo Lab
Microbiology Club President 



_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Loading...