Starting to use Bioperl

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Starting to use Bioperl

Gordon Haverland
I believe I mentioned in my note about that Native-Plants dbase, that I
came into this because I am researching a deer problem.

I have about 1000 entries, most are species, some are genus, I think a
couple are family (or tribe).  Ostensibly, these are all entries for
plants that deer may not prefer to eat.  There are many reasons why a
deer (which I mean generically, so white tail deer, mule deer, moose,
elk, ...) may not eat something: tough foliage, spiny, strong odour,
toxins, and some others.  Some plants share many of these
characteristics, some only have one of them.

What I have is an array of hash references.  Most of the keys point to
string values, a few point to undef or things like array references.
CommonNames often has multiple string values to it.  It would be nice to
put it into a more formal format, so that I can try to find missing
information to fill in.

Any recommendations as to how to proceed with this?

Have a great day!
Gord

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Starting to use Bioperl

Gordon Haverland
On Wed, 9 May 2018 09:54:18 -0700
Gordon Haverland <[hidden email]> wrote:

> I believe I mentioned in my note about that Native-Plants dbase, that
> I came into this because I am researching a deer problem.
>
> I have about 1000 entries, ....

I start writing this note, and now I see a reply from Henry Liu.  Thank
you Henry.

Yes, I agree that having 17k lines of array of hash in a Perl source
file is not a good plan.  :-)  I agree that SQLite3 is likely a good
option.  I was hoping that BioPerl might be a way to move this in that
direction.

To install bioperl for Debian/Devuan is 135 MB of compressed packages,
installed as 750 MB.

I have some biology knowledge, but it is more like biophysics, medical
physics, health physics and biochemistry.  And a bunch of Perl.  But at
heart, I am a materials science and engineering person who is good at
numerical methods.

To go from lists of common names and usually binomial taxonomic names,
sometimes with other information, to something technically correct has
been fun.  Mistakes in common names, mistakes in Genus, mistakes in
species and sometimes just dumb mistakes (sorry, deer really like
eating aspen and willow).  I've been using Wikipedia to flesh out more
information, but wikipedia is not that reliable.

As I understand things, it is possible to have the same binomial name
(Genus species) in different order, family, or tribe.  But in reading
at Wikipedia or other places tracking down some points (like what is
the toxic component that is in the precursor to carrots), it is
apparent that there are many conventions in describing things.  Which
is not like the periodic table and chart of the nuclides I am familiar
with.

In Bio::DB::Taxonomy, I can search for a binomial name, and it will give
me a list of hits, which could be empty.  If I get multiple hits, a
source of this is that there are synonyms for the binomial name.
Another source could be that there is a similarly named Genus species
in some other division/order/family/tribe.

There are two different ways to add metadata.  One kind of metadata is
related to genetics and is capable of having a "location or position"
in the genome of that entity.  The other is data that is without
position.  Which is where most (all?) information related to why deer
might not want to eat this stuff would be held.

Some of the various plant entities I have run across information on are
well defined.  Other entities have genetics such that there is
significant variance in plant properties grown in slightly different
microclimate, let alone different soils, growth zones or whatever.  I
guess this is normal to biology, it is strange to someone who is more
of a physicist.

I think that at some point I am going to need position dependent
metadata.  Mostly in terms of toxicity.  I'll take spinach and rhubarb
for examples.  Both plants contain a fair amount of oxalic
acid/oxalates.  One has edible leaves, and the other doesn't.  It is
apparently possible to make a DIY insecticide from rhubarb leaves.
Oxalic acid concentrations in the leaves don't seem to explain the
ability to make an insecticide from the leaves, there must be other
toxic components that are involved here.

A common toxin "modality" in plants is something attached to a sugar.
Cyanogenic glycosides is an example.  Do I want to try and relate this
to RNA (as they involve sugars from the work I've done with
pharmaceutical chemists), to DNA or proteins (those seem to be the
three metrics common to many of these databases)?  Something else?

I will grind on this a while.  Having autism makes me want to classify
things.

Have a great day!
Gord

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Starting to use Bioperl

Gordon Haverland
In reply to this post by Gordon Haverland
On Wed, 9 May 2018 09:54:18 -0700
Gordon Haverland <[hidden email]> wrote:

>           ... I am researching a deer problem.

There are BioPerl and Bio-LITE routines which can work with taxonomy
information.  Finding something which can write a SQLite3 dbase took a
little digging, but something does exist.

I've never played with BioPerl before, and I am still trying to clean
and expand my deer plant data, so I ran my latest effort with a call to
BioPerl to look up a taxonid and then a taxon.  It just happened the
first element in my list was a hybrid species (Abelia x grandiflora).
Anyway, following some BioPerl documentation I connected to -entrez
(excuse any spelling mistakes) and it came up with a hit.  A species
hit, which is what I was hoping for.

From that returned object, I can get an ancestor object (which is a
genus), and from that I can get an ancestor object which is a family,
and from that I can get an ancestor object which is an order and then
further iterations on ancestor get non_ranked clade stuff which I am
not sure how to handle.  I haven't tried iterating to the limit, I was
hoping that at some point an attempt to return an ancestor would return
under.  But I really don't know what to do with this non_rank clade
stuff.

I suspect, I need to iterate this ancestor stuff until I get to kingdom
plantae?  This gives me a "root".  I now have a species (usually) with
N ancestors up to a common root (kingdom plantae).  That constitutes a
tree as I understand things, but it is all one sided.

If I go to the next entry in my deer resistant plants data, I may have
M ancestors up to kingdom plantae.   And do this for 1000 or so other
entries.

For each set of ancestor lookups, I need to make a tree.

All of these trees have the same root (kingdom plantae).  So I should
be able to add all these trees together.  And then I think I found the
utilities to save this mess as SQLite.

As I understand things, I probably want to be working with NCBI ID
numbers on the species entered?  And what you call annotation, I would
save in one or more separate SQLite3 dbases keyed on the NCBI ID number?

Let's assume one of the fields of annotation is the USDA growing zone.
A person thinks they want to do a query on USDA Zone 3, so the program
changes this to a query for USDA Zones 2-4, which picks off all the
NCBI ID numbers, and then a person can use BioPerl to make a picture of
all the deer resistant taxonomy known.

One of the sources of data into this, has colour of the flowers.  So
someone could conceivably be looking for pink flowered, deer resistant
plants.  That's why I suggested there might be more than 1 SQLite dbase
of annotation to go with this stuff.

I'll stop writing, and go back to reading code.  I downloaded the
Bio-LITE modules (not at Debian/Devuan), and I think there were
suggestions of other code to download.  And read.

Have a great day!
Gord


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Starting to use Bioperl

Peter Cock
Hi Gordon,

A couple of bits of background reading for you.

First, there is a database schema called BioSQL which might be
of interest in that it includes taxon tables - based primarily on the
NCBI taxonomy tree but it could be used for another taxonomy.
There is an SQLite version of this (in use by Biopython) but that
has not as far as a I know been integrated into BioPerl yet.


I think given your taxonomy focus, you can ignore BioSQL which
is more suited to working with NCBI/EMBL annotated sequences.

Now, I mentioned the NCBI taxonomy, which is a de facto world
standard but will not always reflect the latest expert opinion in
all branches of life. Nevertheless, I would start there.

You can query the NCBI taxonomy via Entrez (and by hand on
the website), see how to walk the tree, ignore the boring ranks,
until you reach the root of the tree.

Or, you can download the NCBI taxonomy as a set of text files,
for which you should have no trouble finding examples scripts
to load and work with:


This year the NCBI started offering this data in a slightly newer
format:


Most of these files are plain text tables using the rather
unusual field separator of "\t|\t" (tab, pipe, tab), but the
README files are very comprehensive.

This is in Python, but my most recent occasion to process
this data was to make a cut-down version of the NCBI
taxonomy as part of constructing a small test dataset:


Peter


On Fri, May 11, 2018 at 1:50 AM, Gordon Haverland <[hidden email]> wrote:
On Wed, 9 May 2018 09:54:18 -0700
Gordon Haverland <[hidden email]> wrote:

>           ... I am researching a deer problem.

There are BioPerl and Bio-LITE routines which can work with taxonomy
information.  Finding something which can write a SQLite3 dbase took a
little digging, but something does exist.

I've never played with BioPerl before, and I am still trying to clean
and expand my deer plant data, so I ran my latest effort with a call to
BioPerl to look up a taxonid and then a taxon.  It just happened the
first element in my list was a hybrid species (Abelia x grandiflora).
Anyway, following some BioPerl documentation I connected to -entrez
(excuse any spelling mistakes) and it came up with a hit.  A species
hit, which is what I was hoping for.

From that returned object, I can get an ancestor object (which is a
genus), and from that I can get an ancestor object which is a family,
and from that I can get an ancestor object which is an order and then
further iterations on ancestor get non_ranked clade stuff which I am
not sure how to handle.  I haven't tried iterating to the limit, I was
hoping that at some point an attempt to return an ancestor would return
under.  But I really don't know what to do with this non_rank clade
stuff.

I suspect, I need to iterate this ancestor stuff until I get to kingdom
plantae?  This gives me a "root".  I now have a species (usually) with
N ancestors up to a common root (kingdom plantae).  That constitutes a
tree as I understand things, but it is all one sided.

If I go to the next entry in my deer resistant plants data, I may have
M ancestors up to kingdom plantae.   And do this for 1000 or so other
entries.

For each set of ancestor lookups, I need to make a tree.

All of these trees have the same root (kingdom plantae).  So I should
be able to add all these trees together.  And then I think I found the
utilities to save this mess as SQLite.

As I understand things, I probably want to be working with NCBI ID
numbers on the species entered?  And what you call annotation, I would
save in one or more separate SQLite3 dbases keyed on the NCBI ID number?

Let's assume one of the fields of annotation is the USDA growing zone.
A person thinks they want to do a query on USDA Zone 3, so the program
changes this to a query for USDA Zones 2-4, which picks off all the
NCBI ID numbers, and then a person can use BioPerl to make a picture of
all the deer resistant taxonomy known.

One of the sources of data into this, has colour of the flowers.  So
someone could conceivably be looking for pink flowered, deer resistant
plants.  That's why I suggested there might be more than 1 SQLite dbase
of annotation to go with this stuff.

I'll stop writing, and go back to reading code.  I downloaded the
Bio-LITE modules (not at Debian/Devuan), and I think there were
suggestions of other code to download.  And read.

Have a great day!
Gord


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Starting to use Bioperl

Gordon Haverland
On Fri, 11 May 2018 10:12:04 +0100
Peter Cock <[hidden email]> wrote:

> This year the NCBI started offering this data in a slightly newer
> format:
>
> https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/
>
> Most of these files are plain text tables using the rather
> unusual field separator of "\t|\t" (tab, pipe, tab), but the
> README files are very comprehensive.

I found this, and got the tarball version.  I thought the README said
it was \t|\n?  Doesn't matter, it's an unusual separator.

There are Perl scripts in the tarball.  I think I read there, that if
the NCBI dump files are older than 180 days, it downloads newer
versions?  Or maybe I was reading something else.

In any event, the BioSQL site at Github doesn't see much updating.  It
looks to me like all the activity is in biopython, so I downloaded that
for my Devuan machine.

> This is in Python, but my most recent occasion to process
> this data was to make a cut-down version of the NCBI
> taxonomy as part of constructing a small test dataset:
>
> https://github.com/abaizan/kodoja/blob/master/test/taxonomy/filter_taxonomy.py

I seen this at Google, you labelled something a bug.

In looking for the new_taxdump thing (via Google), another Perl script
about findingSpeciesFromGenus (or something like that) popped up.  So,
I have a few things of source to look through.

Thanks.

Gord

_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l
Reply | Threaded
Open this post in threaded view
|

Re: Starting to use Bioperl

Peter Cock


On Sun, May 13, 2018 at 12:26 AM, Gordon Haverland <[hidden email]> wrote:
On Fri, 11 May 2018 10:12:04 +0100
Peter Cock <[hidden email]> wrote:

> This year the NCBI started offering this data in a slightly newer
> format:
>
> https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/
>
> Most of these files are plain text tables using the rather
> unusual field separator of "\t|\t" (tab, pipe, tab), but the
> README files are very comprehensive.

I found this, and got the tarball version.  I thought the README said
it was \t|\n?  Doesn't matter, it's an unusual separator.

From memory, yes, the record separator is tab pipe newline,
but the field separator is tab pipe tab.
 
There are Perl scripts in the tarball.  I think I read there, that if
the NCBI dump files are older than 180 days, it downloads newer
versions?  Or maybe I was reading something else.

In any event, the BioSQL site at Github doesn't see much updating.  It
looks to me like all the activity is in biopython, so I downloaded that
for my Devuan machine.

As a mature database schema, we'd not expect much change.
The only substantial change in BioSQL in recent years was
extending the schema to work on SQLite.

 
> This is in Python, but my most recent occasion to process
> this data was to make a cut-down version of the NCBI
> taxonomy as part of constructing a small test dataset:
>
> https://github.com/abaizan/kodoja/blob/master/test/taxonomy/filter_taxonomy.py

I seen this at Google, you labelled something a bug.

Possibly you meant this recent work - something I had been
meaning to fix, but this conversation promoted me to do it:


 
In looking for the new_taxdump thing (via Google), another Perl script
about findingSpeciesFromGenus (or something like that) popped up.  So,
I have a few things of source to look through.

Thanks.

Gord


Yes, the NCBI taxonomy has existing in this format for over
a decade I think - there should be lots of scripts out there
for use/guidance.

Peter 


_______________________________________________
Bioperl-l mailing list
[hidden email]
http://mailman.open-bio.org/mailman/listinfo/bioperl-l