Identify the maximum level of taxonomic detail in a categorisation - lvphj/epydemiology GitHub Wiki

phjMaxLevelOfTaxonomicDetail()

import numpy as np
import pandas as pd
import epydemiology as epy

df = epy.phjMaxLevelOfTaxonomicDetail(phjDF,
                                      phjFirstCol,
                                      phjLastCol,
                                      phjNewColName = 'newColumn',
                                      phjDropPreExisting = False,
                                      phjCleanup = False,
                                      phjPrintResults = False)

Description

This function takes a Pandas dataframe containing a taxonomic classification of various descriptors and returns a column containing the maximum level of taxonomic detail that each description represents. This will, almost certainly, make more sense with an example.

An example dataframe is given below:

import numpy as np
import pandas as pd
import collections

myOrderedDict = collections.OrderedDict()
myOrderedDict['Descriptor'] = ['dog','ferret','cat','rabbit','horse','primate','rodent','gerbil','guinea pig','rat','mammal','lizard','snake','common basilisk','turtle','tortoise','spur-thighed tortoise']
myOrderedDict['Phylum'] = ['Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata']
myOrderedDict['Class'] = ['Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Reptilia','Reptilia','Reptilia','Reptilia','Reptilia','Reptilia']
myOrderedDict['Order'] = ['Carnivora','Carnivora','Carnivora','Lagomorpha','Perissodactyla','Primates','Rodentia','Rodentia','Rodentia','Rodentia','','Squamata','Squamata','Squamata','Testudines','Testudines','Testudines']
myOrderedDict['Suborder'] = ['','','Feliformia','','','','','','','','','Lacertilia','Serpentes','Iguania','','Cryptodira','Cryptodira']
myOrderedDict['Superfamily'] = ['','','','','','','','','','','','','','','','','']
myOrderedDict['Family'] = ['Canidae','Mustelidae','Felidae','Leporidae','Equidae','','','Muridae','Caviidae','Muridae','','','','Corytophanidae','','Testudinidae','Testudinidae']
myOrderedDict['Subfamily'] = ['','','','','','','','Gerbillinae','','Murinae','','','','','','','']
myOrderedDict['Genus'] = ['Canis','Mustela','Felis','Oryctolagus','Equus','','','','Cavia','Rattus','','','','Basiliscus','','','Testudo']
myOrderedDict['Species'] = ['lupus','putorius','silvestris','cuniculus','ferus','','','','porcellus','norvegicus','','','','basiliscus','','','graeca']
myOrderedDict['Subspecies'] = ['familiaris','furo','catus','','caballus','','','','','domestica','','','','','','','']

df = pd.DataFrame(myOrderedDict)

The dataframe contains several descriptions of animals and a corresponding taxonomic representation.

               Descriptor    Phylum     Class           Order    Suborder  \
0                     dog  Chordata  Mammalia       Carnivora               
1                  ferret  Chordata  Mammalia       Carnivora               
2                     cat  Chordata  Mammalia       Carnivora  Feliformia   
3                  rabbit  Chordata  Mammalia      Lagomorpha               
4                   horse  Chordata  Mammalia  Perissodactyla               
5                 primate  Chordata  Mammalia        Primates               
6                  rodent  Chordata  Mammalia        Rodentia               
7                  gerbil  Chordata  Mammalia        Rodentia               
8              guinea pig  Chordata  Mammalia        Rodentia               
9                     rat  Chordata  Mammalia        Rodentia               
10                 mammal  Chordata  Mammalia                               
11                 lizard  Chordata  Reptilia        Squamata  Lacertilia   
12                  snake  Chordata  Reptilia        Squamata   Serpentes   
13        common basilisk  Chordata  Reptilia         Squmata     Iguania   
14                 turtle  Chordata  Reptilia      Testudines               
15               tortoise  Chordata  Reptilia      Testudines  Cryptodira   
16  spur-thighed tortoise  Chordata  Reptilia      Testudines  Cryptodira   

   Superfamily          Family    Subfamily        Genus     Species  \
0                      Canidae                     Canis       lupus   
1                   Mustelidae                   Mustela    putorius   
2                      Felidae                     Felis  silvestris   
3                    Leporidae               Oryctolagus   cuniculus   
4                      Equidae                     Equus       ferus   
5                                                                      
6                                                                      
7                      Muridae  Gerbillinae                            
8                     Caviidae                     Cavia   porcellus   
9                      Muridae      Murinae       Rattus  norvegicus   
10                                                                     
11                                                                     
12                                                                     
13              Corytophanidae                Basiliscus  basiliscus   
14                                                                     
15                Testudinidae                                         
16                Testudinidae                   Testudo      graeca   

    Subspecies  
0   familiaris  
1         furo  
2        catus  
3               
4     caballus  
5               
6               
7               
8               
9    domestica  
10              
11              
12              
13              
14              
15              
16              

Not all categories have an entry. For example, in the above dataframe, 'dog' is categorised down to the level of 'subspecies' but it does not contain information for 'suborder', 'superfamily' or 'subfamily'.

The function works by expressing the row of data to a binary representation based on whether each cell contains text (e.g. text - blank - blank - text - blank would be represented as 10010). The rightmost set bit is then determined using a method based on two's complement. In the preceding example, the rightmost set bit would be at position 2 (from the right).

A column that contains the name of the maximum taxonomic descriptor based on the entries in the dataframe is produced. For example, the maximum taxonomic descriptor for 'dog' is 'subspecies' (i.e. 'familiaris') whilst the maximum taxonomic descriptor for 'tortoise' is 'family' (i.e. 'Testudinidae').

Function parameters

  1. phjDF

    The dataframe containing taxonomic information.

  2. phjFirstCol

    The name for the first column containing taxonomic classification (e.g. 'Phylum').

  3. phjLastCol

    The name of the last column containing taxonomic classification (e.g. 'Subspecies').

  4. phjNewColName (default = 'newColumn')

    The name of the column that will be created to contain the maximum taxonomic category.

  5. phjDropPreExisting (default = False)

    If set to True, the function will delete any pre-existing columns that have the same name as those that will be created when the function is run.

  6. phjCleanup (default = False)

    If set to True, the function will delete temporary columns created during the process of running the function.

  7. phjPrintResults (default = False)

    If set to True, the function will print information to screen as it proceeds.

Exceptions raised

None.

Returns

By default, function returns the original dataframe with an added column containing the maximum taxonomic category, together with two temporary columns, bin (a binary representation of the columns) and posFromR (the position of the rightmost column containing taxonomic information). These temporary columns can be automatically removed by setting phjCleanup = True.

Other notes

It is assumed that all the columns to be considered are consecutive within the database and that the order of columns as they occur in the database (e.g. Phylum, Class, Order, Family, Genus, Species) is meaningful.

This method makes use of the idea of two's-complement (see https://en.wikipedia.org/wiki/Two%27s_complement#From_the_ones'_complement). The algorithm to find the position of the rightmost set bit (i.e. the position on the right that is set to '1') was described at: https://www.geeksforgeeks.org/position-of-rightmost-set-bit/ but was a little confusing. The following has been rewritten to make it clearer.

Using the example of 00010010 (decimal 18)

  1. Take two's complement of binary number. This can be found by flipping each binary digit (e.g. in an 8-bit system, the decimal number 18 is represented as 00010010 which would become 11101101, the one's complement) and then adding 1 (so 11101101 would become 11101110).

  2. Do a bit-wise AND with the original number (i.e. result equals 1 if both bits are equal to 1. This can be achieved simply by multiplying the two bits at each position e.g. 0 x 0 = 0, 1 x 0 = 0, 1 x 1 = 1). This produces a number with a '1' at the required position, in this case 00000010.

  3. Take the log2 of the binary number to give the position minus one (i.e. log2(00000010) = 1).

  4. Add one to produce the final answer (i.e. 1 + 1 = 2).

The site also gave the following Python code:

# Python Code for Position
# of rightmost set bit

import math

def getFirstSetBitPos(n):

    return math.log2(n&-n)+1

# driver code

n = 12
print(int(getFirstSetBitPos(n)))

# This code is contributed
# by Anant Agarwal.

This was adapted to use array arithmatic in a Pandas dataframe:

df['pos'] = (np.log2(df['bin']&-df['bin'])+1).astype(int)

# Position of rightmost set bit
phjTempDF['posFromR'] = (np.log2(phjTempDF['bin'].astype(int) & -phjTempDF['bin'].astype(int)) + 1).astype(int)

If all cells in a row were empty, the binary representation would be 000...000. This causes big problems when trying to calculate the two's complement because log2(0) is infinity. To overcome this problem, add a '1' to start of each string; this won't affect the calculation of the rightmost set bit except in cases where all cells are empty, in which case the rightmost set bit will lie outside the number of columns being considered.

Example

import numpy as np
import pandas as pd
import collections

myOrderedDict = collections.OrderedDict()
myOrderedDict['Descriptor'] = ['dog','ferret','cat','rabbit','horse','primate','rodent','gerbil','guinea pig','rat','mammal','lizard','snake','common basilisk','turtle','tortoise','spur-thighed tortoise']
myOrderedDict['Phylum'] = ['Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata']
myOrderedDict['Class'] = ['Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Reptilia','Reptilia','Reptilia','Reptilia','Reptilia','Reptilia']
myOrderedDict['Order'] = ['Carnivora','Carnivora','Carnivora','Lagomorpha','Perissodactyla','Primates','Rodentia','Rodentia','Rodentia','Rodentia','','Squamata','Squamata','Squmata','Testudines','Testudines','Testudines']
myOrderedDict['Suborder'] = ['','','Feliformia','','','','','','','','','Lacertilia','Serpentes','Iguania','','Cryptodira','Cryptodira']
myOrderedDict['Superfamily'] = ['','','','','','','','','','','','','','','','','']
myOrderedDict['Family'] = ['Canidae','Mustelidae','Felidae','Leporidae','Equidae','','','Muridae','Caviidae','Muridae','','','','Corytophanidae','','Testudinidae','Testudinidae']
myOrderedDict['Subfamily'] = ['','','','','','','','Gerbillinae','','Murinae','','','','','','','']
myOrderedDict['Genus'] = ['Canis','Mustela','Felis','Oryctolagus','Equus','','','','Cavia','Rattus','','','','Basiliscus','','','Testudo']
myOrderedDict['Species'] = ['lupus','putorius','silvestris','cuniculus','ferus','','','','porcellus','norvegicus','','','','basiliscus','','','graeca']
myOrderedDict['Subspecies'] = ['familiaris','furo','catus','','caballus','','','','','domestica','','','','','','','']

df = pd.DataFrame(myOrderedDict)

df = epy.phjMaxLevelOfTaxonomicDetail(phjDF = df,
                                      phjFirstCol = 'Phylum',
                                      phjLastCol = 'Subspecies',
                                      phjNewColName = 'max_tax_details',
                                      phjDropPreExisting = False,
                                      phjCleanup = True,
                                      phjPrintResults = False)

This function adds a column which contains the name of the rightmost column that contains an entry.

               Descriptor    Phylum     Class           Order    Suborder  \
0                     dog  Chordata  Mammalia       Carnivora               
1                  ferret  Chordata  Mammalia       Carnivora               
2                     cat  Chordata  Mammalia       Carnivora  Feliformia   
3                  rabbit  Chordata  Mammalia      Lagomorpha               
4                   horse  Chordata  Mammalia  Perissodactyla               
5                 primate  Chordata  Mammalia        Primates               
6                  rodent  Chordata  Mammalia        Rodentia               
7                  gerbil  Chordata  Mammalia        Rodentia               
8              guinea pig  Chordata  Mammalia        Rodentia               
9                     rat  Chordata  Mammalia        Rodentia               
10                 mammal  Chordata  Mammalia                               
11                 lizard  Chordata  Reptilia        Squamata  Lacertilia   
12                  snake  Chordata  Reptilia        Squamata   Serpentes   
13        common basilisk  Chordata  Reptilia         Squmata     Iguania   
14                 turtle  Chordata  Reptilia      Testudines               
15               tortoise  Chordata  Reptilia      Testudines  Cryptodira   
16  spur-thighed tortoise  Chordata  Reptilia      Testudines  Cryptodira   

   Superfamily          Family    Subfamily        Genus     Species  \
0                      Canidae                     Canis       lupus   
1                   Mustelidae                   Mustela    putorius   
2                      Felidae                     Felis  silvestris   
3                    Leporidae               Oryctolagus   cuniculus   
4                      Equidae                     Equus       ferus   
5                                                                      
6                                                                      
7                      Muridae  Gerbillinae                            
8                     Caviidae                     Cavia   porcellus   
9                      Muridae      Murinae       Rattus  norvegicus   
10                                                                     
11                                                                     
12                                                                     
13              Corytophanidae                Basiliscus  basiliscus   
14                                                                     
15                Testudinidae                                         
16                Testudinidae                   Testudo      graeca   

    Subspecies max_tax_detail  
0   familiaris     Subspecies  
1         furo     Subspecies  
2        catus     Subspecies  
3                     Species  
4     caballus     Subspecies  
5                       Order  
6                       Order  
7                   Subfamily  
8                     Species  
9    domestica     Subspecies  
10                      Class  
11                   Suborder  
12                   Suborder  
13                    Species  
14                      Order  
15                     Family  
16                    Species