Identify the maximum level of taxonomic detail in a categorisation - lvphj/epydemiology GitHub Wiki
phjMaxLevelOfTaxonomicDetail()
import numpy as np
import pandas as pd
import epydemiology as epy
df = epy.phjMaxLevelOfTaxonomicDetail(phjDF,
phjFirstCol,
phjLastCol,
phjNewColName = 'newColumn',
phjDropPreExisting = False,
phjCleanup = False,
phjPrintResults = False)
Description
This function takes a Pandas dataframe containing a taxonomic classification of various descriptors and returns a column containing the maximum level of taxonomic detail that each description represents. This will, almost certainly, make more sense with an example.
An example dataframe is given below:
import numpy as np
import pandas as pd
import collections
myOrderedDict = collections.OrderedDict()
myOrderedDict['Descriptor'] = ['dog','ferret','cat','rabbit','horse','primate','rodent','gerbil','guinea pig','rat','mammal','lizard','snake','common basilisk','turtle','tortoise','spur-thighed tortoise']
myOrderedDict['Phylum'] = ['Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata']
myOrderedDict['Class'] = ['Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Reptilia','Reptilia','Reptilia','Reptilia','Reptilia','Reptilia']
myOrderedDict['Order'] = ['Carnivora','Carnivora','Carnivora','Lagomorpha','Perissodactyla','Primates','Rodentia','Rodentia','Rodentia','Rodentia','','Squamata','Squamata','Squamata','Testudines','Testudines','Testudines']
myOrderedDict['Suborder'] = ['','','Feliformia','','','','','','','','','Lacertilia','Serpentes','Iguania','','Cryptodira','Cryptodira']
myOrderedDict['Superfamily'] = ['','','','','','','','','','','','','','','','','']
myOrderedDict['Family'] = ['Canidae','Mustelidae','Felidae','Leporidae','Equidae','','','Muridae','Caviidae','Muridae','','','','Corytophanidae','','Testudinidae','Testudinidae']
myOrderedDict['Subfamily'] = ['','','','','','','','Gerbillinae','','Murinae','','','','','','','']
myOrderedDict['Genus'] = ['Canis','Mustela','Felis','Oryctolagus','Equus','','','','Cavia','Rattus','','','','Basiliscus','','','Testudo']
myOrderedDict['Species'] = ['lupus','putorius','silvestris','cuniculus','ferus','','','','porcellus','norvegicus','','','','basiliscus','','','graeca']
myOrderedDict['Subspecies'] = ['familiaris','furo','catus','','caballus','','','','','domestica','','','','','','','']
df = pd.DataFrame(myOrderedDict)
The dataframe contains several descriptions of animals and a corresponding taxonomic representation.
Descriptor Phylum Class Order Suborder \
0 dog Chordata Mammalia Carnivora
1 ferret Chordata Mammalia Carnivora
2 cat Chordata Mammalia Carnivora Feliformia
3 rabbit Chordata Mammalia Lagomorpha
4 horse Chordata Mammalia Perissodactyla
5 primate Chordata Mammalia Primates
6 rodent Chordata Mammalia Rodentia
7 gerbil Chordata Mammalia Rodentia
8 guinea pig Chordata Mammalia Rodentia
9 rat Chordata Mammalia Rodentia
10 mammal Chordata Mammalia
11 lizard Chordata Reptilia Squamata Lacertilia
12 snake Chordata Reptilia Squamata Serpentes
13 common basilisk Chordata Reptilia Squmata Iguania
14 turtle Chordata Reptilia Testudines
15 tortoise Chordata Reptilia Testudines Cryptodira
16 spur-thighed tortoise Chordata Reptilia Testudines Cryptodira
Superfamily Family Subfamily Genus Species \
0 Canidae Canis lupus
1 Mustelidae Mustela putorius
2 Felidae Felis silvestris
3 Leporidae Oryctolagus cuniculus
4 Equidae Equus ferus
5
6
7 Muridae Gerbillinae
8 Caviidae Cavia porcellus
9 Muridae Murinae Rattus norvegicus
10
11
12
13 Corytophanidae Basiliscus basiliscus
14
15 Testudinidae
16 Testudinidae Testudo graeca
Subspecies
0 familiaris
1 furo
2 catus
3
4 caballus
5
6
7
8
9 domestica
10
11
12
13
14
15
16
Not all categories have an entry. For example, in the above dataframe, 'dog' is categorised down to the level of 'subspecies' but it does not contain information for 'suborder', 'superfamily' or 'subfamily'.
The function works by expressing the row of data to a binary representation based on whether each cell contains text (e.g. text - blank - blank - text - blank would be represented as 10010). The rightmost set bit is then determined using a method based on two's complement. In the preceding example, the rightmost set bit would be at position 2 (from the right).
A column that contains the name of the maximum taxonomic descriptor based on the entries in the dataframe is produced. For example, the maximum taxonomic descriptor for 'dog' is 'subspecies' (i.e. 'familiaris') whilst the maximum taxonomic descriptor for 'tortoise' is 'family' (i.e. 'Testudinidae').
Function parameters
-
phjDF
The dataframe containing taxonomic information.
-
phjFirstCol
The name for the first column containing taxonomic classification (e.g. 'Phylum').
-
phjLastCol
The name of the last column containing taxonomic classification (e.g. 'Subspecies').
-
phjNewColName (default = 'newColumn')
The name of the column that will be created to contain the maximum taxonomic category.
-
phjDropPreExisting (default = False)
If set to True, the function will delete any pre-existing columns that have the same name as those that will be created when the function is run.
-
phjCleanup (default = False)
If set to True, the function will delete temporary columns created during the process of running the function.
-
phjPrintResults (default = False)
If set to True, the function will print information to screen as it proceeds.
Exceptions raised
None.
Returns
By default, function returns the original dataframe with an added column containing the maximum taxonomic category, together with two temporary columns, bin
(a binary representation of the columns) and posFromR
(the position of the rightmost column containing taxonomic information). These temporary columns can be automatically removed by setting phjCleanup = True
.
Other notes
It is assumed that all the columns to be considered are consecutive within the database and that the order of columns as they occur in the database (e.g. Phylum, Class, Order, Family, Genus, Species) is meaningful.
This method makes use of the idea of two's-complement (see https://en.wikipedia.org/wiki/Two%27s_complement#From_the_ones'_complement). The algorithm to find the position of the rightmost set bit (i.e. the position on the right that is set to '1') was described at: https://www.geeksforgeeks.org/position-of-rightmost-set-bit/ but was a little confusing. The following has been rewritten to make it clearer.
Using the example of 00010010 (decimal 18)
-
Take two's complement of binary number. This can be found by flipping each binary digit (e.g. in an 8-bit system, the decimal number 18 is represented as 00010010 which would become 11101101, the one's complement) and then adding 1 (so 11101101 would become 11101110).
-
Do a bit-wise AND with the original number (i.e. result equals 1 if both bits are equal to 1. This can be achieved simply by multiplying the two bits at each position e.g. 0 x 0 = 0, 1 x 0 = 0, 1 x 1 = 1). This produces a number with a '1' at the required position, in this case 00000010.
-
Take the log2 of the binary number to give the position minus one (i.e. log2(00000010) = 1).
-
Add one to produce the final answer (i.e. 1 + 1 = 2).
The site also gave the following Python code:
# Python Code for Position
# of rightmost set bit
import math
def getFirstSetBitPos(n):
return math.log2(n&-n)+1
# driver code
n = 12
print(int(getFirstSetBitPos(n)))
# This code is contributed
# by Anant Agarwal.
This was adapted to use array arithmatic in a Pandas dataframe:
df['pos'] = (np.log2(df['bin']&-df['bin'])+1).astype(int)
# Position of rightmost set bit
phjTempDF['posFromR'] = (np.log2(phjTempDF['bin'].astype(int) & -phjTempDF['bin'].astype(int)) + 1).astype(int)
If all cells in a row were empty, the binary representation would be 000...000. This causes big problems when trying to calculate the two's complement because log2(0) is infinity. To overcome this problem, add a '1' to start of each string; this won't affect the calculation of the rightmost set bit except in cases where all cells are empty, in which case the rightmost set bit will lie outside the number of columns being considered.
Example
import numpy as np
import pandas as pd
import collections
myOrderedDict = collections.OrderedDict()
myOrderedDict['Descriptor'] = ['dog','ferret','cat','rabbit','horse','primate','rodent','gerbil','guinea pig','rat','mammal','lizard','snake','common basilisk','turtle','tortoise','spur-thighed tortoise']
myOrderedDict['Phylum'] = ['Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata','Chordata']
myOrderedDict['Class'] = ['Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Mammalia','Reptilia','Reptilia','Reptilia','Reptilia','Reptilia','Reptilia']
myOrderedDict['Order'] = ['Carnivora','Carnivora','Carnivora','Lagomorpha','Perissodactyla','Primates','Rodentia','Rodentia','Rodentia','Rodentia','','Squamata','Squamata','Squmata','Testudines','Testudines','Testudines']
myOrderedDict['Suborder'] = ['','','Feliformia','','','','','','','','','Lacertilia','Serpentes','Iguania','','Cryptodira','Cryptodira']
myOrderedDict['Superfamily'] = ['','','','','','','','','','','','','','','','','']
myOrderedDict['Family'] = ['Canidae','Mustelidae','Felidae','Leporidae','Equidae','','','Muridae','Caviidae','Muridae','','','','Corytophanidae','','Testudinidae','Testudinidae']
myOrderedDict['Subfamily'] = ['','','','','','','','Gerbillinae','','Murinae','','','','','','','']
myOrderedDict['Genus'] = ['Canis','Mustela','Felis','Oryctolagus','Equus','','','','Cavia','Rattus','','','','Basiliscus','','','Testudo']
myOrderedDict['Species'] = ['lupus','putorius','silvestris','cuniculus','ferus','','','','porcellus','norvegicus','','','','basiliscus','','','graeca']
myOrderedDict['Subspecies'] = ['familiaris','furo','catus','','caballus','','','','','domestica','','','','','','','']
df = pd.DataFrame(myOrderedDict)
df = epy.phjMaxLevelOfTaxonomicDetail(phjDF = df,
phjFirstCol = 'Phylum',
phjLastCol = 'Subspecies',
phjNewColName = 'max_tax_details',
phjDropPreExisting = False,
phjCleanup = True,
phjPrintResults = False)
This function adds a column which contains the name of the rightmost column that contains an entry.
Descriptor Phylum Class Order Suborder \
0 dog Chordata Mammalia Carnivora
1 ferret Chordata Mammalia Carnivora
2 cat Chordata Mammalia Carnivora Feliformia
3 rabbit Chordata Mammalia Lagomorpha
4 horse Chordata Mammalia Perissodactyla
5 primate Chordata Mammalia Primates
6 rodent Chordata Mammalia Rodentia
7 gerbil Chordata Mammalia Rodentia
8 guinea pig Chordata Mammalia Rodentia
9 rat Chordata Mammalia Rodentia
10 mammal Chordata Mammalia
11 lizard Chordata Reptilia Squamata Lacertilia
12 snake Chordata Reptilia Squamata Serpentes
13 common basilisk Chordata Reptilia Squmata Iguania
14 turtle Chordata Reptilia Testudines
15 tortoise Chordata Reptilia Testudines Cryptodira
16 spur-thighed tortoise Chordata Reptilia Testudines Cryptodira
Superfamily Family Subfamily Genus Species \
0 Canidae Canis lupus
1 Mustelidae Mustela putorius
2 Felidae Felis silvestris
3 Leporidae Oryctolagus cuniculus
4 Equidae Equus ferus
5
6
7 Muridae Gerbillinae
8 Caviidae Cavia porcellus
9 Muridae Murinae Rattus norvegicus
10
11
12
13 Corytophanidae Basiliscus basiliscus
14
15 Testudinidae
16 Testudinidae Testudo graeca
Subspecies max_tax_detail
0 familiaris Subspecies
1 furo Subspecies
2 catus Subspecies
3 Species
4 caballus Subspecies
5 Order
6 Order
7 Subfamily
8 Species
9 domestica Subspecies
10 Class
11 Suborder
12 Suborder
13 Species
14 Order
15 Family
16 Species