Chapter 8 problem set 1 - UCD-pbio-rclub/python_problems GitHub Wiki

Chapter 8 problem set 1

John

  1. Read these 3 files into python and make them data frames: flights.xlsx, weather.tsv, airlines.csv

  2. Remove the 'hour' column from the weather data frame

  3. Inner join these three data frames. You will have to modify certain columns and be explicit about which columns to merge on. (should have 29 columns in the end)

  4. Set year,month,day,and origin to indexes

Min-Yao

This question is related to Hierarchical Indexing.

1. Using the same data from last week. (Import my RNA-Seq CPM data from 'Expression Browser_CPM_practice.xlsx' file. Expression Browser_CPM_practice.xlsx) In column labels, the first number means plant genotype and the second part is one letter and a number which means the treatment conditions and sample numbers. please change them to multilevel hierarchical columns labels with column name in front of the whole data. For example, '6_c1' change it to genotype = '6', treatments = 'c', sample_number = '1'.

Answer

xlsx = pd.ExcelFile('HW7/Expression Browser_CPM_practice.xlsx')
RNASeq = pd.read_excel(xlsx, 'Expression Browser_CPM')
RNASeq.head()
RNASeq = RNASeq.set_index('Name')
RNASeq.head()
RNASeq_col_list = RNASeq.columns.tolist()
RNASeq_col_list
RNASeq_column = pd.DataFrame(RNASeq_col_list)
RNASeq_column
RNASeq_column2 = RNASeq_column[0].apply(lambda x: pd.Series(list(x.replace('_',''))))
RNASeq_column2
RNASeq_column3 = RNASeq_column2.drop(columns=[3])
RNASeq_column3
multiindex_column = pd.MultiIndex.from_frame(RNASeq_column3)
multiindex_column
RNASeq.columns = multiindex_column
RNASeq.head()
RNASeq.columns.names = ['genotype', 'treatments', 'sample_number']
RNASeq.head()

2. In row labels, these numbers are Solyc ID for tomatoes. "Solyc" is the 5 letter abbreviation of Solanum lycopersicum, the 2 digit number following the 'Solyc' denotes the chromosome, 'g' denotes that the sequence is a gene, and the 6 digit number following the 'g' identifies the gene on the chromosome. The '.1' denotes the annotation version number of the locus. Please use hierarchical indexing to label another level of index to show which chromosome each gene on and give the index name 'chromosome'.

Answer

index_list = pd.DataFrame(RNASeq.index.tolist())
index_list
chromosome = index_list[0].apply(lambda x: x[5:7])
chromosome
solycID = pd.Series(RNASeq.index)
solycID
newRNASeq = RNASeq.set_index([solycID, chromosome])
newRNASeq
newRNASeq.index.names = ['SolycID','chromosome']
newRNASeq

3. In order to compare genotypes and treatments, you want to know the summary statistics of different genotypes and treatments. Please calculate the average expression level of each gene in different genotypes and treatments.

Answer

newRNASeq.mean(level='genotype', axis=1)

newRNASeq.mean(level='treatments', axis=1)

newRNASeq.mean(level=['genotype','treatments'], axis=1)

Joel

These datasets contain information on US population by state/region over a period of time, as well as the state's area.

population = pd.read_csv("https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv")
areas = pd.read_csv("https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv")
stateabrevs = pd.read_csv("https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.csv")

Calculate the population density for the year 2012 (using the total population)

Answer

# Merge the state abbreviations with the areas dataset; this will help doing a second merge of the areas to the population table.

mergedPop = pd.merge(population, pd.merge(areas,stateabrevs,on="state"),left_on='state/region', right_on='abbreviation',how="outer")

# an outer merge on the second set helps mantain the data that doesn't have a match:
population.shape[0] == mergedPop.shape[0] #the original population and the merged have the same number of rows.

# Next, filter by total population (not just under 18) and year (2012)
total2012 = mergedPop[(mergedPop["ages"] == "total") & (mergedPop["year"] == 2012)]

# Calculate the population density:
density = total2012["population"] / total2012["area (sq. mi)"]
density

Kae

Using these baby name datasets from 1996 and 1998, make dataframes using the columns for name and the number of babies given that name. Merge the lists and see if there were any names used in one year but not the other.

Answer

import pandas as pd
bb6=pd.read_csv('/Users/klombardo/Garbage/babynames1996.txt', names=['name','sex','count'])
bb8=pd.read_csv('/Users/klombardo/Garbage/babynames1998.txt',names=['name','sex','count'])

new=pd.merge(bb6, bb8, on=['name','sex'],how='outer',indicator=True)
new
new[new['_merge'].str.contains("right_only")] 
new[new['_merge'].str.contains("left_only")]
#check to make sure everything adds up
new[new['_merge'].str.contains("right_only|left_only")] 

Rie

  1. Read csv file, named "Rie_Chap8part1Data.021319.csv"(https://github.com/UCD-pbio-rclub/python-data-analysis_RieU/blob/master/Rie_Chap8part1Data.021319.csv). Set two indexes for the columns 'miR' and 'category'
Answer

data = pd.read_csv('Rie_Chap8part1Data.021319.csv')
data2 = data.set_index(['miR','category'])
  1. Select the data based on the category (known and novel). Hint: I reshape the data first.
Answer

stackData = data2.stack()
known = stackData.loc[:, 'known']
novel = stackData.loc[:, 'novel']
⚠️ **GitHub.com Fallback** ⚠️