Chapter 7 problem set 1 - UCD-pbio-rclub/python_problems GitHub Wiki

Chapter 7 Problem Set 1

John

Practice with missing data

  1. Create a DataFrame using
import pandas as pd
import numpy as np
from numpy import NAN as NA
df = pd.DataFrame(np.random.randn(10, 10), columns= ['a','b','c','d','e','f','g','h','i','j'])
df.iloc[:7, 1:3] = NA
df.iloc[4:8, 5:8] = NA
df.iloc[:,9] = NA
df
  1. Remove all rows where at least half the columns have missing data
  2. Remove all columns where at least half the rows have missing data
  3. Combine problems 2 and 3. Does the order matter?
  4. Remove all rows which have NA in either column 'b' or 'f'

Min-Yao

  1. Import my RNA-Seq CPM data from 'Expression Browser_CPM_practice.xlsx' file. Please made the Itag number become the row index. How many genes in this data set?
Answer

```python xlsx = pd.ExcelFile('HW7/Expression Browser_CPM_practice.xlsx') RNASeq = pd.read_excel(xlsx, 'Expression Browser_CPM') RNASeq RNASeq = RNASeq.set_index('Name') RNASeq ```

  1. Please replace all 0 with NA.
Answer

```python replaced_RNASeq = RNASeq.replace(0, NA) replaced_RNASeq ```

  1. We want to remove the genes that have no expression in all samples. How many genes left after we remove these genes.
Answer

```python reduced_RNASeq = replaced_RNASeq.dropna(how='all') reduced_RNASeq ```

Julin

In Chapter 7.1 we learned how to replace a missing value in a Series with the mean value of that series. Now we work with data frames. Create a data frame like so:

import pandas as pd
import numpy as np
from numpy import nan as NA

data = pd.DataFrame(np.random.randn(5,5))

data.iloc[3] = NA

data.iloc[2:, 3] = NA

data
  1. For each column of the data frame that has missing data, replace that data with the mean values of the data in that column.
Answer

```python # it turns out the same method works, column-wise, on DataFrames data.fillna(data.mean()) ```

  1. Do the same thing, but only if there are less than 3 missing values in the column; otherwise leave the NAs there.
Answer

# first find out which columns have less than 3 NAs
datafill = len(data) - data.count() < 3  # the lhs gives number of NAs per column

# next compute column means and replace with NA where appropriate
colmeans = data.mean()
colmeans[[not i for i in datafill]] = NA 

# now use these as the replacement 
data.fillna(colmeans)

Rie

  1. Read CSV file named "soybean_miR.phythonClub.012519.csv" in my directory(https://github.com/UCD-pbio-rclub/python-data-analysis_RieU/blob/master/soybean_miR.phythonClub.012519.csv). "num" means the number of miRNA in the category that I am interested. I want to see the ratio of each miRNA relative to the total.
Answer

mir = pd.read_csv('soybean_miR.phythonClub.012519.csv')

  1. Eliminated the rows containing 0 because I am not interested in those miRNAs!
Answer

mir2 = mir.replace(0, NA) dropped = mir2.dropna()

  1. Compute miRNA ratio relative to the total. Append the ratio information in a new column.
Answer

num = pd.Series(dropped['num'])

total = pd.Series(dropped['total'])

ratio = pd.Series(num/total)

dropped['ratio']= ratio


Joel

Create a DataFrame with random numbers

import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(20),columns=["S1"])
  1. Bin data into quartiles (for bonus, label them according to their bin)
Answer

## qcut to cut into quantiles, 
# 4 to bin into 4 equal parts
# labels to assign labels to each element
quarts = pd.qcut(data["S1"],4,labels = ["25","50","75","100"])
  1. Assign the corresponding bin label to the index of the data
Answer

data.index=quarts # Assigns the labels
data.sort_values(by="S1",ascending=True) #Makes sure they're in order
  1. Get the values for the first quartile in ascending order
Answer

data.loc["25"].sort_values(by="S1",ascending=True)
⚠️ **GitHub.com Fallback** ⚠️