Chapter 7 Problem Set 1

John

Practice with missing data

Create a DataFrame using

import pandas as pd
import numpy as np
from numpy import NAN as NA
df = pd.DataFrame(np.random.randn(10, 10), columns= ['a','b','c','d','e','f','g','h','i','j'])
df.iloc[:7, 1:3] = NA
df.iloc[4:8, 5:8] = NA
df.iloc[:,9] = NA
df

Remove all rows where at least half the columns have missing data
Remove all columns where at least half the rows have missing data
Combine problems 2 and 3. Does the order matter?
Remove all rows which have NA in either column 'b' or 'f'

Min-Yao

Import my RNA-Seq CPM data from 'Expression Browser_CPM_practice.xlsx' file. Please made the Itag number become the row index. How many genes in this data set?

Answer

```python xlsx = pd.ExcelFile('HW7/Expression Browser_CPM_practice.xlsx') RNASeq = pd.read_excel(xlsx, 'Expression Browser_CPM') RNASeq RNASeq = RNASeq.set_index('Name') RNASeq ```

Please replace all 0 with NA.

Answer

```python replaced_RNASeq = RNASeq.replace(0, NA) replaced_RNASeq ```

We want to remove the genes that have no expression in all samples. How many genes left after we remove these genes.

Answer

```python reduced_RNASeq = replaced_RNASeq.dropna(how='all') reduced_RNASeq ```

Julin

In Chapter 7.1 we learned how to replace a missing value in a Series with the mean value of that series. Now we work with data frames. Create a data frame like so:

import pandas as pd
import numpy as np
from numpy import nan as NA

data = pd.DataFrame(np.random.randn(5,5))

data.iloc[3] = NA

data.iloc[2:, 3] = NA

data

For each column of the data frame that has missing data, replace that data with the mean values of the data in that column.

Answer

```python # it turns out the same method works, column-wise, on DataFrames data.fillna(data.mean()) ```

Do the same thing, but only if there are less than 3 missing values in the column; otherwise leave the NAs there.

Answer

# first find out which columns have less than 3 NAs
datafill = len(data) - data.count() < 3  # the lhs gives number of NAs per column

# next compute column means and replace with NA where appropriate
colmeans = data.mean()
colmeans[[not i for i in datafill]] = NA 

# now use these as the replacement 
data.fillna(colmeans)

Rie

Read CSV file named "soybean_miR.phythonClub.012519.csv" in my directory(https://github.com/UCD-pbio-rclub/python-data-analysis_RieU/blob/master/soybean_miR.phythonClub.012519.csv). "num" means the number of miRNA in the category that I am interested. I want to see the ratio of each miRNA relative to the total.

Answer

mir = pd.read_csv('soybean_miR.phythonClub.012519.csv')

Eliminated the rows containing 0 because I am not interested in those miRNAs!

Answer

mir2 = mir.replace(0, NA) dropped = mir2.dropna()

Compute miRNA ratio relative to the total. Append the ratio information in a new column.

Answer

num = pd.Series(dropped['num'])

total = pd.Series(dropped['total'])

ratio = pd.Series(num/total)

dropped['ratio']= ratio

Joel

Create a DataFrame with random numbers

import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(20),columns=["S1"])

Bin data into quartiles (for bonus, label them according to their bin)

Answer

## qcut to cut into quantiles, 
# 4 to bin into 4 equal parts
# labels to assign labels to each element
quarts = pd.qcut(data["S1"],4,labels = ["25","50","75","100"])

Assign the corresponding bin label to the index of the data

Answer

data.index=quarts # Assigns the labels
data.sort_values(by="S1",ascending=True) #Makes sure they're in order

Get the values for the first quartile in ascending order

Answer

data.loc["25"].sort_values(by="S1",ascending=True)

Chapter 7 problem set 1 - UCD-pbio-rclub/python_problems GitHub Wiki

Chapter 7 Problem Set 1

John

Min-Yao

Julin

Rie

Joel

⚠️ GitHub.com Fallback ⚠️

Chapter 7 problem set 1 - UCD-pbio-rclub/python_problems GitHub Wiki

Chapter 7 Problem Set 1

John

Min-Yao

Julin

Rie

Joel

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️