Chapter 7 problem set 1 - UCD-pbio-rclub/python_problems GitHub Wiki
Practice with missing data
- Create a DataFrame using
import pandas as pd
import numpy as np
from numpy import NAN as NA
df = pd.DataFrame(np.random.randn(10, 10), columns= ['a','b','c','d','e','f','g','h','i','j'])
df.iloc[:7, 1:3] = NA
df.iloc[4:8, 5:8] = NA
df.iloc[:,9] = NA
df
- Remove all rows where at least half the columns have missing data
- Remove all columns where at least half the rows have missing data
- Combine problems 2 and 3. Does the order matter?
- Remove all rows which have NA in either column 'b' or 'f'
- Import my RNA-Seq CPM data from 'Expression Browser_CPM_practice.xlsx' file. Please made the Itag number become the row index. How many genes in this data set?
Answer
```python xlsx = pd.ExcelFile('HW7/Expression Browser_CPM_practice.xlsx') RNASeq = pd.read_excel(xlsx, 'Expression Browser_CPM') RNASeq RNASeq = RNASeq.set_index('Name') RNASeq ```
- Please replace all 0 with NA.
Answer
```python replaced_RNASeq = RNASeq.replace(0, NA) replaced_RNASeq ```
- We want to remove the genes that have no expression in all samples. How many genes left after we remove these genes.
Answer
```python reduced_RNASeq = replaced_RNASeq.dropna(how='all') reduced_RNASeq ```
In Chapter 7.1 we learned how to replace a missing value in a Series with the mean value of that series. Now we work with data frames. Create a data frame like so:
import pandas as pd
import numpy as np
from numpy import nan as NA
data = pd.DataFrame(np.random.randn(5,5))
data.iloc[3] = NA
data.iloc[2:, 3] = NA
data
- For each column of the data frame that has missing data, replace that data with the mean values of the data in that column.
Answer
```python # it turns out the same method works, column-wise, on DataFrames data.fillna(data.mean()) ```
- Do the same thing, but only if there are less than 3 missing values in the column; otherwise leave the NAs there.
Answer
# first find out which columns have less than 3 NAs
datafill = len(data) - data.count() < 3 # the lhs gives number of NAs per column
# next compute column means and replace with NA where appropriate
colmeans = data.mean()
colmeans[[not i for i in datafill]] = NA
# now use these as the replacement
data.fillna(colmeans)
- Read CSV file named "soybean_miR.phythonClub.012519.csv" in my directory(https://github.com/UCD-pbio-rclub/python-data-analysis_RieU/blob/master/soybean_miR.phythonClub.012519.csv). "num" means the number of miRNA in the category that I am interested. I want to see the ratio of each miRNA relative to the total.
Answer
mir = pd.read_csv('soybean_miR.phythonClub.012519.csv')
- Eliminated the rows containing 0 because I am not interested in those miRNAs!
Answer
mir2 = mir.replace(0, NA) dropped = mir2.dropna()
- Compute miRNA ratio relative to the total. Append the ratio information in a new column.
Answer
num = pd.Series(dropped['num'])
total = pd.Series(dropped['total'])
ratio = pd.Series(num/total)
dropped['ratio']= ratio
Create a DataFrame with random numbers
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(20),columns=["S1"])
- Bin data into quartiles (for bonus, label them according to their bin)
Answer
## qcut to cut into quantiles,
# 4 to bin into 4 equal parts
# labels to assign labels to each element
quarts = pd.qcut(data["S1"],4,labels = ["25","50","75","100"])
- Assign the corresponding bin label to the index of the data
Answer
data.index=quarts # Assigns the labels
data.sort_values(by="S1",ascending=True) #Makes sure they're in order
- Get the values for the first quartile in ascending order
Answer
data.loc["25"].sort_values(by="S1",ascending=True)