2. Data input and output in Python - Lin-Brain-Lab/machine_learning GitHub Wiki
Procedures to read fMRI data in Python
Example data are at:
-
Perspective-taking fMRI: /space_lin2/fhlin/perspective
-
Human connectome project: /space_lin1/hcp
0. Python environment
The path of our current available Anaconda folder is:
/space_lin2/kaihua/anaconda3
Activating our python environment is simple.
First, "source" the relevant profile, after which you can check if the "conda" command is available by outputting its version, as shown below:
source /space_lin2/kaihua/.bashrc
conda --version
Later, choose which virtual environment you want to utilise. Currently, there are mainly two environments: (base) and (py36), and you can activate one of them by entering "conda activate" command, as shown below:
conda activate base
conda activate py36
Different environments would contain different packages of different versions, which will meet the requirements of different tasks. Check the details of these packages by entering:
conda list
1. Read in-house fMRI data
Here are examples to read perspective-taking fMRI data in Python.
1.1 NII format
Take the NII file '/space_lin2/fhlin/perspective/subj_02/epi_data/unpack/bold/007/sfmcstc.nii'
for example.
import nibabel as nb
nii_img = nb.load("/space_lin2/fhlin/perspective/subj_02/epi_data/unpack/bold/007/sfmcstc.nii")
nii_data = nii_img.get_fdata() # shape: (72, 72, 42, 101)
affine = nii_img.affine # affine matrix, shape: (4, 4)
hdr = nii_img.header # header
1.2 STC format
import mne
stc = mne.read_source_estimate("/space_lin2/fhlin/perspective/subj_02/epi_data/unpack/f1/subj_02_2_fsaverage_sfmcstc-lh.stc") # <class 'mne.source_estimate.SourceEstimate'>
stc_data = stc.data # shape: (20484, 100), <class 'numpy.ndarray'>
stc_vertices = stc.vertices # shape: (2, 10242)
stc_tmin = stc.tmin # i.e., epoch_begin_latency=0.0
stc_tstep = stc.tstep # i.e., sample_period=2.0
stc_times = stc.times # shape: (87,)
2. Read Human Connectome Project data
The resulting STC files can be read by this python script file(not fully completed)
3. Basic data processing in Python
3.1 After reading STC files
...
Assume we have got the fMRI data and its target in the form of numpy array, of which the shapes are:
print(fmri_data.shape) # shape: (36, 17, 150); (samples, scenes, rois)
print(behavior_targets.shape) # shape: (36,)
3.2 Standardization
Before applying standardization, we should first convert the original fMRI data from 3 dimensions to 2 dimensions:
fmri_data_shape1 = [fmri_data.shape[0], fmri_data.shape[1], fmri_data.shape[2]] # [36, 17, 150]
fmri_data = fmri_data.reshape(fmri_data_shape1[0]*fmri_data_shape1[1], fmri_data_shape1[2]) # shape: (36, 17, 150) -> (612, 150)
Then we can apply Standardization by utilizing the function "StandardScaler" in scikit-learn:
standardization = True
if standardization:
ss = StandardScaler()
fmri_data = ss.fit_transform(fmri_data) # fit and transform
3.3 Principle component analysis (PCA)
Similarly, we can apply PCA by utilizing the function "PCA" in scikit-learn:
use_pca = True
if use_pca:
pca = PCA(n_components=0.85) # remain 85% variance explanation
fmri_data = pca.fit_transform(fmri_data) # shape: (612, 150) -> (612, 8)
3.4 Restore data shape
After applying standardization or PCA, we need to restore the data shape from 2 dimensions to 3 dimensions:
fmri_data = fmri_data.reshape(fmri_data_shape1[0], fmri_data_shape1[1], -1) # shape: (612, 8) -> (36, 17, 8)
For temporal models like LSTM, shape (36, 17, 8) can directly serve as the input. But for other conventional machine learning regression models, we still need to further flatten the temporal (i.e., scenes) and spatial (i.e., rois after PCA) dimension:
fmri_data_shape2 = [fmri_data.shape[0], fmri_data.shape[1], fmri_data.shape[2]] # [36, 17, 8]
fmri_data = fmri_data.reshape(fmri_data_shape2[0], fmri_data_shape2[1]*fmri_data_shape2[2]) # shape: (36, 17, 8) -> (36, 136)
3.5 Split dataset into training and testing sets
Splitting dataset can be achieved by applying the function "train_test_split" in scikit-learn:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(fmri_data, behavior_targets, test_size=0.2, random_state=42) # selecting 20% samples as the testing set
4. Example 1: conventional machine learning regression algorithms in Python
4.1 Define the conventional machine learning models:
For example, we can choose SVR, RIDGE, LASSO, ELASTC, Random Forest as our models, which can be easily defined by scikit-learn:
from sklearn.svm import SVR
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
svr_reg = SVR(kernel='linear')
ridge_reg = Ridge(alpha=1000, solver="cholesky")
lasso_reg = Lasso(alpha=1000)
elastic_reg = ElasticNet(alpha=0.1,l1_ratio=0.5)
rf_reg = RandomForestRegressor(max_depth=3, n_estimators=10000)
4.2 Train the models
In "3. processing data in Python", we have got the X_train, X_test, y_train, y_test. Therefore, we can use "fit" our models by using these data:
svr = svr_reg.fit(X_train, y_train)
ridge = ridge_reg.fit(X_train, y_train)
lasso = lasso_reg.fit(X_train, y_train)
elastic = elastic_reg.fit(X_train, y_train)
rf = rf_reg.fit(X_train, y_train)
4.3 Showing the results
Taking ELASTIC for example, we express our results mainly through RMSE and R2 score:
from sklearn.metrics import mean_squared_error
print("(train) elastic score: ", elastic.score(X_train, y_train))
print("(train) elastic RMSE: ", np.sqrt(mean_squared_error(elastic.predict(X_train),y_train)))
print("(test) elastic score: ", elastic.score(X_test, y_test))
print("(test) elastic RMSE: ", np.sqrt(mean_squared_error(elastic.predict(X_test),y_test)))
5. Example 2: conventional machine learning regression algorithms in Python
5.1 Define the basic LSTM model
We use Pytorch to define LSTM:
import torch
import torch.nn as nn
class LSTMNet(nn.Module):
def __init__(self, in_size, out_size):
super(LSTMNet, self).__init__()
self.lstm = nn.LSTM(in_size, 64, 1, batch_first=True)
self.drop = nn.Dropout(0.6)
self.fc1 = nn.Linear(64, 32)
self.fc2 = nn.Linear(32, out_size)
self.sigmoid = nn.Sigmoid()
self.relu = nn.ReLU()
self.tanh = nn.Tanh()
def forward(self, x):
batch_size = x.size(0)
h_0 = torch.zeros(1, batch_size, 64)
c_0 = torch.zeros(1, batch_size, 64)
out, (final_hidden_state, final_cell_state) = self.lstm(x, (h_0, c_0))
out = self.drop(out)
out = self.tanh(out)
out = self.fc1(out)
out = self.drop(out)
out = self.tanh(out)
out = self.fc2(out)
out = out[:, -1, :]
out = out.squeeze(-1)
return out
5.2 Train LSTM
from sklearn.metrics import mean_squared_error, r2_score
model = LSTMNet(fmri_data.shape[2], 1)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, weight_decay=5e-2)
criterion = nn.MSELoss()
epochs = 200
train_loss = []
test_loss = []
train_r2_score = []
test_r2_score = []
for epoch in range(epochs):
print("--------------------------------")
print("epoch: ", epoch)
inputs, labels = torch.tensor(X_train, dtype=torch.float32), torch.tensor(y_train, dtype=torch.float32)
outputs = model(inputs)
optimizer.zero_grad()
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
train_loss.append(np.sqrt(loss.data))
print("train_loss: ", train_loss[-1].detach().numpy())
train_r2_score.append(r2_score(labels.detach().numpy(), outputs.detach().numpy()))
model.eval()
with torch.no_grad():
inputs, labels = torch.tensor(X_test, dtype=torch.float32), torch.tensor(y_test, dtype=torch.float32)
outputs = model(inputs)
loss = criterion(outputs, labels)
test_loss.append(np.sqrt(loss.data))
print("test_loss: ", test_loss[-1].detach().numpy())
test_r2_score.append(r2_score(labels.detach().numpy(), outputs.detach().numpy()))
if test_loss[-1] <= min(test_loss):
torch.save(model.state_dict(), 'model/best_model.pth') # save the model's parameters in the epoch that has the min RMSE on testing set
torch.save(model.state_dict(), 'model/last_model.pth') # save the model's parameters in the last epoch
5.3 Load pre-trained LSTM to predict values
Take the 'best_model.pth' for example:
model = LSTMNet(fmri_data.shape[2], 1).eval()
model.load_state_dict(torch.load("model/best_model.pth"))
train_predict = model(torch.tensor(X_train, dtype=torch.float32)).detach().numpy()
test_predict = model(torch.tensor(X_test, dtype=torch.float32)).detach().numpy()