2. Data input and output in Python - Lin-Brain-Lab/machine_learning GitHub Wiki

Procedures to read fMRI data in Python

Example data are at:

  1. Perspective-taking fMRI: /space_lin2/fhlin/perspective

  2. Human connectome project: /space_lin1/hcp

0. Python environment

The path of our current available Anaconda folder is: /space_lin2/kaihua/anaconda3

Activating our python environment is simple.

First, "source" the relevant profile, after which you can check if the "conda" command is available by outputting its version, as shown below:

source /space_lin2/kaihua/.bashrc
conda --version

Later, choose which virtual environment you want to utilise. Currently, there are mainly two environments: (base) and (py36), and you can activate one of them by entering "conda activate" command, as shown below:

conda activate base

conda activate py36

Different environments would contain different packages of different versions, which will meet the requirements of different tasks. Check the details of these packages by entering:

conda list

1. Read in-house fMRI data

Here are examples to read perspective-taking fMRI data in Python.

1.1 NII format

Take the NII file '/space_lin2/fhlin/perspective/subj_02/epi_data/unpack/bold/007/sfmcstc.nii' for example.

import nibabel as nb
nii_img = nb.load("/space_lin2/fhlin/perspective/subj_02/epi_data/unpack/bold/007/sfmcstc.nii")

nii_data = nii_img.get_fdata()  # shape: (72, 72, 42, 101)
affine = nii_img.affine  # affine matrix, shape: (4, 4)
hdr = nii_img.header  # header

1.2 STC format

import mne
stc = mne.read_source_estimate("/space_lin2/fhlin/perspective/subj_02/epi_data/unpack/f1/subj_02_2_fsaverage_sfmcstc-lh.stc") # <class 'mne.source_estimate.SourceEstimate'>
stc_data = stc.data # shape: (20484, 100), <class 'numpy.ndarray'>
stc_vertices = stc.vertices # shape: (2, 10242)
stc_tmin = stc.tmin # i.e., epoch_begin_latency=0.0
stc_tstep = stc.tstep # i.e., sample_period=2.0
stc_times = stc.times # shape: (87,)

2. Read Human Connectome Project data

The resulting STC files can be read by this python script file(not fully completed)

3. Basic data processing in Python

3.1 After reading STC files

...

Assume we have got the fMRI data and its target in the form of numpy array, of which the shapes are:

print(fmri_data.shape) # shape: (36, 17, 150); (samples, scenes, rois)
print(behavior_targets.shape) # shape: (36,)

3.2 Standardization

Before applying standardization, we should first convert the original fMRI data from 3 dimensions to 2 dimensions:

fmri_data_shape1 = [fmri_data.shape[0], fmri_data.shape[1], fmri_data.shape[2]] # [36, 17, 150]
fmri_data = fmri_data.reshape(fmri_data_shape1[0]*fmri_data_shape1[1], fmri_data_shape1[2]) # shape: (36, 17, 150) -> (612, 150)

Then we can apply Standardization by utilizing the function "StandardScaler" in scikit-learn:

standardization = True
if standardization:
    ss = StandardScaler()
    fmri_data = ss.fit_transform(fmri_data) # fit and transform

3.3 Principle component analysis (PCA)

Similarly, we can apply PCA by utilizing the function "PCA" in scikit-learn:

use_pca = True
if use_pca:
    pca = PCA(n_components=0.85) # remain 85% variance explanation
    fmri_data = pca.fit_transform(fmri_data) # shape: (612, 150) -> (612, 8)

3.4 Restore data shape

After applying standardization or PCA, we need to restore the data shape from 2 dimensions to 3 dimensions:

fmri_data = fmri_data.reshape(fmri_data_shape1[0], fmri_data_shape1[1], -1) # shape: (612, 8) -> (36, 17, 8)

For temporal models like LSTM, shape (36, 17, 8) can directly serve as the input. But for other conventional machine learning regression models, we still need to further flatten the temporal (i.e., scenes) and spatial (i.e., rois after PCA) dimension:

fmri_data_shape2 = [fmri_data.shape[0], fmri_data.shape[1], fmri_data.shape[2]] # [36, 17, 8]
fmri_data = fmri_data.reshape(fmri_data_shape2[0], fmri_data_shape2[1]*fmri_data_shape2[2]) # shape: (36, 17, 8) -> (36, 136)

3.5 Split dataset into training and testing sets

Splitting dataset can be achieved by applying the function "train_test_split" in scikit-learn:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(fmri_data, behavior_targets, test_size=0.2, random_state=42) # selecting 20% samples as the testing set

4. Example 1: conventional machine learning regression algorithms in Python

4.1 Define the conventional machine learning models:

For example, we can choose SVR, RIDGE, LASSO, ELASTC, Random Forest as our models, which can be easily defined by scikit-learn:

from sklearn.svm import SVR
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor

svr_reg = SVR(kernel='linear')
ridge_reg = Ridge(alpha=1000, solver="cholesky")
lasso_reg = Lasso(alpha=1000)
elastic_reg = ElasticNet(alpha=0.1,l1_ratio=0.5)
rf_reg = RandomForestRegressor(max_depth=3, n_estimators=10000)

4.2 Train the models

In "3. processing data in Python", we have got the X_train, X_test, y_train, y_test. Therefore, we can use "fit" our models by using these data:

svr = svr_reg.fit(X_train, y_train)
ridge = ridge_reg.fit(X_train, y_train)
lasso = lasso_reg.fit(X_train, y_train)
elastic = elastic_reg.fit(X_train, y_train)
rf = rf_reg.fit(X_train, y_train)

4.3 Showing the results

Taking ELASTIC for example, we express our results mainly through RMSE and R2 score:

from sklearn.metrics import mean_squared_error
print("(train) elastic score: ", elastic.score(X_train, y_train))
print("(train) elastic RMSE: ", np.sqrt(mean_squared_error(elastic.predict(X_train),y_train)))
print("(test) elastic score: ", elastic.score(X_test, y_test))
print("(test) elastic RMSE: ", np.sqrt(mean_squared_error(elastic.predict(X_test),y_test)))

5. Example 2: conventional machine learning regression algorithms in Python

5.1 Define the basic LSTM model

We use Pytorch to define LSTM:

import torch
import torch.nn as nn
class LSTMNet(nn.Module):
    def __init__(self, in_size, out_size):
        super(LSTMNet, self).__init__()

        self.lstm = nn.LSTM(in_size, 64, 1, batch_first=True)
        self.drop = nn.Dropout(0.6)
        self.fc1 = nn.Linear(64, 32)
        self.fc2 = nn.Linear(32, out_size)
        self.sigmoid = nn.Sigmoid()
        self.relu = nn.ReLU()
        self.tanh = nn.Tanh()

    def forward(self, x):
        batch_size = x.size(0)
        h_0 = torch.zeros(1, batch_size, 64)
        c_0 = torch.zeros(1, batch_size, 64)
        out, (final_hidden_state, final_cell_state) = self.lstm(x, (h_0, c_0))
        out = self.drop(out)
        out = self.tanh(out)
        out = self.fc1(out)
        out = self.drop(out)
        out = self.tanh(out)
        out = self.fc2(out)
        out = out[:, -1, :]
        out = out.squeeze(-1)
        return out

5.2 Train LSTM

from sklearn.metrics import mean_squared_error, r2_score

model = LSTMNet(fmri_data.shape[2], 1)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, weight_decay=5e-2)
criterion = nn.MSELoss()
epochs = 200

train_loss = []
test_loss = []
train_r2_score = []
test_r2_score = []

for epoch in range(epochs):
    print("--------------------------------")
    print("epoch: ", epoch)

    inputs, labels = torch.tensor(X_train, dtype=torch.float32), torch.tensor(y_train, dtype=torch.float32)
    outputs = model(inputs)
    optimizer.zero_grad()
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    train_loss.append(np.sqrt(loss.data))
    print("train_loss: ", train_loss[-1].detach().numpy())
    train_r2_score.append(r2_score(labels.detach().numpy(), outputs.detach().numpy()))

    model.eval()
    with torch.no_grad():
        inputs, labels = torch.tensor(X_test, dtype=torch.float32), torch.tensor(y_test, dtype=torch.float32)
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        test_loss.append(np.sqrt(loss.data))
        print("test_loss: ", test_loss[-1].detach().numpy())
        test_r2_score.append(r2_score(labels.detach().numpy(), outputs.detach().numpy()))

        if test_loss[-1] <= min(test_loss):
            torch.save(model.state_dict(), 'model/best_model.pth') # save the model's parameters in the epoch that has the min RMSE on testing set

torch.save(model.state_dict(), 'model/last_model.pth') # save the model's parameters in the last epoch

5.3 Load pre-trained LSTM to predict values

Take the 'best_model.pth' for example:

model = LSTMNet(fmri_data.shape[2], 1).eval()
model.load_state_dict(torch.load("model/best_model.pth"))

train_predict = model(torch.tensor(X_train, dtype=torch.float32)).detach().numpy()
test_predict = model(torch.tensor(X_test, dtype=torch.float32)).detach().numpy()