Analysis of scRNA seq and scMethylation data from scNMT seq technology by DCCA model - cmzuo11/DCCA GitHub Wiki

Analysis pipeline for the mouse_gastrulation of scNMT-seq technology

After the successful installation of the DCCA model on your server, you can use the following steps to analyze the mouse_gastrulation data.

Preprocessing steps (R script)

1. Before running the DCCA model, you can download the raw count files for two-omics data from the GEO database under accession number: GSE109262

2. Feature selection for scRNA-seq (R script)

library('Seurat')

source(./DCCA/Processing_data.R) # load our defined functions into R environment

# Select the highly variable genes (HVGs) by 'vst' based on scRNA-seq data 
Seurat_obj = Create_Seurat_from_scRNA(scRNA_data, nDim = 10) 
HVGs       = Select_HVGs_from_scRNA(Seurat_obj, selection.method = 'vst', nfeatures = 1000) # select 1000 HVGs by 'vst'

3. Feature selection for scMethylation in promoter region

library(reticulate)
source_python('./DCCA/utilities.py')
scMethy_score = utilies$Feature_selection_by_Laplace( scMethy_file, scMethy_order )
scMethy_hv    = row.names(scMethy_data)[order(scMethy_score, decreasing = T)]
scMethy_scale = Scale_data(scMethy_data[match(scMethy_hv, row.names(scMethy_data)),], na_rm = T)

# Save scRNA-seq (HVGs * cells) and scATAC-seq (nearby_loci * cells) data 
write.table(scRNA_data[match(HVGs, row.names(scRNA_data)),], file = './Example_test/scRNA_seq_SNARE.tsv', sep="\t", quote=F)
write.table(scMethy_scale, file = './Example_test/scMethylation_SNARE.txt', sep="\t", quote=F)

Run DCCA model (Python script):

python Main_scNMT_seq.py

1. create a neural network structure based on input two-omics data:

model =  DCCA_missing( layer_e_1 = [Nfeature1, 128], hidden1_1 = 128, Zdim_1 = 10, layer_d_1 = [10, 128],
                       hidden2_1 = 128, layer_e_2 = [Nfeature2, 128], hidden1_2 = 128, Zdim_2 = 10,
                       layer_d_2 = [10, 128], hidden2_2 = 128, args = args, ground_truth = label_ground_truth,
                       ground_truth1 = label_ground_truth, Type_1 = "NB", Type_2 = "Gaussian", cycle = 1,
                       attention_loss = "Eucli" )

Note:

(1). the encoder for the VAE of scRNA-seq data is [Nfeature1, 128, 10], and the decoder is [10, 128, Nfeature1];

(2). the encoder for the VAE of scATAC-seq data is [Nfeature2, 128, 10], and the decoder is [10, 128， Nfeature2];

(3). 'attention_loss' indicates that Euclidean distance was used to transfer attention between two-omics data. You can replace "Eucli" with 'L1' when using L1-norm as a distance to transfer information;

(4). 'Type_1' and 'Type_2' indicate the likelihood function of scRNA-seq and scMethylation data.

2. Model fitting

NMI_score1, ARI_score1, NMI_score2, ARI_score2  =  model.fit_model(train_loader, test_loader, total_loader, "RNA" )

3. Save model

save_checkpoint(model, model_file )

4. Load model to reproduce the result

model_new = load_checkpoint( model_file , model, args.use_cuda )

5. Predict cell clusters based on latent features by K-means

cluster_rna, cluster_epi = model.predict_cluster_by_kmeans(total_loader)

6. Infer scMethylation from scRNA-seq data

recon_ys  = model.inference_other_from_rna( total_loader_r )

Results (R script)

Visualization

Generating UMAP visualization based on latent features of two-omics data:

Plot_umap_embeddings('./Example_test/scRNA-latent.csv', './Example_test/scATAC-latent.csv', 
                     './Example_test/cell_metadata.txt','./Example_test/Latent_umap.pdf' )

The 'Latent_umap.pdf' included two pictures of two latent features as follows: