Analysis of cellMix dataset from SNARE seq technology by DCCA model - cmzuo11/DCCA GitHub Wiki
Analysis pipeline for the cell line mixture dataset of SNARE-seq
After the successful installation of the DCCA model on your server, you can use the following steps to analyze the cellMix data.
Preprocessing steps (R script)
GSE126074
1. Download the raw count files for two-omics data from the GEO database under accession number:2. Feature selection:
Load library and function into R environment
library('Seurat')
source(./DCCA/Processing_data.R) # load our defined functions into R environment
Select the highly variable genes (HVGs) by 'vst' based on scRNA-seq data
Seurat_obj = Create_Seurat_from_scRNA(scRNA_data, nDim = 4, remove_HK = T)
HVGs = Select_HVGs_from_scRNA(Seurat_obj, selection.method = 'vst', nfeatures = 500) # select 500 HVGs by 'vst'
Select all peaks within 100kbp upstream and gene body of HVGs for human
nearby_loci, df_genes_loci = Select_Loci_by_vargenes(HVGs, scATAC_data, width = 100000, species = "human")
Save scRNA-seq (HVGs * cells) and scATAC-seq (nearby_loci * cells) data
write.table(scRNA_data[match(HVGs, row.names(scRNA_data)),], file = './Example_test/scRNA_seq_SNARE.tsv', sep="\t", quote=F)
scATAC_used = scATAC_data[match(nearby_loci, row.names(scATAC_data)),]
scATAC_used[which(scATAC_used>0)] = 1
write.table(scATAC_used, file = './Example_test/scATAC_seq_SNARE.txt', sep="\t", quote=F)
Run DCCA model (Python script):
python Main_SNARE_seq.py
1. create a neural network structure based on input two-omics data:
model = DCCA( layer_e_1 = [Nfeature1, 128], hidden1_1 = 128, Zdim_1 = 4, layer_d_1 = [4, 128],
hidden2_1 = 128, layer_e_2 = [Nfeature2, 1500, 128], hidden1_2 = 128, Zdim_2 = 4,
layer_d_2 = [4], hidden2_2 = 4, args = args, ground_truth = label_ground_truth,
ground_truth1 = label_ground_truth, Type_1 = "NB", Type_2 = "Bernoulli", cycle = 1,
attention_loss = "Eucli" )
Note:
(1). the encoder for the VAE of scRNA-seq data is [Nfeature1, 128, 4], and the decoder is [4, 128, Nfeature1];
(2). the encoder for the VAE of scATAC-seq data is [Nfeature2, 1500, 128, 4], and the decoder is [4, Nfeature2];
(3). 'attention_loss' indicates that Euclidean distance was used to transfer attention between two-omics data. You can replace "Eucli" with 'L1' when using L1-norm as a distance to transfer information;
(4). 'Type_1' and 'Type_2' indicate the likelihood function of scRNA-seq and scATAC-seq data.
2. Model fitting
# 90% of dataset was used for training, and 10% of dataset was used for testing
NMI_score1, ARI_score1, NMI_score2, ARI_score2 = model.fit_model(train_loader, test_loader, total_loader, "RNA" )
# NMI_score1 and ARI_score1 indicate the clustering score of scRNA-seq, and NMI_score2 and ARI_score2 indicate the clustering score of scATAC-seq data. Four metrics are calculated bsaed on predicted cell clusters based on latent features by K-means and true cell types.
3. Save model
save_checkpoint(model, model_file )
4. Load model to reproduce the result
model_new = load_checkpoint( model_file , model, args.use_cuda )
5. Predict cell clusters based on latent features by K-means
cluster_rna, cluster_epi = model.predict_cluster_by_kmeans(total_loader)
6. Generate scATAC-seq data from scRNA-seq data
recon_atac = model.inference_other_from_rna( test_loader )
Results (R script)
Visualization
Generating UMAP visualization based on latent features of two-omics data:
colo_cellLine = c("BJ" = "chartreuse3", "GM" = "coral2", "H1" = "dodgerblue", "K562"= "cyan3")
Plot_umap_embeddings('./Example_test/scRNA-latent.csv', './Example_test/scATAC-latent.csv',
'./Example_test/cell_metadata.txt','./Example_test/Latent_umap.pdf',
colo_cellLine )
The 'Latent_umap.pdf' included two pictures of two latent features as follows:
Calculate TF motif scores.
Calculate_TF_score('./Example_test/scATAC-norm.csv', './Example_test/cell_metadata.txt', out_file, species = "human").