Analysis of mouse_skin dataset from SHARE seq technology by DCCA model - cmzuo11/DCCA GitHub Wiki
Analysis pipeline for the mouse skin dataset of SHARE-seq
After the successful installation of the DCCA model on your server, you can use the following steps to analyze the mouse_skin data.
Preprocessing steps (R script)
GSE140203
1. Download the raw count files for two-omics data from the GEO database under accession number:2. Feature selection:
Load library and function into R environment
library('Seurat')
source(./DCCA/Processing_data.R) # load our defined functions into R environment
Select the highly variable genes (HVGs) by 'vst' based on scRNA-seq data
Seurat_obj = Create_Seurat_from_scRNA(scRNA_data, nDim = 20)
HVGs = Select_HVGs_from_scRNA(Seurat_obj, selection.method = 'vst', nfeatures = 2000) # select 2000 HVGs by 'vst'
Select all peaks within 100kbp upstream and gene body of HVGs for human
nearby_loci, df_genes_loci = Select_Loci_by_vargenes(HVGs, scATAC_data, width = 50000, species = "mouse")
Save scRNA-seq (HVGs * cells) and scATAC-seq (nearby_loci * cells) data
write.table(scRNA_data[match(HVGs, row.names(scRNA_data)),], file = './Example_test/scRNA_seq_SNARE.tsv', sep="\t", quote=F)
write.table(scATAC_data[match(nearby_loci, row.names(scATAC_data)),], file = './Example_test/scATAC_seq_SNARE.txt', sep="\t", quote=F)
Run DCCA model (Python script):
python Main_Share_seq.py
1. create a neural network structure based on input two-omics data:
model = DCCA( layer_e_1 = [Nfeature1, 1000, 128], hidden1_1 = 128, Zdim_1 = 20, layer_d_1 = [20, 128, 1000],
hidden2_1 = 1000, layer_e_2 = [Nfeature2, 10000, 5000, 1000, 128], hidden1_2 = 128, Zdim_2 = 20,
layer_d_2 = [20], hidden2_2 = 20, args = args, ground_truth = None,
ground_truth1 = None, Type_1 = "NB", Type_2 = "Bernoulli", cycle = 3,
attention_loss = "Eucli" )
Note:
(1). the encoder for the VAE of scRNA-seq data is [Nfeature1, 1000, 128, 20], and the decoder is [20, 128, 1000, Nfeature1];
(2). the encoder for the VAE of scATAC-seq data is [Nfeature2, 10000, 5000, 1000, 128, 20], and the decoder is [20, Nfeature2];
(3). 'attention_loss' indicates that Euclidean distance was used to transfer attention between two-omics data. You can replace "Eucli" with 'L1' when using L1-norm as a distance to transfer information;
(4). 'Type_1' and 'Type_2' indicate the likelihood function of scRNA-seq and scATAC-seq data.
2. Model fitting
NMI_score1, ARI_score1, NMI_score2, ARI_score2 = model.fit_model(train_loader, test_loader, total_loader, "RNA" )
3. Save model
save_checkpoint(model, model_file )
4. Load model to reproduce the result
model_new = load_checkpoint( model_file , model, args.use_cuda )
5. Predict cell clusters based on latent features by K-means
cluster_rna, cluster_epi = model.predict_cluster_by_kmeans(total_loader)
Results (R script)
Visualization
Generating UMAP visualization based on latent features of two-omics data:
Plot_umap_embeddings('./Example_test/scRNA-latent.csv', './Example_test/scATAC-latent.csv',
'./Example_test/cell_metadata.txt','./Example_test/Latent_umap.pdf')
The 'Latent_umap.pdf' included two pictures of two latent features as follows:
Calculate TF motif scores.
Calculate_TF_score('./Example_test/scATAC-norm.csv', './Example_test/cell_metadata.txt', out_file, species = "mouse").