Analysis of cellMix dataset from SNARE seq technology by DCCA model - cmzuo11/DCCA GitHub Wiki

Analysis pipeline for the cell line mixture dataset of SNARE-seq

After the successful installation of the DCCA model on your server, you can use the following steps to analyze the cellMix data.

Preprocessing steps (R script)

1. Download the raw count files for two-omics data from the GEO database under accession number: GSE126074

2. Feature selection:

Load library and function into R environment

library('Seurat')
source(./DCCA/Processing_data.R) # load our defined functions into R environment

Select the highly variable genes (HVGs) by 'vst' based on scRNA-seq data

Seurat_obj = Create_Seurat_from_scRNA(scRNA_data, nDim = 4, remove_HK = T) 
HVGs       = Select_HVGs_from_scRNA(Seurat_obj, selection.method = 'vst', nfeatures = 500) # select 500 HVGs by 'vst'

Select all peaks within 100kbp upstream and gene body of HVGs for human

nearby_loci, df_genes_loci = Select_Loci_by_vargenes(HVGs, scATAC_data, width = 100000, species = "human")

Save scRNA-seq (HVGs * cells) and scATAC-seq (nearby_loci * cells) data

write.table(scRNA_data[match(HVGs, row.names(scRNA_data)),], file = './Example_test/scRNA_seq_SNARE.tsv', sep="\t", quote=F)
scATAC_used                       = scATAC_data[match(nearby_loci, row.names(scATAC_data)),]
scATAC_used[which(scATAC_used>0)] = 1
write.table(scATAC_used, file = './Example_test/scATAC_seq_SNARE.txt', sep="\t", quote=F)

Run DCCA model (Python script):

python Main_SNARE_seq.py

1. create a neural network structure based on input two-omics data:

model      =  DCCA( layer_e_1 = [Nfeature1, 128], hidden1_1 = 128, Zdim_1 = 4, layer_d_1 = [4, 128],
                    hidden2_1 = 128, layer_e_2 = [Nfeature2, 1500, 128], hidden1_2 = 128, Zdim_2 = 4,
                    layer_d_2 = [4], hidden2_2 = 4, args = args, ground_truth = label_ground_truth,
                    ground_truth1 = label_ground_truth, Type_1 = "NB", Type_2 = "Bernoulli", cycle = 1,
                    attention_loss = "Eucli" )

Note:

(1). the encoder for the VAE of scRNA-seq data is [Nfeature1, 128, 4], and the decoder is [4, 128, Nfeature1];

(2). the encoder for the VAE of scATAC-seq data is [Nfeature2, 1500, 128, 4], and the decoder is [4, Nfeature2];

(3). 'attention_loss' indicates that Euclidean distance was used to transfer attention between two-omics data. You can replace "Eucli" with 'L1' when using L1-norm as a distance to transfer information;

(4). 'Type_1' and 'Type_2' indicate the likelihood function of scRNA-seq and scATAC-seq data.

2. Model fitting

# 90% of dataset was used for training, and 10% of dataset was used for testing
NMI_score1, ARI_score1, NMI_score2, ARI_score2  =  model.fit_model(train_loader, test_loader, total_loader, "RNA" ) 
# NMI_score1 and ARI_score1 indicate the clustering score of scRNA-seq, and NMI_score2 and ARI_score2 indicate the clustering score of scATAC-seq data. Four metrics are calculated bsaed on predicted cell clusters based on latent features by K-means and true cell types.

3. Save model

save_checkpoint(model, model_file )

4. Load model to reproduce the result

model_new = load_checkpoint( model_file , model, args.use_cuda )

5. Predict cell clusters based on latent features by K-means

cluster_rna, cluster_epi = model.predict_cluster_by_kmeans(total_loader)

6. Generate scATAC-seq data from scRNA-seq data

recon_atac   = model.inference_other_from_rna( test_loader )

Results (R script)

Visualization

Generating UMAP visualization based on latent features of two-omics data:

colo_cellLine = c("BJ" = "chartreuse3", "GM" = "coral2", "H1" = "dodgerblue", "K562"= "cyan3")
Plot_umap_embeddings('./Example_test/scRNA-latent.csv', './Example_test/scATAC-latent.csv', 
                     './Example_test/cell_metadata.txt','./Example_test/Latent_umap.pdf',
                     colo_cellLine )

The 'Latent_umap.pdf' included two pictures of two latent features as follows:

Calculate TF motif scores.

Calculate_TF_score('./Example_test/scATAC-norm.csv', './Example_test/cell_metadata.txt', out_file,  species = "human").