Analysis of PBMC_3K dataset from 10X Genomics by DCCA model - cmzuo11/DCCA GitHub Wiki

Analysis pipeline for the PBMC_3K from 10X genomics

After the successful installation of the DCCA model on your server, you can use the following steps to analyze the PBMC_3K data.

Preprocessing steps (R script)

1. Download the raw count files for two-omics data from the 10X Genomics with the link

2. Feature selection:

Load library and functions

library('Seurat')
source(./DCCA/Processing_data.R)

Select the highly variable genes (HVGs) by 'vst' based on scRNA-seq data

Seurat_obj = Create_Seurat_from_scRNA(scRNA_data, nDim = 10) 
HVGs       = Select_HVGs_from_scRNA(Seurat_obj, selection.method = 'vst', nfeatures = 1500) # select 1500 HVGs by 'vst' 

Select all peaks within 100kbp upstream and gene body of HVGs

nearby_loci, df_genes_loci = Select_Loci_by_vargenes(HVGs, scATAC_data, width = 100000, species = "human")  

Save scRNA-seq (HVGs * cells) and scATAC-seq (nearby_loci * cells) data

used_rna  = scRNA_data[match(HVGs, row.names(scRNA_data)),]
write.table(used_rna, file  = './Example_test/scRNA_seq_10X.tsv', sep="\t", quote=F)

used_atac = scATAC_data[match(nearby_loci, row.names(scATAC_data)),]
used_atac[which(used_atac>0)] = 1
write.table(used_atac, file = './Example_test/scATAC_seq_10X.txt', sep="\t", quote=F)

Run DCCA model (Python script):

python Main_PBMC_3k.py

1. create a neural network structure based on input two-omics data:

model     = DCCA( layer_e_1 = [Nfeature1, 128], hidden1_1 = 128, Zdim_1 = 10, layer_d_1 = [10, 128],
                  hidden2_1 = 128, layer_e_2 = [Nfeature2, 5000, 2500, 1000, 128], hidden1_2 = 128, Zdim_2 = 10,
                  layer_d_2 = [10], hidden2_2 = 10, args = args, ground_truth = label_ground_truth,
                  ground_truth1 = label_ground_truth, Type_1 = "NB", Type_2 = "Bernoulli", cycle = 3, 
                  attention_loss = "Eucli" )

Note:

(1). the encoder for the VAE of scRNA-seq data is [Nfeature1, 128, 10], and the decoder is [10, 128, Nfeature1];

(2). the encoder for the VAE of scATAC-seq data is [Nfeature2, 5000, 2500, 1000, 128, 10], and the decoder is [10, Nfeature2];

(3). 'attention_loss' indicates that Euclidean distance was used to transfer attention between two-omics data. You can replace "Eucli" with 'L1' when using L1-norm as a distance to transfer information;

(4). 'Type_1' and 'Type_2' indicate the likelihood function of scRNA-seq and scATAC-seq data.

2. Model fitting

NMI_score1, ARI_score1, NMI_score2, ARI_score2  =  model.fit_model(train_loader, test_loader, total_loader, "RNA" )

3. Save model

save_checkpoint(model, model_file ) 

4. Load model to reproduce the result

model_new                = load_checkpoint( model_file , model, args.use_cuda )

5. Predict cell clusters based on latent features by K-means

cluster_rna, cluster_epi = model.predict_cluster_by_kmeans(total_loader)

6. Generate scATAC-seq data from scRNA-seq data

recon_atac               = model.inference_other_from_rna( test_loader )

Results (R script)

Visualization

Generating UMAP visualization based on latent features of two-omics data:

color_PBMC    = c("CD14 Mono"="#B97E4C", "CD16 Mono"="#A061E2", "B intermediate"="#E5E02F", "B memory"="#5DB8EA", 
                  "B naive"= "#33A02C" ,"CD4 Naive"="#D0021B", "CD4 TCM"="#F5A623", "CD4 TEM"="#517E1E", "CD8 Naive"="#686BEC",
                  "CD8 TEM" = "dodgerblue", "cDC2" = "tomato2", "gdT" = "#BF43D9", "MAIT" = "palegreen", "NK"="cyan2", 
                  "pDC" = "chocolate1", "Treg" ="plum2" )

Plot_umap_embeddings('./Example_test/scRNA-latent.csv', './Example_test/scATAC-latent.csv', 
                     './Example_test/cell_metadata.txt','./Example_test/Latent_umap.pdf',
                     color_PBMC )

The 'Latent_umap.pdf' included two pictures of two latent features as follows:

Calculate TF motif scores.

Calculate_TF_score('./Example_test/scATAC-norm.csv', './Example_test/cell_metadata.txt', out_file,  species = "human").

Infer network

match_data = TF_loci_mapping(used_atac, species = "human")
Infer_net  = Infer_network(used_rna, used_atac, df_genes_loci, Seurat_obj["Cell_type"](/cmzuo11/DCCA/wiki/"Cell_type"), match_data)

Extract cell-type-specific network

cell_speci_net = Generate_cell_type_regulon(Infer_net)

Activity of regulon

regulon_activity_cells = Regulon_activity(cell_speci_net, rna_data, Seurat_obj["Cell_type"](/cmzuo11/DCCA/wiki/"Cell_type"))
Plot_regulon_activity(regulon_activity_cells, Seurat_obj["Cell_type"](/cmzuo11/DCCA/wiki/"Cell_type"))