PECA pS - PECAplus/Perseus-PluginPECA GitHub Wiki

Contents

Description

PECA-pS (pulsed SILAC) is a model that separates estimation of rate parameters from pulsed proteomics data, estimating synthesis and degradation rates per gene per measurement interval.

NOTE: For correct estimation of rates, the model requires that the time course pattern must be monotone decreasing in the channel representing degradation of pre-existing proteins, and monotone increasing in the channel representing synthesis of new proteins. The model does not know if this assumption holds true; the quality of the data needs to be checked by the user.

NOTE: We explain the tool at an example which used three labels: M for the PRE-existing proteins, H for the NEWly synthesized proteins and L as REFerence label. PECA-pS can also be applied to a two-label SILAC experiment. In that case, the REF label should be the sum of PRE and NEW.

NOTE: the degradation rates for each interval can be easily average over the entire time course and converted to half-lives using the following formula:

, where is the average degradation rate

Parameters

Working Directory

Specifies the directory where input files, output matrices and plots produced by PECA will be saved. It can be specified manually by typing in the path or the folder can be provided by using the "Select" button.

Friendly Reminder: DO NOT SELECT DESKTOP as the tool produces MANY files

About Data

Time Points

Specification of the time points in the datasets(default: 0 1 2 3 4 5). Time points can be in any units such as, minutes or hours, but units need to be the same across the entire list. They should be all numeric values, each separated by a whitespace

Number of Replicates

Specification of the number of replicates in the datasets for mRNA data and label-free proteomics data (default: 1)

Smoothing

If checked, Gaussian Process (GP) smoothing will be applied to the datasets (default: unchecked).

Gaussian Process Variance Parameter

Determines the variation of values from the mean (default 2.0). A small value will result in the function values changing quickly.

Gaussian Process Scale Parameter

Scaling factor that determines the smoothness of the curve (default 1.0). A small value will result in a function that stays close to the mean value.

Gene Set Analysis (GSA)

If checked, a time-dependent functional enrichment analysis will be performed on the output matrix of PECA, specifically on the change point score based on synthesis and degradation rates for PECA-pS and R. The result will be displayed as two additional output matrix - the first one for synthesis rates and the second one for degradation rates (default unchecked). The resulting matrix reports the biological functions whose members are up or down-regulated at specific time points.

Biological Function Annotation Files

Specifies the file path of the function annotation file that should be used for the time-dependent functional enrichment analysis.

File Format:

  1. First column named as ‘Pathwayid’, specifying the pathway IDs, e.g. from Gene Ontology and Consensus Pathway DataBase
  2. Second column named as the same name as gene name columns provided in the parameters, specifying the genes involved
  3. Third column named as 'pathway', specifying the pathways involved

Enrichment Analysis FDR Cutoff

Defines the FDR cutoff for which enrichment analysis should use when analyzing biological functions at specific time points (default 0.05, i.e. 5%). The value of this parameter should lie between 0 and 1 (e.g., 0.05, 0.1, 0.2).

Minimum % of Genes to Consider a Pathway to Be Tested

Specifies the minimum percentage of genes needed in the experimental data for a pathway to be analyzed (default 0). For instance, if 20% is specified, then at least 20 genes need to be present in the experimental data for a pathway of 100 genes. The value of this parameter should lie between 0 and 100.

Minimum Number of Genes For Hypothesis Testing

Specifies the minimum number of significant genes (within the FDR cutoff) from the experimental data for a particular pathway to be reported (default 1). Anything below this number will be assigned a p-value of 1. The value of this parameter should be a positive integer.

Select Data

Gene Name Column

The selected text column will be used as the gene ID identifiers in PECA analysis (default: first text column).

mRNA Concentration Data

The selected expression/numerical columns that should be used as mRNA data (default: first third of expression/numeric columns).

The columns should be ordered by timepoints and then by replicate

Order:

  • time point 1 replicate 1
  • time point N replicate 1
  • time point 1 replicate 2
  • time point N replicate 2

Data Input Form 1

Specification for the data input form of mRNA data, i.e. what data transformation has been applied already (default: Raw).

  • Raw: unprocessed, untransformed data.

  • ln: loge transformed data.

  • log_2: log2 transformed data.

  • log_10: log10 transformed data.

  • log_custom: logX transformed data, where X is a specified positive real value

PRE/REF SILAC Data

The selected expression/numerical columns that should be used as the channel representing degradation of pre-existing proteins (PRE), e.g. SILAC ratios of PRE/REF (default: second third of expression/numeric columns).

Same order as mRNA Data. The number of columns should also match mRNA Data.

Data Input Form 2

Specification for the data input form of PRE/REF SILAC Data (default: Raw).

Same as Data Input Form 1

NEW/REF SILAC Data

The selected expression/numerical columns that should be used as the channel representing synthesis of new proteins, e.g. SILAC ratios of NEW/REF (default: last third of expression/numeric columns).

Same order as mRNA Data. The number of columns should also match mRNA Data.

Data Input Form 3

Specification for the data input form of NEW/REF SILAC Data (default: Raw).

Same as Data Input Form 1

MCMC Parameters

PECA model parameters are estimated using a sampling-based algorithm called MCMC (Markov chain Monte Carlo), which requires the parameters below. All values should be positive integers.

MCMC Burn-In

Defines the iterations to be thrown away at the beginning of MCMC run, i.e. the burn-in period (default: 1000).

MCMC Thinning

Defines the interval in which iterations of MCMC are recorded (default: 10).

MCMC Samples

Defines the total of number of post-burn-in samples to be recorded from MCMC (default: 1000).

Output

General Output

The text column is the gene name column provided from Gene Name Column

The main/expression columns are the loge transformed mRNA and SILAC data sets.

The numeric columns contain: RY, DY, signedCPSX, signedCPDX, FDR_SX, FDR_DX, where X indexes time point (i.e. Y=1 refers to the second time point) and Y indexes time point interval starting from 0 (i.e. Y=0 refers to the interval between the first and second time points).

  • RY is the synthesis rate for the time interval preceding the specified time point (e.g. if X = 1, then the interval is between time point indices 1 and 2)

  • DY is the degradation rate for the time interval

  • signedCPSX is the change point score for synthesis rates

  • signedCPDX is the change point score for degradation rates

  • FDR_SX is the False Discovery Rate for synthesis rate

  • FDR_DX is the False Discovery Rate for degradation rate

GSA Output (if GSA had been checked)

GSA Output produces two additional matrices. The first additional matrix is GSA analysis on synthesis rates and the second one on degradation rates. The two additional matrices follow the same format as described below.

The text column contains: the gene name column provided from Gene Name Column, GO_name, GO_id

  • GO_name is the name of the Gene Ontology

  • GO_id is the ID of the Gene Ontology

The numeric columns contain: MaxSig(Up), MaxSig(Down), Max(Both), GO_size, GO_size_background, Up(X), Down(X), Sig(X), where X indexes time points corresponding to signedCPSX

  • MaxSig(Up) is the maximum value of -log10(Up(X)) for all X

  • MaxSig(Down) is the maximum value of -log10(Down(X)) for all X

  • Max(Both) is the maximum value of -log10(Sig(X)) for all X

  • GO_size is the number of genes in the pathway

  • GO_size_background is the number of genes in the pathway that appears in the experimental data

  • Up(X) is the p-value calculated from the number of up-regulated genes

  • Down(X) is the p-value calculated from the number of down-regulated genes

  • Sig(X) is the p-value calculated from the number of up and down-regulated genes

⚠️ **GitHub.com Fallback** ⚠️