InstallingRunningUsage - Oshlack/Corset GitHub Wiki
Corset requires samtools to be installed. It will use the library and headers files, so these must be in a place that can be accessed by Corset.
We also suggest using gcc version 4.3 or greater because some newer container classes are available in c++0x. This is not a requirement but code compiled with older versions of gcc will run slower.
To install, please down load the tar ball from here. Unzip and untar.
The corset program is already precompiled and you can run it as is, (e.g. by typing <path_to_corset>/corset). You can also copy corset to your bin directory or edit your PATH environmental variable to include <path_to_corset>, so it's automatically found without specifying the path.
Run configure:
./configure
Note that you may have to specify the directory location of sam.h and libbam with
--with-bam_inc=<directory_containing_sam.h>
and
--with-bam_lib=<directory_containing_bam_library>
if they are not in the usual paths. Note that you need to give the absolute path and not a relative path.
Then run:
make
make install
In the simplest case, Corset can be run in the directory containing your bam files simply by typing:
corset *.bam
The inputs should be one bam files for each sample. The bam files should have been produced by multi-mapping the reads to the transcriptome. For example with bowtie/bowtie2 you should use the parameter --all (or -k with a large number).
The usage information provided by corset (version 1.07) is:
Usage: corset [options] <input bam files>
Input bam files:
The input files should be multi-mapped bam files. They can be single, paired-end or mixed
and do not need to be indexed. A space separated list should be given.
e.g. corset sample1.bam sample2.bam sample3.bam
or just: corset sample*.bam
If you want to combine the results from different transcriptomes. i.e. the same reads have
been mapped twice or more, you can used a comma separated list like below:
corset sample1_Trinity.bam,sample1_Oases.bam sample2_Trinity.bam,sample2_Oases.bam ...
Options are:
-d <double list> A comma separated list of distance thresholds. The range must be
between 0 and 1. e.g -d 0.4,0.5. If more than one distance threshold
is supplied, the output filenames will be of the form:
counts-<threshold>.txt and clusters-<threshold>.txt
Default: 0.3
-D <double> The value used for thresholding the log likelihood ratio. The default
value will depend on the number of degrees of freedom (which is the
number of groups -1). By default D = 17.5 + 2.5 * ndf, which corresponds
approximately to a p-value threshold of 10^-5, when there are fewer than
10 groups.
-m <int> Filter out any transcripts with fewer than this many reads aligning.
Default: 10
-g <list> Specifies the grouping. i.e. which samples belong to which experimental
groups. The parameter must be a comma separated list (no spaces), with the
groupings given in the same order as the bam filename. For example:
-g Group1,Group1,Group2,Group2 etc. If this option is not used, each sample
is treated as an independent experimental group.
-p <string> Prefix for the output filenames. The output files will be of the form
<prefix>-counts.txt and <prefix>-clusters.txt. Default filenames are:
counts.txt and clusters.txt
-f <true/false> Specifies whether the output files should be overwritten if they already exist.
Default: false
-n <string list> Specifies the sample names to be used in the header of the output count file.
This should be a comma separated list without spaces.
e.g. -n Group1-ReplicateA,Group1-ReplicateB,Group2-ReplicateA etc.
Default: the input filenames will be used.
-r <true/true-stop/false>
Output a file summarising the read alignments. This may be used if you
would like to read the bam files and run the clustering in seperate runs
of corset. e.g. to read input bam files in parallel. The output will be the
bam filename appended with .corset-reads.
Default: false
-i <bam/corset/salmon_eq_classes> The input file type. Use -i corset, if you previously ran
corset with the -r option and would like to restart using those
read summary files. Use salmon_eq_classes, if you aligned with salmon with
the flag --dumpEq and are passing corset the equivalent class files.
Running with either -i corset or salmon_eq_classes will switch off the -r option.
Default: bam
-l <int> If running with -i corset or salmon_eq_classes, this will filter out a link between contigs
if the link is supported by less than this many reads. Default: 1 (no filtering)
-x <int> If running with -i corset or salmon_eq_classes, this option will filter out reads that
align to more than x contigs. Default: no filtering
Citation: Nadia M. Davidson and Alicia Oshlack, Corset: enabling differential gene expression
analysis for de novo assembled transcriptomes, Genome Biology 2014, 15:410
By default corset will output two files: clusters.txt
and counts.txt
. If you have specified multiple
distance thresholds, then the output will be of the form clusters-<threshold>.txt
and counts-<threshold>.txt
.
clusters.txt
is a tab delimited table with one line for each transcript. The first column contains the transcript ids and the second column is the cluster id it has been assigned to.
counts.txt
is also a tab delimited table. It lists the number of reads assigned to each cluster, one
per row. There is one columns for each sample.
The cluster naming is of the form Clusters-X.Y
. The X
is the super-cluster ID. Any transcript which shares even a single read with another transcript will have the same super-cluster ID. The Y
indicates the cluster number within the super-cluster (ie. those which resulted from the hierarchical clustering and expression testing. If you run Corset with the option -m 0
(no filtering on the number of reads), you might also find clusters with the prefix "NoReadsCluster". These clusters have no reads which map to them, and are therefore excluded from the counts file.