File Format - labgem/CAESAR GitHub Wiki
Description of the files generated by the different steps of CAESAR
Blastp output - matches.tsv
The Diamond software is used to do the blastp. It returns a tsv file named: matches_<db>.tsv
. The file is formatted via the options --outfmt 6 qseqid qlen sseqid slen length pident qcovhsp positive mismatch gaps evalue
.
Ref_id 317 Target_id 349 319 40.4 96.2 187 162 28 1.88e-71
Ref_id 317 Target_id 355 313 39.3 96.5 183 176 14 1.26e-70
Ref_id 317 Target_id 348 317 40.7 96.2 183 163 25 8.17e-70
Ref_id 317 Target_id 349 319 40.1 96.2 184 163 28 3.35e-69
Ref_id 317 Target_id 349 323 39.6 98.1 183 171 24 1.22e-68
Ref_id 317 Target_id 350 325 39.1 98.1 185 170 28 7.68e-68
Ref_id 317 Target_id 339 327 39.8 98.1 182 165 32 2.23e-67
Ref_id 317 Target_id 321 315 40.3 96.5 183 165 23 5.27e-67
Ref_id 317 Target_id 308 305 39.3 95.9 175 177 8 1.43e-66
Hmmsearch output - hits.domtbl
Made with HMMER, it returns a .domtbl file with the option --domtblout
.
# --- full sequence --- -------------- this domain ------------- hmm coord ali coord env coord
# target name accession tlen query name accession qlen E-value score bias # of c-Evalue i-Evalue score bias from to from to from to acc description of target
# ------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
Target_id - 646 Query_name - 534 1.5e-57 200.3 0.0 1 1 4.1e-60 2e-57 199.9 0.0 8 529 72 615 55 619 0.80 Description
Target_id - 676 Query_name - 534 7.5e-39 138.6 0.0 2 2 4.5e-20 2.3e-17 67.7 0.0 298 455 405 561 391 585 0.80 Description
Filter step outputs
filtered_sequences.fasta
This is simply a fasta file containing all the sequences that have passed the filter.
filtered_data.tsv
The format and content of this file depend on the previous step.
after blastp step
The file have the same format as the match.tsv, it contains only the corresponding lines of the sequences in the filtered_sequences.fasta
after hmmsearch step
The format is: Target_id Target_len Query_name full_sequence_score full_sequence_evalue coverage
, e.g:
Target_id 510 Query_name 134.8 1.10e-37 96.0
Target_id 485 Query_name 301.0 4.80e-88 86.0
Target_id 485 Query_name 128.0 1.20e-35 85.0
Target_id 556 Query_name 447.1 2.50e-132 97.0
Target_id 615 Query_name 363.4 5.70e-107 97.0
sources.txt
This file indicates the source databases of each filtered sequence. The db_name
is extracted from the configuration file
formatted as shown in the README
Target_id db_name
Target_id db_name
Target_id db_name
Clustering step outputs
The clustering is made with diamond, it returns a tsv file, with two colunms, the 1st corresponds to the representative sequence of a cluster. The 2nd is a member of this cluster (so for each cluster, in one line, the two columns are identical), e.g:
representative_sequence member
Candidate Selection outputs
The candidate selection step creates a directory for each candidate category. The categories are extracted from the configuration file
with the candidate_selection
key. In the README, the categories are:
- strain_library
- order
- other
Therefore, the directories strain_library
, order
and other
are created by the candidate selection step. Each directory contains the same files.
all_candidates.faa & all_candidates.fna
The protein sequences and the nucleic sequences (if the database was indicated in the configuration file
for the nucleic sequences).
all_candidates.tsv
File containing information on selected candidates.
Candidate
: Id of the candidate, e.g: E3PY99Cluster
: Cluster's representative sequence, e.g: GUT_GENOME000603_02840Cluster_size
: Number of sequences in the cluster, e.g: 4Sources
: The database from which the cluster sequences come from, e.g: uniprot gut_genomeOrganism
: The organism, e.g: Acetoanaerobium sticklandii (strain ATCC 12662 / DSM 519 / JCM 1433 / CCUG 9281 / NCIMB 10654 / HF)Tax_id
: The taxonomic id, e.g: 499177EMBL-GenBank-DDBJ_CDS
: DNA sequence id, e.g: CBH21414.1GC
: GC percentage, e.g: 36.7Query
: The Query nameid
: Identity percentage with the query, e.g: 43.9cov
: Coverage percentage with the query, e.g: 98.8positives
: Fraction of residues for which the alignment scores have positive values, e.g: 63.5mismatch
: Fraction of mismatches, e.g: 47.0gaps
: Fraction of gaps, e.g: 9.1e-value
: e-value, e.g: 2.03e-94Selection_type
: selection criteria in regard to the strain_library, can betax_id
,species
,home_strain
,order
, e.g:home_strain
Strain_library_organism
: The strain_library organism which has matched the sequence, e.g: Clostridium sticklandiiStrain_library_tax_id
: The taxonomic id of the organism which has matched the sequence, e.g: 1511Collection
: Microorganism Collection, e.g: DSMZCollection_id
: The id in the Collection, e.g: 519
all_candidates_ids.txt
This file is write in the same location that the category directories. It lists the ids of all candidates, this list can be used with the --update
option of a run on the same references or hmm profile.
candidate1
candidate2
candidate3
review_reference_sequences.tsv
This file is written in the same location as the category directories. It lists the reference sequences and indicates the number of candidates retrieved by this sequence (only the one that has passed the filter, the candidate is only counted if its lowest e-value is with this reference). It also indicates how many of these candidates have been selected.
Reference Total Selected
RefA 1567 32
Phylogeny step outputs
representatives_seqs.fasta & representatives_seqs_msa.fasta
The representative_seqs.fasta
contains the sequences of each cluster representative sequence. This file is given to mafft in order to build a multiple sequence alignment named representatives_seqs_msa.fasta
.
NB: If the value of --reduce
option is set to 0
, all the filtered sequences are used to build the MSA and is named filtered_sequences_msa.fasta
representative_seqs.tree
Phylogenetic tree in newick format build with FastTree.
NB: If the value of --reduce
option is set to 0
, all the filtered sequences are used to build the MSA and the tree build with this MSA is named filtered_seqs.tree