File Format - labgem/CAESAR GitHub Wiki

Description of the files generated by the different steps of CAESAR

Blastp output - matches.tsv

The Diamond software is used to do the blastp. It returns a tsv file named: matches_<db>.tsv. The file is formatted via the options --outfmt 6 qseqid qlen sseqid slen length pident qcovhsp positive mismatch gaps evalue.

Ref_id	317	Target_id	349	319	40.4	96.2	187	162	28	1.88e-71
Ref_id	317	Target_id	355	313	39.3	96.5	183	176	14	1.26e-70
Ref_id	317	Target_id	348	317	40.7	96.2	183	163	25	8.17e-70
Ref_id	317	Target_id	349	319	40.1	96.2	184	163	28	3.35e-69
Ref_id	317	Target_id	349	323	39.6	98.1	183	171	24	1.22e-68
Ref_id	317	Target_id	350	325	39.1	98.1	185	170	28	7.68e-68
Ref_id	317	Target_id	339	327	39.8	98.1	182	165	32	2.23e-67
Ref_id	317	Target_id	321	315	40.3	96.5	183	165	23	5.27e-67
Ref_id	317	Target_id	308	305	39.3	95.9	175	177	8	1.43e-66

Hmmsearch output - hits.domtbl

Made with HMMER, it returns a .domtbl file with the option --domtblout.

#                                                                                 --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name             accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#     ------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
Target_id                 -            646 Query_name           -            534   1.5e-57  200.3   0.0   1   1   4.1e-60     2e-57  199.9   0.0     8   529    72   615    55   619 0.80 Description
Target_id                 -            676 Query_name           -            534   7.5e-39  138.6   0.0   2   2   4.5e-20   2.3e-17   67.7   0.0   298   455   405   561   391   585 0.80 Description

Filter step outputs

filtered_sequences.fasta

This is simply a fasta file containing all the sequences that have passed the filter.

filtered_data.tsv

The format and content of this file depend on the previous step.

after blastp step

The file have the same format as the match.tsv, it contains only the corresponding lines of the sequences in the filtered_sequences.fasta

after hmmsearch step

The format is: Target_id Target_len Query_name full_sequence_score full_sequence_evalue coverage, e.g:

Target_id	510	Query_name	134.8	1.10e-37	96.0
Target_id	485	Query_name	301.0	4.80e-88	86.0
Target_id	485	Query_name	128.0	1.20e-35	85.0
Target_id	556	Query_name	447.1	2.50e-132	97.0
Target_id	615	Query_name	363.4	5.70e-107	97.0

sources.txt

This file indicates the source databases of each filtered sequence. The db_name is extracted from the configuration file formatted as shown in the README

Target_id db_name
Target_id db_name
Target_id db_name

Clustering step outputs

The clustering is made with diamond, it returns a tsv file, with two colunms, the 1st corresponds to the representative sequence of a cluster. The 2nd is a member of this cluster (so for each cluster, in one line, the two columns are identical), e.g:

representative_sequence member

Candidate Selection outputs

The candidate selection step creates a directory for each candidate category. The categories are extracted from the configuration file with the candidate_selection key. In the README, the categories are:

  • strain_library
  • order
  • other

Therefore, the directories strain_library, order and other are created by the candidate selection step. Each directory contains the same files.

all_candidates.faa & all_candidates.fna

The protein sequences and the nucleic sequences (if the database was indicated in the configuration file for the nucleic sequences).

all_candidates.tsv

File containing information on selected candidates.

  • Candidate: Id of the candidate, e.g: E3PY99
  • Cluster: Cluster's representative sequence, e.g: GUT_GENOME000603_02840
  • Cluster_size: Number of sequences in the cluster, e.g: 4
  • Sources: The database from which the cluster sequences come from, e.g: uniprot gut_genome
  • Organism: The organism, e.g: Acetoanaerobium sticklandii (strain ATCC 12662 / DSM 519 / JCM 1433 / CCUG 9281 / NCIMB 10654 / HF)
  • Tax_id: The taxonomic id, e.g: 499177
  • EMBL-GenBank-DDBJ_CDS: DNA sequence id, e.g: CBH21414.1
  • GC: GC percentage, e.g: 36.7
  • Query: The Query name
  • id: Identity percentage with the query, e.g: 43.9
  • cov: Coverage percentage with the query, e.g: 98.8
  • positives: Fraction of residues for which the alignment scores have positive values, e.g: 63.5
  • mismatch: Fraction of mismatches, e.g: 47.0
  • gaps: Fraction of gaps, e.g: 9.1
  • e-value: e-value, e.g: 2.03e-94
  • Selection_type: selection criteria in regard to the strain_library, can be tax_id, species, home_strain, order, e.g: home_strain
  • Strain_library_organism: The strain_library organism which has matched the sequence, e.g: Clostridium sticklandii
  • Strain_library_tax_id: The taxonomic id of the organism which has matched the sequence, e.g: 1511
  • Collection: Microorganism Collection, e.g: DSMZ
  • Collection_id: The id in the Collection, e.g: 519

all_candidates_ids.txt

This file is write in the same location that the category directories. It lists the ids of all candidates, this list can be used with the --update option of a run on the same references or hmm profile.

candidate1
candidate2
candidate3

review_reference_sequences.tsv

This file is written in the same location as the category directories. It lists the reference sequences and indicates the number of candidates retrieved by this sequence (only the one that has passed the filter, the candidate is only counted if its lowest e-value is with this reference). It also indicates how many of these candidates have been selected.

Reference  Total  Selected
RefA       1567   32

Phylogeny step outputs

representatives_seqs.fasta & representatives_seqs_msa.fasta

The representative_seqs.fasta contains the sequences of each cluster representative sequence. This file is given to mafft in order to build a multiple sequence alignment named representatives_seqs_msa.fasta.

NB: If the value of --reduce option is set to 0, all the filtered sequences are used to build the MSA and is named filtered_sequences_msa.fasta

representative_seqs.tree

Phylogenetic tree in newick format build with FastTree.

NB: If the value of --reduce option is set to 0, all the filtered sequences are used to build the MSA and the tree build with this MSA is named filtered_seqs.tree