COG_UK - isaric4c/wiki GitHub Wiki

Covid-19 Genomics UK Consortium

Spike protein structures showing locations of amino acid residues that are mutated in each variant of concern (VOC). The spike protein protrudes from the surface of the SARS-CoV-2 virus, is responsible for initating binding to and entry into host cells, and is also the primary target for antibodies that recognise the virus.

Consult our own Metadata Catalogue for the

  • COG UK metadata metadata (e.g. cog_metadata.csv), corresponds to alignment patients
  • COG UK naive variants naive variants (e.g. naive_variants.csv)
  • Consensus sequence (e.g. cog_all.fasta) in FASTA format
  • Full alignment (e.g. cog_alignment.fasta) in FASTA format
  • Unmasked alignment (e.g. cog_unmasked_alignment.fasta) in FASTA format, same set of patients as cog_all

A note about identifiers:

  • the main component of the patient identifier is the ISARIC id (in the form ABCD-0123)
  • the ISARIC id will be given a numerical suffix (.1, .2, .3 etc) because there may be multiple samples per patient
  • the FASTA identifier has that ISARIC id surrounded by /England/ and /year/ (e.g. /England/ABCD-0123.1/2020)
  • some patients have multiple ISARIC ids

A note about multiple samples per patient: each physical sample should be assigned exactly one COG ID and any re-sequencing of the same sample should be submitted under the same COG ID. In practise, this is not enforced (and hard to do so anyway) and submitting organisations (including PHE/UKHSA) have often issued new COG IDs for the same sample, resulting in multiple COG IDs actually referring to the same swab. Trying to de-duplicate on sample date is not entirely reliable (frequently off by a day or two in different reports) and it would also be perfectly legitimate to have more than one actually distinct samples (like a throat and a nose swab) on the same day from the same patient.

A note about the FASTA format: some of the letters are lower-case because they are masked.

Data dictionary:

Web references: