2. Methods - davidlord/Biomarkers-immuno-lungcancer Wiki


Library preparation

Samples deriving from NSCLC patients treated with immune therapy. Two samples from each patient, one normal (blood or plasma) and one tumor (sampled through??(check with Anna) FFPE tumor sample. The samples were sent to a company for sequencing (Eurofins), sequenced through illumina paired end sequencing (check version), targeted sequencing, using oncogene panel (look up name) (cite?). Tumor samples = FFPE samples? Check w Anna.

Quality control

Quality control (QC) of the sequence reads (in fastq format) was performed in the genomic QC software FastQC (cite FastQC). The reads displayed high quality in regards of phred scores. Almost all reads exhibited high adapter content and various degrees of sequence duplication levels. Low scores in these specific parameters was - to a certain extent - expected, due to the fact that the reads derive from targeted sequencing in which sequencing is preceeded by PCR-amplification (source). Furthermore, a large subset of reads displayed questionable per sequence GC content. FastQC reports from normal- and tumor samples respectively were congregated in MultiQC, a software aggregating results from various bioinformatics analyses (cite MultiQC). The normal sample reads displayed slighty higher quality than the tumor sample reads in regards to all parameters but one, adapter content, in which the tumour sample reads displayed lower quality.

Installation of softwares


HG38 (cite) was used as reference genome in the processes explained in the following sections. The reference genome was indexed with BWA (cite BWA) and Samtools (cite Samtools) respectively, generating the following index-files: .amb, .ann, .bwt, .pac, .sa, .fai. A sequence dictionary (.dict file) was created from the reference genome using the CreateSequenceDictionary (cite CreateSequenceDictionary tool in GATK. The reads were mapped against the reference genome using BWA, more specifically the mem (cite BWA mem) algorithm. The output of the mapping step was piped to the Samtools sort function (cite Samtools sort?) emitting a .bam file for each pair of fastq input.

BAM-file processing

The previously generated .bam files were indexed using the Samtools index (cite Samtools index) function, generating a .fai for each .bam file. The .bam files were then inputed to the MarkDuplicates function in GATK (cite MarkDuplicates GATK/Picard), tagging reads that derive from the same DNA fragment (resulting from PCR amplification during the library preparation). The output .bam files were then indexed with Samtools Index. Read groups were added to the .bam files using the AddOrReplaceReadGroups function in GATK (cite AddOrReplaceReadGroups GATK/Picard) with order set to "coordinate". The output was subsequently indexed using the Samtools index function.

In preparation for the downstream base quality score recalibration process, two resource files were downloaded from the GATK resources package (cite GATK resources?) (https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle). Firstly, a .vcf file containing known SNPs in the human genome (cite Homo_sapiens_assembly38.dbsnp138.vcf) and secondly, a .vcf file containing known indels (cite Mills_and_1000G_gold_standard.indels.hg38.vcf.gz). Both reference files were indexed using the IndexFeatureFile tool in GATK (cite IndexFeatureFile GATK), generating a .tbi file for each .vcf file.

Base quality scores were then recalibrated in the .bam files in a two-step process. Firstly, the .bam files were inputed to the Base Quality Score Recalibrator (BQSR) (cite BQSR GATK), detecting and generating a model of systematic error (artificial errors deriving from the sequencing machine) in the .bam files. One respective model was generated for each .bam file. Secondly, the ApplyBQSR tool (cite ApplyBQSR GATK) was used to apply the models - previously generated by BQSR - to each .bam file, adjusting the base quality scores. The output of this process was used as input in the downstream somatic variant calling (explained in a later section).

The .bam files were further processed in preparation for the downstream MSI-analysis. Indels were realigned in order to reduce false positives in a two-step process. Firstly, the tool RealignerTargetCreator (GATK) (cite RealignerTargetCreator) was used to identify target intervals in the .bam files. Secondly, IndelRealigner (GATK) (cite IndelRealigner) was used to perform local realignment in the target intervals (specified by the output from the previous step). Note that for this two-step indel realignment process, GATK version 3.8-10 was used, whereas GATK 4.2.20 is used for all other GATK processes.

MSI assessment

For the MSI analysis explained in this section, the NGS-based MSI software MSIsensor-pro (cite MSIsensor-pro) was used. Firstly, microsatellites were identified in the reference genome using the 'scan' function, generating a .sites file containing microsatellite sites in list format. To asess the %MSI in each patient tumor sample, the 'msi' function was run on paired normal-tumor processed .bam files (indels realigned), emitting a summary file of MSI stats. for each .bam file pair.

Somatic variant calling

In preparation for the somatic variant calling, a number of resource files were downloaded from the GATK resources bundle. Firsly, an intervals list of HG38 was downloaded, this file is used to restrict the downstream analyses to smaller intervals at a time, thus increasing the efficiency of these processes. Secondly, A .vcf file containing population germline allele frequencies of HG38 was downloaded, along with its index (.tbi) file. This germline population resource was used to determine the evidence for alleles being somatic or germline in downstream analyses.

Panel of Normals

A panel of normals (PoN) was generated through a set of subsequent steps, the generated PoN was later included as a parameter in multiple downstream processed with the purpose of filtering out artifacts, deriving from sequencing- and mapping errors. Before generating the PoN, a germline population resource (a simplified version of a gnomAD resource, containing allele frequencies only) was downloaded from the GATK somatic variant calling best practices resource bundle (https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38;tab=objects) in .vcf format. All normal (germline) processed .bam files were included when generating the PoN. Firstly, the somatic variant caller Mutect2 (cite Mutect2) was run - with the same sensitivity as that used in somatic variant calling - on the normal samples, generating a .vcf for each .bam file. Secondly, a datastore containing all variants from the previously generated .vcf files was generated using the GenomicsDBImport function (cite GenomicsDBImport GATK).

Finally, the datastore and the previously downloded germline population resource file were used to create a PoN in .vcf format, using the CreateSomaticPanelOfNormals (cite CreateSomaticPanelOfNormals GATK) function. The germline gnomAD resource allowed for distinction between germline and somatic variants in the PoN. The PoN retains sites that were called in two or more samples, deeming these as sites proned to technical errors.

Somatic variant calling

Paired normal-tumor processed .bam files were inputed to the somatic variant caller Mutect2 (cite Mutect2) (GATK), emitting an unfiltered .vcf file for each .bam file pair. Mutect2 identifies (possible) somatic variants through a subsequent number of steps. Initially, active regions are targeted and de-novo assembly is performed on these, emitting possible haplotypes. The likelihoods of these haplotypes are then determined through a pair-HMM model. Likewise to the previous CreateSomaticPanelOfNormals process, the gnomAD population germline resource was passed as a parameter to Mutect2 in order to exclude germline variants to be identified as somatic. Additionally, germline variants are subtracted from the somatic variants through the paired normal-tumor input, thus excluding rare germline variants (not captured by the germline resource). The previously generated PoN is passed as an additional parameter to exclude technical artifacts and mapping errors captured by the PoN. Finally, an extra parameter was passed to Mutect2, altering the process to also emit a f1r2 file for each output .vcf. The f1r2 files contain raw forward- and reverse reads and were used as input for the downstream LearnReadOrientationModel process (explained in the next section).

Variant filtering

Variant filtering of the previously generated .vcf files was conducted through a series of steps. Firstly, a set of subsequent preparatory steps was performed, followed by the actual variant filtering through the FilterMutectCalls (cite FilterMutectCalls) function (GATK). Variants were filtered in regards to technical artifacts, cross-sample contamination, and orientation bias (common in FFPE samples).

Read orientation artifact filter

In order to account for base substitutions frequent in FFPE samples, a read orientation artifact filter was applied. The aforementioned f1r2 files (additional output of Mutect2) were inputed to the LearnReadOrientationModel (cite LearnReadOrientationModel) function (GATK), generating a orientation model for each input file. The orientation models consists of data of whether or not alternate alleles are biased toward the forward- or the reverse read respectively. Each orientation model was later passed as an additional parameter to the downstream FilterMutectCalls, allowing for filtering of base substitutions that likely occured due to FFPE preservation of the tumor tissue.

Cross-sample contamination assessment

Cross-sample contamination in the tumor samples was estimated in a two-step process. In preparation for the first step, namely GetPilupSummaries, a population resource was downloaded from the GATK somatic variant calling best practices resource bundle (https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38;tab=objects). This resource consisted of a .vcf file containing common biallelic germline variants along with their respective alle frequencies. Subsequently, GetPileupSummaires (cite) (GATK) was run on the normal- and tumor .bam files, emitting a table file - containing counts of reads supporting the reference- and alternate allele respectively for a set of variant sites - for each .bam input.

Next, the CalculateContamination (cite) tool (GATK) was run on the table files emitted in the previous step in matched normal mode. CalculateContamination estimates the degree of cross-sample contamination through comparing homozygous sites of alleles with low allele freequency (given by the population resource) across samples. This estimation is based on the assumtion that rare alleles observed in separate samples remnants of cross-sample contamination. CalculateContamination emitted a file containing the fraction of estimated contamination along with a tumor segmentation file for each input.


FilterMutectCalls: Applies filters to output from Mutect2 (unfiltered VCF). Filter based on sequence context artifacts.

Variant annotation

Annotations were added to the detected variants using the gatk tool Funcotator (cite). Funcotator adds annotations to variants in regards of their function, this information is retreived from a set of data sources, provided as parameters when running Funcotator.

Mutational Signatures

What software to use? Different softwares seem to have different formats of output, want to choose one that is suitable for downstream analysis, modelling in a machine learning model.

vs MuSiCa: Mutsignatures seems harder to run, has good visualization options on output.

Tumor mutation burden (TMB)

TMB calculated through: number of non-synonymous somatic mutations (SNVs and indels) in coding regions / Mb

Principal Component Analysis of variants