Final Project - sansuach/Sanskritiacharya GitHub Wiki
In this wiki, I will document the steps I took to analyze genomic data for the Morning Glory project using various bioinformatics tools on the ISAAC server.I began by creating a directory structure on the ISAAC server to organize my files. I navigated to the analysis directory:
cd /lustre/isaac/proj/UTK0318/final_project/analysis
Next, I created and entered the necessary subdirectories:
mkdir sachar10
cd sachar10
mkdir morning_glory
cd morning_glory
mkdir fastq_files
cd fastq_files
I uploaded my sequencing data from my local computer to the server. I used scp to transfer the files from my local desktop environment to the appropriate directory on ISAAC:
scp -r /home/mobaxterm/Desktop/Fastq_Sankriti/fastq_files/ [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/fastq_files
On the ISAAC server, I used FastQC to assess the quality of my FASTQ files:
module load gcc/10.2.0
fastqc *.fastq.gz -o quality_reports
I downloaded the quality reports to my local desktop for review:
scp -r [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/fastq_files/quality_reports/*fastqc.html /home/mobaxterm/Desktop/Fastq_Sankriti/quality_reports
On the ISAAC server, I used FastQC to assess the quality of my FASTQ files:
module load gcc/10.2.0
fastqc *.fastq.gz -o quality_reports
I downloaded the quality reports to my local desktop for review:
scp -r [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/fastq_files/quality_reports/*fastqc.html /home/mobaxterm/Desktop/Fastq_Sankriti/quality_reports
Upon reviewing the quality reports, I noticed that the sequence quality was low, which could lead to errors in downstream analyses. Therefore, I proceeded with trimming the low-quality reads to improve the overall quality of the data. To save space, I gzipped the FASTQ files:
gzip *.fastq
After gzipping the files, I moved up two directories to prepare for trimming and I created a directory for Trimmomatic and navigated into it:
cd ../../
mkdir trimmomatic
cd trimmomatic
I linked the gzipped FASTQ files to the Trimmomatic directory:
ln -s /lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/fastq_files/*.fastq.gz .
I also copied the adapter file and the Trimmomatic JAR file from my local computer:
scp -r /home/mobaxterm/Desktop/TruSeq3-SE.fa [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic
scp -r /home/mobaxterm/Desktop/trimmomatic-0.39.jar [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic
I created a directory for the trimmed output mkdir trimmed_output
and then wrote a SLURM script nano trimmed.qsh
to use Trimmomatic for trimming low-quality reads:
#!/bin/bash
#SBATCH -J trimmed_fastq
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 02:00:00
#SBATCH --mem=16G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err
# Load the module for Trimmomatic
module load trimmomatic/0.39
# Input and Output Directories
INPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic"
OUTPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic/trimmed_output"
# Loop through all fastq.gz files and process with Trimmomatic
for FILE in "$INPUT_DIR"/*.fastq.gz; do
BASENAME=$(basename "$FILE" .fastq.gz)
echo "Processing $FILE..." >> "$OUTPUT_DIR/processing_log.txt"
# Run Trimmomatic
if java -jar /lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic/trimmomatic-0.39.jar SE -phred33 \
"$FILE" "$OUTPUT_DIR/${BASENAME}_trimmed.fastq.gz" \
ILLUMINACLIP:/path/to/TruSeq3-SE.fa:2:30:10 \
LEADING:5 TRAILING:5 SLIDINGWINDOW:4:20 MINLEN:50 \
-trimlog "$OUTPUT_DIR/${BASENAME}_trimlog.txt"; then
echo "Successfully processed $FILE" >> "$OUTPUT_DIR/processing_log.txt"
else
echo "Failed to process $FILE" >> "$OUTPUT_DIR/processing_log.txt"
fi
done
After trimming, I assessed the quality of the trimmed reads:
cd trimmed_output
mkdir fastqc_results
module load gcc/10.2.0
fastqc *_trimmed.fastq.gz -o fastqc_results
I then downloaded the FastQC results for review:
scp -r [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic/trimmed_output/fastqc_results/*_fastqc.html /home/mobaxterm/Desktop/Fastq_Sankriti/quality_reports
After trimming the reads, the next step was to prepare the reference genome for alignment. I started by creating a directory for BWA, which is the tool I used for aligning the sequences:
cd../../../
mkdir bwa
cd bwa
Next, I copied the reference genome from my local computer to the server using scp:
scp -r /home/mobaxterm/Desktop/Fastq_Sankriti/Reference_genome/NSP306_trifida_chr_v3.fa [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa
After copying the reference genome, I indexed it using BWA. Indexing is an important step because it prepares the reference genome for efficient alignment:
module load bwa/0.7.17
gcc/10.2.0
bwa index NSP306_trifida_chr_v3.fa
Indexing allows BWA to quickly find the relevant sections of the reference genome during the alignment of sequencing reads.This step is important for the performance of the alignment process.
Next, I created symbolic links for the trimmed FASTQ files to prepare for alignment nano link.qsh
:
#!/bin/bash
#SBATCH -J create_symlinks
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH -A ISAAC-UTK0318
#SBATCH -q short
#SBATCH -p short
#SBATCH -t 00:10:00
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err
# Source and target directories
SRC_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic/trimmed_output/"
LINK_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa"
# Loop through each file in the source directory and create a symbolic link
for FILE in $SRC_DIR/*.fastq.gz; do
BASENAME=$(basename $FILE)
ln -s $FILE $LINK_DIR/$BASENAME
done
To align the reads to the reference genome, I wrote a SLURM script nano bwa_mapping.qsh
to run BWA-MEM:
#!/bin/bash
#SBATCH -J bwa
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 02:00:00
#SBATCH --mem=16G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err
# Load the necessary modules
module load bwa/0.7.17
module load samtools/1.16.1-gcc
# Set the input paths and reference genome
REFERENCE="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/NSP306_trifida_chr_v3.fa"
INPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa"
OUTPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/aligned_outputs"
# Create output directory if it doesn't exist
mkdir -p $OUTPUT_DIR
# Loop through all trimmed FASTQ files and align them using BWA
for READ in ${INPUT_DIR}/*_trimmed.fastq.gz; do
# Check if the input file exists
if [ ! -e "$READ" ]; then
echo "No input files found matching the pattern: ${INPUT_DIR}/*_trimmed.fastq.gz"
exit 1
fi
# Get the basename without the suffix
BASENAME="${READ##*/}"
BASENAME="${BASENAME%_trimmed.fastq.gz}"
OUTPUT_FILE="${OUTPUT_DIR}/${BASENAME}_aligned.sam"
# Run BWA MEM for single-end reads
echo "Running BWA MEM for $READ..."
bwa mem -t 4 $REFERENCE $READ > $OUTPUT_FILE
# Convert SAM to BAM, sort, and index
BAM_FILE="${OUTPUT_DIR}/${BASENAME}_aligned.bam"
SORTED_BAM_FILE="${OUTPUT_DIR}/${BASENAME}_aligned_sorted.bam"
echo "Converting SAM to BAM for $OUTPUT_FILE..."
samtools view -Sb $OUTPUT_FILE > $BAM_FILE
echo "Sorting BAM file $BAM_FILE..."
samtools sort $BAM_FILE -o $SORTED_BAM_FILE
echo "Indexing BAM file $SORTED_BAM_FILE..."
samtools index $SORTED_BAM_FILE
# Remove intermediate files
echo "Removing intermediate files for $BASENAME..."
rm $OUTPUT_FILE $BAM_FILE
done
After aligning the reads, I obtained sorted BAM files and their corresponding index files (sorted.bam and sorted.bam.bai). To generate alignment statistics, I moved into the aligned_outputs directory cd aligned_outputs
:
I loaded the necessary Samtools module and created a directory for storing the statistics results:
module load samtools/1.16.1-gcc
mkdir stats_results
Next, I wrote a SLURM script nano stat_results.qsh
to generate flagstat and stats reports for the BAM files:
#!/bin/bash
#SBATCH -J samtools_stats
#SBATCH --nodes=1
#SBATCH --cpus-per-task=2
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 01:00:00
#SBATCH --mem=8G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err
# Load the necessary module
module load samtools/1.16.1-gcc
# Set the input and output paths
INPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/aligned_outputs"
OUTPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/aligned_outputs/stats_results"
# Loop through all BAM files and generate flagstat and stats
for SORTED_BAM_FILE in ${INPUT_DIR}/*_aligned_sorted.bam; do
# Get the basename without the suffix
BASENAME="${SORTED_BAM_FILE##*/}"
BASENAME="${BASENAME%_aligned_sorted.bam}"
# Generate flagstat statistics using samtools
echo "Generating flagstat for $SORTED_BAM_FILE..."
samtools flagstat $SORTED_BAM_FILE > ${OUTPUT_DIR}/${BASENAME}_flagstat.txt
# Generate detailed statistics using samtools
echo "Generating statistics for $SORTED_BAM_FILE..."
samtools stats $SORTED_BAM_FILE > ${OUTPUT_DIR}/${BASENAME}_stats.txt
done
The flagstat.txt and stats.txt files summarize key metrics like total number of reads, number of primary mapped reads and supplementary reads, and percentage of properly paired reads. I used the command head *flagstat.txt
to preview these files.
With my sorted.bam and sorted.bam.bai files ready for variant calling, I created a directory named readgroups mkdir readgroups
and wrote a SLURM batch script nano readgroups.qsh
to add read groups to the sorted BAM files using GATK with following script:
#!/bin/bash
#SBATCH -J add_readgroups
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 02:00:00
#SBATCH --mem=16G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err
# Load the necessary modules
module load gatk/4.2.2.0
module load samtools/1.16.1-gcc
# Set the input and output paths
INPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/aligned_outputs"
OUTPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/aligned_outputs/readgroups"
# Loop through all sorted BAM files and add read groups
for INPUT_BAM in ${INPUT_DIR}/*_aligned_sorted.bam; do
BASENAME="${INPUT_BAM##*/}"
BASENAME="${BASENAME%_aligned_sorted.bam}"
OUTPUT_BAM="${OUTPUT_DIR}/${BASENAME}_sorted.RG.bam"
# Run GATK AddOrReplaceReadGroups
echo "Adding read groups to $INPUT_BAM..."
gatk AddOrReplaceReadGroups \
-I $INPUT_BAM \
-O $OUTPUT_BAM \
-RGID 1 \
-RGLB lib1 \
-RGPL illumina \
-RGPU unit1 \
-RGSM $BASENAME
# Index the output BAM file
echo "Indexing $OUTPUT_BAM..."
samtools index $OUTPUT_BAM
done
This script adds read group information to each BAM file and indexes the output BAM files using samtools index to make them ready for subsequent steps. After running this script, I obtained _sorted.RG.bam and _sorted.RG.bam.bai files.
Next, I created a new directory named gatk Cd ../../../
mkdir gatk
and wrote a SLURM script nano link.qsh
to create symbolic links for the BAM files with added read groups. I defined the source and target directories in the script and used the command:
#!/bin/bash
#SBATCH -J create_symlinks
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH -A ISAAC-UTK0318
#SBATCH -q short
#SBATCH -p short
#SBATCH -t 00:10:00
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err
# Source and target directories
SRC_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/aligned_outputs/readgroups"
LINK_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/gatk"
for FILE in $SRC_DIR/*_sorted.RG.*; do
BASENAME=$(basename $FILE)
ln -s $FILE $LINK_DIR/$BASENAME
done
To further organize the reference files, I wrote a SLURM script named referencelink.qsh to create symbolic links for the reference genome files. The script contained the following commands:
• nano referencelink.qsh
#!/bin/bash
#SBATCH -J create_symlinks
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH -A ISAAC-UTK0318
#SBATCH -q short
#SBATCH -p short
#SBATCH -t 00:10:00
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err
# Source and target directories
SRC_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/"
LINK_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/gatk"
for FILE in $SRC_DIR/NSP306_trifida_chr_v3.*; do
BASENAME=$(basename $FILE)
ln -s $FILE $LINK_DIR/$BASENAME
done
To prepare the reference files, I used Samtools to create a .fai file for the reference genome (NSP306_trifida_chr_v3.fa) by running:
samtools faidx NSP306_trifida_chr_v3.fa
I also created a dictionary file (.dict) for the reference genome using GATK's CreateSequenceDictionary command:
/spack/spack-0.17.1/apps/linux-rhel8-cascadelake/gcc-10.2.0/gatk-4.2.2.0-au34zvt6dnayvobrgrn5bmkqh4o3du7o/bin/gatk CreateSequenceDictionary -R NSP306_trifida_chr_v3.fa -O NSP306_trifida_chr_v3.dict
These files are essential for GATK to understand the reference genome structure during variant calling.
I then proceeded to run GATK for the entire genome to generate VCF files. This step took around 1 hour and 19 minutes. Once I got the VCF files in the results directory, I moved on to identifying SNPs and indels. I wrote a script nano snps_indels.qsh
to analyze the VCF files for SNPs and indels using bcftools. In the script, I counted SNPs and indels as well as the total number of variants for each VCF file:
#!/bin/bash
#SBATCH -J snps_indels
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 03:00:00
#SBATCH --mem=32G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err
OUTPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/gatk/results1"
for vcf_file in ${OUTPUT_DIR}/*gatk.vcf; do
base=$(basename "$vcf_file" "gatk.vcf")
# Count SNPs
snp_count=$(bcftools view -v snps $vcf_file | grep -v "^#" | wc -l)
# Count Indels
indel_count=$(bcftools view -v indels $vcf_file | grep -v "^#" | wc -l)
# Total Variants (SNPs + Indels)
total_count=$(bcftools view $vcf_file | grep -v "^#" | wc -l)
echo "Number of SNPs in ${base}gatk.vcf: $snp_count"
echo "Number of indels in ${base}gatk.vcf: $indel_count"
echo "Total number of variants in ${base}gatk.vcf: $total_count"
done
I saved the results in an output file called results1 and reviewed them using:
cat R-snps_indels.2716705.out
After obtaining the VCF files, I transferred them to my local system using scp and performed segregation analysis in R.
scp [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/gatk/results1/*.vcf* /home/mobaxterm/Desktop/segregation
I wrote an R script to perform segregation analysis on the VCF files of two parents and multiple progeny.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
if (!requireNamespace("VariantAnnotation", quietly = TRUE)) BiocManager::install("VariantAnnotation")
if (!requireNamespace("GenomicRanges", quietly = TRUE)) BiocManager::install("GenomicRanges")
library(VariantAnnotation)
library(GenomicRanges)
These commands installed and loaded the required libraries for working with genomic data. Next, I set the paths for parent and progeny VCF files and ensured the output directory existed.
vcf_dir <- "C:/Users/Sanskriti/OneDrive/Desktop/segregation"
parent1_vcf <- file.path(vcf_dir, "M9gatk.vcf")
parent2_vcf <- file.path(vcf_dir, "M19gatk.vcf")
progeny_files <- list.files(vcf_dir, pattern = "^PT.*\.vcf$", full.names = TRUE)
output_dir <- file.path(vcf_dir, "segregation_results")
Then, I verified the presence of parent VCF files and loaded them for analysis and normalized variant names and extracted the relevant variant information for each parent and progeny.
if (!file.exists(parent1_vcf)) stop("Parent 1 VCF not found: ", parent1_vcf)
if (!file.exists(parent2_vcf)) stop("Parent 2 VCF not found: ", parent2_vcf)
# Load parent VCF files
cat("Loading Parent 1 VCF...\n")
parent1 <- readVcf(parent1_vcf)
cat("Loading Parent 2 VCF...\n")
parent2 <- readVcf(parent2_vcf)
# Extract variant ranges from parents
parent1_variants <- rowRanges(parent1)
parent2_variants <- rowRanges(parent2)
# Normalize variant names (remove "chr" if present)
names(parent1_variants) <- sub("^chr", "", names(parent1_variants))
names(parent2_variants) <- sub("^chr", "", names(parent2_variants))
# Process each progeny file
for (progeny_file in progeny_files) {
cat("Processing progeny file:", progeny_file, "\n")
# Load progeny VCF
progeny <- readVcf(progeny_file)
progeny_variants <- rowRanges(progeny)
names(progeny_variants) <- sub("^chr", "", names(progeny_variants)) # Normalize variant names
# Find shared and unique variants
shared_with_parent1 <- intersect(names(progeny_variants), names(parent1_variants))
shared_with_parent2 <- intersect(names(progeny_variants), names(parent2_variants))
unique_to_progeny <- setdiff(names(progeny_variants), union(names(parent1_variants), names(parent2_variants)))
# Debugging information
cat("Total progeny variants:", length(progeny_variants), "\n")
cat("Shared with Parent 1:", length(shared_with_parent1), "\n")
cat("Shared with Parent 2:", length(shared_with_parent2), "\n")
cat("Unique to progeny:", length(unique_to_progeny), "\n")
# Create result table
results <- data.frame(
Variant_Position = names(progeny_variants),
Inherited_From_Parent1 = names(progeny_variants) %in% shared_with_parent1,
Inherited_From_Parent2 = names(progeny_variants) %in% shared_with_parent2,
Unique_To_Progeny = names(progeny_variants) %in% unique_to_progeny
)
# Save results to file
progeny_basename <- tools::file_path_sans_ext(basename(progeny_file))
output_file <- file.path(output_dir, paste0(progeny_basename, "_segregation_results.csv"))
write.csv(results, output_file, row.names = FALSE)
cat("Results saved to:", output_file, "\n")
}
cat("Segregation analysis complete for all progeny files.\n")
This loop processed each progeny, found variants shared with parents, and saved the results to CSV files.
Then, I created a Circos plot to visualize the inheritance patterns among progeny using R's circlize
package.
First, I loaded necessary libraries and read the combined CSV files from the segregation analysis.Then, I extracted the chromosome and position information for each variant and removed any unnecessary columns to prepare the data for visualization.
Using circlize
, I plotted the chromosome information and visualized each variant with different colors based on its inheritance (e.g., red for Parent 1, blue for Parent 2, and green for unique progeny variants). This helped in understanding the inheritance patterns visually.
For circus plot;
# Load necessary libraries
library(circlize)
library(dplyr)
# Define the directory containing your progeny files
input_directory <- "C:/Users/Sanskriti/OneDrive/Desktop/segregation/segregation_results" # Update with your actual directory path
# Read all CSV files in the directory
file_list <- list.files(input_directory, pattern = "*.csv", full.names = TRUE)
# Initialize an empty dataframe to store combined data
combined_data <- data.frame()
# Loop through each file and combine data
for (file in file_list) {
# Read each file, assuming they are in CSV format
data <- read.csv(file, stringsAsFactors = FALSE)
# Print the column names to ensure they are correct
print("Original Column Names:")
print(names(data))
# Remove any leading/trailing spaces in column names
names(data) <- trimws(names(data))
# Print updated column names for verification
print("Updated Column Names After Cleaning:")
print(names(data))
# Split the Variant_Position column to extract Chromosome and Position
data <- data %>%
filter(grepl(":", Variant_Position)) %>% # Ensure the format is correct with ':'
mutate(
Chromosome = sapply(strsplit(Variant_Position, ":"), `[`, 1),
Position_Info = sapply(strsplit(Variant_Position, ":"), `[`, 2),
Position = as.numeric(sapply(strsplit(Position_Info, "_"), `[`, 1)) # Extract numeric position before allele info
) %>%
select(-Position_Info) # Remove intermediate column
# Print data to confirm the creation of Chromosome and Position columns
print("Data After Extracting Chromosome and Position:")
print(head(data))
# Append to combined dataframe
combined_data <- bind_rows(combined_data, data)
}
# Check combined data structure
print("Combined Data Structure:")
str(combined_data)
# Check column names in combined data
print("Column Names in Combined Data:")
print(names(combined_data))
# Remove rows with missing values in critical columns
combined_data <- combined_data %>%
filter(!is.na(Chromosome) & !is.na(Position))
# Prepare data for Circos plot
# Grouping by Chromosome for plotting
chromosome_lengths <- combined_data %>%
group_by(Chromosome) %>%
summarize(Length = max(Position, na.rm = TRUE)) %>%
filter(Length > 0) # Ensure valid chromosome lengths
# Print chromosome_lengths to confirm correctness before plotting
print("Chromosome Lengths Data:")
print(chromosome_lengths)
# Initialize Circos plot
circos.clear()
circos.initialize(factors = chromosome_lengths$Chromosome,
xlim = cbind(rep(0, nrow(chromosome_lengths)), chromosome_lengths$Length))
# Add chromosome track
circos.trackPlotRegion(factors = chromosome_lengths$Chromosome, ylim = c(0, 1),
panel.fun = function(x, y) {
chr <- CELL_META$sector.index
circos.text(CELL_META$xcenter, CELL_META$ycenter + 0.5, chr, cex = 0.6, facing = "clockwise", niceFacing = TRUE)
circos.axis(labels.cex = 0.5)
})
# Plot inheritance tracks
for (i in 1:nrow(combined_data)) {
# Skip any rows that have NA in Position or Chromosome (though we already filtered, this is a safety check)
if (is.na(combined_data$Position[i]) || is.na(combined_data$Chromosome[i])) next
color <- ifelse(combined_data$Inherited_From_Parent1[i] == "TRUE", "#FF0000",
ifelse(combined_data$Inherited_From_Parent2[i] == "TRUE", "#0000FF",
ifelse(combined_data$Unique_To_Progeny[i] == "TRUE", "#00FF00", "#FFFFFF")))
circos.rect(sector.index = combined_data$Chromosome[i],
xleft = combined_data$Position[i],
xright = combined_data$Position[i] + 1,
ybottom = 0,
ytop = 1,
col = color,
border = color)
}
# Add a legend
legend("topleft", legend = c("Inherited from Parent 1", "Inherited from Parent 2", "Unique to Progeny"),
col = c("#FF0000", "#0000FF", "#00FF00"), pch = 15, bty = "n", cex = 0.8)
# Clear Circos plot after drawing
circos.clear()