Final Project - sansuach/Sanskritiacharya GitHub Wiki

In this wiki, I will document the steps I took to analyze genomic data for the Morning Glory project using various bioinformatics tools on the ISAAC server.I began by creating a directory structure on the ISAAC server to organize my files. I navigated to the analysis directory:

cd /lustre/isaac/proj/UTK0318/final_project/analysis

Next, I created and entered the necessary subdirectories:

mkdir sachar10
cd sachar10
mkdir morning_glory 
cd morning_glory
mkdir fastq_files 
cd fastq_files

I uploaded my sequencing data from my local computer to the server. I used scp to transfer the files from my local desktop environment to the appropriate directory on ISAAC:

scp -r /home/mobaxterm/Desktop/Fastq_Sankriti/fastq_files/ [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/fastq_files

On the ISAAC server, I used FastQC to assess the quality of my FASTQ files:

module load gcc/10.2.0
fastqc *.fastq.gz -o quality_reports

I downloaded the quality reports to my local desktop for review:

scp -r [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/fastq_files/quality_reports/*fastqc.html /home/mobaxterm/Desktop/Fastq_Sankriti/quality_reports

On the ISAAC server, I used FastQC to assess the quality of my FASTQ files:

module load gcc/10.2.0
fastqc *.fastq.gz -o quality_reports

I downloaded the quality reports to my local desktop for review:

scp -r [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/fastq_files/quality_reports/*fastqc.html /home/mobaxterm/Desktop/Fastq_Sankriti/quality_reports

Upon reviewing the quality reports, I noticed that the sequence quality was low, which could lead to errors in downstream analyses. Therefore, I proceeded with trimming the low-quality reads to improve the overall quality of the data. To save space, I gzipped the FASTQ files:

gzip *.fastq

After gzipping the files, I moved up two directories to prepare for trimming and I created a directory for Trimmomatic and navigated into it:

cd ../../
mkdir trimmomatic 
cd trimmomatic

I linked the gzipped FASTQ files to the Trimmomatic directory:

ln -s /lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/fastq_files/*.fastq.gz .

I also copied the adapter file and the Trimmomatic JAR file from my local computer:

scp -r /home/mobaxterm/Desktop/TruSeq3-SE.fa [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic
scp -r /home/mobaxterm/Desktop/trimmomatic-0.39.jar [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic

I created a directory for the trimmed output mkdir trimmed_output and then wrote a SLURM script nano trimmed.qsh to use Trimmomatic for trimming low-quality reads:

#!/bin/bash
#SBATCH -J trimmed_fastq
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 02:00:00
#SBATCH --mem=16G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err

# Load the module for Trimmomatic
module load trimmomatic/0.39

# Input and Output Directories
INPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic"
OUTPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic/trimmed_output"

# Loop through all fastq.gz files and process with Trimmomatic
for FILE in "$INPUT_DIR"/*.fastq.gz; do
  BASENAME=$(basename "$FILE" .fastq.gz)
  echo "Processing $FILE..." >> "$OUTPUT_DIR/processing_log.txt"

# Run Trimmomatic
  if java -jar /lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic/trimmomatic-0.39.jar SE -phred33 \
    "$FILE" "$OUTPUT_DIR/${BASENAME}_trimmed.fastq.gz" \
    ILLUMINACLIP:/path/to/TruSeq3-SE.fa:2:30:10 \
    LEADING:5 TRAILING:5 SLIDINGWINDOW:4:20 MINLEN:50 \
    -trimlog "$OUTPUT_DIR/${BASENAME}_trimlog.txt"; then
    echo "Successfully processed $FILE" >> "$OUTPUT_DIR/processing_log.txt"
  else
    echo "Failed to process $FILE" >> "$OUTPUT_DIR/processing_log.txt"
  fi
done

After trimming, I assessed the quality of the trimmed reads:

cd trimmed_output
mkdir fastqc_results
module load gcc/10.2.0
fastqc *_trimmed.fastq.gz -o fastqc_results

I then downloaded the FastQC results for review:

scp -r [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic/trimmed_output/fastqc_results/*_fastqc.html /home/mobaxterm/Desktop/Fastq_Sankriti/quality_reports

After trimming the reads, the next step was to prepare the reference genome for alignment. I started by creating a directory for BWA, which is the tool I used for aligning the sequences:

cd../../../
mkdir bwa 
cd bwa

Next, I copied the reference genome from my local computer to the server using scp:

scp -r /home/mobaxterm/Desktop/Fastq_Sankriti/Reference_genome/NSP306_trifida_chr_v3.fa [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa

After copying the reference genome, I indexed it using BWA. Indexing is an important step because it prepares the reference genome for efficient alignment:

module load bwa/0.7.17
gcc/10.2.0
bwa index NSP306_trifida_chr_v3.fa

Indexing allows BWA to quickly find the relevant sections of the reference genome during the alignment of sequencing reads.This step is important for the performance of the alignment process. Next, I created symbolic links for the trimmed FASTQ files to prepare for alignment nano link.qsh:

#!/bin/bash
#SBATCH -J create_symlinks
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH -A ISAAC-UTK0318
#SBATCH -q short
#SBATCH -p short
#SBATCH -t 00:10:00
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err

# Source and target directories
SRC_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/trimmomatic/trimmed_output/"
LINK_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa"

# Loop through each file in the source directory and create a symbolic link
for FILE in $SRC_DIR/*.fastq.gz; do
    BASENAME=$(basename $FILE)
    ln -s $FILE $LINK_DIR/$BASENAME
done

To align the reads to the reference genome, I wrote a SLURM script nano bwa_mapping.qsh to run BWA-MEM:

#!/bin/bash
#SBATCH -J bwa
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 02:00:00
#SBATCH --mem=16G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err

# Load the necessary modules
module load bwa/0.7.17
module load samtools/1.16.1-gcc

# Set the input paths and reference genome
REFERENCE="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/NSP306_trifida_chr_v3.fa"
INPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa"
OUTPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/aligned_outputs"

# Create output directory if it doesn't exist
mkdir -p $OUTPUT_DIR

# Loop through all trimmed FASTQ files and align them using BWA
for READ in ${INPUT_DIR}/*_trimmed.fastq.gz; do
    # Check if the input file exists
    if [ ! -e "$READ" ]; then
        echo "No input files found matching the pattern: ${INPUT_DIR}/*_trimmed.fastq.gz"
        exit 1
    fi
# Get the basename without the suffix
    BASENAME="${READ##*/}"
    BASENAME="${BASENAME%_trimmed.fastq.gz}"

    OUTPUT_FILE="${OUTPUT_DIR}/${BASENAME}_aligned.sam"

    # Run BWA MEM for single-end reads
    echo "Running BWA MEM for $READ..."
    bwa mem -t 4 $REFERENCE $READ > $OUTPUT_FILE

    # Convert SAM to BAM, sort, and index
    BAM_FILE="${OUTPUT_DIR}/${BASENAME}_aligned.bam"
    SORTED_BAM_FILE="${OUTPUT_DIR}/${BASENAME}_aligned_sorted.bam"

    echo "Converting SAM to BAM for $OUTPUT_FILE..."
    samtools view -Sb $OUTPUT_FILE > $BAM_FILE
echo "Sorting BAM file $BAM_FILE..."
    samtools sort $BAM_FILE -o $SORTED_BAM_FILE

    echo "Indexing BAM file $SORTED_BAM_FILE..."
    samtools index $SORTED_BAM_FILE

    # Remove intermediate files
    echo "Removing intermediate files for $BASENAME..."
    rm $OUTPUT_FILE $BAM_FILE

done

After aligning the reads, I obtained sorted BAM files and their corresponding index files (sorted.bam and sorted.bam.bai). To generate alignment statistics, I moved into the aligned_outputs directory cd aligned_outputs: I loaded the necessary Samtools module and created a directory for storing the statistics results:

module load samtools/1.16.1-gcc
mkdir stats_results

Next, I wrote a SLURM script nano stat_results.qsh to generate flagstat and stats reports for the BAM files:

#!/bin/bash
#SBATCH -J samtools_stats
#SBATCH --nodes=1
#SBATCH --cpus-per-task=2
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 01:00:00
#SBATCH --mem=8G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err

# Load the necessary module
module load samtools/1.16.1-gcc

# Set the input and output paths
INPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/aligned_outputs"
OUTPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/aligned_outputs/stats_results"

# Loop through all BAM files and generate flagstat and stats
for SORTED_BAM_FILE in ${INPUT_DIR}/*_aligned_sorted.bam; do
    # Get the basename without the suffix
    BASENAME="${SORTED_BAM_FILE##*/}"
    BASENAME="${BASENAME%_aligned_sorted.bam}"

    # Generate flagstat statistics using samtools
    echo "Generating flagstat for $SORTED_BAM_FILE..."
    samtools flagstat $SORTED_BAM_FILE > ${OUTPUT_DIR}/${BASENAME}_flagstat.txt
# Generate detailed statistics using samtools
    echo "Generating statistics for $SORTED_BAM_FILE..."
    samtools stats $SORTED_BAM_FILE > ${OUTPUT_DIR}/${BASENAME}_stats.txt
done

The flagstat.txt and stats.txt files summarize key metrics like total number of reads, number of primary mapped reads and supplementary reads, and percentage of properly paired reads. I used the command head *flagstat.txt to preview these files. With my sorted.bam and sorted.bam.bai files ready for variant calling, I created a directory named readgroups mkdir readgroups and wrote a SLURM batch script nano readgroups.qsh to add read groups to the sorted BAM files using GATK with following script:

#!/bin/bash
#SBATCH -J add_readgroups
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 02:00:00
#SBATCH --mem=16G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err

# Load the necessary modules
module load gatk/4.2.2.0
module load samtools/1.16.1-gcc

# Set the input and output paths
INPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/aligned_outputs"
OUTPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/aligned_outputs/readgroups"

# Loop through all sorted BAM files and add read groups
for INPUT_BAM in ${INPUT_DIR}/*_aligned_sorted.bam; do
    BASENAME="${INPUT_BAM##*/}"
    BASENAME="${BASENAME%_aligned_sorted.bam}"
    OUTPUT_BAM="${OUTPUT_DIR}/${BASENAME}_sorted.RG.bam"

    # Run GATK AddOrReplaceReadGroups
    echo "Adding read groups to $INPUT_BAM..."
    gatk AddOrReplaceReadGroups \
        -I $INPUT_BAM \
        -O $OUTPUT_BAM \
-RGID 1 \
        -RGLB lib1 \
        -RGPL illumina \
        -RGPU unit1 \
        -RGSM $BASENAME

    # Index the output BAM file
    echo "Indexing $OUTPUT_BAM..."
    samtools index $OUTPUT_BAM

done

This script adds read group information to each BAM file and indexes the output BAM files using samtools index to make them ready for subsequent steps. After running this script, I obtained _sorted.RG.bam and _sorted.RG.bam.bai files. Next, I created a new directory named gatk Cd ../../../ mkdir gatk and wrote a SLURM script nano link.qsh to create symbolic links for the BAM files with added read groups. I defined the source and target directories in the script and used the command:

#!/bin/bash
#SBATCH -J create_symlinks
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH -A ISAAC-UTK0318
#SBATCH -q short
#SBATCH -p short
#SBATCH -t 00:10:00
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err

# Source and target directories
SRC_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/aligned_outputs/readgroups"
LINK_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/gatk"

for FILE in $SRC_DIR/*_sorted.RG.*; do
    BASENAME=$(basename $FILE)
    ln -s $FILE $LINK_DIR/$BASENAME
done

To further organize the reference files, I wrote a SLURM script named referencelink.qsh to create symbolic links for the reference genome files. The script contained the following commands:

•	nano referencelink.qsh
#!/bin/bash
#SBATCH -J create_symlinks
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH -A ISAAC-UTK0318
#SBATCH -q short
#SBATCH -p short
#SBATCH -t 00:10:00
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err

# Source and target directories
SRC_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/bwa/"
LINK_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/gatk"

for FILE in $SRC_DIR/NSP306_trifida_chr_v3.*; do
    BASENAME=$(basename $FILE)
    ln -s $FILE $LINK_DIR/$BASENAME
done

To prepare the reference files, I used Samtools to create a .fai file for the reference genome (NSP306_trifida_chr_v3.fa) by running:

samtools faidx NSP306_trifida_chr_v3.fa

I also created a dictionary file (.dict) for the reference genome using GATK's CreateSequenceDictionary command:

/spack/spack-0.17.1/apps/linux-rhel8-cascadelake/gcc-10.2.0/gatk-4.2.2.0-au34zvt6dnayvobrgrn5bmkqh4o3du7o/bin/gatk CreateSequenceDictionary -R NSP306_trifida_chr_v3.fa -O NSP306_trifida_chr_v3.dict

These files are essential for GATK to understand the reference genome structure during variant calling.

I then proceeded to run GATK for the entire genome to generate VCF files. This step took around 1 hour and 19 minutes. Once I got the VCF files in the results directory, I moved on to identifying SNPs and indels. I wrote a script nano snps_indels.qsh to analyze the VCF files for SNPs and indels using bcftools. In the script, I counted SNPs and indels as well as the total number of variants for each VCF file:

#!/bin/bash
#SBATCH -J snps_indels
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 03:00:00
#SBATCH --mem=32G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err

OUTPUT_DIR="/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/gatk/results1"

for vcf_file in ${OUTPUT_DIR}/*gatk.vcf; do
    base=$(basename "$vcf_file" "gatk.vcf")

    # Count SNPs
    snp_count=$(bcftools view -v snps $vcf_file | grep -v "^#" | wc -l)

    # Count Indels
    indel_count=$(bcftools view -v indels $vcf_file | grep -v "^#" | wc -l)

    # Total Variants (SNPs + Indels)
    total_count=$(bcftools view $vcf_file | grep -v "^#" | wc -l)

    echo "Number of SNPs in ${base}gatk.vcf: $snp_count"
    echo "Number of indels in ${base}gatk.vcf: $indel_count"
    echo "Total number of variants in ${base}gatk.vcf: $total_count"
done

I saved the results in an output file called results1 and reviewed them using:

cat R-snps_indels.2716705.out

After obtaining the VCF files, I transferred them to my local system using scp and performed segregation analysis in R.

scp [email protected]:/lustre/isaac/proj/UTK0318/final_project/analysis/sachar10/morning_glory/gatk/results1/*.vcf* /home/mobaxterm/Desktop/segregation

I wrote an R script to perform segregation analysis on the VCF files of two parents and multiple progeny.

   if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
   if (!requireNamespace("VariantAnnotation", quietly = TRUE)) BiocManager::install("VariantAnnotation")
   if (!requireNamespace("GenomicRanges", quietly = TRUE)) BiocManager::install("GenomicRanges")

   library(VariantAnnotation)
   library(GenomicRanges)

These commands installed and loaded the required libraries for working with genomic data. Next, I set the paths for parent and progeny VCF files and ensured the output directory existed.

vcf_dir <- "C:/Users/Sanskriti/OneDrive/Desktop/segregation"
parent1_vcf <- file.path(vcf_dir, "M9gatk.vcf")
parent2_vcf <- file.path(vcf_dir, "M19gatk.vcf")
progeny_files <- list.files(vcf_dir, pattern = "^PT.*\.vcf$", full.names = TRUE)
output_dir <- file.path(vcf_dir, "segregation_results")

Then, I verified the presence of parent VCF files and loaded them for analysis and normalized variant names and extracted the relevant variant information for each parent and progeny.

if (!file.exists(parent1_vcf)) stop("Parent 1 VCF not found: ", parent1_vcf)
if (!file.exists(parent2_vcf)) stop("Parent 2 VCF not found: ", parent2_vcf)

# Load parent VCF files
cat("Loading Parent 1 VCF...\n")
parent1 <- readVcf(parent1_vcf)
cat("Loading Parent 2 VCF...\n")
parent2 <- readVcf(parent2_vcf)

# Extract variant ranges from parents
parent1_variants <- rowRanges(parent1)
parent2_variants <- rowRanges(parent2)

# Normalize variant names (remove "chr" if present)
names(parent1_variants) <- sub("^chr", "", names(parent1_variants))
names(parent2_variants) <- sub("^chr", "", names(parent2_variants))

# Process each progeny file
for (progeny_file in progeny_files) {
cat("Processing progeny file:", progeny_file, "\n")

# Load progeny VCF
progeny <- readVcf(progeny_file)
progeny_variants <- rowRanges(progeny)
names(progeny_variants) <- sub("^chr", "", names(progeny_variants))  # Normalize variant names

# Find shared and unique variants
shared_with_parent1 <- intersect(names(progeny_variants), names(parent1_variants))
shared_with_parent2 <- intersect(names(progeny_variants), names(parent2_variants))
unique_to_progeny <- setdiff(names(progeny_variants), union(names(parent1_variants), names(parent2_variants)))

# Debugging information
cat("Total progeny variants:", length(progeny_variants), "\n")
cat("Shared with Parent 1:", length(shared_with_parent1), "\n")
cat("Shared with Parent 2:", length(shared_with_parent2), "\n")
cat("Unique to progeny:", length(unique_to_progeny), "\n")

# Create result table
results <- data.frame(
 Variant_Position = names(progeny_variants),
 Inherited_From_Parent1 = names(progeny_variants) %in% shared_with_parent1,
 Inherited_From_Parent2 = names(progeny_variants) %in% shared_with_parent2,
 Unique_To_Progeny = names(progeny_variants) %in% unique_to_progeny
)

# Save results to file
progeny_basename <- tools::file_path_sans_ext(basename(progeny_file))
output_file <- file.path(output_dir, paste0(progeny_basename, "_segregation_results.csv"))
write.csv(results, output_file, row.names = FALSE)

cat("Results saved to:", output_file, "\n")
}

cat("Segregation analysis complete for all progeny files.\n")

This loop processed each progeny, found variants shared with parents, and saved the results to CSV files.

Then, I created a Circos plot to visualize the inheritance patterns among progeny using R's circlize package. First, I loaded necessary libraries and read the combined CSV files from the segregation analysis.Then, I extracted the chromosome and position information for each variant and removed any unnecessary columns to prepare the data for visualization. Using circlize, I plotted the chromosome information and visualized each variant with different colors based on its inheritance (e.g., red for Parent 1, blue for Parent 2, and green for unique progeny variants). This helped in understanding the inheritance patterns visually.

For circus plot;
# Load necessary libraries
library(circlize)
library(dplyr)

# Define the directory containing your progeny files
input_directory <- "C:/Users/Sanskriti/OneDrive/Desktop/segregation/segregation_results"  # Update with your actual directory path

# Read all CSV files in the directory
file_list <- list.files(input_directory, pattern = "*.csv", full.names = TRUE)

# Initialize an empty dataframe to store combined data
combined_data <- data.frame()

# Loop through each file and combine data
for (file in file_list) {
  # Read each file, assuming they are in CSV format
  data <- read.csv(file, stringsAsFactors = FALSE)
  
  # Print the column names to ensure they are correct
  print("Original Column Names:")
  print(names(data))
  
  # Remove any leading/trailing spaces in column names
  names(data) <- trimws(names(data))
  
  # Print updated column names for verification
  print("Updated Column Names After Cleaning:")
  print(names(data))
  
  # Split the Variant_Position column to extract Chromosome and Position
  data <- data %>%
    filter(grepl(":", Variant_Position)) %>% # Ensure the format is correct with ':'
    mutate(
      Chromosome = sapply(strsplit(Variant_Position, ":"), `[`, 1),
      Position_Info = sapply(strsplit(Variant_Position, ":"), `[`, 2),
      Position = as.numeric(sapply(strsplit(Position_Info, "_"), `[`, 1)) # Extract numeric position before allele info
    ) %>%
    select(-Position_Info) # Remove intermediate column
  
  # Print data to confirm the creation of Chromosome and Position columns
  print("Data After Extracting Chromosome and Position:")
  print(head(data))
  
  # Append to combined dataframe
  combined_data <- bind_rows(combined_data, data)
}

# Check combined data structure
print("Combined Data Structure:")
str(combined_data)

# Check column names in combined data
print("Column Names in Combined Data:")
print(names(combined_data))

# Remove rows with missing values in critical columns
combined_data <- combined_data %>%
  filter(!is.na(Chromosome) & !is.na(Position))

# Prepare data for Circos plot
# Grouping by Chromosome for plotting
chromosome_lengths <- combined_data %>%
  group_by(Chromosome) %>%
  summarize(Length = max(Position, na.rm = TRUE)) %>%
  filter(Length > 0)  # Ensure valid chromosome lengths

# Print chromosome_lengths to confirm correctness before plotting
print("Chromosome Lengths Data:")
print(chromosome_lengths)

# Initialize Circos plot
circos.clear()
circos.initialize(factors = chromosome_lengths$Chromosome,
                  xlim = cbind(rep(0, nrow(chromosome_lengths)), chromosome_lengths$Length))

# Add chromosome track
circos.trackPlotRegion(factors = chromosome_lengths$Chromosome, ylim = c(0, 1),
                       panel.fun = function(x, y) {
                         chr <- CELL_META$sector.index
                         circos.text(CELL_META$xcenter, CELL_META$ycenter + 0.5, chr, cex = 0.6, facing = "clockwise", niceFacing = TRUE)
                         circos.axis(labels.cex = 0.5)
                       })

# Plot inheritance tracks
for (i in 1:nrow(combined_data)) {
  # Skip any rows that have NA in Position or Chromosome (though we already filtered, this is a safety check)
  if (is.na(combined_data$Position[i]) || is.na(combined_data$Chromosome[i])) next
  
  color <- ifelse(combined_data$Inherited_From_Parent1[i] == "TRUE", "#FF0000",
                  ifelse(combined_data$Inherited_From_Parent2[i] == "TRUE", "#0000FF",
                         ifelse(combined_data$Unique_To_Progeny[i] == "TRUE", "#00FF00", "#FFFFFF")))
  
  circos.rect(sector.index = combined_data$Chromosome[i],
              xleft = combined_data$Position[i],
              xright = combined_data$Position[i] + 1,
              ybottom = 0,
              ytop = 1,
              col = color,
              border = color)
}

# Add a legend
legend("topleft", legend = c("Inherited from Parent 1", "Inherited from Parent 2", "Unique to Progeny"),
       col = c("#FF0000", "#0000FF", "#00FF00"), pch = 15, bty = "n", cex = 0.8)

# Clear Circos plot after drawing
circos.clear()

Final Project - sansuach/Sanskritiacharya GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️