SeqkitTask - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Seqkit Task

1. Download the virus genomes from NCBI Datasets

# Create a new directory named seqkit_task
mkdir seqkit_task

# Change the current directory to seqkit_task
cd seqkit_task

# Download viral genome data for accession numbers NC_045512.2 and KY672931.1 using the datasets tool
datasets download virus genome accession NC_045512.2 KY672931.1#

# Extract the contents of the ncbi_dataset.zip file
unzip ncbi_dataset.zip

# Change the current directory to ncbi_dataset/data
cd ncbi_dataset/data

# View the contents of the genomic.fna file using the less pager
less genomic.fna

# Rename the genomic.fna file to virus.fna
mv genomic.fna virus.fna

2.Seqkit Task

Task 1: Validate the FASTA format of the viral genome files

# Validate the FASTA format of the virus.fna file
seqkit fx2tab virus.fna
Task 2: Extract sequences by ID
# Extract sequences with specific IDs from the virus.fna file
seqkit grep -p "NC_045512.2" virus.fna > NC_045512.2.fna
seqkit grep -p "KY672931.1" virus.fna > KY672931.1.fna

Task 3: Calculate sequence statistics

# Calculate sequence statistics for the virus.fna file
seqkit stats virus.fna
# Count the number of sequences in the virus.fna file
seqkit stats --all virus.fna

Task 4: Convert FASTA to tabular format

# Convert the virus.fna file to tabular format
seqkit fx2tab virus.fna > virus.tab

Task 5: Split sequences into smaller files

# Split the virus.fna file into smaller files with each containing one sequence
seqkit split2 -p 1 virus.fna

Task 6: Reverse complement sequences

# Generate the reverse complement of the sequences in the virus.fna file
seqkit seq --reverse --complement virus.fna > virus_rc.fna

Task 7: Filter sequences by length

# Filter sequences in the virus.fna file by length (e.g., sequences longer than 1000 bp)
seqkit seq -m 1000 virus.fna > virus_filtered.fna

Task 8: Extract the first 100 bases of each sequence

# Extract the first 100 bases of each sequence in the virus.fna file
seqkit subseq -r 1:100 virus.fna > virus_first_100_bases.fna

Task 9: Remove duplicate sequences

#
cat  NC_045512.2.fna virus.fna > virus_dump.fna
# Remove duplicate sequences from the virus.fna file
seqkit rmdup virus.fna > virus_no_duplicates.fna

English Task: Remove duplicate sequences from the virus.fna file using seqkit rmdup.

Task 10: Sort sequences by length

# Sort sequences in the virus.fna file by length
seqkit sort -l virus.fna > virus_sorted_by_length.fna

Task 11: Calculate GC content

# Calculate the GC content of sequences in the virus.fna file
seqkit fx2tab --gc virus.fna > virus_gc_content.tab

Task 12: Translate nucleotide sequences to protein sequences

# Translate nucleotide sequences in the virus.fna file to protein sequences
seqkit translate virus.fna > virus_protein.faa

Task 13: Extract sequences with specific motifs

# Extract sequences containing a specific motif (e.g., "ATG") from the virus.fna file
seqkit grep -s -i -p "ATG" virus.fna > virus_with_motif.fna

Task 14: Generate sequence statistics in tabular format

# Generate sequence statistics in tabular format for the virus.fna file
seqkit stats --tabular virus.fna > virus_stats.tab

Task 15: Convert FASTA to FASTQ format

# Convert the virus.fna file to FASTQ format (assuming a default quality score of 40)
seqkit fq2fa -Q 40 virus.fna > virus.fastq

Task 16: Mask low complexity regions

# Mask low complexity regions in the virus.fna file
seqkit seq --mask-lc virus.fna > virus_masked.fna