SeqkitTask - BGIGPD/BestPractices4Pathogenomics GitHub Wiki
Seqkit Task
1. Download the virus genomes from NCBI Datasets
# Create a new directory named seqkit_task
mkdir seqkit_task
# Change the current directory to seqkit_task
cd seqkit_task
# Download viral genome data for accession numbers NC_045512.2 and KY672931.1 using the datasets tool
datasets download virus genome accession NC_045512.2 KY672931.1#
# Extract the contents of the ncbi_dataset.zip file
unzip ncbi_dataset.zip
# Change the current directory to ncbi_dataset/data
cd ncbi_dataset/data
# View the contents of the genomic.fna file using the less pager
less genomic.fna
# Rename the genomic.fna file to virus.fna
mv genomic.fna virus.fna
2.Seqkit Task
Task 1: Validate the FASTA format of the viral genome files
# Validate the FASTA format of the virus.fna file
seqkit fx2tab virus.fna
Task 2: Extract sequences by ID
# Extract sequences with specific IDs from the virus.fna file
seqkit grep -p "NC_045512.2" virus.fna > NC_045512.2.fna
seqkit grep -p "KY672931.1" virus.fna > KY672931.1.fna
Task 3: Calculate sequence statistics
# Calculate sequence statistics for the virus.fna file
seqkit stats virus.fna
# Count the number of sequences in the virus.fna file
seqkit stats --all virus.fna
Task 4: Convert FASTA to tabular format
# Convert the virus.fna file to tabular format
seqkit fx2tab virus.fna > virus.tab
Task 5: Split sequences into smaller files
# Split the virus.fna file into smaller files with each containing one sequence
seqkit split2 -p 1 virus.fna
Task 6: Reverse complement sequences
# Generate the reverse complement of the sequences in the virus.fna file
seqkit seq --reverse --complement virus.fna > virus_rc.fna
Task 7: Filter sequences by length
# Filter sequences in the virus.fna file by length (e.g., sequences longer than 1000 bp)
seqkit seq -m 1000 virus.fna > virus_filtered.fna
Task 8: Extract the first 100 bases of each sequence
# Extract the first 100 bases of each sequence in the virus.fna file
seqkit subseq -r 1:100 virus.fna > virus_first_100_bases.fna
Task 9: Remove duplicate sequences
#
cat NC_045512.2.fna virus.fna > virus_dump.fna
# Remove duplicate sequences from the virus.fna file
seqkit rmdup virus.fna > virus_no_duplicates.fna
English Task: Remove duplicate sequences from the virus.fna
file using seqkit rmdup
.
Task 10: Sort sequences by length
# Sort sequences in the virus.fna file by length
seqkit sort -l virus.fna > virus_sorted_by_length.fna
Task 11: Calculate GC content
# Calculate the GC content of sequences in the virus.fna file
seqkit fx2tab --gc virus.fna > virus_gc_content.tab
Task 12: Translate nucleotide sequences to protein sequences
# Translate nucleotide sequences in the virus.fna file to protein sequences
seqkit translate virus.fna > virus_protein.faa
Task 13: Extract sequences with specific motifs
# Extract sequences containing a specific motif (e.g., "ATG") from the virus.fna file
seqkit grep -s -i -p "ATG" virus.fna > virus_with_motif.fna
Task 14: Generate sequence statistics in tabular format
# Generate sequence statistics in tabular format for the virus.fna file
seqkit stats --tabular virus.fna > virus_stats.tab
Task 15: Convert FASTA to FASTQ format
# Convert the virus.fna file to FASTQ format (assuming a default quality score of 40)
seqkit fq2fa -Q 40 virus.fna > virus.fastq
Task 16: Mask low complexity regions
# Mask low complexity regions in the virus.fna file
seqkit seq --mask-lc virus.fna > virus_masked.fna