Ribo seq 3bp periodicity preference - wdlingit/cop GitHub Wiki
This document describes steps from extracting sequences of coding gene representative models, mapping Riboseq reads to these representative sequences, and computing 3bp periodicity preference, by taking Arabidopsis thaliana as an example. All programs and scripts are available by following the environment setting.
The TAIR database provides a file of representative gene models. This file is a FASTA so we extracted and transformed it into a list of gene-model pairs.
ubuntu@ubuntu20:~$ wget https://www.arabidopsis.org/download_files/Genes/Araport11_genome_release/Araport11_blastsets/Araport11_cdna_20220914_representative_gene_model.gz
ubuntu@ubuntu20:~$ gzip -d Araport11_cdna_20220914_representative_gene_model.gz
ubuntu@ubuntu20:~$ cat Araport11_cdna_20220914_representative_gene_model | grep ">" | perl -ne '/>(.+?)\.(\S+)/; print "$1\t$1.$2\n"' > Araport11.rep1
ubuntu@ubuntu20:~$ head Araport11.rep1
AT1G06190 AT1G06190.5
AT1G06620 AT1G06620.1
AT1G67670 AT1G67670.1
AT1G12100 AT1G12100.1
AT1G49040 AT1G49040.1
AT1G08710 AT1G08710.2
AT1G78300 AT1G78300.1
AT1G09210 AT1G09210.1
AT1G58430 AT1G58430.1
AT1G72290 AT1G72290.1
NOTE: TAIR recently (2024) updated their website. So the first wget
command may not work with the given URL. Please go to TAIR website for downloading the file directly.
On the other hand, we extracted coding region coordinates of coding gene models of Araport11 and identified models with longest coding regions for each gene.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 Araport11_GFF3_genes_transposons.201606.gff -GRE gene mRNA CDS -O Araport.mRNA_CDS
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA, CDS]]
intron preserving merged model (-IP)): false
representative list (-rep): null
output prefix (-O): Araport.mRNA_CDS
ubuntu@ubuntu20:~$ head Araport.mRNA_CDS.model
>AT1G01010.1 Chr1 3631 5899 + AT1G01010
3760 3913
3996 4276
4486 4605
4706 5095
5174 5326
5439 5630
>AT1G01020.2 Chr1 6788 8737 - AT1G01020
7315 7450
7564 7649
ubuntu@ubuntu20:~$ cat Araport.mRNA_CDS.model | perl -ne 'chomp; @t=split; if(/^>(\S+)/){ $mrna=$1; $gene=$t[-1]; $map{$gene}{$mrna}=1 }else{ $hash{$mrna}+=($t[1]-$t[0]+1) } if(eof){ for $g (sort keys %map){ $max=0; $ans=""; for $m (keys %{$map{$g}}){ if($hash{$m}>$max){ $ans=$m; $max=$hash{$m}; } } print "$g\t$ans\n" } }' > Araport11.rep2
ubuntu@ubuntu20:~$ head Araport11.rep2
AT1G01010 AT1G01010.1
AT1G01020 AT1G01020.1
AT1G01030 AT1G01030.1
AT1G01040 AT1G01040.2
AT1G01050 AT1G01050.1
AT1G01060 AT1G01060.8
AT1G01070 AT1G01070.1
AT1G01080 AT1G01080.2
AT1G01090 AT1G01090.1
AT1G01100 AT1G01100.4
Now, we have representative gene model info downloaded from TAIR in Araport11.rep1
and representative gene model info of coding genes inferred from the Araport11 GFF3 file in Araport11.rep2
. To make sure the final list contains only coding genes, we used the following perl one-liner to read Araport11.rep2
and do necessary updates according to Araport11.rep1
.
ubuntu@ubuntu20:~$ cat Araport11.rep2 | perl -ne 'if($.==1){ open(FILE,"<Araport11.rep1"); while($line=<FILE>){ chomp $line; $line=~/(\S+)\s(\S+)/; $hash{$1}=$2; } close FILE; } chomp; @t=split; if(exists $hash{$t[0]}){ $t[1]=$hash{$t[0]} } print "$t[0]\t$t[1]\n"' > Araport11.CDS.rep
ubuntu@ubuntu20:~$ wc -l Araport11.rep? Araport11.CDS.rep
27562 Araport11.rep1
27655 Araport11.rep2
27655 Araport11.CDS.rep
82872 total
To extract representative coding sequences (including UTRs), we extracted exon regions of all mRNAs and then extracted the coding representative part using Araport11.CDS.rep
.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -O araport.mrna.model -GE mRNA five_prime_UTR:CDS:three_prime_UTR
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): araport.mrna.model
gene-exon feature pairs (-GE): [[mRNA, five_prime_UTR:CDS:three_prime_UTR]]
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false
ubuntu@ubuntu20:~$ cat araport.mrna.model | perl -ne 'if($.==1){ open(FILE,"<Araport11.CDS.rep"); while(my $line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $hash{$s[1]}=1 } close FILE } if(/^>(\S+)/){ $flag=0; $flag=1 if exists $hash{$1} } print if $flag' > araport.coding.rep.model
ubuntu@ubuntu20:~$ grep ">" araport.mrna.model | wc -l
48359
ubuntu@ubuntu20:~$ grep ">" araport.coding.rep.model | wc -l
27655
Afterward, applied SeqGen.pl
with TAIR10 genome FASTA and araport.coding.rep.model
(coordinates of exon regions of representative coding models) to extract representative coding sequences.
ubuntu@ubuntu20:~$ SeqGen.pl araport.coding.rep.fasta TAIR10_chr_all.fas araport.coding.rep.model
We used the following three commands to extract coordinates of CDS, 5UTR, and 3UTR in the genome. In so doing, we can then infer lengths of these regions in the representative coding sequences.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -O araport11.mRNA.CDS -GE mRNA CDS
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): araport11.mRNA.CDS
gene-exon feature pairs (-GE): [[mRNA, CDS]]
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -O araport11.mRNA.UTR5 -GE mRNA five_prime_UTR
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): araport11.mRNA.UTR5
gene-exon feature pairs (-GE): [[mRNA, five_prime_UTR]]
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -O araport11.mRNA.UTR3 -GE mRNA three_prime_UTR
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): araport11.mRNA.UTR3
gene-exon feature pairs (-GE): [[mRNA, three_prime_UTR]]
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false
The following perl one-liner was used to add up lengths of 5UTR/CDS/3UTR regions in the genome for all mRNA. These numbers were saved in file araport11.mRNA.region.lengths
.
ubuntu@ubuntu20:~$ ls araport11.mRNA.CDS araport11.mRNA.UTR5 araport11.mRNA.UTR3 | perl -ne 'chomp; /\.([^\.]+)$/; print "REGION: $1\n"; system "cat $_"' | perl -ne 'chomp; if(/^REGION: (\S+)/){ $region=$1; next } if(/^>(\S+)/){ $id=$1; }else{ @t=split; $lenHash{$id}{$region}+=($t[1]-$t[0]+1) } if(eof){ for $id (sort keys %lenHash){ print "$id"; for $region ("UTR5","CDS","UTR3"){ $lenHash{$id}{$region}=0 if not exists $lenHash{$id}{$region}; print "\t$lenHash{$id}{$region}" } print "\n" } }' > araport11.mRNA.region.lengths
ubuntu@ubuntu20:~$ head araport11.mRNA.region.lengths
AT1G01010.1 129 1290 269
AT1G01020.1 464 738 127
AT1G01020.2 71 576 440
AT1G01020.3 582 711 127
AT1G01020.4 559 711 127
AT1G01020.5 582 597 127
AT1G01020.6 189 315 440
AT1G01030.1 613 1077 215
AT1G01030.2 613 1008 215
AT1G01040.1 398 5730 148
preindex.pl
was used to build necessary index files for samtools and some BioPerl modules. bowtie2-build
was used to build bowtie2 index for mapping Riboseq reads using bowtie2.
ubuntu@ubuntu20:~$ preindex.pl araport.coding.rep.fasta
ubuntu@ubuntu20:~$ bowtie2-build araport.coding.rep.fasta araport.coding.rep
In this case, we have four FASTA files of Riboseq reads: root-ctr-0h-r1.fasta
, root-ctr-0h-r2.fasta
, shoot-ctr-0h-r1.fasta
, and shoot-ctr-0h-r2.fasta
. They were all adapter removed and filtered by removing reads mapping to undesired non-coding genes. The following perl one-liner was applied to generate Mapping.pl
commands that utilize bowtie2 for mapping. Remember to remove the sharp mark to actual execute the commands.
ubuntu@ubuntu20:~$ ls root*.fasta shoot*.fasta | perl -ne 'chomp; /(.+?)\./; $cmd="Mapping.pl x $_ $1.repCoding.bam MappingBowtie.pl -target araport.coding.rep -p 8 -k 30 --gbar 51 --sensitive -ID 0.95"; print "\nCMD: $cmd\n"; #system $cmd'
CMD: Mapping.pl x root-ctr-0h-r1.fasta root-ctr-0h-r1.repCoding.bam MappingBowtie.pl -target araport.coding.rep -p 8 -k 30 --gbar 51 --sensitive -ID 0.95
CMD: Mapping.pl x root-ctr-0h-r2.fasta root-ctr-0h-r2.repCoding.bam MappingBowtie.pl -target araport.coding.rep -p 8 -k 30 --gbar 51 --sensitive -ID 0.95
CMD: Mapping.pl x shoot-ctr-0h-r1.fasta shoot-ctr-0h-r1.repCoding.bam MappingBowtie.pl -target araport.coding.rep -p 8 -k 30 --gbar 51 --sensitive -ID 0.95
CMD: Mapping.pl x shoot-ctr-0h-r2.fasta shoot-ctr-0h-r2.repCoding.bam MappingBowtie.pl -target araport.coding.rep -p 8 -k 30 --gbar 51 --sensitive -ID 0.95
Important options in these commands:
-
-p 8
: a bowtie2 option so that 8 threads will be used (as we performed this entire example in a VM with 8 vCPUs). -
-k 30
: a bowtie2 option so that at most 30 alignments will be reported for one read. Note that bowtie2 will tend to report only one alignment per reads if without this setting. -
--gbar 51
: a bowtie2 option for disallowing creating any gaps in first 51bps of reads. Our maximum input read length is less than this number so there should be no gaps in all alignments. -
--sensitive
: a preset bowtie2 option for sensitively making alignments -
-ID 0.95
: an output controlling option ofMapping.pl
. This is to remove output alignments with identity (#matches/read length) less than 0.95.
To further make sure that all counted reads were unambiguously mapped to representative coding models. We used java program misc.AlignmentFilter2
to pick best alignments for all reads (-filter top true
) and filter out reads with multiple best alignments (-filter uniq true
). Remember to remove the sharp mark to actual execute the commands.
ubuntu@ubuntu20:~$ ls *.repCoding.bam | perl -ne 'chomp; /(.+?)\./; $cmd="java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAM $_ -O /dev/stdout -quiet true -filter top true -filter uniq true | samtools view -Sbo $1.topUniq.bam -T araport.coding.rep.fasta /dev/stdin"; print "\nCMD: $cmd\n"; #system $cmd'
CMD: java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAM root-ctr-0h-r1.repCoding.bam -O /dev/stdout -quiet true -filter top true -filter uniq true | samtools view -Sbo root-ctr-0h-r1.topUniq.bam -T araport.coding.rep.fasta /dev/stdin
CMD: java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAM root-ctr-0h-r2.repCoding.bam -O /dev/stdout -quiet true -filter top true -filter uniq true | samtools view -Sbo root-ctr-0h-r2.topUniq.bam -T araport.coding.rep.fasta /dev/stdin
CMD: java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAM shoot-ctr-0h-r1.repCoding.bam -O /dev/stdout -quiet true -filter top true -filter uniq true | samtools view -Sbo shoot-ctr-0h-r1.topUniq.bam -T araport.coding.rep.fasta /dev/stdin
CMD: java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAM shoot-ctr-0h-r2.repCoding.bam -O /dev/stdout -quiet true -filter top true -filter uniq true | samtools view -Sbo shoot-ctr-0h-r2.topUniq.bam -T araport.coding.rep.fasta /dev/stdin
To compute Riboseq footprints in 1bp resolution, a common practice is to count exactly one specific point of all reads. For example, 15th base-pairs of all reads were adopted for computing read mapping distribution in UTR and CDS regions in PMID:24179124. In this example, we took the 13th base-pairs. The process is a pipeline of BAM => SAM => SamPointize.pl
=> BAM.
ubuntu@ubuntu20:~$ ls *.topUniq.bam | perl -ne 'chomp; /(.+?)\./; $cmd="samtools view $_ | SamPointize.pl -s 13 /dev/stdin | samtools view -Sbo $1.point.bam -T araport.coding.rep.fasta /dev/stdin"; print "\nCMD: $cmd\n"; #system $cmd'
CMD: samtools view root-ctr-0h-r1.topUniq.bam | SamPointize.pl -s 13 /dev/stdin | samtools view -Sbo root-ctr-0h-r1.point.bam -T araport.coding.rep.fasta /dev/stdin
CMD: samtools view root-ctr-0h-r2.topUniq.bam | SamPointize.pl -s 13 /dev/stdin | samtools view -Sbo root-ctr-0h-r2.point.bam -T araport.coding.rep.fasta /dev/stdin
CMD: samtools view shoot-ctr-0h-r1.topUniq.bam | SamPointize.pl -s 13 /dev/stdin | samtools view -Sbo shoot-ctr-0h-r1.point.bam -T araport.coding.rep.fasta /dev/stdin
CMD: samtools view shoot-ctr-0h-r2.topUniq.bam | SamPointize.pl -s 13 /dev/stdin | samtools view -Sbo shoot-ctr-0h-r2.point.bam -T araport.coding.rep.fasta /dev/stdin
The option -s 13
is to take the 13th base-pair in the reads (following read orientation) as the only mapping base-pair in the output alignments.
ubuntu@ubuntu20:~$ samtools view root-ctr-0h-r1.topUniq.bam | head -1
HWI-ST615:762:H9PFHADXX:1:1101:1189:92557_1:N:0:TTAGGC 0 AT5G49720.1 1188 255 29M * 0 0 TNGTGCTAAGGTGGTGTATCAGTTTGGAA IIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:1G27 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-1 YT:Z:UU
ubuntu@ubuntu20:~$ samtools view root-ctr-0h-r1.point.bam | head -1
HWI-ST615:762:H9PFHADXX:1:1101:1189:92557_1:N:0:TTAGGC 0 AT5G49720.1 1200 255 1M * 0 0 G I
In so doing, the mapping start positions in the SAM records would be exactly the positions of 13th base-pairs of mapped reads.
With pointized Riboseq alignments in *.point.bam
and 5UTR/CDS/3UTR lengths in araport11.mRNA.region.lengths
, we can compute read accumulation at every position in the beginning of CDS regions.
ubuntu@ubuntu20:~$ ls *.point.bam | perl -ne 'chomp; /(.+?)\./; print "!!!FILE: $1\n"; system "samtools view $_"' | perl -ne 'if($.==1){ open(FILE,"<araport11.mRNA.region.lengths"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $id=shift @s; push @{$lenHash{$id}}, @s; } close FILE; } if(/^!!!FILE: (\S+)/){ $f=$1 }else{ @t=split; @len=@{$lenHash{$t[2]}}; if($len[0]<=$t[3] && $t[3]<=$len[0]+51){ $hash{$f}{$t[3]-$len[0]}++ } } if(eof STDIN){ for $f (sort keys %hash){ print "\t$f" } print "\n"; for $x (0..51){ print "$x"; for $f (sort keys %hash){ print "\t$hash{$f}{$x}" } print "\n"; } }' > CDS.acc.xls
ubuntu@ubuntu20:~$ head CDS.acc.xls
root-ctr-0h-r1 root-ctr-0h-r2 shoot-ctr-0h-r1 shoot-ctr-0h-r2
0 4189 2667 13976 7524
1 14724 7762 21429 7762
2 5925 1964 2751 1665
3 1585 918 4364 2536
4 2217 1320 4455 1869
5 1061 543 1219 946
6 807 528 2856 1813
7 1986 1247 3964 1680
8 975 500 1508 1432
Some Excel operation can give us the following plot, which shows a 3bp periodicity along the CDS region.
Explanation of the perl one-liner:
ls *.point.bam | # for every pointized BAM file
perl -ne '
chomp;
/(.+?)\./;
print "!!!FILE: $1\n"; # print a special mark (!!!FILE:) and file label
system "samtools view $_" # print pointized SAM records
' | perl -ne '
if($.==1){ # when read first line from the input, this block works only once
open(FILE,"<araport11.mRNA.region.lengths");
while($line=<FILE>){
chomp $line;
@s=split(/\s+/,$line);
$id=shift @s;
push @{$lenHash{$id}}, @s; # save 5UTR/CDS/3UTR region lengths into hash %lenHash
}
close FILE;
}
if(/^!!!FILE: (\S+)/){
$f=$1 # store label information if see the special mark
}else{
@t=split; # $t[2] is representative model ID $t[3] for the mapping position
@len=@{$lenHash{$t[2]}}; # retrieve region lengths from %lenHash
if($len[0]<=$t[3] && $t[3]<=$len[0]+51){ # the point is in 1+first 51bps of CDS
$hash{$f}{$t[3]-$len[0]}++ # $t[3]-$len[0]==0 => the base-pair in front of CDS
}
}
if(eof STDIN){ # when read last line from the input, this block works only once
for $f (sort keys %hash){ print "\t$f" } print "\n"; # output header
for $x (0..51){ # for positions 0, 1, 2, ... 51
print "$x";
for $f (sort keys %hash){
print "\t$hash{$f}{$x}" # output read accumulation for each sample
}
print "\n";
}
}
' > CDS.acc.xls # this file can be opened by Excel as it is actually a tsv file
Again, with pointized Riboseq alignments in *.point.bam
and 5UTR/CDS/3UTR lengths in araport11.mRNA.region.lengths
, we can compute read accumulation at 0th, 1th, and 2th bps of every 3 base-pair for 5UTR, CDS and 3UTR.
ubuntu@ubuntu20:~$ ls *.point.bam | perl -ne 'chomp; /(.+?)\./; print "!!!FILE: $1\n"; system "samtools view $_"' | perl -ne 'if($.==1){ open(FILE,"<araport11.mRNA.region.lengths"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $id=shift @s; push @{$lenHash{$id}}, @s; } close FILE; } if(/^!!!FILE: (\S+)/){ $f=$1 }else{ @t=split; @len=@{$lenHash{$t[2]}}; $class=""; if($t[3]<=$len[0]){ $class="5utr"; $mod=$t[3]%3 }elsif($len[0]+1<=$t[3] && $t[3]<=$len[0]+51){ $class="cds"; $mod=($t[3]-$len[0]-1)%3 }elsif($len[0]+$len[1]+1<=$t[3] && $t[3]<=$len[0]+$len[1]+51){ $class="3utr"; $mod=($t[3]-$len[0]-$len[1]-1)%3 } $hash{"$f\t$class"}{$mod}++ if $class ne "" } if(eof STDIN){ for $k1 (sort keys %hash){ print "$k1\t$hash{$k1}{0}\t$hash{$k1}{1}\t$hash{$k1}{2}\n" } }'
root-ctr-0h-r1 3utr 581 645 458
root-ctr-0h-r1 5utr 14039 10580 11301
root-ctr-0h-r1 cds 48342 25104 16084
root-ctr-0h-r2 3utr 450 493 276
root-ctr-0h-r2 5utr 17086 8531 11170
root-ctr-0h-r2 cds 28842 11148 11963
shoot-ctr-0h-r1 3utr 732 632 448
shoot-ctr-0h-r1 5utr 21116 18949 21828
shoot-ctr-0h-r1 cds 72020 22276 43475
shoot-ctr-0h-r2 3utr 521 380 362
shoot-ctr-0h-r2 5utr 18044 14234 17692
shoot-ctr-0h-r2 cds 32442 19089 30278
It is observed that for CDS region, 0th bps of all 3 base-pair has significantly more reads than 1th and 2th bps, compared to other regions.
Explanation of the perl one-liner:
ls *.point.bam | # for every pointized BAM file
perl -ne '
chomp;
/(.+?)\./;
print "!!!FILE: $1\n"; # print a special mark (!!!FILE:) and file label
system "samtools view $_" # print pointized SAM records
' | perl -ne '
if($.==1){ # when read first line from the input, this block works only once
open(FILE,"<araport11.mRNA.region.lengths");
while($line=<FILE>){
chomp $line;
@s=split(/\s+/,$line);
$id=shift @s;
push @{$lenHash{$id}}, @s; # save 5UTR/CDS/3UTR region lengths into hash %lenHash
}
close FILE;
}
if(/^!!!FILE: (\S+)/){
$f=$1 # store label information if see the special mark
}else{
@t=split; # $t[2] is representative model ID $t[3] for the mapping position
@len=@{$lenHash{$t[2]}}; # retrieve region lengths from %lenHash
$class="";
if($t[3]<=$len[0]){ # the point is in 5UTR
$class="5utr"; $mod=$t[3]%3
}elsif($len[0]+1<=$t[3] && $t[3]<=$len[0]+51){ # the point is in first 51bps of CDS
$class="cds"; $mod=($t[3]-$len[0]-1)%3
}elsif($len[0]+$len[1]+1<=$t[3] && $t[3]<=$len[0]+$len[1]+51){ # the point is in first 51bps of 3UTR
$class="3utr"; $mod=($t[3]-$len[0]-$len[1]-1)%3
} # $mod for position of every 3 base-pair
$hash{"$f\t$class"}{$mod}++ if $class ne "" # add count for corresponding file and region
}
if(eof STDIN){ # when read last line from the input, this block works only once
for $k1 (sort keys %hash){
print "$k1\t$hash{$k1}{0}\t$hash{$k1}{1}\t$hash{$k1}{2}\n"
}
}
'
Supposed that cleaned Riboseq reads were saved in FASTQ format and compressed in *.gz
files. The following perl one-liner can used to compute read length distribution of all files.
ubuntu@ubuntu20:~$ ls *.gz | perl -ne 'chomp; /(.+?)\./; print "!!!FILE: $1\n"; system "gzip -dc $_"' | perl -ne 'chomp; if(/^!!!FILE: (\S+)/){ $cnt=0; $f=$1; $files{$f}=1 }else{ $cnt++; if(($cnt%4)==2){ s/^\s+|\s+$//g; $hash{length($_)}{$f}++ } } if(eof){ for $f (sort keys %files){ print "\t$f" } print "\n"; for $len (sort {$a<=>$b} keys %hash){ print "$len"; for $f (sort keys %files){ if(exists $hash{$len}{$f}){ print "\t$hash{$len}{$f}" }else{ print "\t0" } } print "\n" } }' | less
root-ctr-0h-r1 root-ctr-0h-r2 shoot-ctr-0h-r1 shoot-ctr-0h-r2
15 0 1509 0 802
16 0 3511 0 2092
17 0 4815 0 4725
18 0 5887 0 7212
19 0 6926 0 7511
20 23789 13131 18681 9551
21 57060 38401 42113 24438
22 81405 75339 64279 37983
23 112457 131009 113117 123383
24 322571 273205 208608 127738
25 622358 618208 446363 185290
26 2402146 1087920 785581 269089
27 4216948 3042870 1258728 512239
28 5425997 4826210 3080591 1356950
29 4802262 5300209 4904951 3475938
30 9117234 9185777 8025603 5354937
31 12669476 12426514 12204871 9254776
32 10271297 12536049 8844096 6899578
33 4059393 8401201 9179076 7758248
34 1950254 5177039 5176473 7126098
35 943215 2270547 2768890 5014734
36 561691 1325082 1409064 3149693
37 313184 562394 641187 1397641
38 92463 232682 329648 561905
39 23505 52484 155155 473897
40 7802 17128 49005 104359
41 3284 4154 22703 23051
42 1588 1905 5235 6097
43 737 875 2059 4596
44 372 330 893 2380
45 218 124 600 945
46 150 52 491 317
47 66 19 347 183
48 16 8 59 46