Ribo seq 3bp periodicity preference - wdlingit/cop GitHub Wiki

This document describes steps from extracting sequences of coding gene representative models, mapping Riboseq reads to these representative sequences, and computing 3bp periodicity preference, by taking Arabidopsis thaliana as an example. All programs and scripts are available by following the environment setting.

Extract representative coding sequences

The TAIR database provides a file of representative gene models. This file is a FASTA so we extracted and transformed it into a list of gene-model pairs.

ubuntu@ubuntu20:~$ wget https://www.arabidopsis.org/download_files/Genes/Araport11_genome_release/Araport11_blastsets/Araport11_cdna_20220914_representative_gene_model.gz

ubuntu@ubuntu20:~$ gzip -d Araport11_cdna_20220914_representative_gene_model.gz

ubuntu@ubuntu20:~$ cat Araport11_cdna_20220914_representative_gene_model | grep ">" | perl -ne '/>(.+?)\.(\S+)/; print "$1\t$1.$2\n"' > Araport11.rep1

ubuntu@ubuntu20:~$ head Araport11.rep1
AT1G06190       AT1G06190.5
AT1G06620       AT1G06620.1
AT1G67670       AT1G67670.1
AT1G12100       AT1G12100.1
AT1G49040       AT1G49040.1
AT1G08710       AT1G08710.2
AT1G78300       AT1G78300.1
AT1G09210       AT1G09210.1
AT1G58430       AT1G58430.1
AT1G72290       AT1G72290.1

NOTE: TAIR recently (2024) updated their website. So the first wget command may not work with the given URL. Please go to TAIR website for downloading the file directly.

On the other hand, we extracted coding region coordinates of coding gene models of Araport11 and identified models with longest coding regions for each gene.

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 Araport11_GFF3_genes_transposons.201606.gff -GRE gene mRNA CDS -O Araport.mRNA_CDS
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA, CDS]]
intron preserving merged model (-IP)): false
representative list (-rep): null
output prefix (-O): Araport.mRNA_CDS

ubuntu@ubuntu20:~$ head Araport.mRNA_CDS.model
>AT1G01010.1    Chr1    3631    5899    +       AT1G01010
3760    3913
3996    4276
4486    4605
4706    5095
5174    5326
5439    5630
>AT1G01020.2    Chr1    6788    8737    -       AT1G01020
7315    7450
7564    7649

ubuntu@ubuntu20:~$ cat Araport.mRNA_CDS.model | perl -ne 'chomp; @t=split; if(/^>(\S+)/){ $mrna=$1; $gene=$t[-1]; $map{$gene}{$mrna}=1 }else{ $hash{$mrna}+=($t[1]-$t[0]+1) } if(eof){ for $g (sort keys %map){ $max=0; $ans=""; for $m (keys %{$map{$g}}){ if($hash{$m}>$max){ $ans=$m; $max=$hash{$m}; } } print "$g\t$ans\n" } }' > Araport11.rep2

ubuntu@ubuntu20:~$ head Araport11.rep2
AT1G01010       AT1G01010.1
AT1G01020       AT1G01020.1
AT1G01030       AT1G01030.1
AT1G01040       AT1G01040.2
AT1G01050       AT1G01050.1
AT1G01060       AT1G01060.8
AT1G01070       AT1G01070.1
AT1G01080       AT1G01080.2
AT1G01090       AT1G01090.1
AT1G01100       AT1G01100.4

Now, we have representative gene model info downloaded from TAIR in Araport11.rep1 and representative gene model info of coding genes inferred from the Araport11 GFF3 file in Araport11.rep2. To make sure the final list contains only coding genes, we used the following perl one-liner to read Araport11.rep2 and do necessary updates according to Araport11.rep1.

ubuntu@ubuntu20:~$ cat Araport11.rep2 | perl -ne 'if($.==1){ open(FILE,"<Araport11.rep1"); while($line=<FILE>){ chomp $line; $line=~/(\S+)\s(\S+)/; $hash{$1}=$2; } close FILE; } chomp; @t=split; if(exists $hash{$t[0]}){ $t[1]=$hash{$t[0]} } print "$t[0]\t$t[1]\n"' > Araport11.CDS.rep

ubuntu@ubuntu20:~$ wc -l Araport11.rep? Araport11.CDS.rep
  27562 Araport11.rep1
  27655 Araport11.rep2
  27655 Araport11.CDS.rep
  82872 total

To extract representative coding sequences (including UTRs), we extracted exon regions of all mRNAs and then extracted the coding representative part using Araport11.CDS.rep.

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -O araport.mrna.model -GE mRNA five_prime_UTR:CDS:three_prime_UTR
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): araport.mrna.model
gene-exon feature pairs (-GE): [[mRNA, five_prime_UTR:CDS:three_prime_UTR]]
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false

ubuntu@ubuntu20:~$ cat araport.mrna.model | perl -ne 'if($.==1){ open(FILE,"<Araport11.CDS.rep"); while(my $line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $hash{$s[1]}=1 } close FILE } if(/^>(\S+)/){ $flag=0; $flag=1 if exists $hash{$1} } print if $flag' > araport.coding.rep.model

ubuntu@ubuntu20:~$ grep ">" araport.mrna.model | wc -l
48359
ubuntu@ubuntu20:~$ grep ">" araport.coding.rep.model | wc -l
27655

Afterward, applied SeqGen.pl with TAIR10 genome FASTA and araport.coding.rep.model (coordinates of exon regions of representative coding models) to extract representative coding sequences.

ubuntu@ubuntu20:~$ SeqGen.pl araport.coding.rep.fasta TAIR10_chr_all.fas araport.coding.rep.model

Locate CDS regions inside representative coding sequences

We used the following three commands to extract coordinates of CDS, 5UTR, and 3UTR in the genome. In so doing, we can then infer lengths of these regions in the representative coding sequences.

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -O araport11.mRNA.CDS -GE mRNA CDS
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): araport11.mRNA.CDS
gene-exon feature pairs (-GE): [[mRNA, CDS]]
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -O araport11.mRNA.UTR5 -GE mRNA five_prime_UTR
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): araport11.mRNA.UTR5
gene-exon feature pairs (-GE): [[mRNA, five_prime_UTR]]
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -O araport11.mRNA.UTR3 -GE mRNA three_prime_UTR
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): araport11.mRNA.UTR3
gene-exon feature pairs (-GE): [[mRNA, three_prime_UTR]]
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false

The following perl one-liner was used to add up lengths of 5UTR/CDS/3UTR regions in the genome for all mRNA. These numbers were saved in file araport11.mRNA.region.lengths.

ubuntu@ubuntu20:~$ ls araport11.mRNA.CDS araport11.mRNA.UTR5 araport11.mRNA.UTR3 | perl -ne 'chomp; /\.([^\.]+)$/; print "REGION: $1\n"; system "cat $_"' | perl -ne 'chomp; if(/^REGION: (\S+)/){ $region=$1; next } if(/^>(\S+)/){ $id=$1; }else{ @t=split; $lenHash{$id}{$region}+=($t[1]-$t[0]+1) } if(eof){ for $id (sort keys %lenHash){ print "$id"; for $region ("UTR5","CDS","UTR3"){ $lenHash{$id}{$region}=0 if not exists $lenHash{$id}{$region}; print "\t$lenHash{$id}{$region}" } print "\n" } }' > araport11.mRNA.region.lengths

ubuntu@ubuntu20:~$ head araport11.mRNA.region.lengths
AT1G01010.1     129     1290    269
AT1G01020.1     464     738     127
AT1G01020.2     71      576     440
AT1G01020.3     582     711     127
AT1G01020.4     559     711     127
AT1G01020.5     582     597     127
AT1G01020.6     189     315     440
AT1G01030.1     613     1077    215
AT1G01030.2     613     1008    215
AT1G01040.1     398     5730    148

Map Riboseq reads to representative coding sequences and pointize alignments

preindex.pl was used to build necessary index files for samtools and some BioPerl modules. bowtie2-build was used to build bowtie2 index for mapping Riboseq reads using bowtie2.

ubuntu@ubuntu20:~$ preindex.pl araport.coding.rep.fasta

ubuntu@ubuntu20:~$ bowtie2-build araport.coding.rep.fasta araport.coding.rep

In this case, we have four FASTA files of Riboseq reads: root-ctr-0h-r1.fasta, root-ctr-0h-r2.fasta, shoot-ctr-0h-r1.fasta, and shoot-ctr-0h-r2.fasta. They were all adapter removed and filtered by removing reads mapping to undesired non-coding genes. The following perl one-liner was applied to generate Mapping.pl commands that utilize bowtie2 for mapping. Remember to remove the sharp mark to actual execute the commands.

ubuntu@ubuntu20:~$ ls root*.fasta shoot*.fasta | perl -ne 'chomp; /(.+?)\./; $cmd="Mapping.pl x $_ $1.repCoding.bam MappingBowtie.pl -target araport.coding.rep -p 8 -k 30 --gbar 51 --sensitive -ID 0.95"; print "\nCMD: $cmd\n"; #system $cmd'

CMD: Mapping.pl x root-ctr-0h-r1.fasta root-ctr-0h-r1.repCoding.bam MappingBowtie.pl -target araport.coding.rep -p 8 -k 30 --gbar 51 --sensitive -ID 0.95

CMD: Mapping.pl x root-ctr-0h-r2.fasta root-ctr-0h-r2.repCoding.bam MappingBowtie.pl -target araport.coding.rep -p 8 -k 30 --gbar 51 --sensitive -ID 0.95

CMD: Mapping.pl x shoot-ctr-0h-r1.fasta shoot-ctr-0h-r1.repCoding.bam MappingBowtie.pl -target araport.coding.rep -p 8 -k 30 --gbar 51 --sensitive -ID 0.95

CMD: Mapping.pl x shoot-ctr-0h-r2.fasta shoot-ctr-0h-r2.repCoding.bam MappingBowtie.pl -target araport.coding.rep -p 8 -k 30 --gbar 51 --sensitive -ID 0.95

Important options in these commands:

  1. -p 8: a bowtie2 option so that 8 threads will be used (as we performed this entire example in a VM with 8 vCPUs).
  2. -k 30: a bowtie2 option so that at most 30 alignments will be reported for one read. Note that bowtie2 will tend to report only one alignment per reads if without this setting.
  3. --gbar 51: a bowtie2 option for disallowing creating any gaps in first 51bps of reads. Our maximum input read length is less than this number so there should be no gaps in all alignments.
  4. --sensitive: a preset bowtie2 option for sensitively making alignments
  5. -ID 0.95: an output controlling option of Mapping.pl. This is to remove output alignments with identity (#matches/read length) less than 0.95.

To further make sure that all counted reads were unambiguously mapped to representative coding models. We used java program misc.AlignmentFilter2 to pick best alignments for all reads (-filter top true) and filter out reads with multiple best alignments (-filter uniq true). Remember to remove the sharp mark to actual execute the commands.

ubuntu@ubuntu20:~$ ls *.repCoding.bam | perl -ne 'chomp; /(.+?)\./; $cmd="java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAM $_ -O /dev/stdout -quiet true -filter top true -filter uniq true | samtools view -Sbo $1.topUniq.bam -T araport.coding.rep.fasta /dev/stdin"; print "\nCMD: $cmd\n"; #system $cmd'

CMD: java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAM root-ctr-0h-r1.repCoding.bam -O /dev/stdout -quiet true -filter top true -filter uniq true | samtools view -Sbo root-ctr-0h-r1.topUniq.bam -T araport.coding.rep.fasta /dev/stdin

CMD: java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAM root-ctr-0h-r2.repCoding.bam -O /dev/stdout -quiet true -filter top true -filter uniq true | samtools view -Sbo root-ctr-0h-r2.topUniq.bam -T araport.coding.rep.fasta /dev/stdin

CMD: java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAM shoot-ctr-0h-r1.repCoding.bam -O /dev/stdout -quiet true -filter top true -filter uniq true | samtools view -Sbo shoot-ctr-0h-r1.topUniq.bam -T araport.coding.rep.fasta /dev/stdin

CMD: java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAM shoot-ctr-0h-r2.repCoding.bam -O /dev/stdout -quiet true -filter top true -filter uniq true | samtools view -Sbo shoot-ctr-0h-r2.topUniq.bam -T araport.coding.rep.fasta /dev/stdin

To compute Riboseq footprints in 1bp resolution, a common practice is to count exactly one specific point of all reads. For example, 15th base-pairs of all reads were adopted for computing read mapping distribution in UTR and CDS regions in PMID:24179124. In this example, we took the 13th base-pairs. The process is a pipeline of BAM => SAM => SamPointize.pl => BAM.

ubuntu@ubuntu20:~$ ls *.topUniq.bam | perl -ne 'chomp; /(.+?)\./; $cmd="samtools view $_ | SamPointize.pl -s 13 /dev/stdin | samtools view -Sbo $1.point.bam -T araport.coding.rep.fasta /dev/stdin"; print "\nCMD: $cmd\n"; #system $cmd'

CMD: samtools view root-ctr-0h-r1.topUniq.bam | SamPointize.pl -s 13 /dev/stdin | samtools view -Sbo root-ctr-0h-r1.point.bam -T araport.coding.rep.fasta /dev/stdin

CMD: samtools view root-ctr-0h-r2.topUniq.bam | SamPointize.pl -s 13 /dev/stdin | samtools view -Sbo root-ctr-0h-r2.point.bam -T araport.coding.rep.fasta /dev/stdin

CMD: samtools view shoot-ctr-0h-r1.topUniq.bam | SamPointize.pl -s 13 /dev/stdin | samtools view -Sbo shoot-ctr-0h-r1.point.bam -T araport.coding.rep.fasta /dev/stdin

CMD: samtools view shoot-ctr-0h-r2.topUniq.bam | SamPointize.pl -s 13 /dev/stdin | samtools view -Sbo shoot-ctr-0h-r2.point.bam -T araport.coding.rep.fasta /dev/stdin

The option -s 13 is to take the 13th base-pair in the reads (following read orientation) as the only mapping base-pair in the output alignments.

ubuntu@ubuntu20:~$ samtools view root-ctr-0h-r1.topUniq.bam | head -1
HWI-ST615:762:H9PFHADXX:1:1101:1189:92557_1:N:0:TTAGGC  0       AT5G49720.1     1188    255     29M     *       0       0       TNGTGCTAAGGTGGTGTATCAGTTTGGAA   IIIIIIIIIIIIIIIIIIIIIIIIIIIII        MD:Z:1G27       XG:i:0  NM:i:1  XM:i:1  XN:i:0  XO:i:0  AS:i:-1 YT:Z:UU

ubuntu@ubuntu20:~$ samtools view root-ctr-0h-r1.point.bam | head -1
HWI-ST615:762:H9PFHADXX:1:1101:1189:92557_1:N:0:TTAGGC  0       AT5G49720.1     1200    255     1M      *       0       0       G       I

In so doing, the mapping start positions in the SAM records would be exactly the positions of 13th base-pairs of mapped reads.

3bp periodicity plots

With pointized Riboseq alignments in *.point.bam and 5UTR/CDS/3UTR lengths in araport11.mRNA.region.lengths, we can compute read accumulation at every position in the beginning of CDS regions.

ubuntu@ubuntu20:~$ ls *.point.bam | perl -ne 'chomp; /(.+?)\./; print "!!!FILE: $1\n"; system "samtools view $_"' | perl -ne 'if($.==1){ open(FILE,"<araport11.mRNA.region.lengths"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $id=shift @s; push @{$lenHash{$id}}, @s; } close FILE; } if(/^!!!FILE: (\S+)/){ $f=$1 }else{ @t=split; @len=@{$lenHash{$t[2]}}; if($len[0]<=$t[3] && $t[3]<=$len[0]+51){ $hash{$f}{$t[3]-$len[0]}++ } } if(eof STDIN){ for $f (sort keys %hash){ print "\t$f" } print "\n"; for $x (0..51){ print "$x"; for $f (sort keys %hash){ print "\t$hash{$f}{$x}" } print "\n"; } }' > CDS.acc.xls

ubuntu@ubuntu20:~$ head CDS.acc.xls
        root-ctr-0h-r1  root-ctr-0h-r2  shoot-ctr-0h-r1 shoot-ctr-0h-r2
0       4189    2667    13976   7524
1       14724   7762    21429   7762
2       5925    1964    2751    1665
3       1585    918     4364    2536
4       2217    1320    4455    1869
5       1061    543     1219    946
6       807     528     2856    1813
7       1986    1247    3964    1680
8       975     500     1508    1432

Some Excel operation can give us the following plot, which shows a 3bp periodicity along the CDS region. Riboseq 3bp periodicity

Explanation of the perl one-liner:

ls *.point.bam |                      # for every pointized BAM file
perl -ne '
	chomp; 
	/(.+?)\./; 
	print "!!!FILE: $1\n";        # print a special mark (!!!FILE:) and file label
	system "samtools view $_"     # print pointized SAM records
' | perl -ne '
	if($.==1){                    # when read first line from the input, this block works only once
		open(FILE,"<araport11.mRNA.region.lengths"); 
		while($line=<FILE>){ 
			chomp $line; 
			@s=split(/\s+/,$line); 
			$id=shift @s; 
			push @{$lenHash{$id}}, @s;  # save 5UTR/CDS/3UTR region lengths into hash %lenHash
		} 
		close FILE; 
	} 
	if(/^!!!FILE: (\S+)/){ 
		$f=$1                               # store label information if see the special mark
	}else{ 
		@t=split;                           # $t[2] is representative model ID $t[3] for the mapping position
		@len=@{$lenHash{$t[2]}};            # retrieve region lengths from %lenHash
		if($len[0]<=$t[3] && $t[3]<=$len[0]+51){  # the point is in 1+first 51bps of CDS
			$hash{$f}{$t[3]-$len[0]}++        # $t[3]-$len[0]==0 => the base-pair in front of CDS
		} 
	} 
	if(eof STDIN){                # when read last line from the input, this block works only once
		for $f (sort keys %hash){ print "\t$f" } print "\n";  # output header
		for $x (0..51){                                       # for positions 0, 1, 2, ... 51
			print "$x"; 
			for $f (sort keys %hash){ 
				print "\t$hash{$f}{$x}"    # output read accumulation for each sample
			} 
			print "\n"; 
		} 
	}
' > CDS.acc.xls                       # this file can be opened by Excel as it is actually a tsv file

Positional preference in 3bp periodicity

Again, with pointized Riboseq alignments in *.point.bam and 5UTR/CDS/3UTR lengths in araport11.mRNA.region.lengths, we can compute read accumulation at 0th, 1th, and 2th bps of every 3 base-pair for 5UTR, CDS and 3UTR.

ubuntu@ubuntu20:~$ ls *.point.bam | perl -ne 'chomp; /(.+?)\./; print "!!!FILE: $1\n"; system "samtools view $_"' | perl -ne 'if($.==1){ open(FILE,"<araport11.mRNA.region.lengths"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $id=shift @s; push @{$lenHash{$id}}, @s; } close FILE; } if(/^!!!FILE: (\S+)/){ $f=$1 }else{ @t=split; @len=@{$lenHash{$t[2]}}; $class=""; if($t[3]<=$len[0]){ $class="5utr"; $mod=$t[3]%3 }elsif($len[0]+1<=$t[3] && $t[3]<=$len[0]+51){ $class="cds"; $mod=($t[3]-$len[0]-1)%3 }elsif($len[0]+$len[1]+1<=$t[3] && $t[3]<=$len[0]+$len[1]+51){ $class="3utr"; $mod=($t[3]-$len[0]-$len[1]-1)%3 } $hash{"$f\t$class"}{$mod}++ if $class ne "" } if(eof STDIN){ for $k1 (sort keys %hash){ print "$k1\t$hash{$k1}{0}\t$hash{$k1}{1}\t$hash{$k1}{2}\n" } }'
root-ctr-0h-r1  3utr    581     645     458
root-ctr-0h-r1  5utr    14039   10580   11301
root-ctr-0h-r1  cds     48342   25104   16084
root-ctr-0h-r2  3utr    450     493     276
root-ctr-0h-r2  5utr    17086   8531    11170
root-ctr-0h-r2  cds     28842   11148   11963
shoot-ctr-0h-r1 3utr    732     632     448
shoot-ctr-0h-r1 5utr    21116   18949   21828
shoot-ctr-0h-r1 cds     72020   22276   43475
shoot-ctr-0h-r2 3utr    521     380     362
shoot-ctr-0h-r2 5utr    18044   14234   17692
shoot-ctr-0h-r2 cds     32442   19089   30278

It is observed that for CDS region, 0th bps of all 3 base-pair has significantly more reads than 1th and 2th bps, compared to other regions.

Explanation of the perl one-liner:

ls *.point.bam |                      # for every pointized BAM file
perl -ne '
	chomp; 
	/(.+?)\./; 
	print "!!!FILE: $1\n";        # print a special mark (!!!FILE:) and file label
	system "samtools view $_"     # print pointized SAM records
' | perl -ne '
	if($.==1){                    # when read first line from the input, this block works only once
		open(FILE,"<araport11.mRNA.region.lengths"); 
		while($line=<FILE>){ 
			chomp $line; 
			@s=split(/\s+/,$line); 
			$id=shift @s; 
			push @{$lenHash{$id}}, @s;  # save 5UTR/CDS/3UTR region lengths into hash %lenHash
		} 
		close FILE; 
	} 
	if(/^!!!FILE: (\S+)/){ 
		$f=$1                               # store label information if see the special mark
	}else{ 
		@t=split;                           # $t[2] is representative model ID $t[3] for the mapping position
		@len=@{$lenHash{$t[2]}};            # retrieve region lengths from %lenHash
		$class=""; 
		if($t[3]<=$len[0]){                 # the point is in 5UTR
			$class="5utr"; $mod=$t[3]%3 
		}elsif($len[0]+1<=$t[3] && $t[3]<=$len[0]+51){  # the point is in first 51bps of CDS
			$class="cds"; $mod=($t[3]-$len[0]-1)%3 
		}elsif($len[0]+$len[1]+1<=$t[3] && $t[3]<=$len[0]+$len[1]+51){  # the point is in first 51bps of 3UTR
			$class="3utr"; $mod=($t[3]-$len[0]-$len[1]-1)%3 
		}                                               # $mod for position of every 3 base-pair
		$hash{"$f\t$class"}{$mod}++ if $class ne ""     # add count for corresponding file and region
	} 
	if(eof STDIN){                # when read last line from the input, this block works only once
		for $k1 (sort keys %hash){ 
			print "$k1\t$hash{$k1}{0}\t$hash{$k1}{1}\t$hash{$k1}{2}\n" 
		} 
	}
'

Appendix: a perl one-liner for read length distribution

Supposed that cleaned Riboseq reads were saved in FASTQ format and compressed in *.gz files. The following perl one-liner can used to compute read length distribution of all files.

ubuntu@ubuntu20:~$ ls *.gz | perl -ne 'chomp; /(.+?)\./; print "!!!FILE: $1\n"; system "gzip -dc $_"' | perl -ne 'chomp; if(/^!!!FILE: (\S+)/){ $cnt=0; $f=$1; $files{$f}=1 }else{ $cnt++; if(($cnt%4)==2){ s/^\s+|\s+$//g; $hash{length($_)}{$f}++ } } if(eof){ for $f (sort keys %files){ print "\t$f" } print "\n"; for $len (sort {$a<=>$b} keys %hash){ print "$len"; for $f (sort keys %files){ if(exists $hash{$len}{$f}){ print "\t$hash{$len}{$f}" }else{ print "\t0" } } print "\n" } }' | less
        root-ctr-0h-r1  root-ctr-0h-r2  shoot-ctr-0h-r1 shoot-ctr-0h-r2
15      0       1509    0       802
16      0       3511    0       2092
17      0       4815    0       4725
18      0       5887    0       7212
19      0       6926    0       7511
20      23789   13131   18681   9551
21      57060   38401   42113   24438
22      81405   75339   64279   37983
23      112457  131009  113117  123383
24      322571  273205  208608  127738
25      622358  618208  446363  185290
26      2402146 1087920 785581  269089
27      4216948 3042870 1258728 512239
28      5425997 4826210 3080591 1356950
29      4802262 5300209 4904951 3475938
30      9117234 9185777 8025603 5354937
31      12669476        12426514        12204871        9254776
32      10271297        12536049        8844096 6899578
33      4059393 8401201 9179076 7758248
34      1950254 5177039 5176473 7126098
35      943215  2270547 2768890 5014734
36      561691  1325082 1409064 3149693
37      313184  562394  641187  1397641
38      92463   232682  329648  561905
39      23505   52484   155155  473897
40      7802    17128   49005   104359
41      3284    4154    22703   23051
42      1588    1905    5235    6097
43      737     875     2059    4596
44      372     330     893     2380
45      218     124     600     945
46      150     52      491     317
47      66      19      347     183
48      16      8       59      46
⚠️ **GitHub.com Fallback** ⚠️