Perl Tutorial Regular Expressions, File IO & Text Processing - JohnHau/mis GitHub Wiki

Perl is famous for processing text files via regular expressions.

  1. Regular Expressions in Perl A Regular Expression (or Regex) is a pattern (or filter) that describes a set of strings that matches the pattern. In other words, a regex accepts a certain set of strings and rejects the rest.

I shall assume that you are familiar with Regex syntax. Otherwise, you could read:

"Regex Syntax Summary" for a summary of regex syntax and examples. "Regular Expressions" for full coverage. Perl makes extensive use of regular expressions with many built-in syntaxes and operators. In Perl (and JavaScript), a regex is delimited by a pair of forward slashes (default), in the form of /regex/. You can use built-in operators:

m/regex/modifier: Match against the regex. s/regex/replacement/modifier: Substitute matched substring(s) by the replacement. 1.1 Matching Operator m// You can use matching operator m// to check if a regex pattern exists in a string. The syntax is:

m/regex/ m/regex/modifiers # Optional modifiers /regex/ # Operator m can be omitted if forward-slashes are used as delimiter /regex/modifiers Delimiter Instead of using forward-slashes (/) as delimiter, you could use other non-alphanumeric characters such as !, @ and % in the form of m!regex!modifiers m@regex@modifiers or m%regex%modifiers. However, if forward-slash (/) is used as the delimiter, the operator m can be omitted in the form of /regex/modifiers. Changing the default delimiter is confusing, and not recommended.

m//, by default, operates on the default variable $. It returns true if $ matches regex; and false otherwise.

Example 1: Regex [0-9]+ #!/usr/bin/env perl

try_m_1.pl

use strict; use warnings; while (<>) { # Read input from command-line into default variable $_ print m/[0-9]+/ ? "Accept\n" : "Reject\n"; # one or more digits? } $ ./try_m_1.pl 123 Accept 00000 Accept abc Reject abc123 Accept Example 2: Extracting the Matched Substrings The built-in array variables @- and @+ keep the start and end positions of the matched substring, where $-[0] and $+[0] for the full match, and $-[n] and $+[n] for back references $1, $2, ..., $n, ....

#!/usr/bin/env perl

try_m_2.pl

use strict; use warnings; while (<>) { # Read input from command-line into default variable $_ if (m/[0-9]+/) { print 'Accept substring: ' . substr($_, $-[0], $+[0] - $-[0]) . "\n"; } else { print "Reject\n"; } } $ ./try_m_2.pl 123 Accept substring: 123 00000 Accept substring: 00000 abc Reject abc123xyz Accept substring: 123 abc123xyz456 Accept substring: 123 Example 3: Modifier 'g' (global) By default, m// finds only the first match. To find all matches, include 'g' (global) modifier.

#!/usr/bin/env perl

try_m_3.pl

use strict; use warnings;

my $regex = '[0-9]+'; # Define regex pattern in non-interpolating string

while (<>) { # Read input from command-line into default variable $_

Do m//g and save matched substring into an array

my @matches = /$regex/g; print "Matched substrings (in array): @matches\n"; # print array

Do m//g in a loop

print 'Matched substrings (in loop) : '; while (/$regex/g) { print substr($, $-[0], $+[0] - $-[0]), ','; } print "\n"; } $ ./try_m_3.pl abc123xyz456_0_789 Matched substrings (in array): 123 456 0 789 Matched substrings (in loop) : 123,456,0,789, abc Matched substrings (in array): Matched substrings (in loop) : 123 Matched substrings (in array): 123 Matched substrings (in loop) : 123, 1.2 Operators =~ and !~ By default, the matching operators operate on the default variable $. To operate on other variable instead of $_, you could use the =~ and !~ operators as follows:

str =~ m/regex/modifiers # Return true if str matches regex. str !~ m/regex/modifiers # Return true if str does NOT match regex. When used with m//, =~ behaves like comparison (== or eq).

Example 4: =~ Operator #!/usr/bin/env perl

try_m_4.pl

use strict; use warnings;

print 'yes or no? '; my $reply; chomp($reply = <>); # Remove newline print $reply =~ /^y/i ? "positive!\n" : "negative!\n";

Begins with 'y', case-insensitive

1.3 Substitution Operator s/// You can substitute a string (or a portion of a string) with another string using s/// substitution operator. The syntax is:

s/regex/replacement/ s/regex/replacement/modifiers # Optional modifiers Similar to m//, s/// operates on the default variable $_ by default. To operate on other variable, you could use the =~ and !~ operators. When used with s///, =~ behaves like assignment (=).

Example 5: s/// #!/usr/bin/env perl

try_s_1.pl

use strict; use warnings;

while (<>) { # Read input from command-line into default variable $_ s/\w+/***/g; # Match each word print "$_"; } $ ./try_s_1.pl this is an apple. *** *** *** ***. 1.4 Modifiers Modifiers (such as /g, /i, /e, /o, /s and /x) can be used to control the behavior of m// and s///.

g (global): By default, only the first occurrence of the matching string of each line is processed. You can use modifier /g to specify global operation. i (case-insensitive): By default, matching is case-sensitive. You can use the modifier /i to enable case in-sensitive matching. m (multiline): multiline string, affecting position anchor ^, $, \A, \Z. s: permits metacharacter . (dot) to match the newline. 1.5 Parenthesized Back-References & Matched Variables $1, ..., $9 Parentheses ( ) serve two purposes in regex:

Firstly, parentheses ( ) can be used to group sub-expressions for overriding the precedence or applying a repetition operator. For example, /(a|e|i|o|u){3,5}/ is the same as /a{3,5}|e{3,5}|i{3,5}|o{3,5}|u{3,5}/. Secondly, parentheses are used to provide the so called back-references. A back-reference contains the matched sub-string. For examples, the regex /(\S+)/ creates one back-reference (\S+), which contains the first word (consecutive non-spaces) in the input string; the regex /(\S+)\s+(\S+)/ creates two back-references: (\S+) and another (\S+), containing the first two words, separated by one or more spaces \s+. The back-references are stored in special variables $1, $2, …, $9, where $1 contains the substring matched the first pair of parentheses, and so on. For example, /(\S+)\s+(\S+)/ creates two back-references which matched with the first two words. The matched words are stored in $1 and $2, respectively.

For example, the following expression swap the first and second words:

s/(\S+) (\S+)/$2 $1/; # Swap the first and second words separated by a single space Back-references can also be referenced in your program.

For example,

(my $word) = ($str =~ /(\S+)/); The parentheses creates one back-reference, which matches the first word of the $str if there is one, and is placed inside the scalar variable $word. If there is no match, $word is UNDEF.

Another example,

(my $word1, my $word2) = ($str =~ /(\S+)\s+(\S+)/); The 2 pairs of parentheses place the first two words (separated by one or more white-spaces) of the $str into variables $word1 and $word2 if there are more than two words; otherwise, both $word1 and $word2 are UNDEF. Note that regular expression matching must be complete and there is no partial matching.

\1, \2, \3 has the same meaning as $1, $2, $3, but are valid only inside the s/// or m//. For example, /(\S+)\s\1/ matches a pair of repeated words, separated by a white-space.

1.6 Character Translation Operator tr/// You can use translator operator to translate a character into another character. The syntax is:

tr/fromchars/tochars/modifiers replaces or translates fromchars to tochars in $_, and returns the number of characters replaced.

For examples,

tr/a-z/A-Z/ # converts $_ to uppercase. tr/dog/cat/ # translates d to c, o to a, g to t. $str =~ tr/0-9/a-j/ # replace 0 by a, etc in $str. tr/A-CG/KX-Z/ # replace A by K, B by X, C by Y, G by Z. Instead of forward slash (/), you can use parentheses (), brackets [], curly bracket {} as delimiter, e.g.,

tr[0-9][##########] # replace numbers by #. tr{!.}(.!) # swap ! and ., one pass. If tochars is shorter than fromchars, the last character of tochars is used repeatedly.

tr/a-z/A-E/ # f to z is replaced by E. tr/// returns the number of replaced characters. You can use it to count the occurrence of certain characters. For examples,

my $numLetters = ($string =~ tr/a-zA-Z/a-zA-Z/); my $numDigits = ($string =~ tr/0-9/0-9/); my $numSpaces = ($string =~ tr/ / /); Modifiers /c, /d and /s for tr/// /c: complements (inverses) fromchars. /d: deletes any matched but un-replaced characters. /s: squashes duplicate characters into just one. For examples,

tr/A-Za-z/ /c # replaces all non-alphabets with space tr/A-Z//d # deletes all uppercase (matched with no replacement). tr/A-Za-z//dc # deletes all non-alphabets tr/!//s # squashes duplicate ! 1.7 String Functions: split and join split(regex, str, [numItems]): Splits the given str using the regex, and return the items in an array. The optional third parameter specifies the maximum items to be processed.

join(joinStr, strList): Joins the items in strList with the given joinStr (possibly empty).

For examples,

#!/usr/bin/env perl use strict; use warnings;

my $msg = 'Hello, world again!'; my @words = split(/ /, $msg); # ('Hello,', 'world', 'again!') for (@words) { say; } # Use default scalar variable

say join('--', @words); # 'Hello,--world--again!' my $newMsg = join '', @words; # 'Hello,worldagain!' say $newMsg; 1.8 Functions grep, map grep(regex, array): selects those elements of the array, that matches regex. map(regex, array): returns a new array constructed by applying regex to each element of the array. 2. File Input/Output 2.1 Filehandle Filehandles are data structure which your program can use to manipulate files. A filehandle acts as a gate between your program and the files, directories, or other programs. Your program first opens a gate, then sends or receives data through the gate, and finally closes the gate. There are many types of gates: one-way vs. two-way, slow vs. fast, wide vs. narrow.

Naming Convention: use uppercase for the name of the filehandle, e.g., FILE, DIR, FILEIN, FILEOUT, and etc.

Once a filehandle is created and connected to a file (or a directory, or a program), you can read or write to the underlying file through the filehandle using angle brackets, e.g., .

Example: Read and print the content of a text file via a filehandle.

#!/usr/bin/env perl use strict; use warnings;

FileRead.pl: Read & print the content of a text file.

my $filename = shift; # Get the filename from command line.

Create a filehandle called FILE and connect to the file.

open(FILE, $filename) or die "Can't open $filename: $!";

while () { # Set $_ to each line of the file print; # Print $_ } Example: Search and print lines containing a particular search word.

#!/usr/bin/env perl use strict; use warnings;

FileSearch.pl: Search for lines containing a search word.

(my $filename, my $word) = @ARGV; # Get filename & search word.

Create a filehandle called FILE and connect to the file.

open(FILE, $filename) or die "Can't open $filename: $!";

while () { # Set $_ to each line of the file print if /\b$word\b/i; # Match $_ with word, case insensitive } Example: Print the content of a directory via a directory handle.

#!/usr/bin/env perl use strict; use warnings;

DirPrint.pl: Print the content of a directory.

my $dirname = shift; # Get directory name from command-line opendir(DIR, $dirname) or die "Can't open directory $dirname: $!"; my @files = readdir(DIR); foreach my $file (@files) {

Display files not beginning with dot.

print "$file\n" if ($file !~ /^./); } You can use C-style's printf for formatted output to file.

2.2 File Handling Functions Function open: open(filehandle, string) opens the filename given by string and associates it with the filehandle. It returns true if success and UNDEF otherwise.

If string begins with < (or nothing), it is opened for reading. If string begins with >, it is opened for writing. If string begins with >>, it is opened for appending. If string begins with +<, +>, +>>, it is opened for both reading and writing. If string is -, STDIN is opened. If string is >-, STDOUT is opened. If string begins with -| or |-, your process will fork() to execute the pipe command. Function close: close(filehandle) closes the file associated with the filehandle. When the program exits, Perl closes all opened filehandles. Closing of file flushes the output buffer to the file. You only have to explicitly close the file in case the user aborts the program, to ensure data integrity.

A common procedure for modifying a file is to:

Read in the entire file with open(FILE, $filename) and @lines = . Close the filehandle. Operate upon @lines (which is in the fast RAM) rather than FILE (which is in the slow disk). Write the new file contents using open(FILE, “>$filename”) and print FILE @lines. Close the file handle. Example: Read the contents of the entire file into memory; modify and write back to disk.

#!/usr/bin/env perl use strict; use warnings;

FileChange.pl

my $filename = shift; # Get the filename from command line.

Create a filehandle called FILE and connect to the file.

open(FILE, $filename) or die "Can't open $filename: $!";

Read the entire file into an array in memory.

my @lines = ; close(FILE);

open(FILE, ">$filename") or die "Can't write to $filename: $!"; foreach my $line (@lines) { print FILE uc($line); # Change to uppercase } close(FILE); Example: Reading from a file

#!/usr/bin/env perl use strict; use warnings;

open(FILEIN, "test.txt") or die "Can't open file: $!"; while () { # set $_ to each line of the file. print; # print $_ } Example: Writing to a file

#!/usr/bin/env perl use strict; use warnings;

my $filename = shift; # Get the file from command line. open(FILE, ">$filename") or die "Can't write to $filename: $!"; print FILE "This is line 1\n"; # no comma after FILE. print FILE "This is line 2\n"; print FILE "This is line 3\n"; Example: Appending to a file

#!/usr/bin/env perl use strict; use warnings;

my $filename = shift; # Get the file from command line. open(FILE, ">>$filename") or die "Can't append to $filename: $!"; print FILE "This is line 4\n"; # no comma after FILE. print FILE "This is line 5\n"; 2.3 In-Place Editing Instead of reading in one file and write to another file, you could do in-place editing by specifying –i flag or use the special variable $^I.

The –ibackupExtension flag tells Perl to edit files in-place. If a backupExtension is provided, a backup file will be created with the backupExtension. The special variable $^I=backupExtension does the same thing. Example: In-place editing using –i flag

#!/usr/bin/env perl -i.old # In-place edit, backup as '.old' use strict; use warnings;

while (<>) { s/line/TEST/g; print; # Print to the file, not STDOUT. } Example: In-place editing using $^I special variable.

#!/usr/bin/env perl use strict; use warnings;

$^I = '.bak'; # Enable in-place editing, backup in '.bak'. while (<>) { s/TEST/line/g; print; # Print to the file, not STDOUT. } 2.4 Functions seek, tell, truncate seek(filehandle, position, whence): moves the file pointer of the filehandle to position, as measured from whence. seek() returns 1 upon success and 0 otherwise. File position is measured in bytes. whence of 0 measured from the beginning of the file; 1 measured from the current position; and 2 measured from the end. For example:

seek(FILE, 0, 2); # 0 byte from end-of-file, give file size. seek(FILE, -2, 2); # 2 bytes before end-of-file. seek(FILE, -10, 1); # Move file pointer 10 byte backward. seek(FILE, 20, 0); # 20 bytes from the begin-of-file. tell(filehandle): returns the current file position of filehandle.

truncate(FILE, length): truncates FILE to length bytes. FILE can be either a filehandle or a file name.

To find the length of a file, you could:

seek(FILE, 0, 2); # Move file point to end of file. print tell(FILE); # Print the file size. Example: Truncate the last 2 bytes if they begin with \x0D,

#!/usr/bin/env perl use strict; use warnings;

my $filename = shift; # Get the file from command line. open(FILE, "+<$filename") or die "Can't open $file: $!"; seek(FILE, -2, 2); # 2 byte before end-of-file. my $pos = tell FILE; my $data = ; # read moves the file pointer. if ($data =~ /^\x0D/) { # begin with 0D truncate FILE, $pos; # truncate last 2 bytes. } 2.5 Function eof eof(filehandle) returns 1 if the file pointer is positioned at the end of the file or if the filehandle is not opened.

2.6 Reading Bytes Instead of Lines The function read(filehandle, var, length, offset) reads length bytes from filehandle starting from the current file pointer, and saves into variable var starting from offset (if omitted, default is 0). The bytes includes \x0A, \x0D etc.

Example #!/usr/bin/env perl use strict; use warnings;

(my $numbytes, my $filename) = @ARGV; open(FILE, $filename) or die "Can't open $filename: $!";

my $data; read(FILE, $data, $numbytes); print $data, "\n----\n";

read(FILE, $data, $numbytes); # continue from current file ptr print $data; print $data, "\n----\n";

read(FILE, $data, $numbytes, 2); # save in $data offset 2 print $data, "\n----\n"; 2.7 Piping Data To and From a Process If you wish your program to receive data from a process or want your program to send data to a process, you could open a pipe to an external program.

open(handle, "command|") lets you read from the output of command. open(handle, "|command") lets you write to the input of command. Both of these statements return the Process ID (PID) of the command.

Example: The dir command lists the current directory. By opening a pipe from dir, you can access its output.

#!/usr/bin/env perl use strict; use warnings;

open(PIPEFROM, "dir|") or die "Pipe failed: $!"; while () { print; } close PIPEFROM; Example: This example shows how you can pipe input into the sendmail program.

#!/usr/bin/env perl use strict; use warnings;

my $my_login = test open(MAIL, "| sendmail –t –f$my_login") or die "Pipe failed: $!"; print MAIL, "From: [email protected]\n"; print MAIL, "To: [email protected]\n"; print MAIL, "Subject: test\n"; print MAIL, "\n"; print MAIL, "Testing line 1\n"; print MAIL, "Testing line 2\n"; close MAIL; You cannot pipe data both to and from a command. If you want to read the output of a command that you have opened with the |command, send the output to a file. For example,

open (PIPETO, "|command > /output.txt"); 2.8 Deleting file: Function unlink unlink(FILES) deletes the FILES, returning the number of files deleted. Do not use unlink() to delete a directory, use rmdir() instead. For example,

unlink $filename; unlink "/var/adm/message"; unlink "message"; 2.9 Inspecting Files You can inspect a file using (-test FILE) condition. The condition returns true if FILE satisfies test. FILE can be a filehandle or filename. The available test are: -e: exists. -f: plain file. -d: directory. -T: seems to be a text file (data from 0 to 127). -B: seems to be a binary file (data from 0 to 255). -r: readable. -w: writable. -x: executable. -s: returns the size of the file in bytes. -z: empty (zero byte). Example #!/usr/bin/env perl use strict; use warnings;

my $dir = shift; opendir(DIR, $dir) or die "Can't open directory: $!"; my @files = readdir(DIR); closedir(DIR);

foreach my $file (@files) { if (-f "$dir/$file") { print "$file is a file\n"; print "$file seems to be a text file\n" if (-T "$dir/$file"); print "$file seems to be a binary file\n" if (-B "$dir/$file"); my $size = -s "$dir/$file"; print "$file size is $size\n"; print "$file is a empty\n" if (-z "$dir/$file"); } elsif (-d "$dir/$file") { print "$file is a directory\n"; } print "$file is a readable\n" if (-r "$dir/$file"); print "$file is a writable\n" if (-w "$dir/$file"); print "$file is a executable\n" if (-x "$dir/$file"); }

image

2.11 Accessing the Directories opendir(DIRHANDLE, dirname) opens the directory dirname. closedir(DIRHANDLE) closes the directory handle. readdir(DIRHANDLE) returns the next file from DIRHANDLE in a scalar context, or the rest of the files in the array context. glob(string) returns an array of filenames matching the wildcard in string, e.g., glob('*.dat') and glob('test??.txt'). mkdir(dirname, mode) creates the directory dirname with the protection specified by mode. rmdir(dirname) deletes the directory dirname, only if it is empty. chdir(dirname) changes the working directory to dirname. chroot(dirname) makes dirname the root directory "/" for the current process, used by superuser only. Example: Print the contents of a given directory.

#!/usr/bin/env perl use strict; use warnings;

my $dirname = shift; # first command-line argument. opendir(DIR, $dirname) or die "can't open $dirname: $!\n"; @files = readdir(DIR); closedir(DIR); foreach my $file (@files) { print "$file\n"; } Example: Removing empty files in a given directory

#!/usr/bin/env perl use strict; use warnings;

my $dirname = shift; opendir(DIR, $dirname) or die "Can't open directory: $!"; my @files = readdir(DIR); foreach my $file (@files) { if ((-f "$dir/$file") && (-z "$dir/$file")) { print "deleting $dir/$file\n"; unlink "$dir/$file"; } } closedir(DIR); Example: Display files matches "*.txt"

my @files = glob('*.txt'); foreach (@files) { print; print "\n" } Example: Display files matches the command-line pattern.

$file = shift; @files = glob($file); foreach (@files) { print; print "\n" } 2.12 Standard Filehandles Perl defines the following standard filehandles:

STDIN – Standard Input, usually refers to the keyboard. STDOUT – Standard Output, usually refers to the console. STDERR – Standard Error, usually refers to the console. ARGV – Command-line arguments. For example:

my $line = # Set $line to the next line of user input my $item = # Set $item to the next command-line argument my @items = # Put all command-line arguments into the array When you use an empty angle brackets <> to get inputs from user, it uses the STDIN filehandle; when you get the inputs from the command-line, it uses ARGV filehandle. Perl fills in STDIN or ARGV for you automatically. Whenever you use print() function, it uses the STDOUT filehandler.

<> behaves like when there is still data to be read from the command-line files, and behave like otherwise.

  1. Text Formatting 3.1 Function write write(filehandle): printed formatted text to filehandle, using the format associated with filehandle. If filehandle is omitted, STDOUT would be used.

3.2 Declaring format format name = text1 text2 . 3.3 Picture Field @<, @|, @> @<: left-flushes the string on the next line of formatting texts. @>: right-flushes the string on the next line of formatting texts. @|: centers the string on the next line of the formatting texts. @<, @>, @| can be repeated to control the number of characters to be formatted. The number of characters to be formatted is same as the length of the picture field. @###.## formats numbers by lining up the decimal points under ".".

For examples,

[TODO]

image

⚠️ **GitHub.com Fallback** ⚠️