Lesson 3 : Intro to Pipes and Filters - joslynnlee/CHEM-454 GitHub Wiki
Lesson adapted from "The Carpentries"
Introduction
Now that we know a few basic commands, we can finally look at the shell’s most powerful feature: the ease with which it lets us combine existing programs in new ways.
-
How can I combine existing commands to do new things?
-
Redirect a command’s output to a file.
-
Process a file instead of keyboard input using redirection.
-
Construct command pipelines with two or more stages.
We’ll start with the directory called shell-lesson-data/molecules
that contains six files describing some simple organic molecules. The .pdb
extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule.
Let's see the files in the molecules
folder:
ls molecules
Let’s go into that directory with cd
and then run an example command wc cubane.pdb
:
cd molecules
wc cubane.pdb
wc
is the ‘word count’ command: it counts the number of lines, words, and characters in files (from left to right, in that order).
If we run the command wc *.pdb
, the *
in *.pdb
matches zero or more characters, so the shell turns *.pdb
into a list of all .pdb files in the current directory:
wc *.pdb
Note that wc *.pdb
also shows the total number of all lines in the last line of the output.
If we run wc -l
instead of just wc
, the output shows only the number of lines per file:
wc -l *.pdb
The -m
and -w
options can also be used with the wc command, to show only the number of characters or the number of words in the files.
If you submit a command and there is no response, the command is usually awaiting to process an input given at the command prompt. Type in wc -l
:
wc -l
The command prompt just sits there and waits for us to give it some data interactively. From the outside, though, all we see is it sitting there: the command doesn’t appear to do anything.
If you make this kind of mistake, you can escape out of this state by holding down the control key Ctrl
and typing the letter C
once and letting go of the Ctrl key
.
Redirect
Which of these files contains the fewest lines? It’s an easy question to answer when there are only six files, but what if there were 6000? Let's see what is in our directory:
ls
Our first step toward a solution is to run the command:
wc -l *.pdb > lengths.txt
The greater than symbol, >
, tells the shell to redirect the command’s output to a file instead of printing it to the screen. (This is why there is no screen output: everything that wc
would have printed has gone into the file lengths.txt
instead.) The shell will create the file if it doesn’t exist. If the file exists, it will be silently overwritten, which may lead to data loss and thus requires some caution.
- Avoid Redirecting to the same file
It’s a very bad idea to try redirecting the output of a command that operates on a file to the same file. Doing something like this may give you incorrect results and/or delete the contents of the file.
Type ls lengths.txt
confirms that the file exists:
ls lengths.txt
We can now send the content of lengths.txt
to the screen using cat lengths.txt
. The cat
command gets its name from ‘concatenate’ i.e. join together, and it prints the contents of files one after another. There’s only one file in this case, so cat
just shows us what it contains:
cat lengths.txt
Filtering output
Next we’ll use the sort
command to sort the contents of the lengths.txt
file numerically with the -n
flag.
sort -n lengths.txt
Viewing files in Unix
Sometimes you need to see if you have the correct file without having to open the file. This is when head
and tail
command become helpful!
Let's use the head
command on the cubane.pdb
file:
head cubane.pdb
-
What did this do?
-
How does it differ from
cat
? -
What do you think
tail
does for viewing a file?
tail cubane.pdb
If you need only the first 3 lines of a file, you can use the -n
flag with head
or `tail:
head -n 3 cubane.pdb > 3-cubane.txt
cat 3-cubane.txt
Passing output to another command
In our example of finding the file with the fewest lines, we are using two intermediate files lengths.txt
to store output. This is a confusing way to work because even once you understand what wc
, sort
, and head
do, those intermediate files make it hard to follow what’s going on. We can make it easier to understand by running sort and head together:
sort -n lengths.txt | head -n 1
The vertical bar, |
, between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right.
This has removed the need for a temporary file.
Combining multiple commands
Nothing prevents us from chaining pipes consecutively. We can for example send the output of wc
directly to sort, and then the resulting output to head
. This removes the need for any intermediate files. What do you think this would do?
wc -l *.pdb | sort -n | head -n 1