Lesson 3 : Intro to Pipes and Filters - joslynnlee/CHEM-454 GitHub Wiki

Lesson adapted from "The Carpentries"

Introduction

Now that we know a few basic commands, we can finally look at the shell’s most powerful feature: the ease with which it lets us combine existing programs in new ways.

  • How can I combine existing commands to do new things?

  • Redirect a command’s output to a file.

  • Process a file instead of keyboard input using redirection.

  • Construct command pipelines with two or more stages.

We’ll start with the directory called shell-lesson-data/molecules that contains six files describing some simple organic molecules. The .pdb extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule.

Let's see the files in the molecules folder:

ls molecules

Let’s go into that directory with cd and then run an example command wc cubane.pdb:

cd molecules
wc cubane.pdb

wc is the ‘word count’ command: it counts the number of lines, words, and characters in files (from left to right, in that order).

If we run the command wc *.pdb, the * in *.pdb matches zero or more characters, so the shell turns *.pdb into a list of all .pdb files in the current directory:

wc *.pdb

Note that wc *.pdb also shows the total number of all lines in the last line of the output.

If we run wc -l instead of just wc, the output shows only the number of lines per file:

wc -l *.pdb

The -m and -w options can also be used with the wc command, to show only the number of characters or the number of words in the files.

If you submit a command and there is no response, the command is usually awaiting to process an input given at the command prompt. Type in wc -l:

wc -l

The command prompt just sits there and waits for us to give it some data interactively. From the outside, though, all we see is it sitting there: the command doesn’t appear to do anything.

If you make this kind of mistake, you can escape out of this state by holding down the control key Ctrl and typing the letter C once and letting go of the Ctrl key.

Redirect

Which of these files contains the fewest lines? It’s an easy question to answer when there are only six files, but what if there were 6000? Let's see what is in our directory:

ls 

Our first step toward a solution is to run the command:

wc -l *.pdb > lengths.txt

The greater than symbol, >, tells the shell to redirect the command’s output to a file instead of printing it to the screen. (This is why there is no screen output: everything that wc would have printed has gone into the file lengths.txt instead.) The shell will create the file if it doesn’t exist. If the file exists, it will be silently overwritten, which may lead to data loss and thus requires some caution.

  • Avoid Redirecting to the same file

It’s a very bad idea to try redirecting the output of a command that operates on a file to the same file. Doing something like this may give you incorrect results and/or delete the contents of the file.

Type ls lengths.txt confirms that the file exists:

ls lengths.txt

We can now send the content of lengths.txt to the screen using cat lengths.txt. The cat command gets its name from ‘concatenate’ i.e. join together, and it prints the contents of files one after another. There’s only one file in this case, so cat just shows us what it contains:

cat lengths.txt

Filtering output

Next we’ll use the sort command to sort the contents of the lengths.txt file numerically with the -n flag.

sort -n lengths.txt

Viewing files in Unix

Sometimes you need to see if you have the correct file without having to open the file. This is when head and tail command become helpful!

Let's use the head command on the cubane.pdb file:

head cubane.pdb
  • What did this do?

  • How does it differ from cat?

  • What do you think tail does for viewing a file?

tail cubane.pdb

If you need only the first 3 lines of a file, you can use the -n flag with head or `tail:

head -n 3 cubane.pdb > 3-cubane.txt
cat 3-cubane.txt

Passing output to another command

In our example of finding the file with the fewest lines, we are using two intermediate files lengths.txt to store output. This is a confusing way to work because even once you understand what wc, sort, and head do, those intermediate files make it hard to follow what’s going on. We can make it easier to understand by running sort and head together:

sort -n lengths.txt | head -n 1

The vertical bar, |, between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right.

This has removed the need for a temporary file.

Combining multiple commands

Nothing prevents us from chaining pipes consecutively. We can for example send the output of wc directly to sort, and then the resulting output to head. This removes the need for any intermediate files. What do you think this would do?

wc -l *.pdb | sort -n | head -n 1