Basic Unix Commands 1 - stechtmann/BL4300-5300 GitHub Wiki

Pre-class reading

Required Reading
Sections 1 - 3 from Software carpentry Unix shell introduction

Class notes

Remotely logging onto a server.
- MTU remote login instructions
Paths.
Creating files and directories.
Text editors.
Navigating through the directory structure.

In class assignment

Remotely logging on to colossus

If using a windows computer use putty to log on
the server name is colossus.it.mtu.edu
Your username is your mtuid
Your password is your mtu password
If using a mac or linux computer, use terminal and enter the following.

ssh [email protected]

Then enter your mtu password.
Note: the cursor may not move when typing in your password.

Basic Unix Commands

Confirm the username under which you are logged in.

whoami

Determine your current working directory

pwd

Determine the files that are in your working directory

ls

Change directory to your Desktop

cd Desktop

Change directory back up to your home directory (`~`)

cd ..

Paths

Overview of paths and working directories.

Files exist on the computer in specific directories. These directories are in relation to the root directory. The address of a file or directory is known as the path. The path can be used to identify a specific file on the computer.

Paths can be either relative or absolute.

Relative Path - The location of a file relative to your working directory.

The location of your Desktop directory relative your home directory (~) should be `~/Desktop/

Absolute Path - The location of a file relative to the root directory (/).

The location of your Desktop directory relative to the root is some thing like /home/campus14/smtechtm/Desktop. Absolute paths always start at the root /

Symbolic links

You can use shortcuts to link files that are in different paths to make accessing certain directories simpler.

Scratch Directory

Many of the files that we will use in this class can be very large. Since the storage of our home directories is limited, we can use a directory on the server that is not limited by storage amount. This directory is known as the scratch directory. The path for the scratch directory is

/scratch_30_day_tmp/USERNAME/

Your home directory path is something like

/home/campusXX/USERNAME

Since these two directories are on different parts of the server, it is easier to create a symbolic link (shortcut) to link the two to each other.

cd ~
ln -s /scratch_30_day_tmp/smtechtm scratch

You can check to see if this worked by running

ls -l

Further activities

Change directories onto the scratch directory

cd scratch

Create a directory

make a directory called InClass

mkdir InClass

Change directories into the new directory

cd InClass

Make a directory for today's work

mkdir Sequences

Change into the `Sequences` directory

cd Sequences

Downloading files

When handling sequencing data there are a few key data types that are contained within plain text files.

Fasta files (.fasta) - contains a line with the sequence name that starts with > followed by the sequence on the next line.
Genbank files(.gb) - contain sequence and metadata about the sequence.
Fastq files (.fastq) - contains a line with the sequence name that starts with @ followed by the sequence on the next line followed by the quality scores on third line

Downloading files with `curl`

To download files from a webpage, you can use the command curl. The basic structure of a curl command is

curl url > filename

Let's download a fasta file to work with for today's activities.

curl https://raw.githubusercontent.com/stechtmann/BL4300-5300/master/data/Weekly_Assignment_data/WA1.fasta > WA1.fasta

Check to see that your file is now in your directory

ls

Looking at your file

To see the contents of your file on the command line you can look at the top lines using the head command, the last lines using the tail command or scroll through the file using the less command.

look at the top few lines of the file with the `head` command

head WA1.fasta

look at your whole file with the `less` command.

less WA1.fasta

to exit the less command press the q key.

Editing text files

Edit your file with the text editor `nano`

nano WA1.fasta

ctrl-x will close and save the file.

Unix Filters

Filters are commands that are helpful for processing text files. These filters take a text input and will print an output as text.

The output from a filter can be saved to a file using the >. You can append the output from a filter to an existing file using the >>.

Filters can be strung together into pipelines using the pipe (|)

The cat command

The cat filter displays the lines of a file line by line. This is the equivalent of printing the contents of a file on to the terminal screen.

cat WA1.fasta

tr

The tr command is a text replace command that will find a string and replace that string with another string of characters

Let's create a pipeline to change all of the characters in our file from lower to to upper case

cat WA1.fasta |tr "[a-z]" "[A-Z]"

The cat command will print the lines of the file. The | will send the printed file to the next command.
The tr command will replace the text from lower case [a-z] to upper case [A-Z]

grep

The grep command is a pattern finding command. This allows you to find a specific string and print the lines from the file that contain that specific string. It also allows you count how many times that a specific string is found.

Let's pull just the sequence names out of the file. Since we know that in a fasta file all of the sequence names start with the > string, we can use grep to print any lines that contains the > in it.

grep '>' WA1.fasta

This should print all of the sequence names to the terminal window.

Now let's count how many sequence names we have in this file.

grep -c '>' WA1.fasta

The flag -c will count the lines rather than print them to the terminal.

cut

The cut filter allows you divide a line into specific segment based on some delimiter. There are some file that use certain characters to specify divisions between pieces of data. An example of such a file type is a .csv (comma-separated values) file. In a .csv file commas separate values or data entries in a line. You can use the cut command to separate the line into different values and then the cut command allows you to extract a specific part of the lines.

Using the WA1.fasta file let's separate the lines of the names based on spaces as the delimiter and extract the first column. In this file, the sequence names are divided into three parts separated by spaces. The first part is the protein name. The second part is the accession number, and the third part is the taxonomy of the organism.

Let's extract the first field (protein name)

grep '>' WA1.fasta | cut -d ' ' -f 1

Let's extract the from the third field to the end (Taxonomy)

grep '>' WA1.fasta | cut -d ' ' -f 3-

sort

The sort command will order the lines in a file alphabetically or numerically.

Let's sort the output from the previous command to see if there are any sequences that came from the same species.

grep '>' WA1.fasta | cut -d ' ' -f 3- | sort

uniq

The uniq command will de-replicate the output of a file so that all entries that if a line is the same as the one above it, uniq will remove the duplicate entry. With the option flag -c you can count how many instances of the repetitive line were found.

Let's count repetitive entries after sorting.

grep '>' WA1.fasta | cut -d ' ' -f 3- | sort| uniq -c