Basic Unix Commands 1 - stechtmann/BL4300-5300 GitHub Wiki
Pre-class reading
Required Reading
Sections 1 - 3 from Software carpentry Unix shell introduction
Class notes
- Remotely logging onto a server.
- Paths.
- Creating files and directories.
- Text editors.
- Navigating through the directory structure.
In class assignment
Important: unix cheat sheet
Remotely logging on to colossus
- If using a windows computer use putty to log on
- the server name is
colossus.it.mtu.edu - Your username is your mtuid
- Your password is your mtu password
- If using a mac or linux computer, use terminal and enter the following.
ssh [email protected]
Then enter your mtu password.
Note: the cursor may not move when typing in your password.
Basic Unix Commands
Confirm the username under which you are logged in.
whoami
Determine your current working directory
pwd
Determine the files that are in your working directory
ls
Change directory to your Desktop
cd Desktop
Change directory back up to your home directory (~)
cd ..
Paths
Overview of paths and working directories.
Files exist on the computer in specific directories. These directories are in relation to the root directory. The address of a file or directory is known as the path. The path can be used to identify a specific file on the computer.
Paths can be either relative or absolute.
Relative Path - The location of a file relative to your working directory.
The location of your Desktop directory relative your home directory (~) should be `~/Desktop/
Absolute Path - The location of a file relative to the root directory (/).
The location of your Desktop directory relative to the root is some thing like /home/campus14/smtechtm/Desktop. Absolute paths always start at the root /
Symbolic links
You can use shortcuts to link files that are in different paths to make accessing certain directories simpler.
Scratch Directory
Many of the files that we will use in this class can be very large. Since the storage of our home directories is limited, we can use a directory on the server that is not limited by storage amount. This directory is known as the scratch directory. The path for the scratch directory is
/scratch_30_day_tmp/USERNAME/
Your home directory path is something like
/home/campusXX/USERNAME
Since these two directories are on different parts of the server, it is easier to create a symbolic link (shortcut) to link the two to each other.
cd ~
ln -s /scratch_30_day_tmp/smtechtm scratch
You can check to see if this worked by running
ls -l
Further activities
Change directories onto the scratch directory
cd scratch
Create a directory
- make a directory called
InClass
mkdir InClass
Change directories into the new directory
cd InClass
Make a directory for today's work
mkdir Sequences
Change into the Sequences directory
cd Sequences
Downloading files
When handling sequencing data there are a few key data types that are contained within plain text files.
- Fasta files (
.fasta) - contains a line with the sequence name that starts with>followed by the sequence on the next line. - Genbank files(
.gb) - contain sequence and metadata about the sequence. - Fastq files (
.fastq) - contains a line with the sequence name that starts with@followed by the sequence on the next line followed by the quality scores on third line
Downloading files with curl
To download files from a webpage, you can use the command curl. The basic structure of a curl command is
curl url > filename
Let's download a fasta file to work with for today's activities.
curl https://raw.githubusercontent.com/stechtmann/BL4300-5300/master/data/Weekly_Assignment_data/WA1.fasta > WA1.fasta
Check to see that your file is now in your directory
ls
Looking at your file
To see the contents of your file on the command line you can look at the top lines using the head command, the last lines using the tail command or scroll through the file using the less command.
look at the top few lines of the file with the head command
head WA1.fasta
look at your whole file with the less command.
less WA1.fasta
to exit the less command press the q key.
Editing text files
Edit your file with the text editor nano
nano WA1.fasta
ctrl-x will close and save the file.
Unix Filters
Filters are commands that are helpful for processing text files. These filters take a text input and will print an output as text.
The output from a filter can be saved to a file using the >. You can append the output from a filter to an existing file using the >>.
Filters can be strung together into pipelines using the pipe (|)
The cat command
The cat filter displays the lines of a file line by line. This is the equivalent of printing the contents of a file on to the terminal screen.
cat WA1.fasta
tr
The tr command is a text replace command that will find a string and replace that string with another string of characters
Let's create a pipeline to change all of the characters in our file from lower to to upper case
cat WA1.fasta |tr "[a-z]" "[A-Z]"
The cat command will print the lines of the file.
The | will send the printed file to the next command.
The tr command will replace the text from lower case [a-z] to upper case [A-Z]
grep
The grep command is a pattern finding command. This allows you to find a specific string and print the lines from the file that contain that specific string. It also allows you count how many times that a specific string is found.
Let's pull just the sequence names out of the file. Since we know that in a fasta file all of the sequence names start with the > string, we can use grep to print any lines that contains the > in it.
grep '>' WA1.fasta
This should print all of the sequence names to the terminal window.
Now let's count how many sequence names we have in this file.
grep -c '>' WA1.fasta
The flag -c will count the lines rather than print them to the terminal.
cut
The cut filter allows you divide a line into specific segment based on some delimiter.
There are some file that use certain characters to specify divisions between pieces of data. An example of such a file type is a .csv (comma-separated values) file. In a .csv file commas separate values or data entries in a line. You can use the cut command to separate the line into different values and then the cut command allows you to extract a specific part of the lines.
Using the WA1.fasta file let's separate the lines of the names based on spaces as the delimiter and extract the first column. In this file, the sequence names are divided into three parts separated by spaces. The first part is the protein name. The second part is the accession number, and the third part is the taxonomy of the organism.
Let's extract the first field (protein name)
grep '>' WA1.fasta | cut -d ' ' -f 1
Let's extract the from the third field to the end (Taxonomy)
grep '>' WA1.fasta | cut -d ' ' -f 3-
sort
The sort command will order the lines in a file alphabetically or numerically.
Let's sort the output from the previous command to see if there are any sequences that came from the same species.
grep '>' WA1.fasta | cut -d ' ' -f 3- | sort
uniq
The uniq command will de-replicate the output of a file so that all entries that if a line is the same as the one above it, uniq will remove the duplicate entry. With the option flag -c you can count how many instances of the repetitive line were found.
Let's count repetitive entries after sorting.
grep '>' WA1.fasta | cut -d ' ' -f 3- | sort| uniq -c