01. Logging In to HPCC and an Intro to Bash and Linux Navigation - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

A UNIX BOOTCAMP

This is your introduction to the Unix environment. You are required to use the computers in Biology 405 for this exercise. Do not use your own computer.

Several of these tasks are pointless and really do nothing. Their purpose is to familiarize yourself with some of the basic commands/file manipulations you will use throughout the course. Keep a cheat sheet for yourself that you can quickly refer to in the future. You can find these on google, but some of frequent commands you will use (ex. squeue) are more specific to our system. The following is heavily modified from http://korflab.ucdavis.edu/bootcamp.html.

All of the work below assumes that you have requested and obtained an account on HPCC (exercise 00). If you haven't, you won't be able to participate in this class.

A warning. The better you understand what's happening with this tutorial, the less confused you will be throughout the rest of this class. Learning how to run commands and navigate in a linux environment is critical for not getting lost. Really take some time to concentrate on this tutorial, please.

A note on formatting

Throughout this tutorial, You will notice various text formats. If there is a command you need to type, it will typically be formatted as:

type this command

If I'm trying to represent output you should be seeing on your screen, it will be formatted as:

This is stuff you should see
on your screen.

Sometimes I'll combine the two. You'll get used to it.

At various points in these tutorials, you may see commands with placeholders which will always be enclosed within the < and > or [ and ] symbols. These placeholders need to be replaced with information specific to you and your computer. For example: later in this tutorial you will see [eraider] which you simply replace with your eraider user name and discard the [] symbols. Another common example is <path> which is instructing you to provide the full path (location) of the file or directory.

One more note before you get started. The tutorial also assumes you work through it all in one sitting. If you log out and log back in, you will need to recapitulate all of the actions to get back to where you were when you stopped.

THE TERMINAL

A terminal is the common name for the program that does two main things. It allows you to type input to the computer (i.e. run programs, move/view files etc.) and it allows you to see output from those programs. All Unix machines will have a terminal program available.

How you get to your terminal will depend on whether you use a Mac or Windows PC. You should be completing this using the PCs in Biology 405. So, use MobaXterm. Open the software and then open a session to HPCC as follows.

  1. Click the 'Session' button at the upper left of the screen.
  2. Choose 'SSH' from the options.
  3. Enter 'login.hpcc.ttu.edu' in the Remote host section.
  4. Check the "Specify username box" and enter the user ID provided to you by HPCC.
  5. Click OK.

If all went well, you should now see something that looks like this. The only major difference would be that your user ID is in place of mine.

(image)

To complete your login, you may need to enter your password. While you type, nothing will show up on the screen. That's ok. Just type in your password and hit enter. If you entered it correctly and you have your account set up, you should see something like this.

image

Again, some things may be different because these screenshots were specific to me and not you.

There is some valuable information printed out here but we're not going to worry about most of that now. The important text for us at this moment is the part that says, "DO NOT RUN CPU-INTENSIVE JOBS ON THE LOGIN NODES!" That takes us to the next section.

MOVING FROM THE LOGIN NODE

Now that we are logged on, you should notice something like this. The actual numbers may be slightly different but not by much.

login-20-26:$

The login-20-26:$ text that you see is the Unix command prompt. In this case, it contains the name of the login node you're working on (‘login-20-26’) and the name of the current directory (There's nothing there now but more on that later). Note that the command prompt might not look the same on different Unix systems. In this case, the $ sign marks the end of the prompt. For most of the rest of the course, I'm just going to leave the '$' out because it's a pain to include it with every command you will be executing. For this page, though, I'll be using it a lot.

Be aware that you're working on a system that has thousands of processors (~16,000 to be more precise). A single processing node controls how jobs are distributed to all of those other processors. That node is called the 'login node'. DO NOT perform any analyses on the login node. It is a crime punishable by death. It slows everyone down on the entire system. One of the ways to move off of the head node is to request an interactive session (via the 'interactive' command),

interactive -p nocona -c 1

Notice the change in your command prompt. This tells you that you’re are working from a compute node. The change will be something like this:

login-20-26:$

to

cpu-25-16:$

This indicates the specific node on which you are now working.

If no nodes are available to use for an interactive session, you will get a message saying your job failed. Sometimes, the processors are just exceptionally busy and you will not be able to get a node to work on. In those cases, generating a submission script is a way to go. You create a text file that says what you want to do and submit it to the 'queue' (the list of pending jobs that's being managed by the head node). Once submitted, the job will run when the resources are available. I will be describing how to do that later in the course.

You'll notice below that much of what we're doing is on the head node. The tasks are so simple, that it's really not a problem but it's still bad practice. Forgive me.

You can exit your interactive session by closing your terminal. But that is a jerk thing to do. Your session will stay in the queue for a minimum of 48 hours, locking out others from using that set of processors. You have to specifically tell the system you want to end your session by typing exit. Don't type it now but please do before you leave class or stop working on this assignment.

Any time you are performing computationally intensive work (aka, analyses of any kind) you will want to generate an interactive session.

UNIX COMMANDS AND THE SHELL

A shell is a basic program that allows the user to interact with the system.

For this class, we will primarily be using a shell called "bash" and this exercise familiarizes you with it's usage. Every command you use for the remainder of this exericise is a bash command.

It’s important to note that you will always be inside a single directory when using the terminal. The default behavior is that when you open a new terminal you start in your own home directory (containing files and directories that only you can modify). To see what files and directories are in our home directory, we need to use the ls command. This command stands for 'list' and it lists the contents of a directory. If we run the ls command we should see something like:

$ ls

<some files and things if you've worked on HPCC previously>
$

or this if you haven't worked on HPCC before.

$ ls

$

We can resolve the second situation by just copying some things from a directory I created to your home directory. The following command will copy a directory and all it's contents to your home directory.

cp -r /home/daray/gge /home/[eraider]

Note that you will need to replace [eraider] with your eraider id. Mine is 'daray'. We will use this same system to indicate your eraider id through the rest of the course.

Now, redo your ls command. You should see:

$ ls
gge
$

Again, if you've already worked on HPCC, you may see more.

The output of the ls command lists two things. In this case, it's a single directory, but it could also be files. We’ll learn how to tell them apart later on. These directories were created just for this course. You will therefore probably see something very different on your own computer.

After the ls command finishes it produces a new command prompt, ready for you to type your next command.

The ls command is used to list the contents of any directory, not necessarily the one that you are currently in. Try the following:

ls gge

You should see:

exampledirectory1  exampledirectory2  examplefile1.txt

If you're using a quality terminal, these may be in pretty colors. These are the contents of the gge directory, listed even though you aren't actually in that directory. This has to do with the 'path'. More on that later.

DIRECTORIES, AKA FOLDERS

Looking at directories from within a Unix terminal can often seem confusing, especially if you've grown up using graphical user interfaces (GUIs) like on Windows computers and Macs. But bear in mind that these directories are exactly the same type of folders that you can see if you use any graphical file browser. You might notice some of these names appearing in different colors. Many Unix systems will display files and directories differently by default. Other colors may be used for special types of files.

To see this in a form that you might be more familiar with, we'll use Bitvise. Open Bitvise on your computer and you will see the 'Default profile' window appear. At the top and to the left is the 'login' tab. Enter the same information you did for MobaXTerm in the Host and username sections. If needed, change the 'Initial method' box to 'password' Afterward, go to the bottom of the screen and hit the 'Log in' button. You'll be prompted for your password, which you should enter. After closing the warning screen and accepting the security cookies, you should see several lines of information pop up with the last line indicating 'Authentication completed.'

BitVise has it's own terminal but it's not as nice as MobaXTerm. We're going to use Bitvise to transfer files via SFTP (Secure File Transfer Protocol). To open an SFTP window, click that button on the left side of the window.

image

A new window should appear that is divided into two panels. On the left is your local desktop. On the right is your home directory on HPCC.

image

Note that my home directory is full of stuff. Yours is probably pretty empty and may only include one folder, the 'gge' folder you just copied. Find that directory and double-click it. You should see the same three entries you saw using ls gge earlier in this exercise.

SETTING UP A DIRECTORY STRUCTURE FOR THIS CLASS

As described in a previous tutorial, a coherent directory structure is important to keep you organized. Thus, I want to set one up for this class. Please do the following.

Login to HPCC and get an interactive session with one processor.

interactive -p nocona

We haven't discussed this part of the structure of HPCC yet: you have three working areas (and their associated paths), 'home' (/home/[eraider]), 'work' (/lustre/work/[eraider]), and 'scratch' (/lustre/scratch/[eraider]). There are important differences among the three.

Home has the smallest storage capacity but it's backed up.

Work has limited but larger storage capacity but it's not backed up.

Scratch has effectively unlimited storage but is also not backed up and is purged periodically.

We will be working in scratch because when I worked through all of these exercises, I needed the storage.

Migrate to your scratch directory.

cd /lustre/scratch/[eraider]

Make a directory for this class and for this exercise.

mkdir -p gge2024/data

Note what this command does. It creates two directories simultaneously, the directory for this class, 'gge2024' and the subdirectory, 'data'. That is possible using the -p option, which tells unix to create any necessary intermediate directories required to create the last directory in the path.

You will be storing all of your data and results from any exercises from here on out in your 'gge2024' directory and you will be generating new subdirectories as we go.

Before moving on to the next section, go back to your home directory by typing

cd /home/[eraider]

PATHS

Each file on the filesystem can be uniquely identified by a combination of a filename and a path. You can reference any file on the system by giving its full name, which begins with a / indicating the root directory, continues through a list of subdirectories (the components of the path) and ends with the filename. The absolute path describes the relationship of the file to the root directory, /. Each name in the path represents a subdirectory of the prior directory, and '/' characters separate the directory names. The full name, or absolute path, of a file in someone's home directory might look like this:

/home/daray/gge/exampledirectory1/file1.txt

This means that there is a subdirectory of 'root' (aka /) called 'home'. Within 'home', there is a subdirectory called 'gge'. Within 'gge' there is a subdirectory called 'exampledirectory1'. Within 'exampledirectory1' there is a file called 'file1.txt'.

You can think of these as ways to direct someone to a specific location as follows. If you wanted to tell some aliens how to find you right now, you could tell them that you're in the universe. Within the universe you're in a galaxy called the Milky Way. In the Milky Way, you're in something we call the Solar System. Within the Solar System, you're on one of the planets, the one called Earth. On Earth, you're on the North American continent and on that continent, you're in the country, the USA. Within the USA, you're in Texas. Within Texas, you're in Lubbock County, in the city of Lubbock, on the campus of Texas Tech University, in the Biology Building, in room 405.

To make that into a path similar to the ones we'll be using we'd write it like this. We'll symbolize the universe with an initial '/' and all of the other parts of the path will be subdivisions of that.

/MilkyWay/SolarSystem/Earth/NorthAmerica/USA/Texas/LubbockCounty/Lubbock/TexasTechUniversity/BiologyBuilding/Room405 is the path to you that anyone in the universe could use to locate your position. It's your absolute path.

Suppose instead of being here, you were at the top of Olympus Mons on Mars in a spacecraft. The entire absolute path would not change, only the relevant parts.

In other words, /MilkyWay/SolarSystem/Mars/OlympusMons/YourSpaceCraft

Every file or directory on the system can be named by its absolute path, but it can also be named by a relative path that describes its relationship to the current working directory. Files in the directory you are in can be uniquely identified just by giving the filename they have in the current working directory. Files in subdirectories of your current directory can be named in relation to the subdirectory they are part of. From daray's home directory (/home/daray/), he can uniquely identify the file file1.txt as exampledirectory1/file1.txt. The absence of a preceding / means that the path is defined relative to the current directory rather than relative to the root directory.

If our aliens were already in the Solar System, they would only need to refer to the needed part of the path.

Earth/NorthAmerica/USA/Texas/LubbockCounty/Lubbock/TexasTechUniversity/BiologyBuilding/Room405 or Mars/OlympusMons/YourSpaceCraft, depending on your location.

If you want to name a directory that is on the same level or above the current working directory, there is a shorthand for doing so. Each directory on the system contains two links, ./ and ../, which refer to the current directory and its parent directory (the directory it's a subdirectory of), respectively. If user daray is working in the directory home/daray/exampledirectory1, he can refer to the directory /home/daray/exampledirectory2 as ../exampledirectory2. The '../' backs one out of the example directory into /home/daray/ and then the 'exampledirectory2' directs attention to that folder.

Another shorthand naming convention, is that home directory itself. It can be designated simply by ~. For example if you wanted to identify the path to file1.txt, you could simply type ~/exampledirectory1/file1.txt.

YOU ARE HERE: 'PWD'

pwd stands for "print working directory," and that's exactly what it does. pwd sends the full pathname of the directory you are currently in, the current working directory, to standard output - it prints to the screen. You can think of being "in" a directory in this way: if the directory tree is a map of the filesystem, the current working directory is the "you are here" pointer on the map.

When you log in to the system, your "you are here" pointer is automatically placed in your home directory. Your home directory is a unique place. It contains the files you use almost every time you log into your system, as well as the directories that you create to store other files. What if you want to find out where your home directory is in relation to the rest of the system? Typing pwd at the command prompt in your home directory should give output something like:

login-20-26:$ pwd
/home/[eraider]

This means that your particular home directory is a subdirectory of the system 'home' directory (but designated by your user ID). The system 'home' directory is, in turn, a subdirectory of the root (/) directory.

MAKING NEW DIRECTORIES

If we want to make a new directory (e.g. to store some work related data), we can use the mkdir command:

mkdir making_a_directory

Note that the underscores '_' are important. Unix does not like blank spaces.

ls

Assuming you're still in your home directory and that you have not worked on HPCC before, you should see:

gge making_a_directory

A DIGRESSION ON FILE HIERARCHY

Filesystems can be deep and narrow or broad and shallow. It's best to follow an intuitive scheme for organizing your files. Each level of hierarchy should be related to a step in the process you've used to carry out the project. A filesystem is probably too shallow if the output from numerous processing steps in one large project is all shoved together in one directory. However, a project directory that involves several analyses of just one data object might not need to be broken down into subdirectories. The filesystem is too deep if versions of output of a process are nested beneath each other or if analyses that require the same level of processing are nested in subdirectories. It's much easier to for you to remember and for others to understand the paths to your data if they clearly symbolize steps in the process you used to do the work.

As you'll see in the upcoming example, your home directory will probably contain a number of directories, each containing data and documentation for a particular project. Each of these project directories should be organized in a way that reflects the outline of the project. Each directory should contain documentation that relates to the data within it. That documentation typically takes the form of a README file that has text to describe the contents and, possibly, how they were generated.

ESTABLISHING FILE-NAMING CONVENTIONS

Unix allows an almost unlimited variability in file naming. Filenames can contain any character other than the/ or the null character (the character whose binary representation is all zeros). However, it's important to remember that some characters, such as a space, a backslash, or an ampersand, have special meaning on the command line and may cause problems when naming files. Filenames can be up to 255 characters in length on most systems. However, it's wise to aim for uniformity rather than uniqueness in file naming. Most humans are much better at remembering frequently used patterns than they are at remembering unique 255-character strings, after all.

A common convention in file naming is to name the file with a unique name followed by a dot (.) and then an extension that uniquely indicates the file type.

As you begin working with computers in your research and structuring your data environment, you need to develop your own file-naming conventions, or preferably, find out what naming conventions already exist and use them consistently throughout your project. There's nothing so frustrating as looking through old data sets and finding that the same type of file has been named in several different ways. Have you found all the data or results that belong together? Can the file you are looking for be named something else entirely? In the absence of conventions, there's no way to know this except to open every unidentifiable file and check its format by eye. The next section provides a detailed example of how to set up a filesystem that won't have you tearing out your hair looking for a file you know you put there.

Here are some good rules of thumb to follow for file-naming conventions:

  • Files of the same type should have the same extension.
  • Files derived from the same source data should have a common element in their unique names.
  • The unique name should contain as much information as possible about the experiment.
  • Filenames should be as short as is possible without compromising uniqueness.

You'll probably encounter preestablished conventions for file naming in your work. For instance, if you begin working with protein sequence and structure datafiles, you will find that families of files with the same format have common extensions. You may find that others in your group have established local conventions for certain kinds of data files and results. You should attempt to follow any known conventions. Some typical file naming conventions we'll use are:

  • .fa - fasta formatted sequence files
  • .fq - fastq files that have sequence data and quality scores
  • .txt - plain text files
  • .sh - shell scripts
  • .gz - compressed files
  • .bam - binary versions of mapped read files

STRUCTURING A PROJECT

In a typical genome sequencing and assembly project you will encounter several file types. For example, you may want to keep a record of the sample origination information in a spreadsheet (samples.xlsx). That could be kept in an 'info' directory along with a readme file that describes the project. Then, after getting the initial sequencing reads, you would want to store those in a 'raw_reads' directory. The reads will eventually be assembled into an assembly but you may use multiple assemblers and/or perform multiple assemblies using any one assembler. Thus, you should have an 'assemblies' directory and any subdirectories might reflect the different assembly methods you used within them. Once you decide on an assembly to use for downstream analyses, you will want to keep those files in a relevant directory called 'data_analysis'. Finally, you will likely be writing several scripts to use fo r your assemblies and analyses. Thus, you will want to store them in a 'scripts' directory. Overall, it may look something like this:

image

Assuming you're working on the TTU HPCC and keeping your files on the /lustre/work/ system, your file hierarchy would look something like this:

/lustre/work/your username/species_x_assembly/info
/lustre/work/your username/species_x_assembly/info/readme.txt
/lustre/work/your username/species_x_assembly/raw_reads
/lustre/work/your username/species_x_assembly/raw_reads/illumina
/lustre/work/your username/species_x_assembly/raw_reads/illumina/file1_R1.fastq.gz
/lustre/work/your username/species_x_assembly/raw_reads/illumina/file1_R2.fastq.gz
/lustre/work/your username/species_x_assembly/raw_reads/illumina/file2_R1.fastq.gz
/lustre/work/your username/species_x_assembly/raw_reads/illumina/file2_R2.fastq.gz

and so on....

/lustre/work/your username/species_x_assembly/raw_reads/pacbio
/lustre/work/your username/species_x_assembly/raw_reads/pacbio/file1.fastq.gz
/lustre/work/your username/species_x_assembly/raw_reads/pacbio/file2.fastq.gz

and so on....

/lustre/work/your username/species_x_assembly/raw_reads/nanopore
/lustre/work/your username/species_x_assembly/raw_reads/nanopore/file1.fastq.gz
/lustre/work/your username/species_x_assembly/raw_reads/nanopore/file2.fastq.gz

and so on....

GETTING FROM A TO B

We are in the home directory on the computer but we want to to work in the gge directory. To change directories in Unix, we use the cd command:

cd gge

login-20-26:/gge$

Notice that — on this system — the command prompt has expanded to include our current directory. This doesn’t happen by default on all Unix systems, but you should know that you can configure what information appears as part of the command prompt.

Let’s make two new subdirectories and navigate into them:

mkdir outer_directory

cd outer_directory

login-20-26:/gge/outer_directory$

Now try:

mkdir inner_directory

cd inner_directory

login-20-26:/gge/outer_directory/inner_directory$

Now our command prompt is getting quite long, but it reveals that we are three levels beneath the home directory. We created the two directories in separate steps, but it is possible to use the mkdir command in way to do this all in one step.

Like most Unix commands, mkdir supports command-line options which let you alter its behavior and functionality. Command-like options are — as the name suggests — optional arguments that are placed after the command name. They often take the form of single letters (following a dash). If we had used the -p option of the mkdir command we could have done this in one step. E.g.

mkdir -p outer_directory/inner_directory

The -p option means, 'create all intermediate subdirectories in making the ultimate directory in the path.'

MAKING THE 'LS' COMMAND MORE USEFUL

The .. operator that we saw earlier can also be used with the ls command, e.g. you can list directories that are ‘above’ you:

ls ../../

exampledirectory1  exampledirectory2  examplefile1.txt  outer_directory

Time to learn another useful command-line option. If you add the letter ‘l’ to the ls command it will give you a longer output compared to the default:

ls -l ../../

total 12
drwxr-xr-x 2 <eriaider> bio 4096 Nov  1  2019 exampledirectory1
drwxr-xr-x 2 <eriaider> bio 4096 Nov  1  2019 exampledirectory2
-rw-r--r-- 1 <eriaider> bio    0 Jul 17 11:32 examplefile1.txt
drwxr-xr-x 3 <eriaider> bio 4096 Jul 17 12:46 outer_directory

Note that if you were to enter the following:

ls -l ~/gge

You get exactly the same thing. Why?

For each file or directory we now see more information (including file ownership and modification times). The ‘d’ at the start of each line indicates that these are directories. There are many, many different options for the ls command. Try out the following (against any directory of your choice) to see how the output changes.

ls -R ../../

ls -l -t -S ../../

ls -l -t -S -r ../../

ls -ltSr ../../

ls -lh ../../

Note that the last example combine multiple options but only use one dash. This is a very common way of specifying multiple command-line options. You may be wondering what some of these options are doing. It’s time to learn about Unix documentation.

I'm a big fan of ls -lhrt. Gives you just about everything you could ask for.

'MAN' PAGES

If every Unix command has so many options, you might be wondering how you find out what they are and what they do. Well, thankfully every Unix command has an associated ‘manual’ that you can access by using the man command. E.g.

man ls

man cd

man man # yes even the man command has a manual page

When you are using the man command, press space to scroll down a page, b to go back a page, or q to quit. You can also use the up and down arrows to scroll a line at a time. The man command is actually using another Unix program, a text viewer called less, which we’ll come to later on.

REMOVING DIRECTORIES

We now have a few (empty) directories that we should remove. To do this use the rmdir command, this will only remove empty directories so it is quite safe to use. If you want to know more about this command (or any Unix command), then remember that you can just look at its man page.

cd ~/gge

rmdir outer_directory/inner_directory

Using ls correctly will show you that the outer_directory is still present but the inner directory is gone.

rmdir outer_directory

USING TAB COMPLETION

Saving keystrokes may not seem important, but the longer that you spend typing in a terminal window, the happier you will be if you can reduce the time you spend at the keyboard. Especially, as prolonged typing is not good for your body. So the best Unix tip to learn early on is that you can tab complete the names of files and programs on most Unix systems. Type enough letters that uniquely identify the name of a file, directory or program and press tab…Unix will do the rest. E.g. if you type ‘tou’ and then press tab, Unix should autocomplete the word to ‘touch’ (this is a command which we will learn more about in a minute). In this case, tab completion will occur because there are no other Unix commands that start with ‘tou’. If pressing tab doesn’t do anything, then you have not have typed enough unique characters. In this case pressing tab twice will show you all possible completions.

Navigate to your home directory, make a 'Learning_unix' directory with the mkdir command, and then use the cd command to change to the Learning_unix directory. Use tab completion to complete directory name. If there are no other directories starting with ‘L’ in your home directory, then you should only need to type ‘cd’ + ‘L’ + ‘tab’.

Tab completion will make your life easier and make you more productive! This trick can save you a LOT of typing! It can also save you many, many instances of trying to figure out what simple typos you may have made when entering a long path.

Another great time-saver is that Unix stores a list of all the commands that you have typed in each login session. You can access this list by using the history command or more simply by using the up and down arrows to access anything from your history. So if you type a long command but make a mistake, press the up arrow and then you can use the left and right arrows to move the cursor in order to make a change.

CREATING EMPTY FILES WITH 'TOUCH'

The following sections will deal with Unix commands that help us to work with files, i.e. copy files to/from places, move files, rename files, remove files, and most importantly, look at files. First, we need to have some files to play with. The Unix command touch will let us create a new, empty file. The touch command does other things too, but for now we just want a couple of files to work with.

cd gge

touch heaven.txt

touch earth.txt

ls

earth.txt  exampledirectory1  exampledirectory2  examplefile1.txt  heaven.txt

A QUICK DIGRESSION.

Go back to your Bitvise SFTP window and use your mouse to navigate to the gge directory. Look around inside of it and you should see all of the work you've done to this point but in a graphical user format rather than on the command line.

Quickly drag and drop the file earth.txt to your desktop. This is how we will transfer files back and forth from HPCC to the local system.

MOVING FILES (MOVING HEAVEN AND EARTH)

Now, let’s assume that we want to move these files to a new directory (‘temp’). We will do this using the Unix mv (move) command. Remember to use tab completion:

mkdir temp

mv heaven.txt temp/

mv earth.txt temp/

ls

exampledirectory1  exampledirectory2  examplefile1.txt  temp

ls temp/

earth.txt  heaven.txt

For the mv command, we always have to specify a source file (or directory) that we want to move, and then specify a target location. If we had wanted to we could have moved both files in one go by typing any of the following commands:

mv *.txt temp/

mv *ea* temp/

The asterisk * acts as a wild-card character, essentially meaning ‘match anything’. The second example works because only those two files contain the letters ‘ea’ in their names. Using wild-card characters can save you a lot of typing but be careful with it, especially with commands like rmdir. Once a file or directory is gone, it's gone forever.

The ‘?’ character is also a wild-card but with a slightly different meaning. See if you can work out what it does.

RENAMING FILES

In the earlier example, the destination for the mv command was a directory name (temp). So we moved a file from its source location to a target location, but note that the target could have also been a (different) file name, rather than a directory. E.g. let’s make a new file and move it whilst renaming it at the same time:

touch rags

ls

exampledirectory1  exampledirectory2  examplefile1.txt  rags  temp

mv rags temp/riches

ls temp/

earth.txt  heaven.txt  riches

In this example we create a new file (‘rags’) and move it to a new location and in the process change the name (to ‘riches’). So mv can rename a file as well as move it. The logical extension of this is using mv to rename a file without moving it (you have to use mv to do this as Unix does not have a separate ‘rename’ command):

mv temp/riches temp/rags

Use ls to see what happened.

MOVING DIRECTORIES: 'MV'

It is important to understand that as long as you have specified a ‘source’ and a ‘target’ location when you are moving a file, then it doesn’t matter what your current directory is. You can move or copy things within the same directory or between different directories regardless of whether you are in any of those directories. Moving directories is just like moving files:

mkdir temp2

mv temp2 temp

ls temp/

earth.txt  heaven.txt  rags  temp2

REMOVING FILES: 'RM'

You’ve seen how to remove a directory with the rmdir command, but rmdir won’t remove directories if they contain any files. So how can we remove the files we have created (inside gge/temp)? In order to do this, we will have to use the rm (remove) command.

Please read the next section VERY carefully. Misuse of the rm command can lead to needless death & destruction

Potentially, rm is a very dangerous command; if you delete something with rm, you will not get it back! It is possible to delete everything in your home directory (all directories and subdirectories) with rm, that is why it is such a dangerous command. Never, NEVER use rm *

Let me repeat that last part again. It is possible to delete EVERY file you have ever created with the rm command. Are you scared yet? You should be. Luckily there is a way of making rm a little bit safer. We can use it with the -i command-line option which will ask for confirmation before deleting anything (remember to use tab-completion):

cd temp

ls

earth.txt  heaven.txt  rags  temp2

rm -i earth.txt heaven.txt rags

You'll need to respond with 'y' to the following prompts.

rm: remove regular empty file ‘earth.txt’? y
rm: remove regular empty file ‘heaven.txt’? y
rm: remove regular empty file ‘rags’? y

ls

All you're left with is

temp2

We could have simplified this step by using a wild-card (e.g. rm -i *.txt) or we could have made things more complex by removing each file with a separate rm command. Let’s finish cleaning up:

rmdir temp2

cd ..

rmdir temp

You could have gotten rid of both at once by just using rm with the -rf option on the top directory,

rm -rf temp

But again, this is dangerous territory.

COPYING FILES: 'CP'

Copying files with the cp (copy) command is very similar to moving them. Remember to always specify a source and a target location. Let’s create a new file and make a copy of it:

mkdir copy

cd copy

touch file1

cp file1 file2

ls

file1  file2

What if we wanted to copy files from a different directory to our current directory? Let’s put a file in our home directory (specified by ~ remember) and copy it to the current directory (gge):

touch ~/file3

ls ~

gge file3 <and possibly a bunch of other things>

cp ~/file3 .

ls

file1  file2  file3

This last step introduces another new concept. In Unix, the current directory can be represented by a . (dot) character. You will mostly use this only for copying files to the current directory that you are in. Compare the following:

ls

ls .

ls ./

In this case, using the dot is somewhat pointless because ls will already list the contents of the current directory by default. Also note how the trailing slash is optional. You can use rm to remove the temporary files and rmdir to remove the 'copy' directory.

Remove all three files using rm.

COPYING DIRECTORIES

The cp command also allows us (with the use of a command-line option) to copy entire directories. Use man cp to see how the -R or -r options let you copy a directory recursively.

VIEWING FILES WITH 'LESS'

So far we have covered listing the contents of directories and moving/copying/deleting either files and/or directories. Now we will quickly cover how you can look at files. The less command lets you view (but not edit) text files. We will use the echo command to put some text in a file and then view it:

echo "Call me Ishmael."

Call me Ishmael.

echo "Call me Ishmael." > opening_lines.txt

ls

opening_lines.txt

less opening_lines.txt

On its own, echo isn’t a very exciting Unix command. It just echoes text back to the screen. But we can redirect that text into an output file by using the > symbol. This allows for something called file redirection.

Careful when using file redirection (>), it will overwrite any existing file of the same name

When you are using less, you can bring up a page of help commands by pressing 'h', scroll forward a page by pressing space, or go forward or backwards one line at a time by pressing 'j' or 'k'. To exit less, press 'q' (for quit). The less program also does about a million other useful things (including text searching).

VIEWING FILES WITH 'CAT'

Let’s add another line to the file:

echo "The primroses were over." >> opening_lines.txt

cat opening_lines.txt

Call me Ishmael.
The primroses were over.

Notice that we use >> and not just >. This operator will append to a file. If we only used >, we would end up overwriting the file. The cat command displays the contents of the file (or files) and then returns you to the command line. Unlike less you have no control on how you view that text (or what you do with it). It is a very simple, but sometimes useful, command. You can use cat to quickly combine multiple files or, if you wanted to, make a copy of an existing file:

cat opening_lines.txt > file_copy.txt

Use ls to view the result and then remove the new file.

COUNTING CHARACTERS IN A FILE

ls

opening_lines.txt

ls -l

total 4
-rw-rw-r-- 1 [eraider] bio 42 Jun 15 04:13 opening_lines.txt

wc opening_lines.txt

 2  7 42 opening_lines.txt

wc -l opening_lines.txt

2 opening_lines.txt

The ls -l option shows us a long listing, which includes the size of the file in bytes (in this case ‘42’). Another way of finding this out is by using Unix’s wc command (word count). By default this tells you many lines, words, and characters are in a specified file (or files), but you can use command line options to give you just one of those statistics (in this case we count lines with wc -l).

EDITING SMALL FILES WITH 'NANO'

Nano is a lightweight editor installed on most Unix systems. There are many more powerful editors (such as ‘emacs’ and ‘vi’), but these have steep learning curves. Nano is very simple. You can edit (or create) files by typing:

nano opening_lines.txt

You should see the following appear in your terminal:

image

The bottom of the nano window shows you a list of simple commands which are all accessible by typing ‘Control’ plus a letter. E.g. Control + X exits the program.

THE $PATH ENVIRONMENT VARIABLE

One other use of the echo command is for displaying the contents of something known as environment variables. These contain user-specific or system-wide values that either reflect simple pieces of information (your username), or lists of useful locations on the file system. Some examples:

echo $USER
[eraider]
echo $HOME
/home/[eraider]
echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games #Yours will likely look different.

The last one shows the content of the $PATH environment variable, which displays a — colon separated — list of directories that are expected to contain programs that you can run. This includes all of the Unix commands that you have seen so far. These are files that live in directories which are run like programs (e.g. ls is just a special type of file in the /bin directory).

Knowing how to change your $PATH to include custom directories can be necessary sometimes (e.g. if you install some new bioinformatics software in a non-standard location). We will do this at times throughout the class.

MATCHING LINES IN FILES WITH 'GREP'

Use nano to add the following lines to opening_lines.txt:

Now is the winter of our discontent.  
All children, except one, grow up.  
The Galactic Empire was dying.  
In a hole in the ground there lived a hobbit.  
It was a pleasure to burn.  
It was a bright, cold day in April, and the clocks were striking thirteen.  
It was love at first sight.  
I am an invisible man.  
It was the day my grandmother exploded.  
When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow.  
Marley was dead, to begin with.

You will often want to search files to find lines that match a certain pattern. The Unix command grep does this (and much more). The following examples show how you can use grep’s command-line options to:

  • show lines that match a specified pattern
  • ignore case when matching (-i)
  • only match whole words (-w)
  • show lines that don’t match a pattern (-v)
  • use wildcard characters and other patterns to allow for alternatives (*, ., and [])

grep was opening_lines.txt

The Galactic Empire was dying.
It was a pleasure to burn.
It was a bright, cold day in April, and the clocks were striking thirteen.
It was love at first sight.
It was the day my grandmother exploded.
When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow.
Marley was dead, to begin with.

The -v option inverts the search.

grep -v was opening_lines.txt

Call me Ishmael.
The primroses were over.
Now is the winter of our discontent.
All children, except one, grow up.
In a hole in the ground there lived a hobbit.
I am an invisible man.

grep all opening_lines.txt

Call me Ishmael.

The -i option allows you to ignore the case of the searched string.

grep -i all opening_lines.txt

Call me Ishmael.
All children, except one, grow up.

grep in opening_lines.txt

Now is the winter of our discontent.
The Galactic Empire was dying.
In a hole in the ground there lived a hobbit.
It was a bright, cold day in April, and the clocks were striking thirteen.
I am an invisible man.
Marley was dead, to begin with.

The -w option only yields whole matches to the searched string.

grep -w in opening_lines.txt

In a hole in the ground there lived a hobbit.
It was a bright, cold day in April, and the clocks were striking thirteen.

See if you can figure out what the following searches accomplish.

grep [aeiou]t opening_lines.txt

In a hole in the ground there lived a hobbit.
It was love at first sight.
It was the day my grandmother exploded.
When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow.
Marley was dead, to begin with.

grep -w -i [aeiou]t opening_lines.txt

It was a pleasure to burn.
It was a bright, cold day in April, and the clocks were striking thirteen.
It was love at first sight.
It was the day my grandmother exploded.
When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow.

COMBINING UNIX COMMANDS WITH PIPES

One of the most poweful features of Unix is that you can send the output from one command or program to any other command (as long as the second commmand accepts input of some sort). We do this by using what is known as a pipe. This is implemented using the ‘|’ character (which is a character which always seems to be on different keys depending on the keyboard that you are using). Think of the pipe as simply connecting two Unix programs. Here’s an example which introduces some new Unix commands:

grep was opening_lines.txt | wc -c

316

The above command searches the specified file for lines matching ‘was’, it sends the lines that match through a pipe to the wc program. We use the -c option to count the total number of characters in the matching lines (316).

The following uses some built in commands that we haven't discussed yet.

grep was opening_lines.txt | sort | head -n 3 | wc -c

130

The second example first sends the output of grep to the Unix sort command. This sorts a file alphanumerically by default. The sorted output is sent to the head command which by default shows the first 10 lines of a file. We use the -n option of this command to only show 3 lines. These 3 lines are then sent to the wc command as before.

Whenever making a long pipe, test each step as you build it!

MISCELLANEOUS UNIX POWER COMMANDS

The following examples introduce some other Unix commands, and show how they could be used to work on a fictional file called file.txt. Remember, you can always learn more about these Unix commands from their respective man pages with the man command. These are not all real world cases I'm asking you to perform, but rather show the potential diversity of Unix command-line tools.

View the penultimate 10 lines of a file (using head and tail commands):

tail -n 20 file.txt | head

Show lines of a file that begin with a start codon (ATG) (the ^ matches patterns at the start of a line):

grep "^ATG" file.txt

Cut out the 3rd column of a tab-delimited text file and sort it to only show unique lines (i.e. remove duplicates):

cut -f 3 file.txt | sort -u

Count how many lines in a file contain the words ‘cat’ or ‘bat’ (-c option of grep counts lines):

grep -c '[bc]at' file.txt

Turn lower-case text into upper-case (using tr command to ‘transliterate’):

cat file.txt | tr 'a-z' 'A-Z'

Change all occurences of ‘Chr1’ to ‘Chromosome 1’ and write changed output to a new file (using sed command):

cat file.txt | sed 's/Chr1/Chromosome 1/' > file2.txt

CHANGING FILE/DIRECTORIES PERMISSIONS WITH 'CHMOD' (modified from https://www.guru99.com/file-permissions.html)

Say you do not want your colleague to see your research files. This can be achieved by changing file permissions.

We can use the _chmod _command which stands for 'change mode'. Using the command, we can set permissions (read, write, execute) on a file/directory for the owner, group and the world.

Usage: chmod permissions filename

There are 2 ways to use the command: Absolute mode and Symbolic mode

ABSOLUTE (NUMERIC) MODE

In this mode, file permissions are not represented as characters but a three-digit octal number. The table below gives numbers for all for permissions types.

Number Permission Type Symbol
0 No Permission ---
1 Execute --x
2 Write -w-
3 Execute + Write -wx
4 Read r--
5 Read + Execute r-x
6 Read +Write rw-
7 Read + Write +Execute rwx

Perhaps you have a file, text.txt.

chmod 764 text.txt

The above command will change permissions as follows:

  • Owner can read, write and execute
  • Usergroup can read and write
  • World can only read

This is shown as '-rwxrw-r-.

SYMBOLIC MODE

In the Absolute mode, you change permissions for all 3 owners. In the symbolic mode, you can modify permissions of a specific owner. It makes use of mathematical symbols to modify the file permissions.

Operator Description
+ Adds a permission to a file or directory
- Removes the permission
= Sets the permission and overrides the permissions set earlier.

The various owners are represented as

User Denotations
u user/owner
g group
o other
a all
We will not be using permissions in numbers like _755 _but characters like rwx.

chmod o=rwx text.txt allows the other users to read, write, and execute the file.

chmod u-r text.txt removes read permission from the user (owner).

Moving files to and from HPCC

To move files to and from HPCC, you will need to use an FTP client. In class, this is easy to do using either MobaXterm or Bitvise.

It's easiest with Bitvise but you can do it with MobaXterm as well. I'll let you figure out how.

In Bitvise, it's similar. The SFTP window that opened with Bitvise when you logged in will have two large windows. The leftmost one shows what's on your local computer and the right one shows what's on HPCC. You can simply drag and drop from one to another.

FileZilla, a package that will also work on Macs, works the same way as Bitvise. Left is local, right is HPCC.

FOR YOU TO DO

If you've followed this tutorial all the way through, this will be easy.

history | tail -200 > history_tutorial_01.txt

Download this file to your local computer using any of the file transfer methods described above and then upload it to Blackboard under Assignment 1 - Linux. That's it. I'll check to see if you did everything.

For additional reading, read through this document (https://kb.iu.edu/d/afsk) to use as a cheat sheet.

⚠️ **GitHub.com Fallback** ⚠️