Section 5: More complex but useful commands - Green-Biome-Institute/AWS GitHub Wiki

Back to Section 4: Making, Moving, Copying, Editing, and Removing Files!

Go back to tutorial overview

Review

Before we jump straight in to Module 2, let’s quickly go over a couple of the concepts and commands that we discussed previously that you should be familiar with now.

Concepts:

  • The Command Line Interface (CLI) is the window where you type commands and information and enter it to interact with your computer. A directory is like a folder, it can contain files or other directories.
  • When you enter an EC2 instance, you are using an entirely different computer (that is in the cloud) than your own!
  • In cases where a regular user doesn’t have permission to do something, you can gain administrator privileges (also known as superuser root privileges)
  • Using the ls command is helpful when you feel like you have gotten “lost” in the CLI
  • Commands have options that allow you to use those commands in different ways. These have the syntax: [command] -[option].
  • Most commands have manuals that you should check if you need help.
  • A variable is a container of information.
  • An environment variable is a variable that is accessible anywhere within your CLI session (no matter where you change directory to, you can still use or read it)
  • The path of a directory or file is the sequence of directories that lead to it. It’s like a map for the computer to use when searching for a file or directory.
  • The PATH (uppercase) is an environment variable with the series of paths (lowercase) that the computer searches when it looks for the instructions for the commands that you enter.
  • It’s always a good idea to organize your computer with descriptive directories and filenames!
  • The command line isn’t hard! It just takes some practice to gain familiarity with.

Commands:

  • echo [words] - print [words] onto the screen
  • man [commands] - read the manual for [command]
  • pwd - print your current working directory
  • ls - list the contents of your current working directory
  • cd [directory] -change directory into [directory]
  • cat [file] - read [file] onto the screen
  • mkdir [dir-name] - make a directory named [dir-name]
  • rm [file] - remove [file]
  • mv [file] [destination] - move [file] to [destination]
  • cp [file1] [new-filename] - make a copy of [file1] with the new name [filename]
  • nano [file] - enter the nano editor and edit [file]
  • exit - exit the current EC2 instance, nano editor, or terminal session
  • sudo [command] - execute [command] with administrator privileges
  • clear - clear the current text from the CLI window

With that review done, we can move onto the content of this module.

Section 5: More complex but useful commands

Learning Points

The aim of this module will be to build on our current commands with some more complex ones. These commands will be used to:

  • Tell us information about the computer like how much storage is left so we know that there is enough room to download a file or how much computer memory is being used.
  • Tell us information of files or directories
  • Search for names and information within files and directories and replacing information within those files
    • For example, Imagine I have a file with 1,000,000 data entries! If each entry is labelled with a number (entry 1 has a label 0000001 followed by its data, entry 2 has a label 0000002 followed by its data, and so on), how do we read only the data of entry number 235,683 without reading the whole file?
  • Downloading files from the internet or github
  • Unzipping files with common file extensions
  • Creating new virtual terminal sessions to run programs on
  • A bunch of great new commands including:
    • lsblk: List block devices on a Linux system
    • df -h: Display amount of free space in your system
    • top: Real-time view of what softwares your computer system is currently running
    • wc: Newline, word, byte counter
    • grep: Search for patterns within files or directories
    • |: Pipe command allows you to take the output of one command and use it as the input for another
    • sed: Search for patterns and replace them with something else
    • apt : Download software packages.
    • wget: Download files from web servers
    • curl: Download files from web servers
    • tar: Compress/uncompress files with the .tar or .tar.gz file extension
    • gunzip: Compress/uncompress files with the .gz file extension
    • unzip: Compress/uncompress files with the .zip file extension
    • pip and conda: Download Python packages
    • git clone: Download Github repositories
    • *: wildcard operator allows you to include all the files and directories in your current working directory or all files that end with [txt]* or begin with [text]*
    • >: Redirection operator that directs the output of a command into a new file (or overwrites the existing one)
    • >>: Redirection operator that is the same as >, except if a file already exists, it appends the output of the command to the left of it to the end of the file on the right of it

Finding out more about our EC2 Instance or computer

df -h, lsblk, top

First, we’ll look at commands that tell us information about the computer system itself.

Use:

$ df -h

This command stands for “disc free” and is used to display the amount of free space available in your system! You can think of your phone, when you have too many pictures or videos on it, it starts to run out of memory. On your phone you can go into the settings and look at the amount of storage left. This is just like that.

In this output you’ll see several columns. The important ones to note are:

  • Filesystem refers to the name of the filesystem with storage capacity
  • Size refers to the size of those filesystems in either megabytes (M) or gigabytes (G).
  • Used refers to the amount of the space within Size that is currently already being used. This might be because of softwares you've installed or data you've uploaded to the computer.
  • Avail is the amount of storage left on that filesystem
  • Use% is the percentage of the total storage left on that filesystem

This command is fairly commonly used as it is useful for figuring out if you have enough space left on your EC2 instance (or computer) to download more data. Sometimes programs will save lots of files while they run as well, and if you don’t have enough storage capacity for those intermediary files, your program will not run (or run incorrectly!)

Next use:

$ lsblk

This is a command for listing the block devices on a Linux system. In this case, a block device just refers to the “blocks” of storage on your device. You can think of these as external hard drives that are attached to your computer. This will become more relevant when we start interacting more with AWS itself.

The important things to note in this output are the columns NAME and SIZE.The NAME column shows the names of these storage blocks and the SIZE column shows the size of those storage blocks in either megabytes (M) or gigabytes (G).

When you add more storage onto an EC2 instances, this command is used. The procedure regarding doing that will be on the EC2 instance wiki page.

Before we use the next command, note that it will take over your CLI window:

$ top

The top program created a real-time view of everything your computer system is currently doing. Every program or command you use requires your system to run a set of instructions. Each row in the top program shows

  • What percent of the central processing units (%CPU)
  • and RAM memory (%MEM) are being used,
  • starting with an identification number on the left (PID)
  • to the command that is being run (COMMAND)
  • and how long it has been running (TIME+).

If your EC2 instance is running slowly you can check this to see if there is another program running behind the scenes. Or you can use it to log in to confirm that the program you wanted to run is still running.

In order to exit out of the top program press:

$ q

These three commands can give us quite a bit of information about our computer! We now can figure out how much storage space our EC2 instance has, how much is available or being used, what the names of the storage blocks are, what programs are running on the instance (even if we can’t physically see them running on the screen)

Finding out more about our files or directories

wc, grep, pipe operator |, sed

Next, let’s find out some more information about the files and directories on our system.

First, we’ll navigate into the directory ex3-dir and list its contents:

$ cd ex3-dir
$ ls

You can see that there are 2 data files in here. Let’s read the first one:

$ cat data.csv

Let’s scroll to the top to see what’s happening. You can see at the top there is a header with the three labels: “ID, Date, and Data”. We’ll imagine that each row is one entry of a specific type of data with an ID number, the date it was taken, and the numbers in the 3rd column are important data to us. Let’s look at the second data file now:

$ cat expanded-data.csv

Wow! That’s a lot of lines of data! This has the same layout as the last file, except instead of 10 entries, it has 10,000! If we want to get information about this file or find specific components of it, it would take too much time to move around with our mouse and by scrolling.

Let’s instead start looking through this second file (expanded-data.csv) with some tools. We’ll begin with the command wc, which stands for “word count”. As you can imagine, it has a similar purpose to the “word count” function in Microsoft Word. If we just use the wc command with no options:

$ wc expanded-data.csv

We get the output

10001 80003 448903 expanded-data.csv

Looking at this output:

  • The first number, 1001, is the number of newlines (the number of rows).
  • The second number is the number of words (blocks of text separated by spacing, this includes numbers!). Since there are 3 words in the header (ID, Time, and Data) and 8 words per data entry x 10,000 lines, that’s 3 + 8000 = 80003.
  • The third number is the number of bytes in the file (don’t need to memorize this, but a byte is a unit of information, each character or number is stored with a certain number of bytes!)
  • After these three numbers, it outputs the name of the file.

We can actually use this for multiple files at the same time (this is true for many commands):

$ wc data.csv expanded-data.csv

Now, you get the same information for each file and for them added up in total!

Now that we have a larger overview of this data file, it would be great to look for specific bits of data within it. This is where the grep command is used. grep is a powerful tool that can be used to search through both files and directories, we will only go over the basics of it to give you an example.

First, let’s look for a piece of information within this large data file. Let’s say we want to look at data entry “ID2000” specifically and nothing else. We can use grep to search for that number (note this will return everything with the “ID2000” in it, in our case there is only one entry with that number as an ID). The syntax for this is

grep [options] [text-you-want-to-search-for] [file-to-serach-through]

So for “ID2000”, use:

$ grep ID2000 expanded-data.csv

You can see that the whole line is printed that has a match and that “ID2000” is printed in red! Let’s imagine that each number in these data files 3rd column represents something real life. What if we want to search for which data entries have specific patterns? For example, which of the 10,000 data entries have the number “1780”. Let’s use grep again, but replace “ID2000” with “1780”:

$ grep 1780 expanded-data.csv

You can see that all the files with 1780 were printed! But wait, if we scroll up, entry “ID1780” is also highlighted! Remember, the command line does what you tell it, and that is not necessarily what you want it to do. In this above command, you told it to search for every line with “1780” in it, and whenever it found that series of characters matched, it printed that line.

Next, this example serves a good purpose to teach something called a “pipe”. To “pipe” something in the Linux CLI means to take the output of one thing and use it as the input for something else. For example, it’s nice that we were able to print all the data entries with “1780”... but how many of those entries were there? We certainly don’t want to count them, do we? Well, we just learned how to search for patterns AND we learned how to do a word count that includes the number of rows! Let’s put them together. The character | is the pipe character. It is not a capital “i” but instead the straight line that is usually on the same key as the backslash “\”. So what we are going to do is to search through our data file for everything with the pattern “1780” and then use that printed output as the input for the command wc! The syntax for this looks like: [command 1] | [command 2]

For our example enter the following:

$ grep 1780 expanded-data.csv | wc

You can see we are doing the exact same grep command as we first did, but now there is no “input” per say for the wc command. This is because the output of the first command grep 1780 expanded-data.csv is being used as the input for the second command wc! And now, the final output we get is just the output of the second command wc, it is the word count of all the lines of text that had the pattern “1780” in it. So there are 41 entries in our data file with that pattern! Or are there? Remember, one of those data entries was chosen by grep because it had “1780” in it’s ID number. I wonder where this could be useful in another context? Searching for patterns of characters (say the letters A, C, T, and G, for instance) in very large data files! Sounds familiar, eh?

(As you know, there are actual programs for finding nucleotide sequences, but I thought this was a fun parallel.)

One other cool thing we can do is go search and change files without having to go in and edit them. This is done using the sed command. Like grep, sed is a fairly powerful command and we will only be showing one of its more common features for your awareness. We will use it to search for and change something within a file.

For example, looking at our original data file in this example, let’s imagine that instead of having it read “Oct” in the date, we need it to use the full month for some sort of documentation purpose. Instead of going into our data.csv file in the nano editor and editing every “Oct” (I’m fairly confident you don’t want to do this on a file with 10,000 data entries, right?), we can use the sed command to just look for the word “Oct” and replace it with “October”! The syntax for this command looks like:

sed ‘s/[text-to-search-for]/[text-to-replace-with]/’ [file-to-edit] > [new-file-with-edits_your-initials]

So using my initials, FM, for this example we will use:

$ sed 's/Oct/October/' data.csv > edited_FM.csv
$ cat edited_FM.csv

Now when we read the newly created file edited_FM.csv, it is an exact copy of data.csv, but all of the locations with “Oct” have been changed to “October".

Note: This exact command will actually only change the first “Oct” to “October” on each line. There are more options for sed to use for changing every occurrence of the pattern you want to replace, inserting or deleting text, etc. Read the man page if you’re more interested! Downloading software and information

Downloading software and information

wget, curl, yum, dpkg, pip, conda

With those tools in our toolbox, let’s move on to downloaded things from the internet! We will use the following commands for downloading new softwares and downloading data (both from public resources or from our own repositories).

apt

A simple way for downloading new software (also commonly called “packages” or “software packages”), you can use Linux’s “Advanced Package Tool”: apt. apt is useful for both downloading new packages and for updating existing ones. There are lots of different software packages installed on our EC2 instances and many of them are constantly undergoing edits by developers. In order to update our software we can use the command:

$ sudo apt update Then $ sudo apt upgrade

These two commands together will collect all the software packages that have updates (meaning versions that are newer online than they are on our computer) and then upgrade our packages with those updates.

You can see that in order for apt to be allowed to interact with certain behind-the-scenes packages and directories of our EC2 instance, we need to use the command apt with administrator privileges using the sudo command.

Instead of updating softwares we already have, we can also use the apt command for downloading new softwares as well. The syntax for this looks like:

apt install [package-name]

For example, to download a package called “pip” enter:

$ sudo apt install pip

Since pip is already installed on this EC2 instance, it will either tell you it is updated or that there is an update available. You can just press “n” to cancel the download. In the future if you need the package you are downloading, you would instead press “Y”.

pip / conda

apt will be your most commonly used package installer for packages. Sometimes apt will not be able to find your package. Sometimes it is because the package is just not available to apt and must be downloaded and installed in a different way that the developer will tell you in the manual or sometimes it is because the package is specific to a different example. For example, there is a programming language called “Python”. A programming language is nothing more than a set of words (a language) that are interpreted by the computer as instructions to do things. Just like in both Spanish and English you can say the word “hello”, but the letters that are put together to do that are different. In Bash, if you use the command echo hello!, it would print “hello!” If you use the command print(‘hello!’) it would also print “hello! Many times, software packages that must be installed are written in different languages and those packages need different package-manager applications to install them. Continuing on with the Python example, because Python is probably one of the most used languages, especially in biology, the command for downloading Python packages is pip or conda. conda refers to a type of Python that is called “Anaconda”. The syntax for pip install is

pip install [package-name]

For example, one of the most commonly used Python packages is called “SciPy”, which is a large toolbox of softwares for math, science, and engineering. To download this package you would use:

$ pip install scipy

Both of these commands, apt and pip download packages into directories that you don’t normally work with, so you won’t “see” the things that are downloaded, though they will become available to you. Other commands for this purpose include yum and dpkg

wget

With those two package installers, we’ll move onto server-based downloading. If you need to download something from a web server you can use wget. The syntax for wget is:

wget [options] [url]

Now, since we are downloading something directly from a web server, we will use the url in the command. We used an example with this in the first module that we can use again now with a better understanding:

$ wget https://raw.githubusercontent.com/Green-Biome-Institute/AWS/master/hello.txt

Here, we use the wget command to download a text file named “hello.txt” from the AWS GBI Github page. You can also use wget for other types of servers like File Transfer Protocol servers (“FTP”), which are incredibly common for storing downloading files on.

Another command for this purpose is curl.

Unzipping and Uncompressing Files

tar, gzip, zip

Many times, when you download a file, there will actually be many files within it that have been compressed into that one file. You may have seen this before with “zipped” files. There are a variety of different types of these compressed files. This will be denoted at the end of the file with .filetype. One of the most common is .tar To open / uncompress a .tar file, you use the tar command! For example, navigate back into your home directory. You will see the next example in ex4-dir. Before we go into that, let's each make a copy of this directory for our own use, so as to not mess with that of the other student on this EC2 instance.

Use the following command, but replacing [your-initials] with your name or initials:

cp -r ex4-dir ex4-dir-[your-initials]

For example for me, I will use

cp -r ex4-dir ex4-dir-FM

Now enter the directory ex4-dir-[your-initials] that you just made (use the cd command!), and then list the contents. You will see there are some files in here, one with the .tar file extension, one with the .zip, and one with tar.gz. Let’s open all of them.

First using tar:

$ tar -xvf mytarfile.tar

Then

$ ls

You can see that the tar file is still there, but there is also a new directory! If you look inside of that new directory “f0-9” you’ll see 10 files. There’s nothing in them, this is just an example of multiple files being stored within one tar file. Let’s move onto the zipped file and use the unzip command:

$ unzip myzipfile.zip

The last file now looks to have a combination of file extensions .tar.gz. This file will also be unzipped with tar:

$ tar -xvf mytargzfile.tar.gz

If you see a file that only has .gz, you can use the command gunzip just like you did unzip.

All of these files were examples of compressed files. They all contained a variety of files within them, but you could only see and interact with those files once you had uncompressed and opened them.

Git and Github

git clone

One last method of downloading files you will inevitably encounter is git. This opens up to a very large topic. Github is an accessible online collection of software projects called repositories. A github repository is a collection of work (usually code) done by individuals or teams. The git command is used by these individuals or teams to coordinate, upload, edit, and download material from these repositories. All we need to cover is the command git clone [url]. What this does is “clones” (makes a copy) of one of these repositories from the online Github server to your computer.

As an example, let’s download the Green Biome Institutes Amazon Web Services (AWS) Github repository onto our EC2 instance! First navigate to your home directory:

$ cd

Then use the command:

$ git clone https://github.com/Green-Biome-Institute/AWS

Then

$ ls

You can see some text was printed and then we list the contents of our home directory, a new directory called AWS was generated. Navigate into it:

$ cd AWS/

Then

$ ls

Here, you’ll find all of the code that is currently in the GBI’s AWS Github repository. And look at that, you now have another copy of the hello.txt file that we downloaded earlier using wget!

Wildcards and redirects

*, >, >>

Using these files that we have just unzipped, let’s learn 2 new operators.

The first redirect operator is >. This operator redirects an output into something else. For instance, imagine you want to put all of the information from the files the directory f0-9 into a single file called f0-9-summary.txt. Well, you could read each file using cat, copy the output, and then go into the nano editor for f0-9-summary.txt, paste it in! But as we experienced with the previous extended-data.csv file, there could be tons of files with tons of information, and doing this would be exhausting. Lets instead use the redirect operator, >, to just put the output of cat file0.txt into a new file f0-9-summary.txt.:

$ cat file0.txt > f0-9summary.txt

Then

$ cat f0-9summary.txt

The next redirect operator is >>. The single greater than sign, >, will actually overwrite the f0-9summary.txt file every time you use it. The double greater than sign, >> will append to an existing file. So now we can use:

$ cat file1.txt >> f0-9summary.txt

Then

$ cat f0-9summary.txt

Now, you can see the data from both file0.txt and file1.txt has been added to the file f0-9summary.txt.

Once again, we realize that this would take a long time for all of the files we have! This is where the wildcard operator comes into play.

The wildcard operator, *, means “including everything”. This can be interpreted a couple different ways. First, if you are in a directory and you want to do something with every file (read them, move them, delete them, copy them etc.), the wildcard operator alone can signify “all files within a directory” (including other directories!). For example, let’s read out all of the files in this folder:

$ cat *

You can see that the CLI has used the command cat to read each file, one after another. Well, this is exactly what we want… almost! Now that we have created a new file named f0-9summary.txt, it will also be included if you use * alone. This is where the wildcard operator really shines. You can actually use it to start or finish a word. For example, we only want to append the text from files 0-9 into the summary file, but not the summary file itself. For example, the following command:

$ cat file*

Reads out all of the files in our current directory that start with “file”! We can also use this at the beginning of a word:

$ cat *9.txt

This only reads out files that end in “9.txt”, which we only have one of.

Let’s use * and > together now to create one summary file that has all the content of file0.txt through file9.txt in it:

$ cat file* > f0-9summary.txt

then

$ cat f0-9summary.txt

There we go! By using the > redirect, we overwrote the file f0-9summary.txt with the output of cat file*, which is the readout of all of the text files file0.txt through file9.txt! Great job!

I hope you can appreciate now how much all of this content builds on itself. So many of these commands can be used together to make your life easier as a scientist.

Review Questions:

What command displays the amount of free storage space in our EC2 instance?

  • df -h!

What is a block device similar with your own computer to in the context of an EC2 instance?

  • It is like an external hard drive, that when mounted (attached) onto the computer, adds storage space that you can save files to.

What command lists the block devices?

  • lsblk

What command shows us a real-time view of the softwares that our EC2 instance is currently running?

  • top

What command shows us the number of newlines, words, and bytes in a specific file (or multiple files!)?

  • wc

What command searches for patterns within files or directories?

  • grep

What command takes the output of one command and allows you to use it as the input of a second command?

  • The pipe command |.

What command can search for and change a pattern inside of a file?

  • sed

What are commands that you can use for downloading software packages?

  • apt, yum, dpkg. For using apt the syntax is: sudo apt install [package-name]

Which commands download files from web servers?

  • wget and curl

What command would you use to uncompress a .tar or .tar.gz file?

  • tar

What command would you use to uncompress a .gz file?

  • gunzip

What command would you use to uncompress a .zip file?

  • unzip

What commands can you use to download python packages?

  • pip and conda (but you can only use conda if the program Anaconda is installed, for our EC2 instances, you can use conda instead of pip!)

How do you download a Github repository to your current working directory?

  • git clone [desired-repository-url]

Which character is symbolic of "all files" (for instance, "list all files in the current directory", or "list all files that end with ".txt")?

  • *

Which character redirects the the output of a command into a new file and overwrites any other file that might already exist with the name you gave it?

  • >

Which character redirects the the output of a command into a new file and but, instead of overwriting another file, appends that output to the end of the file instead?

  • >>

Move on to Section 6: Interacting with Other Computers & Virtual Terminal Sessions

Go back to tutorial overview