Section 7: Software Dependencies & More Complex Software Manuals - Green-Biome-Institute/AWS GitHub Wiki
Back to Section 6: Interacting with Other Computers & Virtual Terminal Sessions
Learning Points for Section 7:
From this section you should take away that: Software Dependencies & More Complex Software Manuals
- A full perspective on how to find, download, install, and operate third part softwares
- An understanding of why you should want to read software manuals
Third Party Softwares
At this point you have a fairly substantial foundation for working with the command line interface. With this, you should have enough comprehension of the CLI to start working with new softwares. Let’s go through the basic steps of using a new software from the beginning: Identification of a new software you would like to use.
There are more tools available online than you will ever realize. There is a good chance that if there is a tool that you think would be valuable in your own experiments or analysis of them, it has likely been created or attempted. Obviously, it might be created by another scientist who built it for their own purpose and therefore is not the most robust or customizable - but it still may do what you want.
Searching for the software.
As you might have guessed, other scientists and Google are going to be your friend here. Lots of these tools will be available on Github, and doing a google search of the tool will help you find the github page because they tend to be named… well whatever the people who made it wanted it to be.
Another great method for finding bioinformatics and data science tools created by scientists for purposes that might be applicable to you is through a literature search. Many times these scientists will publish their software in a comparison to currently available tools. For instance, looking for genome assembly software? A literature review on the most efficient genome assemblers will probably give you a good picture on a variety of the tools out there and the ones that are most commonly used for specific types of genomes.
Downloading the software.
This will vary. Sometimes it is available using easier methods like apt install [software]
. Sometimes you will have to manually install it by cloning the Github repository and following their instructions. Many of the softwares you might use that require this have already been installed on another EC2 instance that you can open up and use at any time through the GBI.
Software Dependencies.
This goes hand-in-hand with step (3). Most softwares (if not all of the ones you will use) require other softwares to operate! One of the main precepts of programming is to not recreate the wheel. For example, if you needed to solve a quadratic equation for its roots, there is already a “function” (a set of instructions for the computer to follow) within the software package “Numpy”. Therefore, you don’t need to waste your time trying to program the quadratic formula. So if you were writing a program that needed to use the quadratic formula, you would download “Numpy” and then just call the quadratic formula function from it whenever you needed to use it. This is an example of a software dependency. The authors of any software will tell you which of these need to be downloaded.
Installing the software (and software dependencies).
Okay, now that you’ve downloaded the software and all its dependencies, you need to install them onto your computer. The difference between the two is that “downloading” a software simply collects the program's information onto your computer, like buying the set of pieces of an Ikea chair that aren’t yet put together. “Installing” a software requires you to put it together and to put it in the right place. The chair isn’t installed correctly in your apartment if it’s laying on its side on top of your desk, now is it? No, it should be right side up and next to the desk, just like the software has to be put together and somewhere on the PATH for the computer to find and use it.
There will usually be instructions for you to follow to do this, but one usual sequence of commands looks like the following:
$ cd [software folder]
$ make
$ sudo make install
Another common command is:
$cd [folder with the software executable
$ export PATH=$PATH:$(pwd)
- This puts the path to your current working directory (that to the software) onto your PATH
Using the software!
Of course, now that we’ve done all this work, we want to use the program! Your main access points for learning a new software are going to be wherever you downloaded the software itself (this is usually the github), the README file / manual, forum pages related to it (like https://www.biostars.org/!), and (of course) your preferred search engine. When starting up or having issues with a given program, other people are likely to have had the same issue and asked about it / given answers that they found to those problems.
I know we’ve already covered using the man
command to read a commands manual, but it begs to be mentioned again. Honestly this is one of the most important takeaways you should get from this entire training: read the manuals for the softwares you really want to use. You don’t have to do this for all of the commands you use - I don’t expect you to read the manual for grep
, sed
, etc. (though you can if you want!). But if you are planning on using a genome assembly software or other bioinformatics tool to do analysis on data that you one day hope to publish? You really should read the documentation.
Why? Because it will give a good understanding of what you are doing, the options that are available to you, and likely methods for avoiding random errors, failures, or potential miscalculations. You wouldn’t use a multiplication sign in an equation between two numbers if you had no idea what it did, would you? I know it might sound a bit boring, but set aside some time to look at the manual of a software you intend to use and at least skim it looking for parts that seem most relevant to you. Some of them are a hundred pages long - don’t read the whole thing. Look at the table of contents and find parts that seem important!
I have one last comment on this subject… people are more likely to respond to you when you have put forward an attempt to understand the issue you are asking about. If you have a question about a bioinformatics software, the author is much more likely to help you (and be enthusiastic to help you) if your question involves the ways you already tried to solve the problem and the places you looked for the answer. Include links! The forums for many programs are usually moderated by the author of that software who is looking to make it more effective, so asking clear questions has the potential to help both you and the author.
Review Questions
What are 2 tools you can use for finding software that is relevant to doing analysis that you are interested in?
- Both a literature search and Google are good options for finding the names and use cases of analysis softwares.
Where should you look to find out how to download and install a third party software?
- Look at the documentation pages for the software. This is usually in the Github repository or on the university / academic lab's website.
Where can you find out more information about how to use the software? What command or option is useful for doing this?
- Look at the documentation, README, or manual pages (these are all effectively synonymous). Use the command
man [software-package]
or the--help
option after you have installed the software.
What information should you include when you ask a question on an online forum (or any time you ask a question for that matter!)?
- Be sure to include:
- The question you are asking in as clear terms as you can word it,
- How you have already tried to answer the question, meaning other questions you have asked, webpages or sources you have looked at, and attempts at solving the problem you have already tried.