BMA231 II: Galaxy intro - bcfgothenburg/HT24 GitHub Wiki
Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)
The aim of this practical is to get you started using Galaxy
, an open source, web-based platform for data intensive biomedical research.
Download xlaev_unknown.fasta
from CANVAS.
Q1. What type of sequence is it?
Q2. How long is the sequence?
For this part, you will be using Galaxy.
-
Log in into your account
-
Rename your history to
Testing galaxy
. You do this in the rightHistory panel
.Click to expand
-
Click on
Upload
on the left menuClick to expand
-
The
Upload from Disk or Web to Testing galaxy
window will open -
Click on
Choose local file
Click to expand
-
Select
xlaev_unknown.fasta
-
Click on
Start
Click to expand
Wait until the data has been uploaded
Click to expand
You will see that your job was added on the History panel
to your right
Click to expand
-
Click on the
job's name
-
You will see the
format: fasta, database: ?
along with a preview of the fileClick to expand
-
-
Click on the
eye icon
-
You will see the actual sequence. This is useful to control that your data is in the correct format.
Click to expand
-
To annotate any unknown protein it is common to use BLAST, this will give us similar sequences that have been annotated (usually). In our case, we will use the SwissProt
database since we have a protein sequence:
- Search for
blastp
in theTools panel
- Click on the
NCBI BLAST+ blastp
tool
Set the following parameters as indicated:
-
Protein query sequence(s)? -->
X: xlaev_unknown.fasta
-
Subject database/sequences -->
Locally installed BLAST database
-
Protein BLAST database -->
SwissProt (22 Jan 2018)
-
Type of BLAST -->
blastp - Traditional BLASTP to compare a protein query to a protein database
-
Output format -->
Tabular (select which columns)
-
Other identifier columns -->
stitle = Subject Title
-
Click
Run Tool
Click to expand
- Click on the
job's name
Q3. In which format are the results presented?
- Click on the
eye icon
Q4. Based on these results, what kind of sequence do you have?
Now we will extract the sequences to make a phylogenetic comparison. To do this we need the identifiers which are in the second column. Furthermore, if we use the pipe
as separator, we can directly target the Uniprot Identifier that is in the 4th position (gi|120521|sp|P17663.2|FRIHB_XENLA). Let's use this:
- Search for
cut
in theTools panel
- Click on the
Cut columns from a table
tool
Set the following parameters as indicated:
-
Cut columns -->
c4
-
Delimited by -->
Pipe
-
From --> Select
X: blastp xlaev_unknown.fasta vs 'swissprot_2018-01-22'
-
Click
Run Tool
Click to expand
- Click on the
job's name
Q5. How many identifiers do you have?
- Click on the
eye icon
-
You will see the list of identifiers from the BLAST results
Click to expand
-
Some programs do not allow duplicates, so let's remove them. First we need to sort our data:
- Search for
sort
in theTools panel
- Click on the
Sort dta in ascending or descending order
tool
Set the following parameters as indicated:
-
Sort Dataset -->
X: cut fon data X
-
on column -->
Column: 1
-
wwith flavor -->
Alphabetical sort
-
Click
Run Tool
Click to expand
And then remove the duplicates:
- Search for
unique
in theTools panel
- Click on the
Unique line assuming sorted input file
tool
Set the following parameters as indicated:
-
File to scan for unique values -->
_X_: Sort don data _X_
-
Click
Run Tool
Click to expand
Now let's extract the sequences:
- Search for
blastdbcmd
in theTools panel
- Click on the
NCBI BLAST+ blastdbcmd entry(s)
tool
Set the following parameters as indicated:
-
Type of BLAST database -->
Protein
-
Subject database/sequences -->
Locally installed BLAST database
-
Protein BLAST database -->
SwissProt (22 Jan 2018)
-
Type of identifier list -->
From file
-
Sequence identifier(s) -->
X: Unique lines on data X
-
Click
Run Tool
Click to expand
Inspect the file by clicking on the eye icon
, you should see something like:
MAFFT is a program similar to ClustalW/O. It will help us to create a multiple sequence alignment
- Search for
mafft
in theTools panel
- Click on the tool
Set the following parameters as indicated:
-
Sequences to align -->
X: Sequences from blastdbmcd 'swissprot_2018-01-22'
-
Data type -->
Amino acids
-
MAFFT flavour -->
auto
-
Click
Run Tool
Click to expand
-
Click on the
job's name
-
Click on the
download symbol
-
Save the file
Click to expand
-
-
Open
Jalview
-
Load the file
-
Have a quick look at the alignment, it should look something like:
-
There are different programs to generate a phylogenetic tree, Galaxy has an algorithm based on maximum likelihood: FastTree
- Search for
fasttree
in theTools panel
- Click on the tool
Set the following parameters as indicated:
-
Aligned sequences file (FASTA or Phylip format) -->
fasta
-
FASTA file -->
X: MAFFT on data X
-
Protein or nucleotide alignment -->
Protein
-
Click
Run Tool
Click to expand
-
Click on the
eye icon
-
You will see a file format based on the Newick file format, which embed additional information about each node in the tree. This format is known as the NHX format or the New Hampshire X format.
Click to expand
-
-
Click on the
job's name
-
Click on the
graph icon
-
Click on the
Phylogenetic Tree Visualization
Click to expand
-
You will see the phylogenetic tree:
To make it easier:
-
Click on
<<
Aligned sequences file (FASTA or Phylip format) -->fasta
-
Tree types -->
Circular
Click to expand
Look for the following sequences:
- P17663
- Q5HN41
- O65100
Q6. Do they cluster together? Why/why not?
Developed by Marcela Dávila, 2021. Updated by Marcela Dávila, 2023