Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)

The aim of this practical is to get you started using Galaxy, an open source, web-based platform for data intensive biomedical research.

The Data

Download xlaev_unknown.fasta from CANVAS.

Q1. What type of sequence is it?

Q2. How long is the sequence?

Galaxy login

For this part, you will be using Galaxy.

Log in into your account
Rename your history to Testing galaxy. You do this in the right History panel.

Click to expand

Upload your data

Click on Upload on the left menu

Click to expand
The Upload from Disk or Web to Testing galaxy window will open
Click on Choose local file

Click to expand
Select xlaev_unknown.fasta
Click on Start

Click to expand

Wait until the data has been uploaded

Click to expand

You will see that your job was added on the History panel to your right

Click to expand

Click on the job's name
- You will see the format: fasta, database: ? along with a preview of the file
  
  Click to expand
Click on the eye icon
- You will see the actual sequence. This is useful to control that your data is in the correct format.
  
  Click to expand

Annotation using BLAST

To annotate any unknown protein it is common to use BLAST, this will give us similar sequences that have been annotated (usually). In our case, we will use the SwissProt database since we have a protein sequence:

Search for blastp in the Tools panel
Click on the NCBI BLAST+ blastp tool

Set the following parameters as indicated:

Protein query sequence(s)? --> X: xlaev_unknown.fasta
Subject database/sequences --> Locally installed BLAST database
Protein BLAST database --> SwissProt (22 Jan 2018)
Type of BLAST --> blastp - Traditional BLASTP to compare a protein query to a protein database
Output format --> Tabular (select which columns)
Other identifier columns --> stitle = Subject Title
Click Run Tool

Click to expand

Click on the job's name

Q3. In which format are the results presented?

Click on the eye icon

Q4. Based on these results, what kind of sequence do you have?

Homologues extraction

Now we will extract the sequences to make a phylogenetic comparison. To do this we need the identifiers which are in the second column. Furthermore, if we use the pipe as separator, we can directly target the Uniprot Identifier that is in the 4th position (gi|120521|sp|P17663.2|FRIHB_XENLA). Let's use this:

Search for cut in the Tools panel
Click on the Cut columns from a table tool

Set the following parameters as indicated:

Cut columns --> c4
Delimited by --> Pipe
From --> Select X: blastp xlaev_unknown.fasta vs 'swissprot_2018-01-22'
Click Run Tool

Click to expand

Click on the job's name

Q5. How many identifiers do you have?

Click on the eye icon
- You will see the list of identifiers from the BLAST results
  
  Click to expand

Some programs do not allow duplicates, so let's remove them. First we need to sort our data:

Search for sort in the Tools panel
Click on the Sort dta in ascending or descending order tool

Set the following parameters as indicated:

Sort Dataset --> X: cut fon data X
on column --> Column: 1
wwith flavor --> Alphabetical sort
Click Run Tool

Click to expand

And then remove the duplicates:

Search for unique in the Tools panel
Click on the Unique line assuming sorted input file tool

Set the following parameters as indicated:

File to scan for unique values --> _X_: Sort don data _X_
Click Run Tool

Click to expand

Now let's extract the sequences:

Search for blastdbcmd in the Tools panel
Click on the NCBI BLAST+ blastdbcmd entry(s) tool

Set the following parameters as indicated:

Type of BLAST database --> Protein
Subject database/sequences --> Locally installed BLAST database
Protein BLAST database --> SwissProt (22 Jan 2018)
Type of identifier list --> From file
Sequence identifier(s) --> X: Unique lines on data X
Click Run Tool

Click to expand

Inspect the file by clicking on the eye icon, you should see something like:

Multiple sequence alignment with MAFFT

MAFFT is a program similar to ClustalW/O. It will help us to create a multiple sequence alignment

Search for mafft in the Tools panel
Click on the tool

Set the following parameters as indicated:

Sequences to align --> X: Sequences from blastdbmcd 'swissprot_2018-01-22'
Data type --> Amino acids
MAFFT flavour --> auto
Click Run Tool

Click to expand

Click on the job's name
- Click on the download symbol
- Save the file
  
  Click to expand
Open Jalview
- Load the file
- Have a quick look at the alignment, it should look something like:

Phylogenetic tree

There are different programs to generate a phylogenetic tree, Galaxy has an algorithm based on maximum likelihood: FastTree

Search for fasttree in the Tools panel
Click on the tool

Set the following parameters as indicated:

Aligned sequences file (FASTA or Phylip format) --> fasta
FASTA file --> X: MAFFT on data X
Protein or nucleotide alignment --> Protein
Click Run Tool

Click to expand

Click on the eye icon
- You will see a file format based on the Newick file format, which embed additional information about each node in the tree. This format is known as the NHX format or the New Hampshire X format.
  
  Click to expand
Click on the job's name
- Click on the graph icon
- Click on the Phylogenetic Tree Visualization
  
  Click to expand

You will see the phylogenetic tree:

To make it easier:

Click on << Aligned sequences file (FASTA or Phylip format) --> fasta
Tree types --> Circular

Click to expand

Look for the following sequences:

P17663
Q5HN41
O65100

Q6. Do they cluster together? Why/why not?

Well done! Now you should be able to use Galaxy for future exercises!

Home: Next Generation Sequencing data analysis with clinical applications

Developed by Marcela Dávila, 2021. Updated by Marcela Dávila, 2023

BMA231 II: Galaxy intro - bcfgothenburg/HT24 GitHub Wiki

The Data

Galaxy login

Upload your data

Annotation using BLAST

Homologues extraction

Multiple sequence alignment with MAFFT

Phylogenetic tree

Well done! Now you should be able to use Galaxy for future exercises!

Home: Next Generation Sequencing data analysis with clinical applications

⚠️ GitHub.com Fallback ⚠️

BMA231 II: Galaxy intro - bcfgothenburg/HT24 GitHub Wiki

The Data

Galaxy login

Upload your data

Annotation using BLAST

Homologues extraction

Multiple sequence alignment with MAFFT

Phylogenetic tree

Well done! Now you should be able to use Galaxy for future exercises!

Home: Next Generation Sequencing data analysis with clinical applications

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️