BMA231 II: Galaxy intro - bcfgothenburg/HT24 GitHub Wiki

Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)


The aim of this practical is to get you started using Galaxy, an open source, web-based platform for data intensive biomedical research.



The Data

Download xlaev_unknown.fasta from CANVAS.

Q1. What type of sequence is it?

Q2. How long is the sequence?

Galaxy login

For this part, you will be using Galaxy.

  • Log in into your account

  • Rename your history to Testing galaxy. You do this in the right History panel.

    Click to expand

Upload your data

  • Click on Upload on the left menu

    Click to expand

  • The Upload from Disk or Web to Testing galaxy window will open

  • Click on Choose local file

    Click to expand

  • Select xlaev_unknown.fasta

  • Click on Start

    Click to expand

Wait until the data has been uploaded

Click to expand

You will see that your job was added on the History panel to your right

Click to expand

  • Click on the job's name

    • You will see the format: fasta, database: ? along with a preview of the file

      Click to expand

  • Click on the eye icon

    • You will see the actual sequence. This is useful to control that your data is in the correct format.

      Click to expand

Annotation using BLAST

To annotate any unknown protein it is common to use BLAST, this will give us similar sequences that have been annotated (usually). In our case, we will use the SwissProt database since we have a protein sequence:

  1. Search for blastp in the Tools panel
  2. Click on the NCBI BLAST+ blastp tool

Set the following parameters as indicated:

  1. Protein query sequence(s)? --> X: xlaev_unknown.fasta

  2. Subject database/sequences --> Locally installed BLAST database

  3. Protein BLAST database --> SwissProt (22 Jan 2018)

  4. Type of BLAST --> blastp - Traditional BLASTP to compare a protein query to a protein database

  5. Output format --> Tabular (select which columns)

  6. Other identifier columns --> stitle = Subject Title

  7. Click Run Tool

    Click to expand
  • Click on the job's name

Q3. In which format are the results presented?

  • Click on the eye icon

Q4. Based on these results, what kind of sequence do you have?

Homologues extraction

Now we will extract the sequences to make a phylogenetic comparison. To do this we need the identifiers which are in the second column. Furthermore, if we use the pipe as separator, we can directly target the Uniprot Identifier that is in the 4th position (gi|120521|sp|P17663.2|FRIHB_XENLA). Let's use this:

  1. Search for cut in the Tools panel
  2. Click on the Cut columns from a table tool

Set the following parameters as indicated:

  1. Cut columns --> c4

  2. Delimited by --> Pipe

  3. From --> Select X: blastp xlaev_unknown.fasta vs 'swissprot_2018-01-22'

  4. Click Run Tool

    Click to expand

  • Click on the job's name

Q5. How many identifiers do you have?

  • Click on the eye icon
    • You will see the list of identifiers from the BLAST results

      Click to expand

Some programs do not allow duplicates, so let's remove them. First we need to sort our data:

  1. Search for sort in the Tools panel
  2. Click on the Sort dta in ascending or descending order tool

Set the following parameters as indicated:

  1. Sort Dataset --> X: cut fon data X

  2. on column --> Column: 1

  3. wwith flavor --> Alphabetical sort

  4. Click Run Tool

    Click to expand

And then remove the duplicates:

  1. Search for unique in the Tools panel
  2. Click on the Unique line assuming sorted input file tool

Set the following parameters as indicated:

  1. File to scan for unique values --> _X_: Sort don data _X_

  2. Click Run Tool

    Click to expand

Now let's extract the sequences:

  1. Search for blastdbcmd in the Tools panel
  2. Click on the NCBI BLAST+ blastdbcmd entry(s) tool

Set the following parameters as indicated:

  1. Type of BLAST database --> Protein

  2. Subject database/sequences --> Locally installed BLAST database

  3. Protein BLAST database --> SwissProt (22 Jan 2018)

  4. Type of identifier list --> From file

  5. Sequence identifier(s) --> X: Unique lines on data X

  6. Click Run Tool

    Click to expand

Inspect the file by clicking on the eye icon, you should see something like:



Multiple sequence alignment with MAFFT

MAFFT is a program similar to ClustalW/O. It will help us to create a multiple sequence alignment

  1. Search for mafft in the Tools panel
  2. Click on the tool

Set the following parameters as indicated:

  1. Sequences to align --> X: Sequences from blastdbmcd 'swissprot_2018-01-22'

  2. Data type --> Amino acids

  3. MAFFT flavour --> auto

  4. Click Run Tool

    Click to expand

  • Click on the job's name

    • Click on the download symbol

    • Save the file

      Click to expand

  • Open Jalview

    • Load the file

    • Have a quick look at the alignment, it should look something like:



Phylogenetic tree

There are different programs to generate a phylogenetic tree, Galaxy has an algorithm based on maximum likelihood: FastTree

  1. Search for fasttree in the Tools panel
  2. Click on the tool

Set the following parameters as indicated:

  1. Aligned sequences file (FASTA or Phylip format) --> fasta

  2. FASTA file --> X: MAFFT on data X

  3. Protein or nucleotide alignment --> Protein

  4. Click Run Tool

    Click to expand

  • Click on the eye icon

    • You will see a file format based on the Newick file format, which embed additional information about each node in the tree. This format is known as the NHX format or the New Hampshire X format.

      Click to expand

  • Click on the job's name

    • Click on the graph icon

    • Click on the Phylogenetic Tree Visualization

      Click to expand

You will see the phylogenetic tree:



To make it easier:

  1. Click on << Aligned sequences file (FASTA or Phylip format) --> fasta

  2. Tree types --> Circular

    Click to expand

Look for the following sequences:

  • P17663
  • Q5HN41
  • O65100

Q6. Do they cluster together? Why/why not?

Well done! Now you should be able to use Galaxy for future exercises!



Developed by Marcela Dávila, 2021. Updated by Marcela Dávila, 2023

⚠️ **GitHub.com Fallback** ⚠️