Assembly service - BU-ISCIII/BU-ISCIII GitHub Wiki
How to perform the Assembly Service
Welcome to this (as) brief (as possible) tutorial on how to perform an Assembly Service as a member of the ISCIII's Bioinformatics Unit!
First of all, take the service and click on Add resolution in iskyLIMS after loggin in with your user and password. For this to happen, you need to specify the estimated delivery date, your user and a service acronym. In order to know which acronym to use for this new resolution, log into your WS user and execute the following commands:
cd /data/bi/services_and_colaborations/CNM/bacteriology
ll -tr
Once these commands have been executed, you'll get a list of all the folders (each one related to one service) contained within the services_and_colaborations
folder, sorted by time in a reverse order. Therefore, you'll see in the last place the last service that was delivered. Have a look at the last service that contains the word ASSEMBLY
in its name, and take note of the number linked to this ASSEMBLY
term.
For the new service, your service acronym will be this number + 1. For example, if the last folder after executing ll -tr
is SRVCNM1120_20240418_ASSEMBLY482_lherrera_S
, the service acronym for your new service will be ASSEMBLY483. Specify this in the form that appears after clicking on Add resolution, and click on Accept.
Now, considering you've already created a buisciii-tools conda environment and installed the tools (this will be necessary eventually) in your local PC, log into your HPC user.
Once you're logged in, go into the services_and_colaborations
folder:
cd /data/bi/services_and_colaborations/CNM/bacteriology/
ll -tr
Now, let's execute the first BU-ISCIII tool: new-service
, where you'll need to specify the resolution ID associated to this service.
bu-isciii new-service SRVCNMXXX.X
Once new-service
is executed, you'll be asked:
Do you want to skip folder creation?
: Unless it is not the first resolution associated with the service, answer NO, because the folder corresponding to the service has not yet been created in theservices_and_collaborations
folder.- Next, specify
assembly_annot
, since this is the service we're running.
Once the new-service
tool is finished, you'll have a new folder in services_and_colaborations
with the following structure: SRVCNMXXX_YYYYMMDD_ASSEMBLYXXX_researcher_S
. Your service will now appear within the In progress tab in iskyLIMS
If we get into this folder, we'll find 6 folders: ANALYSIS
, DOC
, RAW
, REFERENCES
, RESULTS
and TMP
. We should check, before going any further, that the number of files contained within the RAW
folder is equal to the number of samples specified in iskyLIMS x 2, since these are paired-end reads.
If everything is OK, we can get into the ANALYSIS
folder and we'll find the following items inside:
lablog_assembly
: an executable file that checks whether our reads are paired-end or single-end and creates the 00-reads folder which will contain all the reads.samples_id.txt
: a.txt
file containing all the sample names, one per line, so there will be as many lines as samples associated with our service.
Let's execute the lablog_assembly
file:
bash lablog_assembly
After executing this file, if everything is OK, we can now proceed with the new BU-ISCIII tool: scratch
. This tool will copy the content from services_and_colaborations
to the scratch_tmp
folder contained within /data/bi
, since this scratch_tmp
folder will be the one used for the assembly analysis.
bu-isciii scratch SRVCNMXXX.X
Once scratch
is executed, you'll be asked:
Direction of the service
: in this case, we want to copy our files from service to scratch, so we have to select theservice_to_scratch
option.
Once this function is finished, we should go into the scratch_tmp
folder and the specific folder associated with our service:
cd /data/bi/scratch_tmp/bi/SRVCNMXXX_YYYYMMDD_ASSEMBLYXXX_researcher_S/ANALYSIS/DATE_ANALYSIS01_ASSEMBLY
Once we're inside, we can execute our next executable file: lablog
, which will create symbolic links to our raw reads and the samples_id.txt
file, apart from asking us the following information:
Indicate the preferred assembly mode: short, long or hybrid
.Do you want to save trimmed reads?
: Unless the researcher asks us to keep the trimmed reads, we can replyNO
.Is Gram - or +?
: Since the researcher tells us the organism, we can check online whether this bacterium is Gram - or +.
bash lablog
This information will be contained within the samplesheet.csv
file within the DATE_ANALYSIS01_ASSEMBLY
folder. Now, we should check we've loaded all the needed dependencies and perform the assembly analysis:
module load Nextflow/23.10.0 singularity
sbatch assembly.sbatch
After this, the analysis will start, and we'll be able to check the status of the process with:
squeue -u youruser
Once the process is finished, within the DATE_ANALYSIS01_ASSEMBLY
folder, we'll have the following content:
01-processing
: all files associated with the quality control analysis performed byfastp
andfastqc
.02-taxonomy_contamination
: a group of folders, with one folder containing the kmerfinder results for each sample.03-assembly
: a folder containing the results fromquast
andunicycler
.05-annotation
: a folder contaning the results from annotation using PROKKA.99-stats
: a folder containing a .csv file summarizing the results from kmerfinder and quast.pipeline-info
: a folder containing reports about the execution of the assembly pipeline._01_nf_assembly.sh
: an executable file that performs the assembly analysis (bash _01_nf_assembly.sh
does the same assbatch assembly.sbatch
).DATE_assembly01.log
: a file reporting whether all the procedures of the pipeline have correctly been executed or not.assembly.sbatch
lablog
samplesheet.csv
After checking that the pipeline has been executed correctly and we have all the files we should have (mainly the kmerfinder and quast results), we can now go to the RESULTS
folder, still inside scratch_tmp
, and execute the following file, which creates symbolic links to our kmerfinder and quast reports, apart from our raw reads:
bash lablog_assembly_results
Once we've executed this, we'll have a new folder within the RESULTS
folder named DATE_entrega
, which will contain, at the same time, a folder named assembly
. Inside assembly
, we should have a symbolic link (called assemblies
) linked to the unicycler
folder containing all the raw reads, and other symbolic links associated with the kmerfinder .csv
summary, all quast
reports and the multiqc
report, similarly to this:
assemblies quast_GCF_000020105.1_ASM2010v1_report.html summary_assembly_metrics_mqc.csv
kmerfinder_summary.csv quast_GCF_000306985.1_ASM30698v1_report.html
multiqc_report.html quast_global_report.html
If everything is correct and all the files have the expected content, we can proceed to copy the content of the RESULTS
folder to the researcher's SFTP. To do this, we should now execute the next BU-ISCIII tool: finish
, which will delete temporary files, copy the results from scratch back to the services_and_colaborations
folder, rename those folders that should not be copied into the researcher's SFTP and copy those that are of interest to this SFTP:
bu-isciii finish SRVCNMXXX.X
After executing finish
, you'll have to specify again that we are performing an assembly analysis (assembly_annot
) and allow for this tools to rename (RAW
and TMP
will be renamed as RAW_NC
and TMP_NC
) and delete some folders (work
will be deleted).
Once finish
is done, the results will be now at the researcher's SFTP and we can go back to /data/bi/services_and_colaborations/CNM/bacteriology/SRVCNMXXX_YYYYMMDD_ASSEMBLYXXX_researcher_S/RESULTS
. If all the reports have been copied correctly into the corresponding services_and_colaborations
folder, we can now execute the next BU-ISCIII tool: bioinfo-doc
, which will create a .pdf
report with the information that will be delivered to the researcher.
To execute bioinfo_doc
, we have to go back to our WS user, in which we should already have mounted the bioinfo_doc
folder. If this is the case, we can do the following, always after having checked the kmerfinder, quast and multiqc reports and looking for any remarkable aspects that the researcher should be informed about:
bu-isciii bioinfo-doc SRVCNMXXX.X > service_info
Once you've specified the service_info
option, you should execute the bioinfo-doc
BU-ISCIII tool again indicating the delivery
option this time. Please note that the program will ask you to create the markdown files associated to this specific delivery, apart from whether we want to add some notes. If there is something we want to inform the researcher about, we can create a delivery_notes.txt
with this information, by executing the following commands, before executing bu-isciii bioinfo-doc SRVCNMXXX.X > delivery
:
cd /data/bioinfo_doc/services/2024/SRVCNMXXX_YYYYMMDD_ASSEMBLYXXX_researcher_S #Go to the specific folder of the service.
nano delivery_notes.txt #Create and edit the .txt file with notes. Press Crtl + X to save your data and press Enter to exit.
Once this has been done, execute bu-isciii bioinfo-doc SRVCNMXXX.X > delivery
, agree in the fact that you want to create markdown files, indicate the absolute path to the delivery_notes.txt
file and agree to send an email with the results automatically to the researcher. Bioinfo-doc
will have finished and a report with the results and your notes will have been sent to the researcher directly. The service is now delivered and finished.
Lastly, remember to remove all the files related to this service from scratch_tmp
:
bu-isciii scratch SRVCNMXXX.X > remove_scratch
ASSEMBLY REPORT TEMPLATE (TEAM STANDUP)
Once you have finished the analysis and analyzed the results you need to report to the rest of the members. Use this template to do it:
SRVCNMXXX_YYYYMMDD_ASSEMBLYXXX_researcher_S
- Número de muestras: [number of samples analyzed in the service]
- Organismo indicado: [organism provided by the researcher when requesting the service]
- Estado: [status of the service]
- Instrumento y longitud: [instrument used for sequencing and lenght of the reads]
- Cantidad de lecturas: [the median amount of reads]
- Calidad de las lecturas (revisión fastQC): [quality of reads (good, bad, etc)]
- Especie identificada (revisión kmerfinder): [organism reported in general by kmerfinder]
- Incidencias muestras individuales (todo lo anterior + quast): [any information, extracted from kmerfinder and quast reports, that the researcher should be informed about]