Project Overview - PrithiPal/BioInformatics GitHub Wiki

Bioinformatics - Mouse theoretical Glycoanalysis

Introduction to Research[1]

This interdisciplinary research study cell-surface N-glycoproteins through LC-MS based shotgun proteomic tools, an important factor in stem-cell development. The stem-cell behavior is influenced by extra-cellular microenvironment through these proteins and this study attempts to understand the phenomena.

LC-MS based shotgun proteomics uses HPLC that deliver 1 nanoliter/minute constant flow. It operates under intensive atmosphere pressure (100-1000 C) for high resolution separation. Following this, it takes mere few minutes to derive million peptides from thousand protein through mass spectrometry analysis. It detects and quantifies entire proteome of all biological samples and helps with systematic study of human health and diseases.

Repository Overview

Pertaining to the current research comprising the subset of overall research objectives, this repository contains the raw data for analysis. It also contains the relevant files and programs required to obtain the mouse’s final loop stat output. This discourse took immense research to determine the suitable programming language, file-format and tools for data analysis. Following the successful acquisition of results, it’s important to ensure the final results reliability and overall accuracy. For the same reason, all programming scripts incorporate embedded test cases with unit tests.

The scope of intended explanation leaves the technical aspects entailing the research and discuss information relevant to obtain stastical results using programming tools.

Contextual Information

In beginning, the publication summarizing identified peptide of mouse species (mouse publication) and glyco FASTA file for mouse was provided. Afterwards _find_glyco_pep.cpp _was written by previous research students whose understanding is key to associate with research findings.

Glyco protein (.FASTA)

This file uses special format to list the significant properties of glyco proteins. The FASTA format is as below signifies information pertaining to one protein :

>sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3

MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS

WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY

LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY

YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD

AGEGEN

First line tells information for protein identification and then the remainder of lines depicts protein sequence. The standardized glyco data for indexed species can be found in online BASIC (Basic Local Alignment Search Tool) database.

Mouse publication(.pdf)

It lists all identified peptides and associated relevant variables in the study. The mouse publication forms the basis for the information employed for the analysis. This means all piece of peptide information encountered in later parts would be inherited from this single pdf publication. The significant characteristic of each peptide sequence (one row in publication) are as listed below:

CDS ID and Worm pep ID : Technical protein identifiers
Protein description : description of protein
Peptide position (from)
Peptide position (to)
Preceding residue
Peptide sequence
C-terminal peptide residue
No. of potential sites
Glycosylated site-1 and site-2

Again the definitions of these biological variables are not in scope of consideration unless their applicability arises.

Glyco FASTA file and Mouse publication becomes the starting resources for the data analysis. The primary objective is deriving loopstat files which attains special attentions due to the information it encloses.

Find_glyco_pep(.cpp)

This C++ program finds the occurring sequence$$NX(!P)S|T$$ where first character is $$N$$, second is anything except$$P$$ and last is either $$S$$ or $$T$$. The FASTA sequence is used as search source and all instances with respective positions of above combinations are reported.

Usage : ./find_glyco_pep --fasta=[FASTA FILE] --human=[PEPTIDE FILE]

Instructions

The assurance of correctness and durability, each step of data manipulation is broken down in number of steps. Each step signifies milestone whose concrete definitions are determined through appropriate research and requires solving sets of problem statements before proceeding. For instance, first milestone or step may consist of omitting blank rows and convert to tab delimited text file. Afterwards sending the progressed files to Supervisor to receive critical feedback and readjust the methodology accordingly in case of inconsistent calculations/formatting witnessed.
The pdf version of mouse publication is converted and dispersed into 34 different excel Sheets. The original intentions was to obtain files in text file format. The arousal of formatting error during conversion from pdf needs to be first corrected in excel Sheets. Excel is used to correct the inconsistent formatting which included overlapping of column entries into adjacent row proteins data. After the assurance of correction, excel sheet is converted back to 34 discrete text files(from 34 pages pdf)
Now, the extraction of certain columns from corresponding text file is done. The reason for it would be clear in later steps where the extracted information for each protein identifier is utilized. The format is

<Identifier><before residual>.<peptide sequence>.<after residual><peptide start(from)><peptide end(to)><no of potential sites><first site position><second site position>.

Prepare_residual_file.py provides the means for this particular extraction. Basically it accepts mouse publication text files and save the extracted columns in file

<input file>_before_hash.txt

The peptide sequence possess two variables; peptide start and peptide end index which locates sequence’s position within overall protein sequence in FASTA file. Each peptide sequence have one or two glycoanalysis sites and the next step includes insertion of hash (“#”) character in sequence at identified glycoanalysis sites. This functionality is achieved by _place_hash.py _which accepts before_hash text files. The insertion is determined through below calculation :

$$ {i_1 = p_{from} - g_{1} + 3 ; i_2 = p_{from} - g_{2} + 4} $$

where $$i_{1}, i_{2}$$ are index for hash insert , $$g_{1} , g_{2}$$ are index or glyco. sites and $$p_{from}$$ peptide position (start)

[input file]_after_hash.txt

To enhance the format, the removal of secondary identifier such as “CE22235” from “Y49E10.20 CE22235 ” is recommended because it adds one more column and causes inconsistency. The correct_identifier.py will accept any _after_hash files and outputs entries less than secondary identifier.

[input file]_only_first.txt

To remove complete duplicated entries and sort accordingly, write the bash script:

cat f_worm.txt_after_hash_only_first.txt | uniq | sort > f_worm_uniq_after_hash_only_first.txt

The final output file is named with prefix “uniq” for reader convenience and can be renamed. After that, find_glyco_pep.cpp helps extract the position of all occurring expression $$NX(!P)S|T$$ in fasta file. The input are FASTA and _after_hash file.

>> peptide.txt (FORMAT : <protein identifier><List of found NX(!P)S|T>)

To be continued

Until now, the acquisition of mouse peptide file is successfully done which acts as output in subsequent steps to obtain loopstat results. Further documentation to continue with data processes steps will come in existence as soon as the eminent bugs are located and fixed. Following the fixation of bugs, writing of tests also forms integral part of each enclosing python codes for data analysis.

Contribution

Please feel independent to contribute the source code in order to achieve efficiency, robustness and overall durability. Many python modules requires refractoring for the purpose of for instance speeding up search algorithm. In addition, more testing modules can be added to supporting python framework for unittesting.

Citations/References

[1].Sun lab Research Page

[2].Gitbook.com used for documentation

[3] BIOPYTHON for additional bioinformatics data manipulation tools