Computational and Statistical Tools for Research Annotated Publications - mauriceling/mauriceling.github.io GitHub Wiki

[1] Ling, MHT and So, CW. 2003. Architecture of an Open-Sourced, Extensible Data Warehouse Builder: InterBase 6 Data Warehouse Builder (IB-DWB). In Rubinstein, B. I. P., Chan, N., Kshetrapalapuram, K. K. (Eds.), Proceedings of the First Australian Undergraduate Students' Computing Conference. (pp. 40-45).

We report the development of an open-sourced data warehouse builder, InterBase Data Warehouse Builder (IB-DWB), based on Borland InterBase 6 Open Edition Database Server. InterBase 6 is used for its low maintenance and small footprint. IB-DWB is designed modularly and consists of 5 main components, Data Plug Platform, Discoverer Platform, Multi-Dimensional Cube Builder, and Query Supporter, bounded together by a Kernel. It is also an extensible system, made possible by the Data Plug Platform and the Discoverer Platform. Currently, extensions are only possible via dynamic linked-libraries (DLLs). Multi-Dimensional Cube Builder represents a basal mean of data aggregation. The architectural philosophy of IB-DWB centers around providing a base platform that is extensible, which is functionally supported by expansion modules. IB-DWB is currently being hosted by sourceforge.net (Project Unix Name: ib-dwb), licensed under GNU General Public License, Version 2.

[2] Ling, MHT. 2006. An Anthological Review of Research Utilizing MontyLingua, a Python-Based End-to-End Text Processor. The Python Papers 1 (1): 5-12.

MontyLingua, an integral part of ConceptNet which is currently the largest common-sense knowledge base, is an English text processor developed using Python programming language in MIT Media Lab. The main feature of MontyLingua is the coverage for all aspects of English text processing from raw input text to semantic meanings and summary generation, yet each component in MontyLingua is loosely-coupled to each other at the architectural and code level, which enabled individual components to be used independently or substituted. However, there has been no review exploring the role of MontyLingua in recent research work utilizing it. This paper aims to review the use of and roles played by MontyLingua and its components in research work published in 19 articles between October 2004 and August 2006. We had observed a diversified use of MontyLingua in many different areas, both generic and domain-specific. Although the use of text summarizing component had not been observe, we are optimistic that it will have a crucial role in managing the current trend of information overload in future research.

[3] Ling, MHT. 2007. Firebird Database Backup by Serialized Database Table Dump. The Python Papers 2 (1): 12-16.

This paper presents a simple data dump and load utility for Firebird databases which mimics mysqldump in MySQL. This utility, fb_dump and fb_load, for dumping and loading respectively, retrieves each database table using kinterbasdb and serializes the data using marshal module. This utility has two advantages over the standard Firebird database backup utility, gbak. Firstly, it is able to backup and restore single database tables which might help to recover corrupted databases. Secondly, the output is in text-coded format marshal module) making it more resilient than a compressed text backup, as in the case of using gbak.

[4] Ling, MHT, Lefevre, C, Nicholas, KR, Lin, F. 2007. Re-construction of Protein-Protein Interaction Pathways by Mining Subject-Verb-Objects Intermediates. In J.C. Ragapakse, B. Schmidt, and G. Volkert (Eds.), Proceedings of the Second IAPR Workshop on Pattern Recognition in Bioinformatics (PRIB 2007). Lecture Notes in Bioinformatics 4774. (pp. 286-299) Springer-Verlag.

Prior to this project, it was generally considered in the NLP field that biomedical text are domain-specific and will require a certain degree of tool adaptation from the generic-domain to be of use. Muscorian refuted this assumption by demonstrating that an un-adapted generic text processor can perform comparably to adapted tools. At the same time, the un-adapted text processor forms the generalized layer to transform unstructured text into a structured table of subject-verb-object on which question-specific tools can be built. This study also demonstrates the flexibility of this generalization-specialization paradigm by using the same generalized layer for 2 specialized questions.

[5] Ling, MHT, Lefevre, C, Nicholas, KR. 2008. Parts-of-Speech Tagger Errors Do Not Necessarily Degrade Accuracy in Extracting Information from Biomedical Text. The Python Papers 3(1): 65-80.

This manuscript attempts to find out the reason why an un-adapted text processor can perform comparably to adapted tools. It was found that although an un-adapted text processor's parts-of-speech (POS) tagging accuracy is lower than specialized tools, it has minimal effect on the transformation to subject-verb-object structures due to complementary POS tag use in shallow parsing (breaking down sentences into phrases); thus, supporting our previous findings.

[6] Ling, MHT, Lefevre, C, Nicholas, KR. 2008. Filtering Microarray Correlations by Statistical Literature Analysis Yields Potential Hypotheses for Lactation Research. The Python Papers 3(3): 4.

Besides NLP, statistical linguistics which depends on the appearance of words or names in text has been used to extract potential protein-protein interactions, such as in the case of PubGene and CoPub Mapper. In the case of PubGene, it was found that the presence of 2 protein names in 1 abstract out of 10 million (1-PubGene) suggest 60% likelihood of interaction and increases to 72% when the names appears 5 times or more (5-PubGene). This manuscript analyzed PubGene methods using Poisson distribution and found that 1-PubGene is generally more stringent that 99% confidence on Poisson distribution; thus, explaining 1-PubGene's expectedly good performance. This study demonstrated that NLP extracted interactions were almost a proper subset of statistical extraction, suggesting that NLP can be used to annotate statistical extractions. This study also found that a majority of co-expressed genes from microarray analysis, including 7 pairs of perfectly co-expressed genes, were not mentioned in text, suggesting that these potential interactions had not been studied experimentally. Hence, we suggest that text mining may be used to construct a "state of current knowledge" suitable to identify potential hypotheses for further experimental research.

[7] Ling, MHT. 2009. Compendium of Distributions, I: Beta, Binomial, Chi-Square, F, Gamma, Geometric, Poisson, Student's t, and Uniform. The Python Papers Source Codes 1:4.

This paper is the first of a series to implement routines to calculate statistical distributions which forms the basis of other statistical tests.

[8] Ling, MHT. 2009. Ten Z-test Routines from Gopal Kanji's 100 Statistical Tests. The Python Papers Source Codes 1:5.

This paper is the first of a series to implement statistical tests routines. For this manuscript, I chose to implement test routines from Gopal Kanji's book which uses Normal distribution, accounting for 10% of the book.

[9] Ling, MHT. 2009. Understanding Mouse Lactogenesis by Transcriptomics and Literature Analysis. Doctor of Philosophy. Department of Zoology, The University of Melbourne, Australia.

This thesis is advised by Professor Kevin R. Nicholas (currently in Deakin University, Australia) and co-advised Associate Professors Christophe Lefevre (currently in Deakin University, Australia) and Feng Lin (currently in Nanyang Technological University, Singapore). This thesis refuted previous assumption that generic computational linguistics processor is unable to process biomedical text due to domain-specificity and attributed it to complementary parts-of-speech tag use in the shallow parsing (breaking down sentences into phrases) process. This thesis confirmed that subject-verb-object structure is a suitable intermediate for extracting protein-protein interactions from text and demonstrated the flexibility of this technique in information extraction. This thesis demonstrated that information extraction by computational linguistics can supplement information extraction by statistical co-occurrence. Using computational and statistical information extraction, a filter representing the current state of biological knowledge was built to be used with microarray analysis for identifying potential novel hypotheses for further research. This thesis examined the relevance of mouse hormone-treated mammary tissue culture in studying mouse lactogenesis by comparing the transcriptomes of cultured tissues with in vivo mammary tissues across the lactation cycle using Affymetrix microarrays. It concluded that the tissue culture is useful in the study of primary hormonal responses but is unlikely to be useful in studying sustained responses and the tissue culture is a useful tool to “re-construct” the set of hormonal stimuli required to simulate mouse mammary tissues into lactogenesis.

[10] Kuo, CJ, Ling, MHT, Lin, KT, Hsu, CN. 2009. BIOADI: A Machine Learning Approach to Identify Abbreviations and Definitions in Biological Literature. BMC Bioinformatics 10(Suppl 15):S7.

This manuscript deals with a limitation identified in my doctoral thesis - real-time identification of gene/protein names and its abbreviations in text instead of a dictionary approach used in my thesis. We identified about 1.7 million unique long form / abbreviations pairs in the entire PubMed with 95.86% precision and 89.9% recall at an average computational speed of 10.2 seconds per thousand abstracts. At the same time, BIOADI is also a standalone tool that can be incorporated into an analysis pipeline. This study also contributed an annotated corpus to the community for tool evaluation purposes.

[11] Ling, MHT, Lefevre, Christophe, Nicholas, Kevin R. 2009. Biomedical Literature Analysis: Current State and Challenges. In B.G. Kutais (ed). Internet Policies and Issues, Volume 7. Nova Science Publishers, Inc.

This manuscript reviews the central (information retrieval, information extraction and text mining) and allied (corpus collection, databases and system evaluation methods) domains of computational to present the current state of biomedical literature analysis for protein-protein and protein-gene interactions and challenges ahead - Firstly, biomedical text mining is highly dependent in PubMed (MedLine) as text repository but neither the implementation details nor performance is terms of precision and recall is known. Secondly, extraction of interactions depends on the recognition of entity (protein and gene) names in text and whether different names refers to the same protein remains an open problem. Thirdly, extraction of interactions by co-occurrence and NLP has been shown to be complementary suggesting the improvement of future systems in this direction. Fourthly, evidence suggests that generic NLP engines may be able to process text for interaction extractions due to complementary POS tag use in shallow parsing process but more extensive evaluations are needed. Fifthly, there is a shortage of suitable corpora for system evaluation resulting in difficulty in comparison (due to different corpus or databases used in evaluation) prompting the collection of a common set of corpora for communal use. Lastly, biomedical literature analysis tools must demonstrate real world applications without a steep learning curve before the slow adoption of these tools by biologists (the intended users) can be reversed.

[12] Lee, CH, Lee, KC, Oon, JSH, Ling, MHT. 2010. Bactome, I: Python in DNA Fingerprinting. In: Peer-Reviewed Articles from PyCon Asia-Pacific 2010. The Python Papers 5(3): 6.

Bactome is a set of functions created for our analysis of DNA fingerprints. This includes functions to find suitable primers for PCR-based DNA fingerprinting given a known genome, determine restriction digestion profile, and analyse the resulting DNA fingerprint features as migration distance of the bands in gel electrophoresis.

[13] Ng, YY and Ling, MHT. 2010. Electronic Laboratory Notebook on Web2Py Framework. In: Peer-Reviewed Articles from PyCon Asia-Pacific 2010. The Python Papers 5(3): 7.

This paper presents CyNote version 1.4 as a prototype of an electronic laboratory notebook that is built on Web2py framework. CyNote uses a blog-style structure (entries and comments) as laboratory notebook and had implemented a number of bioinformatics and statistical analysis functions. At the same time, this paper evaluates CyNote against US FDA 21 CFR Part 11.

[14] Ling, MHT. 2010. COPADS, I: Distances Measures between Two Lists or Sets. The Python Papers Source Codes 2:2.

This paper implements 35 distance coefficients with worked examples: Jaccard, Dice, Sokal and Michener, Matching, Anderberg, Ochiai, Ochiai 2, First Kulcsynski, Second Kulcsynski, Forbes, Hamann, Simpson, Russel and Rao, Roger and Tanimoto, Sokal and Sneath, Sokal and Sneath 2, Sokal and Sneath 3, Buser, Fossum, Yule Q, Yule Y, McConnaughey, Stiles,Pearson, Dennis, Gower and Legendre, Tulloss, Hamming, Euclidean, Minkowski, Manhattan, Canberra, Complement Bray and Curtis, Cosine, Tanimoto.

[15] Chay, ZE, Ling, MHT. 2010. COPADS, II: Chi-Square test, F-Test and t-Test Routines from Gopal Kanji's 100 Statistical Tests. The Python Papers Source Codes 2:3.

This paper extends previous work on the implementation of statistical tests as described by Kanji. A total of 8 Chi-square tests, 3 F-tests and 6 t-tests routines are implemented, bringing a total of 27 out of 100 tests implemented to date.

[16] Lim, JZR, Aw, ZQ, Goh, DJW, How, JA, Low, SXZ, Loo, BZL, Ling, MHT. 2010. A genetic algorithm framework grounded in biology. The Python Papers Source Codes 2: 6.

This manuscript describes the implementation of a GA framework that uses biological hierarchy - from chromosomes to organisms to population.

[17] Tahat, A, Ling, MHT. 2010. Mapping Relational Operations onto Hypergraph Model. The Python Papers 6(1): 4.

The relational model is the most commonly used data model for storing large datasets. However, many real world objects are recursive and associative in nature which makes storage in the relational model difficult. The hypergraph model is a generalization of a graph model, where each hypernode can be made up of other nodes or graphs and each hyperedge can be made up of one or more edges. It may address the recursive and associative limitations of relational model. However, the hypergraph model is non-tabular; thus, loses the simplicity of the relational model. In this study, we consider the means to convert a relational model into a hypergraph model in two layers and present a reference implementation of relational operators (project, rename, select, inner join, natural join, left join, right join, outer join and Cartesian join) on a hypergraph model.

[18] Ling, MHT. 2010. Specifying the Behaviour of Python Programs: Language and Basic Examples. The Python Papers 5(2): 4.

This manuscript describe BeSSY, a function-centric language for formal behavioural specification that requires no more than high-school mathematics on arithmetic, functions, Boolean algebra and sets theory. An object can be modelled as a union of data sets and functions whereas inherited object can be modelled as a union of supersets and a set of object-specific functions. Python list and dictionary operations are specified in BeSSY for illustration.

[19] Ling, MHT, Lefevre, Christophe, Nicholas, KR. 2010. Mining Protein-Protein Interactions from Published Abstracts with MontyLingua. In Zhongming Zhao(ed). Sequence and Genome Analysis: Methods and Applications. iConcept Press Pty Ltd.

[20] Ling, MHT. 2011. Bactome II: Analyzing Gene List for Gene Ontology Over-Representation. The Python Papers Source Codes 3: 3.

Microarray is an experimental tool that allows for the screening of several thousand genes in a single experiment and the analysis of which often requires mapping onto biological processes. This allows for the examination of processes that are over-represented. A number of tools have been developed but each differed in terms of organisms that can be analyzed. Gene Ontology website has a list of up-to-date annotation files for different organisms that can be used for over-representation analysis. Each file maps each gene of the organism to its ontological terms. It is a simple tool that allows users to use the up-to-date annotation files to generate the expected and observed counts for each GO identifier (GO ID) from a given gene list for further statistical analyses.

[21] Kuo, CJ, Ling, MHT, Hsu, CN. 2011. Soft Tagging of Overlapping High Confidence Gene Mention Variants for Cross-Species Full-Text Gene Normalization. BMC Bioinformatics 12(Suppl 8):S6.

Background: Previously, gene normalization (GN) systems are mostly focused on disambiguation using contextual information. An effective gene mention tagger is deemed unnecessary because the subsequent steps will filter out false positives and high recall is sufficient. However, unlike similar tasks in the past BioCreative challenges, the BioCreative III GN task is particularly challenging because it is not species-specific. Required to process full-length articles, an ineffective gene mention tagger may produce a huge number of ambiguous false positives that overwhelm subsequent filtering steps while still missing many true positives. Results: We present our GN system participated in the BioCreative III GN task. Our system applies a typical 2-stage approach to GN but features a soft tagging gene mention tagger that generates a set of overlapping gene mention variants with a nearly perfect recall. The overlapping gene mention variants increase the chance of precise match in the dictionary and alleviate the need of disambiguation. Our GN system achieved a precision of 0.9 (F-score 0.63) on the BioCreative III GN test corpus with the silver annotation of 507 articles. Its TAP-k scores are competitive to the best results among all participants. Conclusions: We show that despite the lack of clever disambiguation in our gene normalization system, effective soft tagging of gene mention variants can indeed contribute to performance in cross-species and full-text gene normalization.

[22] Ling, MHT, Jean, A, Liao, D, Tew, BBY, Ho, S, Clancy, K. 2011. Integration of Standardized Cloning Methodologies and Sequence Handling to Support Synthetic Biology Studies. Third International Workshop on Bio-Design Automation (IWBDA). San Diego, California, USA.

[23] Ling, MHT. 2012. An Artificial Life Simulation Library Based on Genetic Algorithm, 3-Character Genetic Code and Biological Hierarchy. The Python Papers 7: 5.

Genetic algorithm (GA) is inspired by biological evolution of genetic organisms by optimizing the genotypic combinations encoded within each individual with the help of evolutionary operators, suggesting that GA may be a suitable model for studying real-life evolutionary processes. This paper describes the design of a Python library for artificial life simulation, Digital Organism Simulation Environment (DOSE), based on GA and biological hierarchy starting from genetic sequence to population. A 3-character instruction set that does not take any operand is introduced as genetic code for digital organism. This mimics the 3-nucleotide codon structure in naturally occurring DNA. In addition, the context of a 3-dimensional world composing of ecological cells is introduced to simulate a physical ecosystem. Using DOSE, an experiment to examine the changes in genetic sequences with respect to mutation rates is presented.

[24] Ling, MHT. 2012. Ragaraja 1.0: The Genome Interpreter of Digital Organism Simulation Environment (DOSE). The Python Papers Source Codes 4: 2.

This manuscript describes the implementation and test of Ragaraja instruction set version 1.0, which is the core genomic interpreter of DOSE.

[25] Chen, KFQ, Ling, MHT. 2013. COPADS III (Compendium of Distributions II): Cauchy, Cosine, Exponential, Hypergeometric, Logarithmic, Semicircular, Triangular, and Weibull. The Python Papers Source Codes 5: 2.

This manuscript illustrates the implementation and testing of eight statistical distributions, namely Cauchy, Cosine, Exponential, Hypergeometric, Logarithmic, Semicircular, Triangular, and Weibull distribution, where each distribution consists of three common functions – Probability Density Function (PDF), Cumulative Density Function (CDF) and the inverse of CDF (inverseCDF). These codes had been incorporated into COPADS codebase (https://github.com/copads/copads) are licensed under Lesser General Public Licence version 3.

[26] Ling, MHT. 2014. NotaLogger: Notarization Code Generator and Logging Service. The Python Papers 9: 2.

The act of affixing a signature and date to a document, known as notarization, is often used as evidence for sighting or bearing witness to any documents in question. Notarization and dating are required to render documents admissible in the court of law. However, the weakest link in the process of notarization is the notary; that is, the person dating and affixing his/her signature. A number of legal cases had shown instances of false dating and falsification of signatures. In this study, NotaLogger is proposed, which can be used to generate a notarization code to be appended to the document to be notarized. During notarization code generation, the user can include relevant information to identify the document to be notarized and the date and time of code generation will be logged into the system. Generated and used notarization code can be verified by searching in NotaLogger, and such search will result in date time stamping by a Network Time Protocol server. As a result, NotaLogger can be used as an "independent witness" to any notarizations. NotaLogger can be accessed at http://mauricelab.pythonanywhere.com/notalogger/.

[27] Chan, OYW, Keng, BMH, Ling, MHT. 2014. Bactome III: OLIgonucleotide Variable Expression Ranker (OLIVER) 1.0, Tool for Identifying Suitable Reference (Invariant) Genes from Large Microarray Datasets. The Python Papers Source Codes 6: 2.

This manuscript documents the implementation for OLIgonucleotide Variable Expression Ranker (OLIVER) as described in Chan et al. (2014), which can be downloaded from http://sourceforge.net/projects/bactome/files/OLIVER/OLIVER_1.zip. These codes are licensed under GNU General Public License version 3 for academic and non-for-profit use.

[28] Castillo, CFG, Ling, MHT. 2014. Digital Organism Simulation Environment (DOSE): A Library for Ecologically-Based In Silico Experimental Evolution. Advances in Computer Science: an International Journal 3(1): 44-50.

Testing evolutionary hypothesis in biological setting is expensive and time consuming. Computer simulations of organisms (digital organisms) are commonly used proxies to study evolutionary processes. A number of digital organism simulators have been developed but are deficient in biological and ecological parallels. In this study, we present DOSE (Digital Organism Simulation Environment), a digital organism simulator with biological and ecological parallels. DOSE consists of a biological hierarchy of genetic sequences, organism, population, and ecosystem. A 3-character instruction set that does not take any operand is used as genetic code for digital organism, which the 3-nucleotide codon structure in naturally occurring DNA. The evolutionary driver is simulated by a genetic algorithm. We demonstrate the utility in examining the effects of migration on heterozygosity, also known as local genetic distance. Our simulation results showed that adjacent migration, such as foraging or nomadic behaviour, increases heterozygosity while long distance migration, such as flight covering the entire ecosystem, does not increase heterozygosity.

[29] Koh, YZ, Ling, MHT. 2014. Catalog of Biological and Biomedical Databases Published in 2013. iConcept Journal of Computational and Mathematical Biology 3: 3.

There had been a large number of biological and biomedical related databases being created over the years with a steady rise of about 10% from 2005 to 2012. However, it is difficult to navigate the range of databases as there is no current database inventory and links to databases are embedded in their respective publications. In this study, we developed a set of 91 cataloging tags based on software repositories and listed 379 database papers published in 2013. Of which, only 290 database papers have URL links to the databases. Therefore, only 290 databases were cataloged. Our catalog is given in appendix.

[30] Chew, JS, Ling, MHT. 2016. TAPPS Release 1: Plugin-Extensible Platform for Technical Analysis and Applied Statistics. Advances in Computer Science: an international journal 5(1): 132-141.

In this first article, the main features of TAPPS were described: (1) a thin platform with (2) a CLI-based, domain-specific command language where (3) all analytical functions are implemented as plugins. This results in a defined plugin system, which enables rapid prototyping and testing of analysis functions. This article also describes the architecture and implementation of TAPPS in a level of detail sufficient for interested developers to fork the code for further improvements.

[31] Ling, MHT. 2016. COPADS IV: Fixed Time-Step ODE Solvers for a System of Equations Implemented as a Set of Python Functions. Advances in Computer Science: an international journal 5(3): 5-11.

Ordinary differential equation (ODE) systems are commonly used many different fields. The de-facto method to implement an ODE system in Python programming using SciPy requires the entire system to be implemented as a single function, which only allow for inline documentation. Although each equation can be broken up into sub-equations, there is no compart-mentalization of sub-equations to its ODE. A better method will be to implement each ODE as a function. This encapsulates the sub-equations to its ODE, and allow for function and inline documentation, resulting in better maintainability. This study presents the implementation 11 ODE solvers that enable each ODE in a system to be implemented as a function. Three enhancements will be added. Firstly, the solvers will be implemented as generators to allow for virtually infinite simulation and returning a stream of intermediate results for analysis. Secondly, the solvers will allow for non-ODE-bounded variables or solution vector to improve code and results documentation. Lastly, a means to set upper and lower boundary of ODE solutions will be added. Validation testing shows that the enhanced ODE solvers give comparable results to SciPy’s default ODE solver. The implemented solvers are incorporated into COPADS repository (https://github.com/copads/copads).

[32] Chay, ZE, Goh, BF, Ling, MHT. 2016. PNet: A Python Library for Petri Net Modeling and Simulation. Advances in Computer Science: an international journal 5(4): 24-30.

Petri Net is a formalism to describe changes between 2 or more states across discrete time and has been used to model many systems. We present PNet – a pure Python library for Petri Net modeling and simulation in Python programming language. The design of PNet focuses on reducing the learning curve needed to define a Petri Net by using a text-based language rather than programming constructs to define transition rules. Complex transition rules can be refined as regular Python functions. To demonstrate the simplicity of PNet, we present 2 examples – bread baking, and epidemiological models.

[33] Castillo, CFG, Ling, MHT. 2018. Digital Organism Simulation Environment (DOSE) Version 1.0.4. In Current STEM, Volume 1, page 1-106. Nova Science Publishers, Inc. ISBN 978-1-53613-416-2.

Evolution is a fundamental aspect of biology but examining evolution is difficult, time-consuming and costly. At the same time, molecular analysis of biological organisms is generally destructive, which presents a conundrum between observing possible evolutionary outcomes and in depth molecular analysis to decipher the corresponding evolutionary outcomes. Artificial life simulations via the use of digital organisms (DO) had been proposed as a feasible means of examining evolution in silico and had yield biologically relevant findings. Being digital, identical replicates can be made for analysis; thereby, resolving the conundrum. Recently, original implementation of DOSE (Ling, 2012a) had been improved (Castillo and Ling, 2014a) for use as a Python library for simplified construction of simulation, enabling database logging and revival of simulations. This manuscript documents the implementation and improvement of DOSE, which is released as DOSE version 1.0.4 (https://github.com/mauriceling/dose/releases/tag/v1.0.4) and licensed under GNU General Public License version 3. DOSE codebase is hosted and available for forking at https://github.com/mauriceling/dose.

[34] Ling, MHT. 2018. A Cryptography Method Inspired by Jigsaw Puzzles. In Current STEM, Volume 1, page 129-142. Nova Science Publishers, Inc. ISBN 978-1-53613-416-2.

Cryptography is a critical tool in safeguarding information from “unauthorized” view during the storage and transportation of data. Due to the one-to-one correspondence between plain text and cipher text, encryption algorithms can be seen as a transformation process. This is a deficiency as all information is present, though encrypted, in the cipher text. Inspired by Jigsaw puzzles, a new cryptography system, Jigsaw Encryption System (JES) is proposed where a single plain text file results in many cipher text files, resembling jigsaw pieces from a single image; thus, the loss of a small number of cipher text files may not compromise the entire contents in plain text. Each cipher text can be further processed for added security. This can result in larger permutations needed to decipher by brute force. Reference implementations of preliminary JES versions had been deposited into COPADS (Collection of Python Algorithms and Data Structures) repository (https://github.com/mauriceling/copads; File name: copads/jigsaw.py).

[35] Ling, MHT. 2018. COPADS V: Lindenmayer System with Stochastic and Function-Based Rules. In Current STEM, Volume 1, page 143-172. Nova Science Publishers, Inc. ISBN 978-1-53613-416-2.

Lindenmayer system, commonly known as L-system, is a string rewriting system based on a set of rules. In each iteration, the string is repeatedly rewritten based on the rules given. This has been used to model branching processes; such as, plant and animal body patterning, and sedimentation processes. In addition to deterministic rewriting rules, stochastic rules have been used, leading to the development of stochastic L-system (S-L-system). For more complex modeling, parametric rules have been used, leading to the development of parametric L-system (P-L-system). Combining S-L-system and P-L-system leads to the development of L-system capable of handling both stochastic and parametric rules or parametric-stochastic-L-system (PS-L-system). Currently, there is no pure Python PS-L-system library. In this study, a light-weight, pure Python PS-L-system has been implemented. This has been incorporated into COPADS repository (https://github.com/copads/copads) and and licensed under GNU General Public License 3.

[35] Ling, MHT. 2018. COPADS VI: Fixed Time-Step ODE Solvers with Mixed ODE and non-ODE Function, and Script Generator. In Current STEM, Volume 1, page 173-212. Nova Science Publishers, Inc. ISBN 978-1-53613-416-2.

Ordinary differential equations (ODEs) are commonly used in mathematical modelling. However, the standard means of implementing a system of ODEs in Python, using Scipy, does not allow for each ODE to be implemented as an individual Python function. This results in poor documentation and maintainability. We have re-implemented 11 fixed-steps ODE solvers to allow for an ODE system to be implemented as a set of Python functions. Here, the ODE solvers are enhanced to take on a non-ODE function, allowing for modification of the ODE result vector at each time step, which may be useful in cases where one or more results has to be calibrated using other results. In addition, a script generator is implemented to assist in the generation of a Python ODE script from a set of parameters. This module had been incorporated into the Collection of Python Algorithms and Data Structures (COPADS; https://github.com/mauriceling/copads), under Python Software Foundation License version 2.

[37] Ling MHT. 2018. RANDOMSEQ: Python Command‒line Random Sequence Generator. MOJ Proteomics & Bioinformatics 7(4):206‒208.

Randomly generated sequences are important in many sequence analysis studies as they represent null hypotheses. There are several existing tools to generate random sequences but each has its own strengths and weaknesses. Building upon the strengths and weaknesses of existing tools, a command-line random sequence generator, RANDOMSEQ, is presented. Generation of random sequences is versatile: (a) fixed or variable length nucleotide or amino acid sequences can be generated; (b) a variety of frequencies for sequence generation is accepted – source sequence, single or n-length nucleotide / amino acid frequencies; (c) generated sequences can be free of user-defined start or stop codons or both; (d) generated sequences can be flanked with randomly selected start and stop codons; and (e) one or more constant regions can exist within the sequence.

[38] Ling, MHT. 2018. SEcured REcorder BOx (SEREBO) Based on Blockchain Technology for Immutable Data Management and Notarization. MOJ Proteomics & Bioinformatics 7(6):169‒174.

Several surveys suggest that as many as 33% of scientists have personal knowledge of a colleague who fabricated or falsified research data. This indicates the need of a system that can aid the assurance that research data is not modified. Blockchain technology ensures data authenticity as recorded data is not mutable. In this study, a command-line data recorder and notary service based on blockchain, SEcured REcorder BOx (SEREBO), is presented. SEREBO can help individual scientists or research teams to prove data authenticity after logging data files into the system, and to provide traceable notarization records. Hence, SEREBO a potentially important tool for auditing research data against modifications, and auditing notarization events against backdating or postdating. SEREBO is available for forking at https://github.com/mauriceling/serebo under GNU General Public License version 3 for non-commercial or academic use only.

[39] Ling, MHT. 2019. Draft Implementation of a Method to Secure Data by File Fragmentation. Acta Scientific Computer Sciences 1(2): 10-13.

Cryptography is fundamental in data security and is a critical tool in safeguarding information from “unauthorized” view during the storage and transportation of data. Due to the one-to-one correspondence between plain text and cipher text, encryption algorithms are transformation processes. This implies that all information is present, though encrypted, in the cipher text. Inspired by Jigsaw puzzles, a new cryptography system, Jigsaw Cryptography System (JCS) is proposed where a single plain text file results in many cipher text files, resembling jigsaw pieces from a single image. Thus, the interception of a small number of cipher text files may not compromise the entire contents in plain text. This can result in larger permutations needed to decipher by brute force, which is not easily achievable in most cryptographic methods.

[40] Ling, MHT. 2019. Island: A Simple Forward Simulation Tool for Population Genetics. Acta Scientific Computer Sciences 1(2): 20-22.

Changes in population genetic structure can be a result of genetic drift and/or selective pressure, which may result in changes in adaptability of the population. Computer simulations are commonly used to gain insights into the genetic fate of evolving populations. However, most simulation tools in this area require a firm understanding of the mathematical models of genetic drift but low-cost, hands-on tools are the key to make abstract concepts, such as genetic drift, more intuitive. Here, Island is presented as simple forward simulation tool for population genetics based on Mendelian inheritance where a population is generated from a comma-delimited file containing allelic frequencies. Forward simulations start from an initial population and track its evolution over multiple generations. The population is simulated over generations where each generation results in a population file, which can then be examined independently to observe changes in allelic frequencies over generations.

[41] Ling, MHT. 2020. SeqProperties: A Python Command-Line Tool for Basic Sequence Analysis. Acta Scientific Microbiology 3(6): 103-106.

A Python Command-Line Tool for Basic Sequence Analysis.

[42] Ling MHT. 2020. AdvanceSyn Toolkit: An Open-Source Suite for Model Development and Analysis in Biological Engineering. MOJ Proteomics & Bioinformatics 9(4):83‒86.

Modelling and simulations are useful means to screen potential experimental designs for metabolic engineering. Genome-scale models of metabolism (GSM) and kinetic models (KMs) are the two main approaches for modelling, which resulted in largely disjoint computational tools for GSMs and KMs. Existing tools for GSMs require knowledge of the underlying programming languages while the development and merger of two or more KMs is difficult. In this work, AdvanceSyn Toolkit is an open-sourced high-level command-line tool to develop KMs, and to analyse GSMs and KMs; licensed under the Apache License, Version 2.0, for academic and not-for-profit use. It elevates the need to know the underlying programming language for GSM analysis. AdvanceSyn Model (ASM) specification is a simple and modular format for model development and AdvanceSyn Toolkit provides a method to merge two or more model files for simulation and sensitivity analysis.

[43] Liu, TT, Ling, MHT. 2020. BactClass: Simplifying the Use of Machine Learning in Biology and Medicine. Acta Scientific Medical Sciences 4(11): 43-47.

Machine learning has many applications in biology and medicine. However, most existing tools require substantial programming skills, which can be a challenge to many biologists. Here, we present BactClass as a command-line tool for machine learning algorithms on formatted data, aiming to reduce the challenges faced by biologists who are interested to use machine learning approaches. BactClass is part of the Bactome project (https://github.com/mauriceling/bactome) and is licensed under GNU General Public Licence version 3 for academic and non-commercial purposes only.

[44] Ling, MHT. 2020. Low Classification Accuracy by Logistic Regression, Support Vector Classifier, and Multi-Layer Perceptron, but Not Decision Tree, on Random Attributes from Hadamard Matrix. EC Clinical and Medical Case Reports 3(12): 07-10.

The use of machine learning classifiers is increasing with evidence of overtaking human judgement. This can be risky if workings and implications of machine learning classifiers remain a black box. Here, a case where a balanced and algorithmically generated data set, Hadamard matrix, classifies poorer than random using logistic regression (accuracy < 17.4%), support vector classifier (accuracy < 23.4%) and in most cases of multi-layer perceptron (accuracy < 27.9%) but not in decision tree (accuracy > 77.3%); despite perfect (100%) internal classification accuracy for both support vector classifier and multi-layer perceptron; is reported. This suggests a systematic and yet currently unexplained source of error.

[45] Sim, KS, Ling, MHT. 2021. Installation and Documentation Evaluation of Recent (01 January 2020 to 15 February 2021) Chatbot Engines from Python Package Index (PyPI). Acta Scientific Computer Sciences 3(8): 38-43.

Chatbots have its roots in the early days of computing and gain substantial popularity in recent years. The most critical component of a chatbot is the engine that accepts and responds in natural human language input. In this study, we evaluate the installation and documentation of 21 recent chatbot engines (01 January 2020 to 15 February 2021) indexed in the Python Package Index (PyPI). Fourteen engines can be installed and imported without warning or errors and four engines have rich documentation. Only three (ChatterBot, chatbotAI, and opsdroid) engines with rich documentation can be installed and imported without warnings or errors. This suggests that the majority of the available and recent Python chatbot engines are not ready for widespread usability.

[46] Ling, MHT. 2021. ZeroOne: Building and Enhancing Executing Simulation by Incremental Patches. Acta Scientific Computer Sciences 3(10): 50-52.

Identifying all the required aspects before building a simulation is one of the major difficulties. This may be resolved by incremental simulation building. However, the simulator must be able to accept new codes into the simulation while the simulation is running. Here, I present ZeroOne, a simulation engine which allows for incremental simulation building by monitoring and processing a script file for new codes. ZeroOne is licensed under GNU General Public License Version 3.0 for academic and not-for-profit use.

[47] Amir-Hamzah, N, Kuan, ZJ, Ling, MHT. 2022. Kinetic Models with Default Enzyme Kinetics from Genome-scale Models. Acta Scientific Computer Sciences 4(1): 59-63.

Many genome-scale models of metabolism [GSMs] have been constructed to study the effects of changing native gene expression on its metabolism. Kinetic models of metabolism [KMs] can be a useful tool to study the effects of transgenes and regulations on the time-course metabolic profile of the host. However, the availability of KMs is substantially lesser with smaller scope than GSMs. A possibility is to generate KMs from GSMs but such tool is not available. Here, we present a converter to convert substrate-product pairs in GSM rate laws to enzyme kinetic equations in KM using default enzyme kinetics. Our testing results suggests that simulatable KMs can be successfully generated from GSMs to generate time-course metabolic profiles.

[48] Maitra, A, Ling, MHT. 2022. DOSSIER: A Toolkit to Extract Data from Digital Life Simulations Using Dose. Acta Scientific Computer Sciences 4(7): 37-40.

Artificial life, also known as digital organisms, has been useful in testing various evolutionary hypotheses. DOSE is one of the platforms for experimenting with digital organisms and had been used in several studies. However, the internal architecture of DOSE does not allow for easy processing of simulation results despite storing the state of each digital organism and the world for each generation in an SQLite database. To address this problem, we implemented DOSSIER, a Python-based toolkit that can connect with the SQLite database, facilitate data extraction and processing into a standardized table of fitness scores.

[49] Tan, JZH, Tan, NTF Tan, Ling, MHT. 2022. Brainopy: A Biologically Relevant SQLite-Based Artificial Neural Network Library. Acta Scientific Computer Sciences 4(12): 13-22.

Artificial neural network (ANN) is a computing system inspired by biological neural networks but recently, there is a move towards studying biological neural networks using neuronal simulations. Hence, ANN can be a tool to study biological neural networks. However, most ANN libraries only cater to one signal (equivalent to one neurotransmitter) and generally requires neurons to be organized into layers, which may not have direct biological equivalence. Here, we present Brainopy as a biologically relevant Python-based ANN library as it enables multiple neurotransmitters and allow each neuron to connect to any other neurons. The constructed neural network is persisted as an SQLite database file. Despite focusing on biological relevancy over computational efficiency, we built and simulated neural networks of up to 15000 neurons (within the neuronal complexity of Caenorhabditis elegans, a well-studied organism in neuroscience) using a retail laptop.

[50] Ling, MHT, Musttakim, S, Lau, PN. 2023. Development of a Basic Chemistry Conversational Corpus. Acta Scientific Nutritional Health 7(2): 48-54.

Chatbot technology can be an important tool and supplement to education, leading to explorations in this area. Corpus-based chatbot building has a relatively low entry barrier as it only requires a relevant corpus to train a chatbot engine. The corpus is a set of human-readable questions and answers, and may be an amalgamation of existing corpora. However, a suitable chemistry-based chatbot corpus catering for a freshman general chemistry course addressing inorganic and physical chemistry has not been developed. In this study, we present a basic chemistry conversational corpus consisting of 998 pairs of questions and answers, focused on a freshman general chemistry course addressing inorganic and physical chemistry. Ten human raters evaluated the responses of a chatbot trained on the corpus and suggests that the corpus resulted in better response than random (t = 17.4, p-value = 1.86E-53). However, only 20 of the 50 test questions show better responses compared to random (difference in mean score ≥ 1.9, paired t-test p-value ≤ 0.0324), suggesting that the corpus provides better responses to certain questions rather than overall better responses, with questions related to definitions and computational procedures answered more accurately. Hence, this provides a baseline for future corpora development.

[51] Tan, NTF, Mugundhan, M, Liu, T, Tan, RYH, Tang, AY, Sim, BJH, Tan, JZH, Ling, MHT. 2025. SiPy – Bringing Python and R to the End-User in a Plugin-Extensible System. Medicon Medical Sciences 8(6): 32-41.

Data / statistical literacy is required for informed participation and better decision making in a literate society, and learning a data analysis tool may enhance the learning of statistical concepts. R and Python are well-known platforms for data analysis but are also difficult to learn even though they may be synergistic. Here, we present SiPy (Statistics in Python) as a data analysis tool built using Python and integrates analysis from R, and extensible using plug-ins. We will describe the architecture of the current version of SiPy, SiPy 0.6.0 codenamed “Otoro-Chutoro Continuum”, with a listing of the 60 available analytical functions across 8 function classes. This is followed by an elaboration of SiPy plugin system where end-users can add functions as plugins, before ending with a description of SiPy’s scripting system catering to modular scripting.

Computational and Statistical Tools for Research Annotated Publications - mauriceling/mauriceling.github.io GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️