Text Processing Tool - mubashariq/textanalysistool GitHub Wiki

This Wiki is a part of completing Master thesis that is describing an overall detail of the software tool that was built to analyze research papers. In order to perform text analysis web application was developed that includes different components to perform certain tasks and to fulfill the required system needs like integration of different tools and technology stack, 3rd party services, database design etc.

Assumptions and Dependencies

The following list of assumptions and dependencies were identified for this tool:

Assumptions:

This system would be represented as a web tool No hidden features will be built After analysis it would be available publicly on the github or semantic MediaWiki for later use The system would be accessed by using an internet browser System would not be gathered any personal during the use The system would allow to apply exclusion criteria (listed in section - section name here) The system would allow to perform refinement process on identified topics (listed in section - section name here)

Dependencies:

The system would use PHP as a programming language The system would use MySQL database The system would use PDFParser library to extract data from PDF files The system would be interacted with TextRazor API by using web services The system would retrieve topics or concepts from design patterns research papers The system would retrieve design patterns parameters from semantic Mediawiki

System Components

This system is a combination of different components where each component serving a specific function, system components could be processes, software, hardware, or any other part that helps to built different features of this system. Below is the high level details of different components:

System components

Data Data could be any information or input into a system, it also represents the output and the results given by the system after performing different processes on the data being input to the system.

Hardware A computer (desktop and laptop), its peripheral and communication devices like input, output and storage devices.

Software System software (operating system), editor used for writing programs, web server, database client, browser etc and also the text analysis system built and their set of instructions those create requests for how to take data, how to process, how to perform different operations, how to display information and how to store data and information.

People System professional(s) who built this system and users who will use the system to analyze and perform different set of operations on the data.

Procedure Series of steps those have been taken together to achieve a defined goals, results what was supposed to generate and consistent result or any other specific operation defined to perform on the data.

Technology Stack

The tool was build by myself and have expertise in defined technology stack that helped me to avoid dependency on another professional/software developer to build this tool.

PHP PHP (recursive abbreviation is Hypertext Preprocessor) is an open source web specific server side scripting language and easily can be embedded in HTML to create web pages. PHP is one of the 5 more popular programming languages overall and it one of the most popular web programming language. There were reasons to select PHP for building this tool; first it is easy to implement and maintain, second PHP is now so rich that we are not limited only on some specific features, but the quality, availability of 3rd party services and strong community and community driven documentation.

MySQL/phpMyAdmin MySQL is free (few editions are paid and offer additional functionality but not required in this system) and open source relational database management system (RDBMS). MySQL is a central component of the open source web application software stack. MySQL database is easy to install, maintain and integrate with PHP based applications. MySQL have strong community and now backed by Oracle corporation.

XAMPP XAMPP is one of the top open source and cross-platform web server solution, consisting apache hypertext transfer protocol server, database, and interpreters for PHP programming language. It is a free, simple and lightweight apache distribution that makes it extremely easy for developers to create a local web server for testing and deployment purposes (apachefriends link). XAMPP is easy to install and use at windows system, and mainly it provides both database and web server that’s why it was selected to use for building this system.

HTML/CSS Hypertext Markup Language (HTML) is markup language for developing static web pages and web applications by using the Cascading Style Sheets (CSS) to design web pages and JavaScript to add interactivity.

Javascript/AJAX JavaScript is the client side programming language for making interactive HTML based web pages. Ajax is a set of Web development techniques using many web technologies on the client side to create asynchronous web applications. I used to interact with HTML tags and create asynchronous calls to different part of tool.

Sublime Text Editor Sublime Text is a sophisticated text editor for writing code. It has simple interface, extraordinary features and excellent performance.

PDFParser PDFParser is used to extract text from design patterns research papers PDF files. PDFParser is a PHP library using TCPDF to extract data from a PDF file. Currently, secured documents are not supported, also parser library is still under active development but it provided the needed functionality.

TextRazor TextRazor is the natural language processing API. It used in this thesis to perform text analysis on design patterns research papers, TextRazor performs deep analysis on content to extract topics, topics relevancy score, entity extraction word relations etc.

Database Design

Relational database design was created beforehand that also refers or support to create a relational data model that ascertain the data would be stored in the database and define the relationships among different data/database segments. Later this model was used to create data definition language (DDL), it also provided all the logical and physical design choices and storage parameters those aided to create database.

Database design

Database Tables

Total 5 database tables were created to gather and structure all necessary information.

Patterns It includes all 47 design patterns those are listed at semantic mediawiki,

Research_papers_topics It saves the research papers topics those are extracted by using text analysis.

Parameters Design patterns parameters were collected from semantic media wiki and stored in this table.

Parameters_type It includes the information related to parameters types like either this parameter is related to family, etc.

Common_codes It includes all those common code those were identified by comparing each design pattern with other design patterns.