Activities Summary: Anuv - petermr/CEVOpen GitHub Wiki
- Operating System: Ubuntu 20.04
-
python3 --version
: 3.9 -
pip --version
: 20.0.2
pygetpapers is a fetch tool written in Python, developed by Ayush Garg. It is used to fetch freely available scientific papers from select repositories.
To install pygetpapers run pip install pygetpapers
Check if pygetpapers is properly installed:
pygetpapers --help
In ubuntu the binaries are installed in ~/.local/bin by default. We can add this directory to our system path, and run pygetpapers from our console. To add the binary to the system path, execute:
export PATH="$HOME/.local/bin:$PATH"
ami is a sectioning tool written in Java created by Dr. Peter Murray-Rust. It is used to section a scientific paper into different sections according to their relative position in the document and their usage.
- JAVA
sudo apt install default-jre
To check if the software is successfully installed, run
java --version
- Maven
sudo apt install maven
After Java and Maven is installed, we git clone the repository, and build it.
git clone https://github.com/petermr/ami3.git
cd ami3
mvn install -Dmaven.test.skip=true
To add ami to system path execute the following command:
export PATH="$HOME/ami3/target/appassembler/bin:$PATH"
20210916
git clone https://github.com/ShweataNHegde/scilitanalysis.git
Create a virtual environment by following the instructions here: Working with a virtual environment
Move into the cloned directory with cd scilitanalysis/scilitanalysis
Create a requirements
file with the following data:
yake
scispacy
spacy
pygetpapers
bs4
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bionlp13cg_md-0.4.0.tar.gz
Install the requirements with pip install -r requirements
20210914
A common representation of chemical reactions in scientific literature is in a paragraph format. Reaction information encoded in unstructured paragraph could be potentially useful in a machine-readable structured format. Chemical Markup Language (CML) is an application of XML which provides a tagset for encoding chemical information which might be useful for representing reactions found in the literature. Machines cannot simply read and understand a paragraph of plaintext the way humans do. But with NLP we might be able to identify important and chemical relevant information in paragraphs and parse the information as CML.
There is a vast repository of chemical information locked away in paragraphs of reaction description in scientific literature. The information can be easily deciphered by a chemist, but such a process cannot scale in time and cost when analysing large amounts of scientific literature. Having such information in CML would make analysis and use of chemistry and biochemistry literature scalable.
To identify the components of a paragraph rich in chemical reaction information and correctly encode the information in CML.
- We can get a sense of the structure of a reaction by looking for certain words or word groups.
- Look for words such as ‘reacts with’, ‘undergoes reaction’, ‘undergoes elimination’ ‘combusts’, etc. These words or phrases might indicate the presence of a chemical reaction and also tell us about the products and the type of reaction.
- 0.5M; number followed by M indicated concentration
- ‘Catalysed by’, ‘in presence of’ indicate catalysts and reaction conditions
- ‘At K’ and ‘atm’, ‘temperature’, ‘pressure’, ‘NTP’, etc. indicate reaction conditions.
- ‘Gives’, ‘to form’ is usually followed by the reaction product.
- We can match words against a dictionary of chemical names to check if it is a valid compound or element or not.
Phenol reacts with NaOH and CO2 at 400K and 2-7atm to give Sodium Salicylate.
<reaction>
<reactant>
<formula>C6 H6 O</formula>
<name>Phenol</name>
</reactant>
<reactant>
<formula>Na O H</formula>
<name>Sodium Hydroxide</name>
</reactant>
<reactant>
<formula>C O2</formula>
<name>Carbon Dioxide</name>
</reactant>
<product>
<formula>C7 H5 Na O3</formula>
<name>Sodium Salicylate</name>
</product>
<reaction-conditions>
<temperature>400K</temperature>
<pressure>4-7atm</pressure>
</reaction-conditions>
</reaction>
- Identify passages containing description of a chemical reaction
- Convert molecules descriptions into CML
- Identify images depicting chemical molecules and reactions
- Convert chemical molecules or reactions presented as images into CML
- Encoding metabolic pathways as XML
20210915
Sometimes we may be using software that requires a specific version of a package, or we may need to run multiple programs requiring conflicting package versions. For such cases, and for software development in general, it is useful to do the development in a virtual environment. When we activate a python virtual environment, the packages available in that environment is independent of the packages installed in the system, as a result it is often necessary to install commonly used packages in the virtual environment after creating it. You can create as many virtual environments you want, you might typically want to create a seperate virtual environment for every project.
python3 -m venv /path/to/virtual/environment
The path would also include the name of the virtual environment. For example, if I want to create a virtual environment named 'scilit_venv' in the /home/anuv/scilitanalysis/ I would run the command:
python3 -m venv /home/anuv/scilitanalysis/scilit_venv
source path/to/venv/bin/activate
You need to run this command every time you want to enter the virtual environment. Do note that if you are not using bash you can use the alternative activate files for specific shells, for example, if you are using fish shell then use source venv/bin/activate.fish
.
Continuing the above example, we can activate the scilit_venv
by running:
source /home/anuv/scilitanalysis/scilit_venv/bin/activate
To leave the virtual environment simply run:
deactivate
This wiki has been continued here.