Coding Sessions: Meeting Record - petermr/CEVOpen GitHub Wiki
Meeting Record 1
Date: June 1 2021
Participants: PMR, Shweata, Chaitanya, Sagar, Radhu
Key Points
- Changes were made to
ami_gui.py
andsearch_lib.py
in order to add functionality to the pre-existing code. - PMR talked about the importance of debugging and logging in python.
- Different development methodologies were discussed, pair programming being the primary one. Test-driven development was also discussed briefly.
ami_gui.py
was tested with different ami dictionaries (such as plant activity and plant compound). A python dictionary was created to store the data about hits/matches and where they occur in different ctrees and sections, which was then printed out. We usedicecream
, a python package to pretty-print the python dictionary we created earlier.- The next step would be to store the data on hits in a
.xml
or.json
format.
Immediate Tasks
- Radhu and others: Smoke test the latest
ami_gui.py
- Chaitanya: Explore
pygetpapers
Meeting Record 2
Date: June 7 2021
Attendees: PMR, Kanishka, Radhu, Sagar, Shweata, Bhavini, Chaitanya, Vasant
Key Points:
-
For those interns, whose internship is about to come to an end, focus on your deliverables and thesis. Also, prepare a 2-3 minute video presentation explaining your mini project /literature for interns who join us in the future(Keep it simple, so that it can be understood easily by people from all domains).
"Your research is only as important as your documentation of it." -PMR -
Initiated the process of preparing a workflow for the machine learning mini project which aims to classify text, develop labels and cluster similar scientific literature.
-
We are, potentially, looking to do text classification at:
- document level: how is a given paper similar to other papers
- section level: how are sections within a paper different from each other based on the words used.
-
Paragraphs are well-defined, both in EPMC and PDFs, and will be our basic unit.
-
Standard machine learning methods will be used as scientific literature is relatively much more structured as compared to some other text online, such as tweets(which might require deep learning models). Tools such as TF-IDF and count vectorizer might be potentially used for our purposes.
-
Step 1 of workflow: Come up with a list of useful labels.
TODO:
- Improve documentation of pyami. Emphasis on debugging.
- Jupyter Notebooks can come in handy for the machine learning project. Merits: Ease of packaging.
Meeting Record 3
Date : June 16 2021
Participants : Shweata, Bhavini, Ayush, Radhu, Sagar, PMR
Key Points :-
- Code review of ethic statements named entity recognition miniproject using spacy by Shweata. Debugging and logging of the code was performed. Docstrings were added to improve documentation.
Date: June 23, 2021
Agenda
PLEASE SUBMIT REQUESTS FOR ITEMS
- These can be:
- things you have done (e.g. systems to present)
- code reviews
- major problems
- discussions of style and conformance
- Please support these with Wiki pages, Github code, etc.
- Possible items (please indicate whether you wish to present.) Arbitrary order.
- pygetpapers (@Ayush Garg)
pyami
config.ini files and strategy (@Peter Murray-Rust); these will provide symbolic names for- dictionaries
- projects
- support files (e.g. stopwords)
- These will form the basis of
pyami
software and require individual users to use config files that they can configure to point to the projects and dictionaries. - review of ML software (@Chaitanya Sharma). Please include what data will be input and what ancillary files are needed
- review of data display/analysis (@Bhavini Malhotra). Please include what data will be input and what ancillary files are needed
Bug tracking
Participants: PMR, Ayush, Bhavini, Chaitanya, Radhu, Sagar, Shweata,
Key Points:
- Data visualization and historically significant infographics. We discussed Florence Nightingale's Rose Graph and watched a video on climate change data over the past 50 years. The importance of visual information was discussed as a mode of communication and understanding.
- Explored .ini files in
pyami
. - Reviewed ML miniproject. TODO: Text preprocessing .
- Tried fixing Radhu and Sagar's duplicate synonyms problem.
- Display of pygetpapers crossref and use cases.
Date: 2021-06-30
Participants: PMR, Shweata, Ayush
- Ayush and Shweata updated about their work on Ethics Statement. They have come up with a prototype to build a feedback loop of looking for ethics committees and key phrases in labelled sections and using them to filter unlabeled ethics statements. They also spent a huge chunk of time debugging their code. They had to re-think their logic of getting the dictionary key by splitting the path. This resulted in getting only the last paragraphs of each section - which was the problem. We resolved it by changing the dictionary key to the section of the path independent of the users working directory.
- In relation to Ethics Statement. Comments from PMR:
- Linear workflow doesn't always work. As the project grows, it gets complex. There will be more looping and branching.
- There are different levels that we are working with: Project -> CTree -> sections -> paragraphs -> sentences -> words. It's important, at each step, to know what you are working with.
- KNIME: A tool to visualize workflow can be employed.
- We, then, moved on to
pyami
. The points we mainly discussed where- config.ini
- symbols
- loggers
- We also discussed what made our project unique. What is it, 10 years down the line, users would use
pygetpapers
andpyami
for? The answers that came up were:pygetpapers
can act as a single point entry for multiple repositories, thanks to its modular format. It acts as a wrapper for all of what's out there.- The whole pygetpapers and
pyami
system gives power to the readers. - Even if people were to reinvent the wheels of the functionalities
pyami
has, the data structure (C-project structure) is something that will forever last. It is because it is based on strong ideas based on Linux. (Data Structure + Algorithm) = Program
- PMR also mentioned the invitee to speak at the INYAS mid-year meeting. His primary focus for the talk would be open science - tools for liberating science in the context of the global south - and the role of young scientists. The details will evolve over the week with active contributions from the team.
Date: 2021-07-07
Participants: PMR, Shweata, Ayush, Chaitanya
Key Points:
- We discussed the use of lxml to tackle the problem of parsing child tags and mixed content in XML. To tackle immediate problems face by Shweata, we decided to create a list of tags in python to omit from the flattened to text file so as to remove noise. Challenges include : 1. Loss of information of parsing tags such as the italic tag, sub (subscript tag), sup(superscript tag) etc.
- Sentence splitter and symbolic filenames using underscores. Details in the words of PMR:
TEXT PROCESSING TOOLS
- Some of us now need to carry out transformations on file content and it will be useful to share our experience of what works and indication of how to use it. These thoughts should go on a Tools wiki page...
- I am going to create semantic filenames - where the filename gives some of the history. This is risky but I think it will work.
- Each file has a final extension of the type :
- .txt
- .xml
- .png
- .jpg
- .html (more will come)
- Then each action creates a history tag in the filename. For example a PDF (fulltext.pdf) converted to text and split into sections might be: methods.pd.tx.sc.txt This signifies that: the file started as PDF (pd) it was converted to text (tx) it was split into (text) sections (sc) the section gave a hint that it should be labelled methods So when someone comes to read this file they have a rough idea what they are looking at.
NER :-
Ethics statement project and- Shweata discussed the potential use of textacy for topic modelling and sentence similarity in the ethics statement project . Shweata further discussed the use of fuzzywuzzy for string matching.
- Chaitanya talked about using Hugging Face to improve the results of named entity recognition, a brief discussion about the usage of state of the art technologies such as transformers and BERT . This a common engineering problem, an engineer often has to decide between different methods to solve the same problem. One method might be better than the other method, but if it disrupts the project structure/workflow, it is often a challenge for engineering teams and leads to technical debt if not implemented correctly. Factors such as timeframe, ease of use, understanding of the technology should be taken into consideration before switching to a new model/ technology to achieve slightly better results. The product should do the job / solve the problem without the need for over engineering.
pyami
and creating a separate Github repository for pyami
.
Restructuring of - We want to create a separate repository for
pyami
, this would also escalate the process of cleaning and documentation of OpenDiagram repository.
Web Scraping of biorxiv
- https://www.biorxiv.org/tdm
- You have to pay AWS for this. I think our strategy should be:,
- to retrieve the hit list through the public API.
- to have a parser for this list which extracts links
- to be able to download these links, probably in PDF
- to analyze the PDF (@Peter Murray-Rust is doing this)
- This toolkit will be fairly generic. We need a hitlist parser-downloader. I wrote one for biorxiv in Java. We'll discuss this. It should be a lot
easier to do it in Python.
Since publishers all use this mechanism (though with different syntaxes) this will be a generic hitlist-scraper.
It's also worth looking at quickscrape and seeing whether that architecture is good for (say)
pyquickscrape
.
Date: 2021-07-14
Participants: PMR, Shweata, Ayush, Chaitanya, Bhavini
Key Points:
- Started off with the code review of
pygetpapers
. Discussed the importance of conformant code, use of function and classes for repeating the building blocks of code. - As we move to a very crucial phase in the project, we are refactoring a lot of code and rearranging repositories(cleaning and proper documentation). The team discussed python package project structure by seeking inspiration from this article.
- Creating
pyami
: In the words of PMR
- I'm starting to create a new project (https://github.com/petermr/pyami) which will: use best Python practices. I may need help on this (e.g. from @Ayush Garg). Please critique me - if you don't understand what I'm doing or think it's wrong, say so.
- remove the cruft from openDiagram. This will just be pyami at present - a workflow tool, with relatively few addons (e.g. image processing)
- build minimal data resources, primarily for testing and tutorial. (No 100+ corpora).
- use unit-tests and integration tests. At present I'm thinking of pytest
- use a virtual environment (venv)
- try to use a python package hierarchy and avoid import problems.
- use docstrings, pylint and pyment. readthedocs.
- package pyami for pip install
- create requirements.txt (and try to limit the dependencies, maybe with helper packages).
- use continuous integration (CI) - automated testing - on github.
- use version numbers
- use branching in git
- Pull Requests
setup.py
.- `MANIFEST.in .
- .gitignore .
- CONTRIBUTING.md, CODE_of_CONDUCT, ISSUE_TEMPLATE .
- Issue annotation (priorities, status).
- code coverage.
- Created a new repository called docanalysis. Shweata, Ayush and Chaitanya should document sentence level semantic classification of sections in this repository.
- 2 week timeline to get all the software running including geotagging @Bhavini, ethics statements project , sections classification for the INYAS presentation.
- Importance of
config.ini
files was discussed. Logging in all software's for ease of debugging.
Date: 2021-07-21
Participants: PMR, Shweata, Ayush, Chaitanya, Sagar
Key Points:
- Shweata presented her progress with the phytochemical atlas and [interactive map (https://github.com/petermr/dictionary/blob/main/interactive_map/geotext_geopy_map_oil186.ipynb). A long and useful discussion about creating a dictionary of human settlements from wikidata (potential size of dictionary is 500,000 terms). The need to optimize the code and extract more sentences using tools other than geotext.
- Chaitanya discussed the use of naive-bayes classification using scikit learn. How to tackle with faulty JATS labelling of scientific articles by publishers using xpath. Fine-tuning of feature names in acknowledgment statements.
- Review of
pygetpapers
usage and documentation with Ayush. pyami
config.ini
files and symbols.- Sagar discussed the problem of quotation marks in secondary metabolite dictionary.