Miniproject: Machine Learning - petermr/CEVOpen GitHub Wiki
1. We created sections using ami3
which look like this
<?xml version="1.0" encoding="UTF-8"?>
<ack>
<title>Acknowledgments</title>
<p>The authors are grateful to CNPq-Programa “Ciências sem fronteiras” (Grant No. 233761/2014-4) for financial support.</p>
</ack>
- Sections in a substantial number of published scholarly documents aren't labelled with universally controlled vocabulary in some journals/publications. PMCID : PMC6015887.
<?xml version="1.0" encoding="UTF-8"?>
<fn-group>
<fn fn-type="financial-disclosure">
<p>
<bold>Funding.</bold> This work was supported by the Scientific Research Fund Project of Science and Technology Department of Sichuan Province (Grant Nos. 2016NZ0105, 2017NZ0039, and 2018NZ0010).
</p>
</fn>
</fn-group>
- Read more about JATS
- We want to improve our knowledge resource by clustering together similar articles on a paragraph or section basis. E.g. Using unsupervised learning we find out that gas chromatography is a frequently used phrase, we use it as a label to group together other articles that mention gas chromatography. This would involve manually agreeing on the labels. We can extract such labels from the methods section using unsupervised clustering methods.
- Extract common themes from acknowlodgment statements. At the same time build binary classifiers that can automate the process correctly ('Acknowledgments' is a very broad term and ambiguities may arise even during inter annotator agreement while creating gold standards). This Plos One article describes in depth, the complexity of above mentioned problems.
- We plan on extracting keywords and phrases using NLTK rake and pke using both supervised and unsupervised learning methods. We create a bag of words and tf-idf representation of the entire corpus related to acknowledgment statements. We manually agree on features to be used for phrase matching.
- This project is entirely experimental. We want to work with different tools and libraries in python and discover the tools which serves our purpose best. We will start with classic machine learning models e.g SVM's and Bayesian classification. If and only if we we aren't able to acheive the results we desire, we experiment with state-of-the-art technologies such as transformers, LSTM etc. Here's a brief list of libraries that can be used in future :
- Create a robust open access knowledge resource for invasive plant species and aromatic plants.
- Scikit-learn clustering models https://scikit-learn.org/stable/modules/clustering.html
- gensim https://pypi.org/project/gensim/
- countvectorizer https://www.geeksforgeeks.org/using-countvectorizer-to-extracting-features-from-text/
- tf-idf https://en.wikipedia.org/wiki/Tf%E2%80%93idf
- LDA https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
- cosine similarity https://en.wikipedia.org/wiki/Cosine_similarity
- spacy https://spacy.io/models#conventions
- Jupyter Notebook for ease of collaboration, documentation and packaging.
- Extract features(frequently occuring words) using unsupervised K-means clustering . This is followed by keyphrase extraction using pke library.
- Use obtained features and keyphrases to come up with an xml dictionary to implement semi supervised sentence level search using spacy's phrase matcher. Usage of weight and number of hits per sentence to retrieve unlabeled acknowledgments. This can be done using docanalyis. Basic Instructions :
- Change the glob function in
extract_entities.py
according to your searching and recursion needs.
all_paragraphs = glob(os.path.join(
output, '*', 'sections', '**', '*method*','[1-9]_p.xml'), recursive=True)
- Specify the corpus to use for sentence level matching and information retrieval. Also provide terms.xml ami xml dictionary to search for given phrases in the documents both at paragraph and sentence level. Use excel to manipulate entities.csv written by docanalysis in working directory. The following script is
demo.py
import os
from docanalysis import DocAnalysis
ethic_statement_creator = DocAnalysis()
dict_for_entities = ethic_statement_creator.extract_entities_from_papers(
CORPUS_PATH=os.path.join(
os.getcwd(), "corpus", "oil186",
),
TERMS_XML_PATH=os.path.join(
os.getcwd(), "ethics_dictionary", "acknowledgment_feature_names.xml"
),
removefalse=False
)
list_with_orgs = ethic_statement_creator.extract_particular_fields(
dict_for_entities, 'ORG')
with open('org.text', 'w') as f:
f.write(str(list_with_orgs))
list_with_gpe = ethic_statement_creator.extract_particular_fields(
dict_for_entities, 'GPE')
with open('GPE.text', 'w') as f:
f.write(str(list_with_gpe))
- Build a training data set (also known as the gold standard) by manually labelling statements using human intelligence and inference. Acknow vs Not_Acknow label binary classification dataset.
- Build a classifier using support vector machines, multinomialNB naive bayes, random classifier, logistic regression and k nearest neighbors. The code for the same can be found here.
- Cross validate the classifier by running testing the model on external data (From oil1000 corpus).
- Obtained a scatterplot of dimensionally reduced features using seaborn. Data points for the different categories were separable, overlapping only in very few cases.
- Performed metric evaluation using different classical machine learning models. Accuracy scores obtained for each model were as follows:
model_name KNeighborsClassifier 0.836013 LogisticRegression 0.910713 MultinomialNB 0.985391 RandomForestClassifier 0.866756 Name: accuracy, dtype: float64
- Obtained a confusion matrix to evaluate model for false negatives and false positives.