Parsers for English - doraithodla/notes GitHub Wiki
Is this turning into a learn log?
Did a little bit of work but mostly reading. Lots of reading to be left. Two things I learned are Context Free Grammars and [Probability based context free grammars]]. Need to study more in lot of detail. But a general picture is emerging as to how the text is turning into knowledge.
Here is my understanding.
- Text is tokenized
- Text/tokens are parsed based on rules driven by CFGs and PCFGs.
- A parse tree is constructed
- The parsing process depends on prior tagged data which helps in tagging the elements of the tree.
- The original text is tagged as POS (parts of speech) tagging
There must be a way to verify the accuracy of the tagged data thus. How is it done. Usual supervised machine technique? Human verification.
The roule of deep learning
- Automatic feature extraction
- Prioritization of features
Feature Extraction can benefit from removing irrlevant (to the domain) information like punctuation, stopword removal and case folding.
- Todo - Step by step conversion of raw text to POS tagged text.
- Todo - Step by step conversion of POS tagged text to Named Entities and Relationships
- Question - How is the relationship extracted? Vectorization?
- Todo: Spacy parsers both CFGs and PCFGs
Statistical Parsing
- Statistical parsing uses a probabilistic model of syntax in order to assign probabilities to each parse tree.
- Provides principled approach to resolving syntactic ambiguity.
- Allows supervised learning of parsers from tree-banks of parse trees provided by human linguists.
- Also allows unsupervised learning of parsers from unannotated text, but the accuracy of such parsers has been limited.
Three useful PCFG Tasks
- Observation likelihood: To classify and order sentences.
- Most likely derivation: To determine the most likely parse tree for a sentence.
- Maximum likelihood training: To train a PCFG to fit empirical training data.