Parsers for English - doraithodla/notes GitHub Wiki

Is this turning into a learn log?

Did a little bit of work but mostly reading. Lots of reading to be left. Two things I learned are Context Free Grammars and [Probability based context free grammars]]. Need to study more in lot of detail. But a general picture is emerging as to how the text is turning into knowledge.

Here is my understanding.

Text is tokenized
Text/tokens are parsed based on rules driven by CFGs and PCFGs.
A parse tree is constructed
The parsing process depends on prior tagged data which helps in tagging the elements of the tree.
The original text is tagged as POS (parts of speech) tagging

There must be a way to verify the accuracy of the tagged data thus. How is it done. Usual supervised machine technique? Human verification.

The roule of deep learning

Automatic feature extraction
Prioritization of features

Feature Extraction can benefit from removing irrlevant (to the domain) information like punctuation, stopword removal and case folding.

Statistical Parsing

Statistical parsing uses a probabilistic model of syntax in order to assign probabilities to each parse tree.
Provides principled approach to resolving syntactic ambiguity.
Allows supervised learning of parsers from tree-banks of parse trees provided by human linguists.
Also allows unsupervised learning of parsers from unannotated text, but the accuracy of such parsers has been limited.

Three useful PCFG Tasks

Observation likelihood: To classify and order sentences.
Most likely derivation: To determine the most likely parse tree for a sentence.
Maximum likelihood training: To train a PCFG to fit empirical training data.