NLP questions - SoojungHong/TextMining GitHub Wiki

1. Why TF-IDF?

TFIDF is an information retrieval technique that weighs a term’s frequency (TF) and its inverse document frequency (IDF). Put simply, the higher the TFIDF score (weight), the rarer the term and vice versa.

The TF*IDF algorithm is used to weigh a keyword in any content and assign the importance to that keyword based on the number of times it appears in the document. More importantly, it checks how relevant the keyword is throughout the web, which is referred to as corpus. The IDF (inverse document frequency) of a word is the measure of how significant that term is in the whole corpus.

2. Shallow parsing vs Dependency parsing

Shallow parsing

Chunking or shallow parsing segments a sentence into a sequence of syntactic constituents or chunks, i.e. sequences of adjacent words grouped on the basis of linguistic properties (Abney, 1996).

Dependency parsing

Syntactic Parsing or Dependency Parsing is the task of recognizing a sentence and assigning a syntactic structure to it. The most widely used syntactic structure is the parse tree which can be generated using some parsing algorithms. These parse trees are useful in various applications like grammar checking or more importantly it plays a critical role in the semantic analysis stage. For example to answer the question “Who is the point guard for the LA Laker in the next game ?” we need to figure out its subject, objects, attributes to help us figure out that the user wants the point guard of the LA Lakers specifically for the next game.

Now the task of Syntactic parsing is quite complex due to the fact that a given sentence can have multiple parse trees which we call as ambiguities. Consider a sentence “Book that flight.” which can form multiple parse trees based on its ambiguous part of speech tags unless these ambiguities are resolved.

3. Character N-gram or Word N-gram

First of all, what is N-gram

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.

word based N-grams

The word-based text representations require language-dependent tools, such as a tokenizer (to split the message into tokens) and usually a lemmatizer plus a list of stop words (to reduce the dimensionality of the problem). On the other hand, word n-grams, i.e., contiguous sequences of n words, have also been examined. Such approaches attempt to take advantage of contextual phrasal information (e.g., ‘buy now’), that distinguish spam from legitimate messages. However, word n-grams considerably increase the dimensionality of the problem and the results so far are not encouraging.

Advantage to use character based N-gram

The bag of character n-grams representation is language-independent and does not require any text pre-processing (tokenizer, lemmatizer, or other ‘deep’ NLP tools). It has already been used in several tasks including language identification, authorship attribution, and topic-based text categorization with remarkable results in comparison to word-based representations.

An important characteristic of the character-level n-grams is that they avoid (at least to a great extent) the problem of sparse data that arises when using word-level n-grams. That is, there is much less character combinations than word combinations, therefore, less n-grams will have zero frequency.

4. POS tagging (Part of Speech tagging)

A part-of-speech (PoS) tagger is a software tool that labels words as one of several categories to identify the word's function in a given language. In the English language, words fall into one of eight or nine parts of speech. Part-of-speech categories include noun, verb, article, adjective, preposition, pronoun, adverb, conjunction and interjection.

PoS taggers use algorithms to label terms in text bodies. These taggers make more complex categories than those defined as basic PoS, with tags such as “noun-plural” or even more complex labels. PoS taggers categorize terms in PoS types by their relational position in a phrase, relationship with nearby terms and by the word’s definition. PoS taggers fall into those that use stochastic methods, those based on probability and those which are rule-based.

How to build POS tagger? using supervised learning using already labeled tagset. RNN or HMM can be useful.

What is better algorithm for POS tagger? Hidden Markov Model or SVM? The HMM-based TnT tagger obtains lower accuracy than the SVM-based SVMTool tagger. See results at POS Tagging (State of the art). The POS tagging approaches like rule based and hidden Markov Model(HMM) cannot handle many features. SVM based tagger was proposed which is efficient, portable, scalable and trainable. Support vector machine (SVM) are recently developed supervised learning method having good performance and generalization.

http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/