[텍스트마이닝][6주차 1]Parts of Speech - mingoori0512/minggori GitHub Wiki

Distribution(=vector): review

Words that appear in similar contexts have similar representations(and similar meanings, by the distributional hypothesis)

Parts of speech

Parts of speech are categories of words defined distributionally by the morphological(형태학적) and syntactic(문법적) contexts a word appears in.

Morphological distribution

POS often defined by distributional properties; verbs=class of words that each combine with the same set of affixes(접사들)-'distributional'의 뜻 이해 안됨

동사 or 명사의 원형 기준으로 분류
보충: The distributional hypothesis suggests that the more semantically similar two words are, the more distributionally similar they will be in turn, and thus the more that they will tend to occur in similar linguistic contexts.
보충: Computational analyses of child-directed speech have shown that distributional information—information about how words pattern with one another in sentences—could be a useful source of initial category information.
We can look to the function of the affix(denoting(나타내다) past tense) to include irregular inflections(굴절, 내가 생각하기에는 '활용'): 불규칙형도 파악을 해야 함.

Syntactic distribution

Substitution test: if a word is replaced by another word, does the sentence remain grammatical?
These can often be too strict; some contexts admit substitutability for some pairs but not others.

ex. both verbs but transitive vs intransitive / both nouns but common vs proper


Open class

Nouns: People, places, things, actions-made-nouns("I like swimming"). Inflected for singular/plural

Verbs: Actions, processes. Inflected for tense, aspect, number, person

Adjectives: Properties, qualities. Usually modify nouns

Adverbs: Qualify the manner of verbs("She ran downhill extremely quickly yesterday")

Closed class

Determiner: Mark the beginning of a noun phrase("a dog")

Pronouns: Refer to a noun phrase(he, she, it)

Prepositions: Indicate spatial/temporal relationships(on the table)

Conjunctions: Conjoin two phrases, clauses, sentences(and, or)

OOV(out of vocabulary)? Guess Noun(word2vec은 할 수 없음, 단어가 등장->명사로 추측하면 그나마 맞출 확률이 올라감)

POS tagging

: Labeling the tag that's correct for the context.

같은 벡터를 가지더라도, 다른 pos tag를 가질 수 있음
"Classification": 여러가지 tagging 중 어떤 것에 해당되는지

State of the art(SOTA): 해당 태스크의 가장 뛰어난 모델

Baseline: Most frequent class = 92.34%
Token accuracy: 97% (English news)
- Optimistic: includes punctuation, words with only one tag(deterministic tagging)
- Substantial drop across domains(e.g., train on news, test on literature)-domain적인 특성에 따라 달라짐
Whole sentence accuracy: 55%

Why is part of speech tagging useful? 언어 구조를 이해하는 데 더 도움이 됨

POS indicative(척도) of syntax

POS is indicative of pronunciation

음성 인식을 하는 경우 v, n으로 쓰였을 때 tone이 달라지는 경우가 있음

Tagsets

dataset: 사람이 문장에 대해서 tagging한 것들이 있음

Penn Treebank
Universal Dependencies
Twitter POS

Verbs

VB: base form

VBD: past tense

VBG: present participle

VBP: present(non 3rd-sing)

VBZ: present(3rd-sing)

MD: modal verbs

Nouns

NN: non-proper, singular or mass

NNS: non-proper, plural

NNP: proper, singular

NNPS: proper, plural

DT(Article)

관사: 문법적인 것으로 학습이 덜 되도록 한다. 문맥에 영향을 주지 X(stopwords)-제거하고 학습을 진행하는 경우도 있음

Articles(a, the, every, no)
Indefinite determiners(대상을 특정하지 않는 관사)(another, any, some, each)
That, these, this, those when predicting noun
All, both when not preceding another determiner or possessive pronoun

JJ(Adjectives)

General adjectives
- happy person
- new mail
ordinal(서수) numbers
- fourth person

RB(Adverb)

Most words that end in -ly
Degree words(quite, too, very)
Negative markers: not, n't, never

IN(presposition, subordinating conjunction)

All prepositions(except to) and subordinating conjunctions
- He jumped on the table because he was excited

POS tagging

"classification" task로 볼 수도 있음

Labeling the tag that's correct for the context.(Just tags in evidence within the Penn Treebank-more are possible!)

Sequence Labeling

Classic: HMM(Hidden Markov Model), MEMM(Maximum entropy markov model), CRF(conditional random field): 더 이상 사용하진 X
Neural(2015~): RNN, CM, Transformer

x = {x1,..., xn} 단어

y = {y1,...,yn} pos(or ner)

For a set of inputs x with n sequential time steps, one corresponding label yi for each xi
Modeling 방식: 확률기반 -> Neural Network Model

Named entity recognition(위키피디아에 등장할 만한 특징이 있는 명사)

tim cook is the ceo of apple

apple과 tim cook은 공통적인 특성을 공유한다.

POS->장소 or tkfkadlswlRkwl rnqnsgksek.

3 or 4-class: person, location, organization, (misc)= MISC는 miscellaneous
7-class: person, location, organization, time, money, percent, date

Supersense tagging

(POS, NER, Supersense tagging 모두 다 쓸 수도 있음)

1~24까지 있는 것으로 보임: person, communication, artifact, act, group, food, cognition, possession, location, substance, state, time, attribute, object, process, Tops, phenomenon, event, quantity, motive, animal, body, feeling, shape, plant relation

POS tagging training data

Wall Street Journal(~1M tokens, 45 tags, English)
Universal Dependencies(universal dependency treebanks for many languages; common POS tags for all)

Majority class

Pick the label each word is seen most with in the training data(많이 등장한 빈도로 예측)-Hidden Markov

Naive Bayes

Treat each prediction as independent of the others

조건부 확률로 계산

Logistic Regression(<- by classification)

Treat each prediction as independent of the others but condition on much more expressive set of features

Discriminative Features(이해 X)

x input에 여러가지가 들어갈 수 있음

Features are scoped over entire observed input

-> 순서가 존재한다는 context도 줄 수 있음

Sequences: 순서가 있는 데이터에 어떻게 확률분포를 구할 것인가?

Models that make independent predictions for elements in a sequence can reason over expressive representations of the input x(including correlations among inputs at different time steps xi and xj.)
But they don't capture another important source of information: correlations in the labels y.

through 3가지의 연관관계-> yi 예측 가능

xi-1 xi yi

순서상으로 다음에 등장할 확률을 modeling 하는 것
Most common tag bigrams in Penn Treebank training("data set")
확률을 미리 계산해 두는 것

Generative(데이터 생성) vs. Discriminative models(y=f(x) 구별하는 softmax 선을 구하는 것)

Generative models specify a joint distribution over the labels and the data. With this you could generate(HMM, GAN, Naive Bayes Model) new data

P(x,y) = P(y)P(x|y)

Discriminative models specify the conditional distribution of the label y give the data x. These models focus on how to discriminate between the classes(Logistic Regression, RNN(softmax를 사용하는...))

P(y|x)