[텍스트마이닝][6주차 1]Parts of Speech - mingoori0512/minggori GitHub Wiki
Distribution(=vector): review
- Words that appear in similar contexts have similar representations(and similar meanings, by the distributional hypothesis)
Parts of speech
- Parts of speech are categories of words defined distributionally by the morphological(형태학적) and syntactic(문법적) contexts a word appears in.
Morphological distribution
-
POS often defined by distributional properties; verbs=class of words that each combine with the same set of affixes(접사들)-'distributional'의 뜻 이해 안됨
동사 or 명사의 원형 기준으로 분류
-
보충: The distributional hypothesis suggests that the more semantically similar two words are, the more distributionally similar they will be in turn, and thus the more that they will tend to occur in similar linguistic contexts.
-
보충: Computational analyses of child-directed speech have shown that distributional information—information about how words pattern with one another in sentences—could be a useful source of initial category information.
-
We can look to the function of the affix(denoting(나타내다) past tense) to include irregular inflections(굴절, 내가 생각하기에는 '활용'): 불규칙형도 파악을 해야 함.
Syntactic distribution
-
Substitution test: if a word is replaced by another word, does the sentence remain grammatical?
-
These can often be too strict; some contexts admit substitutability for some pairs but not others.
ex. both verbs but transitive vs intransitive / both nouns but common vs proper
Open class
Nouns: People, places, things, actions-made-nouns("I like swimming"). Inflected for singular/plural
Verbs: Actions, processes. Inflected for tense, aspect, number, person
Adjectives: Properties, qualities. Usually modify nouns
Adverbs: Qualify the manner of verbs("She ran downhill extremely quickly yesterday")
Closed class
Determiner: Mark the beginning of a noun phrase("a dog")
Pronouns: Refer to a noun phrase(he, she, it)
Prepositions: Indicate spatial/temporal relationships(on the table)
Conjunctions: Conjoin two phrases, clauses, sentences(and, or)
- OOV(out of vocabulary)? Guess Noun(word2vec은 할 수 없음, 단어가 등장->명사로 추측하면 그나마 맞출 확률이 올라감)
POS tagging
: Labeling the tag that's correct for the context.
-
같은 벡터를 가지더라도, 다른 pos tag를 가질 수 있음
-
"Classification": 여러가지 tagging 중 어떤 것에 해당되는지
State of the art(SOTA): 해당 태스크의 가장 뛰어난 모델
-
Baseline: Most frequent class = 92.34%
-
Token accuracy: 97% (English news)
-
Optimistic: includes punctuation, words with only one tag(deterministic tagging)
-
Substantial drop across domains(e.g., train on news, test on literature)-domain적인 특성에 따라 달라짐
-
-
Whole sentence accuracy: 55%
Why is part of speech tagging useful? 언어 구조를 이해하는 데 더 도움이 됨
POS indicative(척도) of syntax
POS is indicative of pronunciation
음성 인식을 하는 경우 v, n으로 쓰였을 때 tone이 달라지는 경우가 있음
Tagsets
dataset: 사람이 문장에 대해서 tagging한 것들이 있음
-
Penn Treebank
-
Universal Dependencies
-
Twitter POS
Verbs
VB: base form
VBD: past tense
VBG: present participle
VBP: present(non 3rd-sing)
VBZ: present(3rd-sing)
MD: modal verbs
Nouns
NN: non-proper, singular or mass
NNS: non-proper, plural
NNP: proper, singular
NNPS: proper, plural
DT(Article)
관사: 문법적인 것으로 학습이 덜 되도록 한다. 문맥에 영향을 주지 X(stopwords)-제거하고 학습을 진행하는 경우도 있음
-
Articles(a, the, every, no)
-
Indefinite determiners(대상을 특정하지 않는 관사)(another, any, some, each)
-
That, these, this, those when predicting noun
-
All, both when not preceding another determiner or possessive pronoun
JJ(Adjectives)
-
General adjectives
-
happy person
-
new mail
-
-
ordinal(서수) numbers
- fourth person
RB(Adverb)
-
Most words that end in -ly
-
Degree words(quite, too, very)
-
Negative markers: not, n't, never
IN(presposition, subordinating conjunction)
-
All prepositions(except to) and subordinating conjunctions
- He jumped on the table because he was excited
POS tagging
"classification" task로 볼 수도 있음
Labeling the tag that's correct for the context.(Just tags in evidence within the Penn Treebank-more are possible!)
Sequence Labeling
-
Classic: HMM(Hidden Markov Model), MEMM(Maximum entropy markov model), CRF(conditional random field): 더 이상 사용하진 X
-
Neural(2015~): RNN, CM, Transformer
x = {x1,..., xn} 단어
y = {y1,...,yn} pos(or ner)
-
For a set of inputs x with n sequential time steps, one corresponding label yi for each xi
-
Modeling 방식: 확률기반 -> Neural Network Model
Named entity recognition(위키피디아에 등장할 만한 특징이 있는 명사)
tim cook is the ceo of apple
apple과 tim cook은 공통적인 특성을 공유한다.
POS->장소 or tkfkadlswlRkwl rnqnsgksek.
-
3 or 4-class: person, location, organization, (misc)= MISC는 miscellaneous
-
7-class: person, location, organization, time, money, percent, date
Supersense tagging
(POS, NER, Supersense tagging 모두 다 쓸 수도 있음)
1~24까지 있는 것으로 보임: person, communication, artifact, act, group, food, cognition, possession, location, substance, state, time, attribute, object, process, Tops, phenomenon, event, quantity, motive, animal, body, feeling, shape, plant relation
POS tagging training data
-
Wall Street Journal(~1M tokens, 45 tags, English)
-
Universal Dependencies(universal dependency treebanks for many languages; common POS tags for all)
Majority class
- Pick the label each word is seen most with in the training data(많이 등장한 빈도로 예측)-Hidden Markov
Naive Bayes
- Treat each prediction as independent of the others
조건부 확률로 계산
Logistic Regression(<- by classification)
- Treat each prediction as independent of the others but condition on much more expressive set of features
Discriminative Features(이해 X)
x input에 여러가지가 들어갈 수 있음
- Features are scoped over entire observed input
-> 순서가 존재한다는 context도 줄 수 있음
Sequences: 순서가 있는 데이터에 어떻게 확률분포를 구할 것인가?
-
Models that make independent predictions for elements in a sequence can reason over expressive representations of the input x(including correlations among inputs at different time steps xi and xj.)
-
But they don't capture another important source of information: correlations in the labels y.
through 3가지의 연관관계-> yi 예측 가능
xi-1 xi yi
-
순서상으로 다음에 등장할 확률을 modeling 하는 것
-
Most common tag bigrams in Penn Treebank training("data set")
-
확률을 미리 계산해 두는 것
Generative(데이터 생성) vs. Discriminative models(y=f(x) 구별하는 softmax 선을 구하는 것)
- Generative models specify a joint distribution over the labels and the data. With this you could generate(HMM, GAN, Naive Bayes Model) new data
P(x,y) = P(y)P(x|y)
- Discriminative models specify the conditional distribution of the label y give the data x. These models focus on how to discriminate between the classes(Logistic Regression, RNN(softmax를 사용하는...))
P(y|x)