Questions MEMM - ufal/NPFL095 GitHub Wiki
Andrew McCallum, Dayne Freitag, Fernando Pereira: Maximum Entropy Markov Models for Information Extraction and Segmentation
-
Explain (roughly) how the new formula for αt+1(s) is derived (i.e. Formula 1 in the paper).
-
Section 2.1 states “we will split P(s|s',o) into |S| separately trained transition functions”. What are the advantages and disadvantages of this approach?
-
Let S= {V,N} (verb and non-verb)
Training data =he/N can/V can/V a/N can/N
Observation features are:
b1 = current word is “he”
b2 = current word is “can”
b3 = current word is “a” and next word is “can”
When implementing MEMM you need to define s0, i.e. the previous state before the first token. It may be a special NULL, but for simplicity let’s define it as N.
-
3a) What are the states (s) and observations (o) for this training data?
-
3b) Equation (2) defines features fa based on observation features b. How many such fa features do we have?
-
3c) Equation (3) defines constraints. How many such constraints do we have?
-
3d) List all the constraints involving feature b2, i.e. substitute (whenever possible) concrete numbers into Equation (3).
-
3e) In step 3 of the GIS algorithm you need to compute Ps’(j)(s|o). Compute PN(0)(N|can) and PN(0)(V|can).
Hint : You might be confused by the ms' variable (and t1, …, tms') in Equation (3).
For a given s', t1, …, tms' are the time stamps where the previous state (with time stamp ti - 1) is s'. For example, in our training data:
for s'=N, t1=1 (because s0=N), t2=2 (because s1=N) and t3=5 (because s4=N), i.e. ms'=3;
for s'=V, t1=3 (because s2=V), t2=4 (because s3=V), i.e. ms'=2.
OPTIONAL (These papers may or may not help you with the questions and to learn more about Maximum Entropy Markov Models):
http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html
http://www.mit.edu/~6.863/spring2011/jmnew/6.pdf
http://www.cs.cornell.edu/courses/cs6784/2010sp/lecture/09-McCallumEtAl00.pdf
http://see.stanford.edu/materials/aimlcs229/cs229-hmm.pdf