MWE Annotation Guidelines - nschneid/nanni GitHub Wiki
This document gives a detailed description of a linguistic annotation scheme for multiword expressions (MWEs) in English sentences. A conference paper summarizing the scheme and a dataset created with it are available at http://www.ark.cs.cmu.edu/LexSem/.
The input to annotators is a tokenized sentence. The goal is to join tokens where appropriate; this is done with the special characters underscore (_) for strong multiword links, a tilde (~) for weak links:
- This is a highly
~
recommended fast_
food restaurant .
Weak links can join strongly-linked expressions, such as fast_food + chain:
- This is a highly
~
recommended fast_
food~
chain .
Where an expression is interrupted by other tokens, use trailing and leading joiners:
- Do n't give
_
Jon such_
a_
hard_
time !
This even works when contiguous expression can fall within the gap:
- Do n't give
_
Jonathan_
Q._
Arbuckle such_
a_
hard_
time !
On rare occasion it may be necessary to use multiple gappy expressions, in which case indexing notation is available: a|1 b|2 c|1$3 d|2 e$3
implies two strong expressions—a_c and b_d—and one weak expression, a_c~e. An example: put_ whole$1 _heart_in$1
(amounts to: put_heart_in~whole). Also: make_a_ big$1 _to_do$1
Don't correct spelling mistakes in general, but interpret misspelled words according to their intended meaning. (E.g., if put it is clearly a typo of the particle verb put in, mark put_in.)
Do add _
if two separate tokens should really be one word. Include intermediate hyphens as part of an MWE:
anti - oil wells
or
anti - oil_
wells
or
anti_
-_
oil_
wells
NOT
anti_
- oil_
wells
or
anti -_
oil_
wells
If there’s a nonstandard variant of a term (more than just a misspelling), don’t join: craft_beer BUT handcraft beer
In general, don’t worry about the inflection of the word: come_in, came_in, have been coming_in all map to the same MWE lemma; likewise for grocery_store, grocery_stores.
If different pronouns can be used in a slot, do not join the pronoun. But if the slot requires a possessive, do join the possessive clitic if present: take_
her _
time, take_
one _
’s_
time
While many idioms/MWEs are figurative, not all figurative language is used in a lexically specific way. For example, “my Sicilian family” referring to a pizza community may be nonliteral, but is not an MWE.
We do not annotate foreign sentences, but foreign names within an English sentence are MWEs.
A collocation is a pairing between two or more words that is unexpectedly frequent, but not syntactically or semantically unusual.
- eternally~grateful, can~not~wait, place~is~packed
- what s.o. has~to~say [is willing to say, brings to the conversation; not obligation (compare: I have to say good things to my boss to get promoted)]
A collocation may include components that are themselves MWEs:
- after~all~was_said_and_done
Drawing the line between free combination, collocation, and multiword is often difficult; annotators’ opinions will vary.
TODO (join with ~)
- Cj~and_company (ambiguous whether it is actually the name of a company or a guy and his crew)
Some semi-productive idioms are not well captured as lexicalized multiwords. These should not be joined:
- have + GOODNESS.ADJ + TIME.PERIOD: had a bad day, have a great year, etc.
- EVALUATIVE.ATTRIBUTE of s.o.: (real) Christian of you
- NUMERIC.QUANTITY PLURAL.TIME.NOUN running: two years running
- come + MENTAL.CHANGE.INFINITIVE: come to realize, believe, learn, adapt, have faith, …
Rarely, a token will seemingly participate in multiple MWEs, which cannot be represented in our annotation scheme. Use your best judgment in such cases.
- I recently threw a surprise birthday party for my wife at Fraiser 's .
Possible pairs:
surprise_party
birthday_party
threw_party
Decision:
threw~birthday_party
- triple_chocolate_chunk brownie: multiplier+chocolate, chocolate_chunk
Don’t worry if the parts of an expression are noncanonically ordered: gave_
estimates, give_
an _
estimate, the estimate_
that was _
given
If one of the lexicalized parts is repeated due to coordination, attach the instance closest to the other lexicalized part: talked_to Bob and to Jill; North and South_America
- DO join Dr., Mr., etc. and other titles to a personal name:
Dr._Lori_Levin
,Henry_,_Prince_of_Wales
,Captain_Jack_Sparrow
- DO join Ave., Rd., St., etc. in street names: Forbes_Ave.
- DO join city-state-region expressions: Bellevue~,~WA or Bellevue~WA (include the comma if there is one). Likewise: Ohiopyle_State_Park~,~Pennsylvania; Miami_University~,~Miami~,~Ohio; Amsterdam~,~The_Netherlands
- DON’T join normal dates/times together (but: Fourth_of_July for the holiday)
- Symbols
- DON’T join normal % sign
- DO join letter grade followed by plus or minus: A_+
- DON’T join mathematical operators: 3 x the speed, 3 x 4 = 12 [x meaning “times”]
- DO join # sign when it can be read as “number”: #_1
- DO join a number and “star(s)” in the sense of a rating: 5_-_star
- When in doubt, join cardinal directions: north_east, north_west, south_east, south_west, north_-_northeast, …
- DO attach ’s if part of the name of a retail establishment: Modell_’s
- DO join product expressions such as car Year/Make/Model or software Name/Version
- excludes appositions that are not in a standard format (McDonald’s Dollar_Menu Chicken Sandwich)
- DO join names of foods/dishes if (a) the expression is noncompositional in some way, or (b) there is well-established cultural knowledge about the dish. Use ~ if unsure. For example:
- General_Tso_’s_chicken, macaroni_and_cheese, green_tea, red_velvet cake, ice_cream_sandwich, chicken_salad salad
- triple_chocolate_chunk brownie [multiplier+chocolate, chocolate_chunk]
- pizza~roll, ham~and~cheese, cheese~and~crackers, spaghetti~with~meatballs
- grilled BBQ chicken, pumpkin spice latte, green pepper, turkey sandwich, eggplant parmesan, strawberry banana milkshake
- DO join established varieties of animals/natural kinds: yellow_lab, desert_chameleon, Indian_elephant, furcifer_pardalias; BUT: brown dog
- DO join slogans: Trust_The_Midas_Touch, Just_Do_It, etc.
- pleased/happy/angry_with, mad_at
- good_for s.o. [healthy, desirable]
This is a special use of the preposition ‘on’, but it is does not generally join to form an MWE:
- drop_the_ball on s.o. [not literally!], die on s.o.
- hang_up~on s.o. [collocation]
- (‘step on s.o.’ is different: here it is the semantics of ‘step on’ that could convey negativity in certain contexts, not ‘on’ by itself)
- (appropriate) for_ {one's, a certain, ...} _age (of child)
3 years_old, month_old project. (Note that ago should NOT be joined because it is always postpositional.)
Join unless ‘all’ is paraphrasable as ‘completely’ or ‘entirely’:
- participle: all gone, all done, all_told [overall, in total]
- other adj: all ready, all_right [well, OK] (informal spelling: alright)
Do not join, even though the as PPs are correlated. Exceptions:
- as_long_as [while]
- one_by_one [one at a time]
- Don’t join if by indicates a product, as in a multidimensional measurement: three by five paper = 3 x 5 paper
A few English nouns take idiosyncratic measure words: 3 sheetsofpaper, 2 pairsofpants, a pieceofinformation
Do not attach the modifier if it has an ordinary meaning, e.g. go clear through the wall
- highly~recommended, highly~trained
- family~owned company
- capital_punishment
- big_rig [slang for truck]
- road_construction [the road isn't actually being constructed, but reconstructed!]
- silver_ Mariott _member [rewards program]
- electric_blanket
- last_minute
- price_range
- second_chance
- grocery_stores
- pizza_parlor, pizza place, burger joint (diagnostic: does “favorite X” occur on the web? [to filter out proper names])
- little~danger/risk
- public~welfare
- this place is a hidden~gem
- strike_one/two/three (unusual syntax!)
Cf. Quirk pp. 669–670
- out_of, in_between, in_front_of, next_to
- along_with
- as_well, as_well_as
- in_addition_to, in_light_of, on_account_of
- due_to, owing_to, because_of
From Quirk et al. 1972:
- but_that, in_that, in_order_that, insofar_that, in_the_event_that, save_that, so_that, such_that, except_that, for_all_that, now_that
-
as_{far,long,soon}_as
, inasmuch_as, insofar_as, as_if, as_though, in_case - Do NOT mark the participial ones: assuming, considering, excepting, … that
- to_start_off_with
- that_said
- of_course
Though as a postmodifier it is a bit odd syntactically (anything else, who else, etc.), it does not generally participate in lexicalized idioms.
- “What does ‘else’ even mean?!” - Henrietta
-
do_ X _a_favor_and Y
do_ X _a_favor_, Y
vs. plaindo_favor
- you_get_what_you_pay_for (NOT: ‘you get what you purchase’)
- get_ the_hell _out
- why in_the_hell [can be any WH word]
- do_n’t_forget, forget_it !, never_mind
- i have
tosay, gotta~say, etc.: semantics = obligation on self to say something, pragmatics = can’t restrain self from expressing an opinion - Who_knows [rhetorical question]
- no_way
- Phatic expressions: I_’m_sorry, Thank_you
We do not mark ordinary there be existentials as multiwords.
get + "accomplishment" V
get_upgraded
, get_ cat _neutered
, get_ a bill _passed
In the sense of ‘arrive’, not really a multiword:
- get back home, got to the school
- get_ready, get_done, get_busy, get_older
- get_a_flat
- get_correct
If a verb, adjective, or noun typically takes an infinitival complement, and the infinitive verb is not fixed, don't join to:
- little to say
- important to determine his fate
- able/ability to find information
- chose/choice to do nothing
- willing(ness) to sail
But if it is a special construction, the to should be joined:
- in_order_to VP
- at_liberty_to VP
- ready_to_rumble
- special modal/tense constructions: ought_to, have_to (obligation or necessity), going_to, about_to (but want to, need to, try to)
- a long~{day, week, year} (long in the sense of bad/difficult; cannot be referring to an actual duration because the noun is a time unit with precise duration)
Join these (including n’t and a, but not do) if sufficiently conventional: did n't_sleep_a_wink, did n't_charge_a_cent/penny/dime, did n't_eat_a_morsel/scrap/bite/crumb (of food)
These include the so-called determiner-less PPs (in_town vs. in the town).
- in_a_ nice/good/... _way
- out_of_site
- on_staff
- at_all
- at_liberty_to
- for_sure
- mediocre at_best
- to_boot
- in_town
- on_earth, on
theplanet, intheworld, in the country/universe
- in_cash/quarters/Euros
- to_ her amusement, to our chagrin, to the _surprise of all present
- to_ my _satisfaction
- capacity_for love
- his problem_with the decision
- extensive damage_to the furniture
Sometimes these participate in verb-noun constructions:
- have_a_problem_with [be annoyed with], have_ a _problem_with [have trouble with something not working]
- do_damage_to the furniture
- TODO: explain principles
- talk_with/to, speak_with/to, filled_with
- NOT: learn about, booked at (hotel)
- wait_for
- look_for
- test_for
- (a)rising_from
- disposed_of
- take_care_of
- trust_with
- listen_to, pay_attention_to
-
compare_ X _to Y
,X compared_to Y
- been_to LOCATION/doctor etc.
- trip_over, trip_on
- (do_)damage_to the furniture
- take_in [bring to an establishment for some purpose, e.g. a car for service]
- focus_on
- nibble/nosh/snack/munch_on
- kept_up_with [keep pace, manage not to fall behind]
- looking_for my friend [seeking out my friend] vs. looking for my friend [on behalf of]
- not multiword:
- stay at hotel
- supply with, fit_out with [‘with’ marks the thing transferred/given]
Join: close_to, close_by(_to) far_from, far_away(_from)
- Join article if not followed by a noun (paraphrasable with ‘identical’): his objective was the_same each time / each time he had the same objective
- exact_same
- walks_around (path covering an area)
- stay~away = keep_away
- run_out [get used up] vs. run out_of the filter [leak]
-
back
- Generally do not include literal uses: go back [motion], came/headed back [returned to a location]
- money_back, cash_back [change of medium: overpaying with credit card so as to receive the difference in cash]
- s.o.’s money_back, CONDITION or_your_money_back [refund]
- pay_ s.o. _back, get money back [returning a loan; get money back is possible but not really idiomatic with this meaning]
- brought back [taking a car to the shop again for further repairs] vs. brought_back [returning a purchase for a refund]
- turned_back [turned around to travel back]
- get_back_to s.o. [return to communicate information to s.o.]
- TODO: explain principles
- rent_out
- with ‘out’, disambiguates the landlord/permanent owner vs. tenant/temporary user
- (BUT: rent out an entire place?)
- turn_on, turn_off
- pick_up [retrieve from store]
- If prenominal, don’t join of: a_lot/little/bit/couple/few (of), some/plenty of
- EXCEPTION: a_number_of (TODO: why?) check what H did in xxxx5x
- Join ‘square’ or ‘cubic’ within a unit of measurement: square_miles/yards/..., cubic_centimeter/...
- Join half_a when modifying a quantity: half_a day’s work, half_a million
- cf. classifiers
-
less than, more than: Join if functioning as a relational “operator.” Heuristic: can ‘<’ or ‘>’ be substituted?
- less_than a week later (‘< a week’)
- more happy than sad (NOT: ‘> happy than sad’)
- I agree with him more than with you (NOT: ‘> with you’)
- trying_to_say (with implication of dishonesty or manipulativeness)
- went_out_of_ their _way (went to extra effort)
- go_on_and_on [= talk endlessly] (cf. go_on by itself meaning ‘continue’)
- went_so_far_as_to_say: include ‘say’ because it has a connotation of negativity (beyond ‘went_so_far_as_to (do something)’)
- have_a_gift_for [= possess talent]
love_ Boston _to_death
- go_to_the_bathroom [use the potty]
-
give_ X _a_try
[test it out] - dropped_the_issue, drop_the_subject, drop_it !
say/tell/lie_ (s.t.) to_ s.o. _’s_face
- made_to_order
- took~forever
A support verb is a semantically “light” (mostly contentless) verb whose object is a noun that by itself denotes a state, event, or relation; the noun can be thought of as providing the bulk of the meaning/determining the meaning of the verb [FN Book 2010, p. 31]. We join the verb to the noun in question:
- make_ a _decision/statement/request/speech/lecture
- take_ a(n) _test/exam
- take_ a _picture/photo
- give_speeches/lectures/interviews
- undergo/have/receive/get_ an _operation
- do/perform_surgery
Some useful properties:
-
Most commonly, support verbs are extremely frequent verbs that exhibit a certain degree of grammaticalization: have, get, give, make, take, etc.
-
One indication of lightness is when the noun cannot felicitously be omitted in a question (She made a decision. / #What did she make?; She had an operation. / ?#What did she have?; They perform surgery on newborns. / #What do they perform on newborns?)
-
Support verb constructions can often be paraphrased with a semantically “heavy” verb, which may be derivationally related to the noun:
make_ a _decision
= decide,give_ an _interview
= be interviewed,undergo_ an _operation
= be operated_on. (The noun surgery has no verb in English, but we could imagine “surgure” as a word! In other cases it would be not unreasonable to incorporate the noun into a verb:take_ a _test
= test-take.) -
Caution is required: some expressions are not support verbs, though they appear to be at first blush:
- get a donation: donation refers to the money donated, not the act of donating. (What did she get in the mail? A donation.)
- have a barbecue: here have has the sense of hold (an organized event). (What did she have at her house? A barbecue.)
- have a disease/an illness
- witnessed an operation: the verb and the noun refer to distinct events.
-
NOTE: We exclude the copula from our definition of support, though on rare occasions an idiom lexicalizes a copula: be_the_case.
Following [Calzolari et al. 2002], we distinguish “Type II” support verbs which do contribute some meaning, though it is understood in the context of the event/scenario established by the noun:
- start~ a ~race
- most aspectual verbs—begin/end/start/stop/continue/finish/repeat/interrupt etc.—would qualify when used with an eventive noun
- pass~ an ~exam
- keep~ a ~promise
- answer~ a ~question
- execute~ a ~program, evaluate~ a ~function
Type II support verbs are lower priority for us than “core” support verbs.
verb-noun idioms
Some verb-object combinations are idiomatic, though they do not qualify as support verb constructions. We count these as multiwords as well.
- pay_attention
- take...time: There are several related idioms involving use of one’s time for some purpose. Include ‘the’ for the “extra effort” sense: take_the_time to help. Include a preposition for took_time_out_of (sacrifice), took_time_out/off (scheduled vacation).
- waste/spend/save/have~time/money
- CHANGE: was waste_time
-
give_ an _estimate
,give_ a _quote on something
[typically includes the process of estimation as well as offering that estimate to the customer]
Typically, don’t join:
- well done/made/oiled
Exceptions:
- my hamburger is well_done
- that was a_job_well_done
- a well_-_oiled_machine
- he is well_-_read
- well_fed
- a_lot
- N_star
- in_cash
- possibly: get/have_done (hair, etc.)
- highly~recommended, highly~trained
- city, state, etc. locations: _ => ~
- waste_time, spend_time => ~
TODO
- good job, great job, look good
- good job, good work, hard work [I’d be OK with ~ for these but we decided previously that good/great work should be left alone]
- ‘look at it’: include ‘it’? could be specific or not
- Short of that, One more thing -- ?
- fix problem -- I'd say this is a collocation, so fix~problem
- best restaurant out_there
- fast and friendly [sounds slightly better than “friendly and fast”, but that probably reflects a preference for the word with fewer syllables to come first]
- walk_in_the_door: entering a room or establishment
- have/get_ nails/hair/tattoo/etc. _done (grooming)
- ?? have/get done [work/repairs]
- ?? do~work/job (cf. surgery)
- ?? do~dishes