MWE Annotation Guidelines - nschneid/nanni GitHub Wiki

This document gives a detailed description of a linguistic annotation scheme for multiword expressions (MWEs) in English sentences. A conference paper summarizing the scheme and a dataset created with it are available at http://www.ark.cs.cmu.edu/LexSem/.

Markup

The input to annotators is a tokenized sentence. The goal is to join tokens where appropriate; this is done with the special characters underscore (_) for strong multiword links, a tilde (~) for weak links:

  • This is a highly~recommended fast_food restaurant .

Weak links can join strongly-linked expressions, such as fast_food + chain:

  • This is a highly~recommended fast_food~chain .

Where an expression is interrupted by other tokens, use trailing and leading joiners:

  • Do n't give_ Jon such _a_hard_time !

This even works when contiguous expression can fall within the gap:

  • Do n't give_ Jonathan_Q._Arbuckle such _a_hard_time !

On rare occasion it may be necessary to use multiple gappy expressions, in which case indexing notation is available: a|1 b|2 c|1$3 d|2 e$3 implies two strong expressions—a_c and b_d—and one weak expression, a_c~e. An example: put_ whole$1 _heart_in$1 (amounts to: put_heart_in~whole). Also: make_a_ big$1 _to_do$1

Tokenization, hyphenation, spelling, and morphology

Don't correct spelling mistakes in general, but interpret misspelled words according to their intended meaning. (E.g., if put it is clearly a typo of the particle verb put in, mark put_in.)

Do add _ if two separate tokens should really be one word. Include intermediate hyphens as part of an MWE:

anti - oil wells

or

anti - oil_wells

or

anti_-_oil_wells

NOT

anti_- oil_wells

or

anti -_oil_wells

If there’s a nonstandard variant of a term (more than just a misspelling), don’t join: craft_beer BUT handcraft beer

In general, don’t worry about the inflection of the word: come_in, came_in, have been coming_in all map to the same MWE lemma; likewise for grocery_store, grocery_stores.

If different pronouns can be used in a slot, do not join the pronoun. But if the slot requires a possessive, do join the possessive clitic if present: take_ her _time, take_ one _’s_time

Figurative language

While many idioms/MWEs are figurative, not all figurative language is used in a lexically specific way. For example, “my Sicilian family” referring to a pizza community may be nonliteral, but is not an MWE.

Foreign languages

We do not annotate foreign sentences, but foreign names within an English sentence are MWEs.

Collocations vs. “strong” MWEs

A collocation is a pairing between two or more words that is unexpectedly frequent, but not syntactically or semantically unusual.

  • eternally~grateful, can~not~wait, place~is~packed
  • what s.o. has~to~say [is willing to say, brings to the conversation; not obligation (compare: I have to say good things to my boss to get promoted)]

A collocation may include components that are themselves MWEs:

  • after~all~was_said_and_done

Drawing the line between free combination, collocation, and multiword is often difficult; annotators’ opinions will vary.

Borderline/ambiguous cases

TODO (join with ~)

  • Cj~and_company (ambiguous whether it is actually the name of a company or a guy and his crew)

Constructions with only 1 lexicalized word

Some semi-productive idioms are not well captured as lexicalized multiwords. These should not be joined:

  • have + GOODNESS.ADJ + TIME.PERIOD: had a bad day, have a great year, etc.
  • EVALUATIVE.ATTRIBUTE of s.o.: (real) Christian of you
  • NUMERIC.QUANTITY PLURAL.TIME.NOUN running: two years running
  • come + MENTAL.CHANGE.INFINITIVE: come to realize, believe, learn, adapt, have faith, …

Overlapping expressions

Rarely, a token will seemingly participate in multiple MWEs, which cannot be represented in our annotation scheme. Use your best judgment in such cases.

  • I recently threw a surprise birthday party for my wife at Fraiser 's .

Possible pairs:

surprise_party

birthday_party

threw_party

Decision:

threw~birthday_party

  • triple_chocolate_chunk brownie: multiplier+chocolate, chocolate_chunk

Syntactically perverted expressions

Don’t worry if the parts of an expression are noncanonically ordered: gave_estimates, give_ an _estimate, the estimate_ that was _given

If one of the lexicalized parts is repeated due to coordination, attach the instance closest to the other lexicalized part: talked_to Bob and to Jill; North and South_America

Special kinds of expressions

  • DO join Dr., Mr., etc. and other titles to a personal name: Dr._Lori_Levin, Henry_,_Prince_of_Wales, Captain_Jack_Sparrow
  • DO join Ave., Rd., St., etc. in street names: Forbes_Ave.
  • DO join city-state-region expressions: Bellevue~,~WA or Bellevue~WA (include the comma if there is one). Likewise: Ohiopyle_State_Park~,~Pennsylvania; Miami_University~,~Miami~,~Ohio; Amsterdam~,~The_Netherlands
  • DON’T join normal dates/times together (but: Fourth_of_July for the holiday)
  • Symbols
    • DON’T join normal % sign
    • DO join letter grade followed by plus or minus: A_+
    • DON’T join mathematical operators: 3 x the speed, 3 x 4 = 12 [x meaning “times”]
    • DO join # sign when it can be read as “number”: #_1
  • DO join a number and “star(s)” in the sense of a rating: 5_-_star
  • When in doubt, join cardinal directions: north_east, north_west, south_east, south_west, north_-_northeast, …
  • DO attach ’s if part of the name of a retail establishment: Modell_’s
  • DO join product expressions such as car Year/Make/Model or software Name/Version
    • excludes appositions that are not in a standard format (McDonald’s Dollar_Menu Chicken Sandwich)
  • DO join names of foods/dishes if (a) the expression is noncompositional in some way, or (b) there is well-established cultural knowledge about the dish. Use ~ if unsure. For example:
    • General_Tso_’s_chicken, macaroni_and_cheese, green_tea, red_velvet cake, ice_cream_sandwich, chicken_salad salad
    • triple_chocolate_chunk brownie [multiplier+chocolate, chocolate_chunk]
    • pizza~roll, ham~and~cheese, cheese~and~crackers, spaghetti~with~meatballs
    • grilled BBQ chicken, pumpkin spice latte, green pepper, turkey sandwich, eggplant parmesan, strawberry banana milkshake
  • DO join established varieties of animals/natural kinds: yellow_lab, desert_chameleon, Indian_elephant, furcifer_pardalias; BUT: brown dog
  • DO join slogans: Trust_The_Midas_Touch, Just_Do_It, etc.

By construction

A+[P+Pobj]

  • pleased/happy/angry_with, mad_at
  • good_for s.o. [healthy, desirable]

affective on

This is a special use of the preposition ‘on’, but it is does not generally join to form an MWE:

  • drop_the_ball on s.o. [not literally!], die on s.o.
  • hang_up~on s.o. [collocation]
  • (‘step on s.o.’ is different: here it is the semantics of ‘step on’ that could convey negativity in certain contexts, not ‘on’ by itself)

age

  • (appropriate) for_ {one's, a certain, ...} _age (of child)

age construction: TEMPORAL.QUANTITY old

3 years_old, month_old project. (Note that ago should NOT be joined because it is always postpositional.)

all + A

Join unless ‘all’ is paraphrasable as ‘completely’ or ‘entirely’:

  • participle: all gone, all done, all_told [overall, in total]
  • other adj: all ready, all_right [well, OK] (informal spelling: alright)

as X as Y

Do not join, even though the as PPs are correlated. Exceptions:

  • as_long_as [while]

X by Y

  • one_by_one [one at a time]
  • Don’t join if by indicates a product, as in a multidimensional measurement: three by five paper = 3 x 5 paper

classifiers: MEASURE.WORD of N

A few English nouns take idiosyncratic measure words: 3 sheetsofpaper, 2 pairsofpants, a pieceofinformation

clear/straight/right + P

Do not attach the modifier if it has an ordinary meaning, e.g. go clear through the wall

complex adjective phrases

  • highly~recommended, highly~trained
  • family~owned company

complex nominals: A+N, N+N

  • capital_punishment
  • big_rig [slang for truck]
  • road_construction [the road isn't actually being constructed, but reconstructed!]
  • silver_ Mariott _member [rewards program]
  • electric_blanket
  • last_minute
  • price_range
  • second_chance
  • grocery_stores
  • pizza_parlor, pizza place, burger joint (diagnostic: does “favorite X” occur on the web? [to filter out proper names])
  • little~danger/risk
  • public~welfare
  • this place is a hidden~gem
  • strike_one/two/three (unusual syntax!)

complex prepositions

Cf. Quirk pp. 669–670

  • out_of, in_between, in_front_of, next_to
  • along_with
  • as_well, as_well_as
  • in_addition_to, in_light_of, on_account_of
  • due_to, owing_to, because_of

complex subordinators

From Quirk et al. 1972:

  • but_that, in_that, in_order_that, insofar_that, in_the_event_that, save_that, so_that, such_that, except_that, for_all_that, now_that
  • as_{far,long,soon}_as, inasmuch_as, insofar_as, as_if, as_though, in_case
  • Do NOT mark the participial ones: assuming, considering, excepting, … that

discourse connectives

  • to_start_off_with
  • that_said
  • of_course

else

Though as a postmodifier it is a bit odd syntactically (anything else, who else, etc.), it does not generally participate in lexicalized idioms.

  • “What does ‘else’ even mean?!” - Henrietta

exhortative, emotive, expletive, and proverb idioms

  • do_ X _a_favor_and Y
    do_ X _a_favor_, Y
    vs. plain do_favor
  • you_get_what_you_pay_for (NOT: ‘you get what you purchase’)
  • get_ the_hell _out
  • why in_the_hell [can be any WH word]
  • do_n’t_forget, forget_it !, never_mind
  • i havetosay, gotta~say, etc.: semantics = obligation on self to say something, pragmatics = can’t restrain self from expressing an opinion
  • Who_knows [rhetorical question]
  • no_way
  • Phatic expressions: I_’m_sorry, Thank_you

existential there

We do not mark ordinary there be existentials as multiwords.

get_upgraded, get_ cat _neutered, get_ a bill _passed

get + destination

In the sense of ‘arrive’, not really a multiword:

  • get back home, got to the school

get + result A

  • get_ready, get_done, get_busy, get_older
  • get_a_flat
  • get_correct

infinitival to

If a verb, adjective, or noun typically takes an infinitival complement, and the infinitive verb is not fixed, don't join to:

  • little to say
  • important to determine his fate
  • able/ability to find information
  • chose/choice to do nothing
  • willing(ness) to sail

But if it is a special construction, the to should be joined:

  • in_order_to VP
  • at_liberty_to VP
  • ready_to_rumble
  • special modal/tense constructions: ought_to, have_to (obligation or necessity), going_to, about_to (but want to, need to, try to)

long + TIME.PERIOD

  • a long~{day, week, year} (long in the sense of bad/difficult; cannot be referring to an actual duration because the noun is a time unit with precise duration)

negative polarity items

Join these (including n’t and a, but not do) if sufficiently conventional: did n't_sleep_a_wink, did n't_charge_a_cent/penny/dime, did n't_eat_a_morsel/scrap/bite/crumb (of food)

on: see affective on

prepositional phrase idioms

These include the so-called determiner-less PPs (in_town vs. in the town).

  • in_a_ nice/good/... _way
  • out_of_site
  • on_staff
  • at_all
  • at_liberty_to
  • for_sure
  • mediocre at_best
  • to_boot
  • in_town
  • on_earth, ontheplanet, intheworld, in the country/universe

in + method of payment

  • in_cash/quarters/Euros

to + mental state

  • to_ her amusement, to our chagrin, to the _surprise of all present
  • to_ my _satisfaction

prepositional nouns: N+[P+Pobj]

  • capacity_for love
  • his problem_with the decision
  • extensive damage_to the furniture

Sometimes these participate in verb-noun constructions:

  • have_a_problem_with [be annoyed with], have_ a _problem_with [have trouble with something not working]
  • do_damage_to the furniture

prepositional verbs: V+[P+Pobj]

  • TODO: explain principles
  • talk_with/to, speak_with/to, filled_with
  • NOT: learn about, booked at (hotel)
  • wait_for
  • look_for
  • test_for
  • (a)rising_from
  • disposed_of
  • take_care_of
  • trust_with
  • listen_to, pay_attention_to
  • compare_ X _to Y, X compared_to Y
  • been_to LOCATION/doctor etc.
  • trip_over, trip_on
  • (do_)damage_to the furniture
  • take_in [bring to an establishment for some purpose, e.g. a car for service]
  • focus_on
  • nibble/nosh/snack/munch_on
  • kept_up_with [keep pace, manage not to fall behind]
  • looking_for my friend [seeking out my friend] vs. looking for my friend [on behalf of]
  • not multiword:
    • stay at hotel
    • supply with, fit_out with [‘with’ marks the thing transferred/given]

proximity expressions: A+P, A+P+P

Join: close_to, close_by(_to) far_from, far_away(_from)

same

  • Join article if not followed by a noun (paraphrasable with ‘identical’): his objective was the_same each time / each time he had the same objective
  • exact_same

there: see existential there

verbs with intransitive prepositions

V+P

  • walks_around (path covering an area)
  • stay~away = keep_away
  • run_out [get used up] vs. run out_of the filter [leak]
  • back
    • Generally do not include literal uses: go back [motion], came/headed back [returned to a location]
    • money_back, cash_back [change of medium: overpaying with credit card so as to receive the difference in cash]
    • s.o.’s money_back, CONDITION or_your_money_back [refund]
    • pay_ s.o. _back, get money back [returning a loan; get money back is possible but not really idiomatic with this meaning]
    • brought back [taking a car to the shop again for further repairs] vs. brought_back [returning a purchase for a refund]
    • turned_back [turned around to travel back]
    • get_back_to s.o. [return to communicate information to s.o.]

particle verbs: V+P+Vobj, V+Vobj+P

  • TODO: explain principles
  • rent_out
    • with ‘out’, disambiguates the landlord/permanent owner vs. tenant/temporary user
    • (BUT: rent out an entire place?)
  • turn_on, turn_off
  • pick_up [retrieve from store]

quantifiers/quantity modifiers

  • If prenominal, don’t join of: a_lot/little/bit/couple/few (of), some/plenty of
    • EXCEPTION: a_number_of (TODO: why?) check what H did in xxxx5x
  • Join ‘square’ or ‘cubic’ within a unit of measurement: square_miles/yards/..., cubic_centimeter/...
  • Join half_a when modifying a quantity: half_a day’s work, half_a million
  • cf. classifiers

quantity comparisons

  • less than, more than: Join if functioning as a relational “operator.” Heuristic: can ‘<’ or ‘>’ be substituted?
    • less_than a week later (‘< a week’)
    • more happy than sad (NOT: ‘> happy than sad’)
    • I agree with him more than with you (NOT: ‘> with you’)

VP idioms

  • trying_to_say (with implication of dishonesty or manipulativeness)
  • went_out_of_ their _way (went to extra effort)
  • go_on_and_on [= talk endlessly] (cf. go_on by itself meaning ‘continue’)
  • went_so_far_as_to_say: include ‘say’ because it has a connotation of negativity (beyond ‘went_so_far_as_to (do something)’)
  • have_a_gift_for [= possess talent]
  • love_ Boston _to_death
  • go_to_the_bathroom [use the potty]
  • give_ X _a_try [test it out]
  • dropped_the_issue, drop_the_subject, drop_it !
  • say/tell/lie_ (s.t.) to_ s.o. _’s_face
  • made_to_order
  • took~forever

support verb constructions

A support verb is a semantically “light” (mostly contentless) verb whose object is a noun that by itself denotes a state, event, or relation; the noun can be thought of as providing the bulk of the meaning/determining the meaning of the verb [FN Book 2010, p. 31]. We join the verb to the noun in question:

  • make_ a _decision/statement/request/speech/lecture
  • take_ a(n) _test/exam
  • take_ a _picture/photo
  • give_speeches/lectures/interviews
  • undergo/have/receive/get_ an _operation
  • do/perform_surgery

Some useful properties:

  1. Most commonly, support verbs are extremely frequent verbs that exhibit a certain degree of grammaticalization: have, get, give, make, take, etc.

  2. One indication of lightness is when the noun cannot felicitously be omitted in a question (She made a decision. / #What did she make?; She had an operation. / ?#What did she have?; They perform surgery on newborns. / #What do they perform on newborns?)

  3. Support verb constructions can often be paraphrased with a semantically “heavy” verb, which may be derivationally related to the noun: make_ a _decision = decide, give_ an _interview = be interviewed, undergo_ an _operation = be operated_on. (The noun surgery has no verb in English, but we could imagine “surgure” as a word! In other cases it would be not unreasonable to incorporate the noun into a verb: take_ a _test = test-take.)

  4. Caution is required: some expressions are not support verbs, though they appear to be at first blush:

    • get a donation: donation refers to the money donated, not the act of donating. (What did she get in the mail? A donation.)
    • have a barbecue: here have has the sense of hold (an organized event). (What did she have at her house? A barbecue.)
    • have a disease/an illness
    • witnessed an operation: the verb and the noun refer to distinct events.
  5. NOTE: We exclude the copula from our definition of support, though on rare occasions an idiom lexicalizes a copula: be_the_case.

Following [Calzolari et al. 2002], we distinguish “Type II” support verbs which do contribute some meaning, though it is understood in the context of the event/scenario established by the noun:

  • start~ a ~race
    • most aspectual verbs—begin/end/start/stop/continue/finish/repeat/interrupt etc.—would qualify when used with an eventive noun
  • pass~ an ~exam
  • keep~ a ~promise
  • answer~ a ~question
  • execute~ a ~program, evaluate~ a ~function

Type II support verbs are lower priority for us than “core” support verbs.

verb-noun idioms

Some verb-object combinations are idiomatic, though they do not qualify as support verb constructions. We count these as multiwords as well.

  • pay_attention
  • take...time: There are several related idioms involving use of one’s time for some purpose. Include ‘the’ for the “extra effort” sense: take_the_time to help. Include a preposition for took_time_out_of (sacrifice), took_time_out/off (scheduled vacation).
  • waste/spend/save/have~time/money
    • CHANGE: was waste_time
  • give_ an _estimate, give_ a _quote on something [typically includes the process of estimation as well as offering that estimate to the customer]

well V-ed

Typically, don’t join:

  • well done/made/oiled

Exceptions:

  • my hamburger is well_done
  • that was a_job_well_done
  • a well_-_oiled_machine
  • he is well_-_read
  • well_fed

Changes requiring revisions of old annotations

  • a_lot
  • N_star
  • in_cash
  • possibly: get/have_done (hair, etc.)
  • highly~recommended, highly~trained
  • city, state, etc. locations: _ => ~
  • waste_time, spend_time => ~

TODO

  • good job, great job, look good
  • good job, good work, hard work [I’d be OK with ~ for these but we decided previously that good/great work should be left alone]
  • ‘look at it’: include ‘it’? could be specific or not
  • Short of that, One more thing -- ?
  • fix problem -- I'd say this is a collocation, so fix~problem
  • best restaurant out_there
  • fast and friendly [sounds slightly better than “friendly and fast”, but that probably reflects a preference for the word with fewer syllables to come first]
  • walk_in_the_door: entering a room or establishment
  • have/get_ nails/hair/tattoo/etc. _done (grooming)
  • ?? have/get done [work/repairs]
  • ?? do~work/job (cf. surgery)
  • ?? do~dishes
⚠️ **GitHub.com Fallback** ⚠️