Home - ndobb/clinical-trials-seq2seq-annotation GitHub Wiki

Task Description

The goal of the Leaf Clinical Trials Logical Form Annotation Task (hereafter simply logical form annotation) is to create a corpus of clinical trials eligibility criterion with associated logical forms. The logical forms look similar to a series of nested functions and chained methods in JavaScript or Python, and in fact are designed to be valid code in those languages in order to simplify the annotation task. Using this annotated corpus, we will train a Seq2Seq (Sequence-to-Sequence) model to transform an input criterion text into a logical statement which can be transformed into a SQL query.

Each example input criterion is composed of three parts:

'EXC'  

'-  Pregnant women'

'-  cond("Pregnant") female()'
  1. EXC or INC indicates whether the criterion is part of the inclusion or exclusion criteria. Often this is irrelevant to a given annotation, but may be useful for context.

  2. The original "raw" criterion.

  3. The augmented criterion, with relevant spans of text replaced with named entities. 'Augmented' refers to leveraging the existing named entity recognition models to replace raw text with named entities similar to the annotation output. This simplifies the annotation & Seq2Seq model training task. For example, replacing the text "diabetes" with cond("diabetes") obviates the need for a Seq2Seq model to learn that "diabetes" is a condition and thus the model needs only to learn where to position cond("diabetes").

Annotators should focus on item (3), and code the annotated logical form 2 lines below it:

ex1

In this case, the criterion specifies that patients must be cond("pregnant") and female(), and thus we use an intersect() function with those two criteria in the order they are named. Note that:

  1. As with most programming languages, there should be a comma separating the arguments
  2. Indentation is not strictly necessary but helps for readability
  3. Functions and methods should be written in the order (left-to-right) they appear in the text

Tools

This annotation task should be done using Visual Studio Code or a similar IDE. As each annotation file is technically valid JavaScript code, this allows the annotator to leverage VS Code's syntax highlighting and validation.

The file structure of the annotation looks like:

- completed
  - batch1
  - batch2
- reviewing
  - batch1
  - batch2

As each annotation file is finished, move the completed file from the Reviewing/batch<x> folder to the corresponding Completed/batch<x> folder: tool1

Process

Batches of 100 files to be annotated can be found under Reviewing/batch<x> in each repo, as discussed above. Before annotating each batch, begin by doing the following:

  1. Make sure your commits are up to date and make a new git branch
# Pull latest
$ git pull

# Make new branch. The branch name can be whatever you'd like
$ git checkout -b batch1_attempt1
  1. Next, complete the annotation for the 100 notes you are working on.

  2. Have git track your changed files and create a commit, then push to GitHub.

$ git add . 
$ git commit -m "Committing first attempt at batch1"
$ git push --set-upstream origin batch1_attempt1
  1. Finally, in GitHub, create a Pull Request and set ndobb as a Reviewer. image

  2. Nic will review and make comments. Revise as needed, and afterward merge your changes.

  3. Nic will load the next batch of files. Begin the next set of annotations by pulling again from main.

$ git pull

Examples

Example 1

'INC'

'-  age 20 - 34 years ;'

'-  age() eq(val("20"), op(BETWEEN), val("34"), temporal_unit(YEAR)) ;'

This example introduces the eq() function, which serves to wrap numeric and date comparisons. In general, you should simply copy and paste eq() functions from the input criterion text and do not need to worry about their parameters.

The age() function does not take any parameters, but instead has a .num_filter() method called which takes the eq() as an argument.

Answer
age()
  .num_filter(
      eq(val("20"), op(BETWEEN), val("34"), temporal_unit(YEAR))
  )

Example 2

'EXC'

'-  Previous laryngeal framework surgery ( any type of thyroplasty , arytenoid adduction )'

'-  eq(temporal_per(PAST)) proc("laryngeal framework surgery") ( any type of proc("thyroplasty") , 
    proc("arytenoid adduction") )'

This example states that patients must have had a proc("laryngeal framework surgery") which occurred in the past (eq(temporal_per(PAST))). Moreover, the examples states that proc("thyroplasty") and proc("arytenoid adduction") should be considered equivalent to proc("laryngeal framework surgery").

Answer
proc("laryngeal framework surgery")
  .temporality(
      eq(temporal_per(PAST))
  )
  .equiv(
      union(
          proc("thyroplasty"), 
          proc("arytenoid adduction")
      )
  )

Example 3

'EXC'

'-  Pregnant or lactating woman'

'-  cond("Pregnant") or cond("lactating") female()'

This example states that patients must be cond("Pregnant") or cond("lactating"), but in either case must also be female(). The union function allows us to indicate that patients can be cond("Pregnant") or cond("lactating"), and the wrapping intersect function ensures that they also be female().

Answer
intersect(
  union(
      cond("Pregnant"), 
      cond("lactating")
  ),
  female()    
)

Example 4

'INC'

'-  obesity ( body mass index [ BMI ] > 30 kg / m 2 )'

'-  cond("obesity") ( vital("body mass index") [ vital("BMI") ] 
    eq(op(GT), val("30"), unit("kg"), unit("m 2")) )'

This example states that vital("body mass index") > 30 should be equivalent to cond("obesity"), while the acronym vital("BMI") is in turn equivalent to vital("body mass index").

Answer
cond("obesity")
  .equiv(
      vital("body mass index")
          .equiv(
              vital("BMI")
          )
          .num_filter(
              eq(op(GT), val("30"), unit("kg"), unit("m 2"))
          )
  )

Example 5

'EXC'

'-  prior immunotherapy or treatment with another anti PD - 1 agent besides nivolumab'

'-  eq(temporal_per(PAST)) proc("immunotherapy") or proc() with another 
    drug("anti PD - 1 agent") except() drug("nivolumab")'

This example finds patients who've had either proc("immunotherapy") or any kind of treatment .using() drug("anti PD - 1 agent") with the exception of drug("nivolumab").

Answer
union(
  proc("immunotherapy"),
  drug("anti PD - 1 agent")
      .except(
          drug("nivolumab")
      )
)
  .temporality(
      eq(temporal_per(PAST))
  )

Example 6

'INC'

'-  Present to Stanford Emergency Department as a trauma with a major operative lower extremity injury'

'-  enc() to loc(hosp("Stanford Emergency Department")) as a obs("trauma") with a 
    mod("major operative lower extremity") obs("injury")'

This example introduces a sequence of events, or seq(). The example states that patients must have been seen at loc(hosp("Stanford Emergency Department")) with obs("trauma"), for a mod("major operative lower extremity") obs("injury") - all at the same time (or, at least in the course of the same encounter).

To do so we must specify a seq() in the format of:

seq(
  <anchor event>,
  <next event> { before() | during() | after() },
  ...
)
Answer
seq(
  enc()
      .loc(hosp("Stanford Emergency Department")),
  during(
      obs("trauma")
  ),
  during(
      obs("injury")
          .mod("major operative lower extremity")
  )
)

Example 7

'INC'

'-  Systolic blood pressure ≥ 130 mmHg OR diastolic blood pressure ≥ 80 mmHg 
    ( or on hypertension medication )'

'-  vital("Systolic blood pressure") eq(op(GTEQ), val("130"), unit("mmHg")) OR 
    vital("diastolic blood pressure") eq(op(GTEQ), val("80"), unit("mmHg")) ( or 
    eq(temporal_per(PRESENT)) cond("hypertension") drug() )'

This example includes several union() arguments (vital("Systolic blood pressure") or vital("diastolic blood pressure") or a drug()) and also introduces the .for() method indicating drugs used as treatment .for() cond("hypertension").

Answer
union(
  vital("Systolic blood pressure")
      .num_filter(
          eq(op(GTEQ), val("130"), unit("mmHg"))
      ), 
  vital("diastolic blood pressure")
      .num_filter(
          eq(op(GTEQ), val("80"), unit("mmHg"))
      ), 
  drug()
      .for(
          cond("hypertension")
      )
      .temporality(
          eq(temporal_per(PRESENT))
      )
)

Functions

Functions form the core of the logical form annotation task. Annotated functions, as in many programming languages, are alphanumeric names followed by opening and closing parentheses, e.g., cond().

⚠️ **GitHub.com Fallback** ⚠️