BERT ER: Query Specific BERT Entity Representations for Entity Ranking - shubham526/SIGIR2022-BERT-ER GitHub Wiki

Shubham Chatterjee and Laura Dietz. 2022. BERT-ER: Query-specific BERT Entity Representations for Entity Ranking. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22).

This is an online appendix for the paper. This page contains:

  • Instructions on how to execute the code.
  • Links to datasets used and additional resources developed for this paper.

Shield: CC BY-SA 4.0

All data associated with this work is licensed and released under a Creative Commons Attribution-ShareAlike 4.0 International License. The TREC Complex Answer Retrieval and entity aspect linking datasets have their individual licenses. For details, check the respective web pages.

CC BY-SA 4.0

1. Downloads

1.1. Publicly available datasets

Resource Comments
TREC Complex Answer Retrieval (CAR) dataset We use the following subsets: BenchmarkY1-Train, BenchmarkY2-Test, unprocessedAllButBenchmark, paragraphCorpus.
DBpedia-Entity v2 CAR dataset Dataset DBpedia-Entity v2 projected onto TREC CAR year 2 entities.
Entity aspect linking dataset We use the aspect catalog.

1.2. Resources released with this work

Resource Comments
CAR Data, runs, and trained models for TREC CAR dataset.
DBpedia-Entity v2 [Part 1] [Part 2] Data, runs, and trained models for DBpedia-Entity v2 dataset.
Embeddings Mapping from CAR EntityIds to Wikipedia2Vec/E-BERT/ERNIE embeddings.
Indexes Pre-built Lucene indexes for Wikipedia used in this work.
Entity2Psg TSV file of mapping from EntityId to List[PassageId] (which passages in the corpus contain this entity).

2. ❗ Data leakage for TREC CAR

In the construction of the CAR AllButBenchmark dataset, the organizers utilized several metadata fields, one of which was inlinks (links from other Wikipedia pages to the current page). The ground truth for each query was created by incorporating all entity links from the corresponding article. Consequently, all true entities possess an inlink from the query entity.

However, the page index we used only contained full text and lacked meta fields. This led to a conundrum when calculating the entity expansion (such as ECM, ECM-RM, ECM-Psg), where we relied on inlinks from the entity index. One of the complications arose from the fact that we couldn't exclude all query entities from inlinks because any page could potentially serve as a query. For instance, a page titled "Large Train" could be a valid query.

To overcome these challenges, we implemented a solution referred to as "killed" runs. The essence of this solution lies in the careful construction of an expansion distribution over entities, where we always filter out the query entity - in other words, any entity that has the same ID as the query entity. This methodology involves developing versions of the entity link expansion similar to RM3 (where you divide by the total number of entity mentions).

In addition, we experimented with an NER PageViaSection approach. This method deploys section path queries for each section, following which, it aggregates all rankings via either Reciprocal Rank (RecipRank) or Rank Score Normalization, as used in RM3. In the latter case, the scores are represented as a renormalized multinomial distribution.

3. ❗ New results/data for BERT-ER on TREC CAR

Following the identification of the data leakage described above, we re-ran our code using new runs on TREC CAR. You can find the new runs, data, models, etc. for the CAR dataset below.

Resource Comments
CAR New data, runs, and trained models for TREC CAR dataset.

4. How do I create the Wikipedia indexes used in this work myself?

The code to create the various indexes used in this work can be found here. See the README of the repository for how to use the code.

5. Necessary Installations

Coding. The code is partially in Java and partially in Python. Hence, you would need both Java and Python installed on your system. The code has been tested using Java openJDK 13 and Python 3.7.

Libraries. To run the code, the following libraries need to be installed:

  1. TREC CAR Tools. The TREC CAR data can be read using the official trec-car-tools. The Python code for creating QueryId → Query and EntityId → EntityName mappings requires trec-car-tools to read the TREC CAR data. It can be installed via pip as follows: pip install trec-car-tools. For more information, read the documentation, and see the Github repository.
  2. TagMe Entity Linker. We use the TagMe entity linker to entity link the queries. It can be installed via pip as follows: pip install tagme. For more details, read the documentation, see the demo, and read the paper.

Install Java Code. To install the Java code, you need Maven. We tested the code using Maven v3.6.3.

  • Clone this repository using git clone.
  • Inside the repository, there are two folders: java (containing the java code) and python (containing the python code). From the java directory, run the following command: mvn clean install. This should create a jar file called SIGIR2022-BERT-ER-1.0-SNAPSHOT-jar-with-dependencies.jar inside java/target.

The Python scripts have a help function which may be called using the --help option with the script. For example, python create_train_data.py --help.

6. Create the initial resources

Before running the code to produce the run files (features), we need several resources such as indexes, entity-linked queries, QueryID/Query mappings, etc. Below, we detail how to create these resources.

6.1. Indexes

From a Wikipedia dump and a text corpus, we extract the following types of information which are used to derive features for the BERT-ER++ framework.

  • Page: Full-text of Wikipedia pages, including all visible text including title, headings, and content paragraphs.
  • Entity: Knowledge graph representation of entities using only head information such as title, lead text, and name variations derived from anchor text of incoming links, redirects, and dis- ambiguations. This is the typical representation commonly used by entity-linking methods such as TagMe.
  • Section: Sections (top-level) of Wikipedia pages as a representation of topical entity aspects, which include heading and section content, as well as page title and lead text.
  • Paragraph: Paragraphs from the corpus with full text and entity links preserved. In this work, use the TREC CAR benchmark. We derive page, entity, and section indexes from the allButBenchmark data (omitting query pages) and derive paragraph index from the paragraphCorpus.

Code. The code to create the above types of indexes can be found here. Please read the README of the repository for how to use the code.

6.2. Create a TSV file of mappings

  1. Mapping from QueryIds to Queries
python3 create_query_id_to_name_mapping.py --outlines $outlines_file --output $output_file --query-model $query_model
  • outlines_file: Available with the TREC CAR dataset. We need to use the outlines file corresponding to the benchmark we are using. For example, for BenchmarkY1-Train, the outlines file to be used is train.pages.cbor-outlines.cbor.
  • query_model: Should be either title (for page-level queries) or section (for section-level queries).
  1. Mapping from EntityIds to EntityNames
python3 create_entity_id_to_name_mapping.py --corpus $paragraph_cbor --save $output_file 
  • paragraph_cbor: paragraphCorpus available with the TREC CAR dataset.

Note. The above scripts write the output in a TSV format, hence the file names should end with .tsv.

Reading TSV files. TSV files can be read easily using pandas in Python as follows: pandas.read_csv('myfile.tsv', sep='\t'). Below, we provide a code snippet to read TSV files into a Python dict.

def read_tsv(file: str) -> Dict[str, str]:
    res = {}
    with open(file, 'r') as f:
        for line in f:
            parts = line.split('\t')
            key = parts[0]
            value = parts[1]
            res[key] = value
    return res

Analogously, read_tsv may also be written in another language such as Java.

6.3. Create a TSV file of query annotations

The baselines based on GEEER in our paper need entity link annotations for the queries. We use TagMe entity linker for this purpose.

python3 create_query_annotations.py --queries $queries_file --save $output_file
  • queries_file: The QueryId to QueryName mapping file created above.
  • output_file: A .tsv file.

Annotation. The TSV file contains a JSON-encoded string for each query. The JSON string is a list of key-value pairs representing information about each entity in the query. The following information about each entity is available:

  • begin: Starting character offset where the entity is found (included).
  • end: Ending character offset where the entity is found (excluded).
  • entity_id: Id of the entity in Wikipedia (not the same as CAR entity-ids).
  • entity_name: Title of the Wikipedia page of the entity.
  • score: Annotation accuracy.
  • mention: Anchor text of the entity mentioned in the query.

The entity annotation can be loaded using Python's json module: json.loads(annotation).

6.4. Create an entity-support passage run file

We use the method EntityContextNeighbors (the best performing feature) from Chatterjee et al., 2019 to create the entity-support passage run file used in this work.

java -jar SIGIR2022-BERT-ER-1.0-SNAPSHOT-jar-with-dependencies.jar make-support-psg-run $mode $indexDir $entityPassageFile $entityRunFile $entityFile $queryIdToNameFile $entityIdToNameFile $stopWordsFile $outFile $parallel 
  • mode: Can be either train or test. The train mode is used to create the entity-support passage run file for training data, vice-versa for test mode.
  • indexDir: Index of CAR paragraphs (created above).
  • entityPassageFile: Mappings from CAR EntityId -> List[PassageIds] (which passages contain a given entity)
  • entityRunFile: Entity run file in TREC format.
  • entityFile: TSV file containing entity descriptions. Required only when mode=train.
  • queryIdToNameFile: TSV file of mappings from QueryId to Query.
  • entityIdToNameFile: TSV file of mappings from EntityId to EntityName.
  • stopWordsFile: File containing stop words (one on each line).
  • parallel: Whether to run the code in parallel. Use -Djava.util.concurrent.ForkJoinPool.common.parallelism=N to set the number of threads.

All the above data can be downloaded from the links above.

6.5. Create the TSV file of entity descriptions

The paper uses several types of entity descriptions to fine-tune BERT for entity ranking. The code to create TSV files with these descriptions can be found in the java folder. First, install the Java code using Maven as described above.

To create the entity description files, run the JAR file created after installing the Java code. The general syntax for running the JAR file is: java <JarFile>.jar <mode> <type> <arguments>.

  • Available modes: train, dev, and test.
  • Available types: SupportPsg, LeadText, AspectCandidateSet, AspectSupportPsg, BM25Psg, ECNRun

Run the JAR file with to see the arguments for the mode and type. For example, running java -jar SIGIR2022-BERT-ER-1.0-SNAPSHOT-jar-with-dependencies.jar make-support-psg-run train BM25Psg displays the following message:

BM25Psg:
  <paraIndex>: Path to the paragraph index file.
  <entityParaFile>: Path to the entity paragraph file.
  <entityFile>: Path to the entity file.
  <queriesFile>: Path to the queries file.
  <entitiesFile>: Path to the entities file.
  <stopWordsFile>: Path to the stop words file.
  <outFile>: Path to the output file.
  <parallel>: Whether to run in parallel (true/false).

The code will produce a TSV file of the format <QueryId>\t<EntityId>\t<Description>. The description is a JSON-encoded string with the following keys: score, aspect_id, text, and para_id. Depending on the type of description, some of these fields will be empty. For example, for a BM25 description, the field aspect_id will be empty. Also, the field score might be zero if this file is created using a positive/negative entities file (see below) for training data. This is because the positive/negative entities are not ranked and hence do not have a score. The fields are included in the TSV file as metadata; we are not really using them anywhere.

An example line from the TSV file created using positive entities file for BM25 description:

enwiki:Polyelectrolyte	enwiki:Shutter%20(photography)	{"score":0,"aspect_id":" ","text":"In photography, the shutter-release button (sometimes just shutter release or shutter button) is a push-button found on many cameras, used to take a picture. When pressed, the shutter of the camera is \"released\", so that it opens to capture a picture, and then closes, allowing an exposure time as determined by the shutter speed setting (which may be automatic).  Some cameras also utilize an electronic shutter, as opposed to a mechanical shutter. ","para_id":"f3a105c6a95997d3c3298291c3dc0d80503ff08d"}

Note.

  • The argument paraIndex above is actually an index of entity aspect linked CAR corpus. You can download a pre-built Lucene index from the links above.
  • The argument entityFile is the file containing the positive/negative entities. The files are included in the data available for download above. Alternatively, you can use the scripts car_make_neg_entity_file.py and dbpedia_make_pos_neg_entity_file_for_train_data.py to create them. Run the scripts with the --help argument to see the arguments for the scripts.
  • The argument must be specified in the order they are printed in the help message.

6.6. Create train and test data to fine-tune BERT for entity ranking using various descriptions

  1. Create train data:
python3 make_train_data.py $type $pos_ent_data $neg_ent_data $k $queries, $save_dir
  • type: One of (pairwise|pointwise).
  • pos_ent_data: File containing descriptions of positive examples (created as above).
  • neg_ent_data: File containing descriptions of negative examples (created as above).
  • k: Number of positive and negative entities to consider while making the data.
  • queries: File containing queries.
  • save_dir: Directory where to save.
  1. Create dev or test data:
python3 make_dev_or_test_data.py $entity_data $queries $qrels $save_dir
  • entity_data: File containing descriptions of entities.
  • qrels: Qrels file.
  • queries: File containing queries.
  • save_dir: Directory where to save.

The above code will create the data in JSON-L format (each line is a JSON encoded string). An example data string using BM25Psg is below:

{"doc": "The supercritical CO acts selectively on the caffeine, releasing the alkaloid and nothing else. Water-soaked coffee beans are placed in an extraction vessel. The extractor is then sealed and supercritical CO is forced into the coffee at pressures of 1,000 pounds per square inch to extract the caffeine. The CO acts as the solvent to dissolve and draw the caffeine from the coffee beans, leaving the larger-molecule flavor components behind. The caffeine-laden CO is then transferred to another container called the absorption chamber where the pressure is released and the CO returns to its gaseous state and evaporates, leaving the caffeine behind. The caffeine is removed from the CO using charcoal filters, and the caffeine free CO is pumped back into a pressurized container for reuse on another batch of beans. This process has the advantage that it avoids the use of potentially harmful substances. Because of its cost, this process is primarily used to decaffeinate large quantities of commercial-grade, less-exotic coffee found in grocery stores.", "label": 1, "query_id": "enwiki:Decaffeination", "query": "Decaffeination", "doc_id": "enwiki:Charcoal"}

5. Baselines

Below, we detail how to use our code to reproduce some of the baselines used in this work.

  1. BERT-LeadText++: Fine-tune BERT for entity ranking using the lead text of entities. For this, we use the code here.
python3 train.py $model_type $train $max_le $save_di $dev $qrels $save $checkpoint $run $metric $epoch $batch_size $learning_rate $n_warmup_steps $eval_every $num_workers $freeze_bert $cuda $use_cuda
  • model_type: Type of model (pairwise|pointwise). Default: pairwise.
  • train: Training data.
  • `max_len': Maximum length for truncation/padding. Default: 512
  • save_dir: Directory where model is saved.
  • dev: Development data.
  • qrels: Ground truth file in TREC format.
  • save: Name of checkpoint to save. Default: bert.bin
  • checkpoint: Name of checkpoint to load. Default: None
  • run: Output run file in TREC format. Default: dev.run
  • metric: Metric to use for evaluation. Default: map
  • epoch: Number of epochs. Default: 20
  • batch_size: Size of each batch. Default: 8.
  • learning_rate: Learning rate. Default: 2e-5.
  • n_warmup_steps: Number of warmup steps for scheduling. Default: 1000.
  • eval_every: Evaluate every number of epochs. Default: 1
  • num_workers`: Number of workers to use for DataLoader. Default: 8
  • cuda: CUDA device number. Default: 0.
  • use_cuda: Whether or not to use CUDA. Default: False.
python3 test.py $model_type, $test, $max_len, $save_dir, $qrels, $eval_run $checkpoint, $run, $batch_size, $num_workers, $cuda, $use_cuda
  • eval_run: Whether or not to evaluate the run file. Default: False. Provide the qrels if you set this flag.
  1. GEER: The entity retrieval system from Gerritse et al., 2020. You can use the same script to re-rank entities using Wikipedia2Vec (as in original GEEER paper), E-BERT, and ERNIE embeddings.
python3 geeer_entity_rerank.py $run $annotations $embeddings $embedding_method $name2id $k $save
  • run: Entity run file to re-rank.
  • annotations: File containing TagMe annotations for queries.
  • embeddings: Entity embedding file.
  • embedding_method: Entity embedding method (Wiki2Vec|ERNIE|E-BERT).
  • name2id: EntityName to EntityId mappings.
  • k: Top-K entities to re-rank from run file.
  • save: Output run file (re-ranked).

6. BERT-based features for Learning-To-Rank

In the paper, we fine-tune BERT for entity ranking using various descriptions, the combine them with other features using LTR. To create the BERT-based features, use the code here. Run the code in a similar way as described for BERT-LeadText++ above. The only difference will be that we would use the train/test data created using a different description.

7. Other entity features ("++" features)

The "++" features are created using the various types of indexes described above. The repository contains a bash script called query_car.sh. Change the values in the script to correctly point to the resources on your system. Then execute the script as follows: ./query_car.sh page for page-level runs and ./query_car.sh section for section-level runs.

8. Learning-to-Rank

We perform our learning-to-rank experiments using the toolkit ranklips. Read about it here.

9. Utility bash scripts

We provide some utility bash scripts to automate some of the data creation in this work. See the bash folder in this repository. The folder includes the following scripts:

  1. divide_file_by_dataset.sh: Divide the DBpedia runs by subsets (INEX_LD, QALD2, etc.). Execute the script without arguments to display usage instructions.
  2. make_bert_data.sh: Script to automate the creation of the train/test data using various entity description files. Change the arguments to point to the correct resources on your system.
  3. make_data_files.sh: Script to automate the creation of description files. Change the arguments to point to the correct resources on your system.
  4. train.sh: Script to automate 5-fold CV for training.
  5. inference.sh: Script to automate inference on models trained using 5-fold CV.

Acknowledgement

This material is based upon work supported by the National Science Foundation under Grant No. 1846017. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Cite

@inproceedings{chatterjee2022berter,
  author = {Chatterjee, Shubham and Dietz, Laura},
  title = {BERT-ER: Query-specific BERT Entity Representations for Entity Ranking},
  year = {2022},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3477495.3531944},
  doi = {10.1145/3477495.3531944},
  booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  numpages = {10},
  location = {Madrid, Spain},
  series = {SIGIR '22}
}

Contact

If you have any questions, please contact Shubham Chatterjee at [email protected] or [email protected].

⚠️ **GitHub.com Fallback** ⚠️