Eye tracking - theDebbister/cognitiveNLP-dataCollection GitHub Wiki

Eye-tracking

An eye tracker measures eye positions and eye movement. Gaze patterns are an indirect measurement of cognitive processes occurring during reading. There are many available eye-tracking corpora in multiple languages that can be used for NLP purposes.

This collection contains corpora in the following languages:

Chinese
Danish
Dutch
English
French
German
Hindi
Persian
Portuguese (Brazilian)
Russian
Spanish
Swedish
Turkish
Multilingual

Chinese

Hong Kong Corpus of Chinese Sentence and Passage Reading (HKC)

Stimulus: 300 one-line single sentences and 7 multiline passages in simplified Chinese
Participants: 98 native speakers
Data: https://osf.io/7uq3j/
Reference: Wu & Kit (2023)

The Database of Eye-Movement Measures on Words in Chinese Reading

Stimulus: 8,551 Chinese words in full sentences
Participants: 1,718 students, native speakers
Data: https://osf.io/94wue/
Reference: Zhang et al. (2022)

GECO-CN

Stimulus: Novel "The Mysterious Affair at Styles" ("斯泰尔斯庄园奇案" in Chinese) written by Agatha Christie; contains L1 Chinese reading and L2 English reading
Participants: 30 participants
Data: https://osf.io/pmvhd/?view_only=77def2827a514254957cc846e14826cf
Reference: Sui et al. (2022)

Beijing Sentence Corpus

Stimulus: 150 sentences
Participants: 60 participants
Data: https://osf.io/vr3k8/
Reference: Pan et al. (2021)

Gaze Behavior in Text Summarization Dataset

Stimulus: 100 articles from public news websites
Participants: 50 participants
Data: https://github.com/MMLabTHUSZ/ADEGBTS
Reference: Yi et al. (2020)

Chinese Word Length Effects

Stimulus: 90 sentences
Participants: 35
Provided features: first fixation duration, single fixation duration, gaze duration, total fixation duration, skipping probability
Data: https://osf.io/e2ws6/
Reference: Zang et al. (2018)

Reading Attention

Stimulus: 15 questions and 60 answer documents
Participants: 29
Provided features: fixations and saccades
Data: http://www.thuir.cn/group/~YQLiu/
Reference: Li et al. (2018)

Danish

CopCo

Stimulus: 20 speech manuscripts
Participants: 22 native speakers, 19 native speakers with dyslexia, 10 Danish L2 speakers of various levels
Provided features: first fixation duration, mean fixation duration, first pass duration, go-past time, total reading time, landing position, number of fixations, mean saccade duration, peak saccade velocity
Data: https://osf.io/ud8s5/
Reference: Hollenstein et al. (2022)

Dutch

RaCCooNS (Radboud Coregistration Corpus of Narrative Sentences)

Also has a EEG data.
Stimulus: 200 Dutch sentences from the SONAR-500 Dutch corpus (book section)
Participants: 37
Provided features: raw eye-tracking data, the preprocessed eye-tracking data at the fixation, word, and trial levels
Data: https://data.ru.nl/collections/ru/cls/eeg_et_sentence_reading_dsc_556?0
Reference: Frank & Aumeistere (2023)

GECO Corpus

Also has an English part.
Stimulus: novel by Agatha Christie
Participants: 19 bilingual, 14 monolingual readers
Provided features: first fixation duration, single fixation duration, go-past time, total reading time, gaze duration
Data: http://expsy.ugent.be/downloads/geco/
Reference: Cop et al. (2017)

Mental Simulation Corpus

Stimulus: 3 existing Dutch short stories (2143, 2659, and 2988 words)
Participants: 102
Data: https://osf.io/qgx26/
Reference: Mak & Willems (2019)

English

GECO-CN

CELER

Stimulus: Sentences from the Wall Street Journal
Participants: 69 native English speakers and 296 English learners
Data: https://github.com/berzak/celer (access restricted due to licensing of the reading materials)
Reference: Berzak et al. (2022)

VQA-MHUG

Stimulus: 3990 question and image pairs (tailored towards visual question answering), tagged and balanced by reasoning type and difficulty.
Participants: 49 participants
Data: https://perceptualui.org/publications/sood21_conll/
Reference: Sood et al. (2021)

SAT Reading Dataset

Stimulus: four SAT passages for reading comprehension from practice tests
Participants: 95 undergraduate students
Data: https://github.com/ahnchive/SB-SAT
Reference: Ahn et al. 2020

MQA-RC

Stimulus: 32 documents (around 200-250 words each) containing movie plot synopses from the MovieQA dataset
Participants: 23 English native speakers
Data: https://perceptualui.org/research/datasets/MQA-RC/
Reference: Sood et al. (2020)

Zurich Cognitive Language Processing Corpus (ZuCo) 1 + 2

Simultaneous eye tracking and EEG recordings.

ZuCo 1

Stimulus: Sentences from Wikipedia and the Stanford Sentiment Treebank, normal reading and annotation
Participants: 12
Provided features: number of fixations, first fixation duration, single fixation duration, go-past time, total reading time, gaze duration, pupil size
Data: https://osf.io/q3zws/
Reference: Hollenstein et al. (2018)

ZuCo 2

Stimulus: 700 sentences from Wikipedia, normal reading and annotation
Participants: 18 Provided features: number of fixations, first fixation duration, single fixation duration, go-past time, total reading time, gaze duration, pupil size
Data: https://osf.io/2urht/.
Reference: Hollenstein et al. (2020)

GECO Corpus

Also includes a Dutch part.
Stimulus: novel by Agatha Christie
Participants: 19 bilingual, 14 monolingual readers
Provided features: first fixation duration, single fixation duration, go-past time, total reading time, gaze duration
Data: http://expsy.ugent.be/downloads/geco/
Reference: Cop et al. (2017)

Passage reading

Stimulus: 40 passages of text
Participants: 48
Data: https://osf.io/4qtnf/
Reference: Parker et al. (2017)

Provo Corpus

Stimulus: online news articles, popular science magazines, and public-domain works of fiction
Participants: 84
Data: https://osf.io/sjefs/
Reference: Luke & Christianson (2016)

ASD Data

Stimulus: 27 individual texts from various domains (4,658 words in total)
Participants: 14-20 (contains data of subjects with and without autism)
Data: https://github.com/anomymous1/ASD-Data/
Provided features: Time to 1st View (sec), Time Viewed (sec), Time Viewed (%), Fixations (#), Revisits (#). Texts also include readability scores and comprehension questions.
Reference: Yaneva (2016)

CFILT Datasets

The Center for Indian Language Technology (CFILT) offers 6 eye tracking datasets specifically recorded for NLP purposes.
Data: http://www.cfilt.iitb.ac.in/cognitive-nlp/

Essay Grading

Stimulus: 48 essays selected from the ASAP AEG dataset
Participants: 8 fluent English speakers
Reference: Mathias et al. (2020)

Text Quality

Stimulus: 30 from different sources, ((simple) Wikipedia, news articles)
Participants: 20 fluent English speakers
Reference: Mathias et al. (2018)

Scanpath

Stimulus: sentences from Wikipedia and Simple Wikipedia
Participants: 16
Reference: Mishra et al. (2017)

Sarcasm

Stimulus: Twitter or Amazon movie reviews
Participants: 7
Reference: Mishra et al. (2016)

Coreference

Stimulus: MUC-6 dataset
Participants: 14
Reference: Cheri et al. (2016)

Sentiment

Stimulus: movie reviews from a movie corpus and from Twitter
Participants: 5
Reference: Joshi et al. (2014)

UCL Corpus

Stimulus: 205 sentences
Participants: 43
Data: https://link.springer.com/article/10.3758/s13428-012-0313-y#SupplementaryMaterial
Reference: Frank et al. (2013)

This dataset also includes self-paced reading times.

Dundee Corpus

Stimulus: newspaper articles
Participants: 10
Data: can be provided by Alan Kennedy upon request.
Reference: Kennedy et al. (2003)

Also includes a French part.

French

Dundee Corpus

Stimulus: newspaper articles
Participants: 10
Data: can be provided by Alan Kennedy upon request.
Reference: Kennedy et al. (2003)

Also includes an English part.

German

Postdam Textbook Corpus

Stimulus: Scientific texts read by experts and non-experts
Participants: 75
Data: https://osf.io/dn5hp/
Reference: Jäger et al. (2021)

Multimodal Duolingo Bio-Signal Dataset

Stimulus: German language lessons using the web-based Duolingo
Participants: 22 participants (either native English speakers or fluent in English)
Data: https://figshare.com/s/688e387fbfdc000f4e90
Reference: Notaro et al. (2018)

This dataset also contains EEG and mouse movements metrics.

Postdam Sentence Corpus

Stimulus: 144 sentences
Participants: 33
Provided features: predictability estimates
Data: http://read.psych.uni-potsdam.de/
Reference: Kliegl et al. (2004)

Hindi

Postdam-Allahabad Hindi Eyetracking Corpus

Stimulus: 153 sentences from the Hindi-Urdu treebank Participants: 30
Provided features: lexical features, first fixation duration, total fixation time, first-pass reading time, regression path duration, etc.
Data: https://osf.io/dh54b/
Reference: Husain et al. (2015)

Persian

Dependency Resolution Dataset

Stimulus: 136 sentences
Participants: 40
Provided features: FFD, FFP, SFD, FPRT, RBRT, TFT, RPD, CRPD, RRT, RRTP, RRTR, RBRC, TRC, LPRT
Data: http://www.ling.uni-potsdam.de/~vasishth/code/SafaviEtAl2016DataCode.zip
Reference: Safavi et al. (2016)

Also contains self-paced reading times.

Portuguese (Brazilian)

RastrOS

Stimulus: 50 short paragraphs of various genres
Participants: 37
Data: https://osf.io/9jxg3/
Reference: Leal et al. (forthcoming)

Russian

Russian Sentence Corpus

Stimulus: 144 sentences
Participants: 96
Data: https://osf.io/x5q2r/
Reference: Laurinavichyute et al. (2018)

Spanish

Nicenboim et al. (2015)

Stimulus: 264 sentences including various syntactic phenomena
Participants: 76
Data: https://github.com/bnicenboim/papers/tree/master/NicenboimEtAl2015.%20Working%20memory%20differences%20in%20long-distance%20dependency%20resolution
Reference: Nicenboim et al. (2015)

This study also contains a self-paced reading experiment.

Swedish

HEN: Processing of gender-neutral pronouns

Stimulus: 48 sentence pairs where the first sentence included a noun referring to a person (e.g., sister, hairdresser, person) and the second included a pronoun referring to the noun.
Participants: 120 participants
Data: https://figshare.com/articles/dataset/Open_data_Are_new_gender-neutral_pronouns_difficult_to_process_in_reading_The_case_of_hen_in_Swedish/13143158/1
Reference: Vergoossen et al. (2020)

Turkish

TURead

Stimulus: 192 short texts, each composed of 1-3 sentences
Participants: 215
Provided features: total fixation duration, gaze duration, first fixation duration & more
Data: https://osf.io/w53cz/
Reference: Acartürk et al. (2023)

Multilingual

MECO L1

Stimulus: 12 short texts about general domain topics, native speakers reading in their own language
Participants: 580 readers of 13 languages (Dutch, English, Estonian, Finnish, German, Greek, Hebrew, Italian, Korean, Norwegian, Russian, Spanish, and Turkish)
Data: https://osf.io/3527a/
Reference: Siegelman et al. (2022)

MECO L2

Stimulus: 12 short texts about general domain topics, L2 speakers reading in English
Participants: 543 readers of 12 languages (Dutch, English, Estonian, Finnish, German, Greek, Hebrew, Italian, Norwegian, Russian, Spanish, and Turkish)
Data: https://osf.io/q9h43/ Reference: Kuperman et al. (2022)