asianlang_resources - shigashiyama/nlp_survey GitHub Wiki

Asian Language Resources

List of Language Resources

A curated list of Japanese, Korean and Vietnamese open speech corpora
https://www.hieuthi.com/blog/2018/04/22/speech-japanese-korean-vietnamese.html

Multilingual

Simultaneous Interpretation/Translation

NAIST-SIC: NAIST Simultaneous Interpretation Corpus
- Japanese <--> English
- https://dsc-nlp.naist.jp/data/NAIST-SIC/
AHC-SI: the first large scale English-Japanese Interpretation Data
- Japanese <--> English
- https://github.com/mingzi151/AHC-SI
CIAIR Simultaneous Interpretation Corpus (同時通訳データベース)
- Japanese <--> English
- http://sidb.jp/
JNPC Corpus (GSK2020-A 通訳データベース)
- Japanese <--> English
- https://www.gsk.or.jp/catalog/gsk2020-a/
2020 Duolingo Shared Task: Simultaneous Translation And Paraphrase for Language Education (STAPLE)
- English <--> Portuguese, Hungarian, Japanese, Korean, and Vietnamese
- https://sharedtask.duolingo.com/

News, Web, and General Text

OPUS
- http://opus.nlpl.eu/
TUFS Asian Language Parallel Corpus (TALPCo)
- Japanese -> English, Korean, Burmese, Malay, Indonesian, Thai, and Vietnamese
- https://github.com/matbahasa/TALPCo
TUFS Media Corpus
- Japanese <-- Arabic, Bengali, Burmese, Indonesian, Persian, Turkish, Urdu, and Vietnamese
- http://ngc2068.tufs.ac.jp/tufsmedia-corpus/
Asian Language Treebank (ALT)
- English --> Bengali, Indonesian, Japanese, Filipino, Vietnamese, Myanmar, Thai, Malay, and Khmer
- http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/
Facebook Low Resource MT Benchmark (FLoRes)
- English --> Nepali, Sinhala, Khmer, Pashto
- https://github.com/facebookresearch/flores
JParaCrawl
- Japanese <--> English, Chinese
- http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/
JIJI Corpus (only available for WAT participants)
- Japanese --> English
- https://lotus.kuee.kyoto-u.ac.jp/WAT/jiji-corpus/2020/TaskDescription.html
Japanese-Russian-English News Commentary Parallel Data
- https://github.com/aizhanti/JaRuNC
Japanese-Vietnamese Parallel Corpora
- https://github.com/ngovinhtn/JaViCorpus
WCC-JC: A Web Crawled Corpus for Japanese-Chinese NMT
- https://github.com/zhang-jinyi/Web-Crawled-Corpus-for-Japanese-Chinese-NMT
IIT Bombay English-Hindi Corpus
- http://www.cfilt.iitb.ac.in/iitb_parallel/
EnTam: An English-Tamil Parallel Corpus
- http://ufal.mff.cuni.cz/~ramasamy/parallel/html/
UCSY corpus and ALT corpus for WAT Myanmar-English task
- http://lotus.kuee.kyoto-u.ac.jp/WAT/my-en-data/
ECCC corpus and ALT corpus for WAT Khmer-English task
- http://lotus.kuee.kyoto-u.ac.jp/WAT/km-en-data/
Korean Parallel Corpora
- Korean <--> English
- https://github.com/jungyeul/korean-parallel-corpora

Science, Patent, and Technical Documents

Asian Scientific Paper Excerpt Corpus (ASPEC)
- Japanese <--> English, Chinese
- http://lotus.kuee.kyoto-u.ac.jp/ASPEC/
NTCIR-10 PatentMT (Patent Mahine Translation Test Collection)
- Japanese <--> English; Chinese --> English
- http://research.nii.ac.jp/ntcir/permission/ntcir-10/perm-en-PatentMT.html
JPO Patent Corpus
- Japanese <--> English, Chinese, Korean
- https://alaginrc.nict.go.jp/resources/jpo-info/jpo-outline.html
- https://lotus.kuee.kyoto-u.ac.jp/WAT/patent/jpc2021.html (only available for WAT participants)
Timely Disclosure Documents Corpus (東証適時開示コーパス)
- Japanese -> English
- https://www.jpx.co.jp/corporate/news/news-releases/0060/20200916-01.html

Speech and Dialogue

Open Speech and Language Resources
- https://www.openslr.org/resources.php
MAGICHUB - Datasets
- https://magichub.com/category/datasets
Web Inventory of Transcribed and Translated Talks (WIT3)
- IWSLT 2017 data: English --> Arabic, German, French, Japanese, Korean, Chinese
- IWSLT 2015 data: English --> French, German, Chinese, Thai, Vietnamese
- https://wit3.fbk.eu/home
Multilingual TEDx
- English <- Spanish, French, Portuguese, Italian, Russian, Greek, Arabic, and German
- https://www.openslr.org/100/
The Multitarget TED Talks Task (MTTT)
- https://www.cs.jhu.edu/~kevinduh/a/multitarget-tedtalks/
MUST-C - multilingual speech translation corpus from English TED Talks
- English -> Arabic, Chinese, Czech, Dutch, French, German, Italian, Persian, Portuguese, Romanian, Russian, Spanish, Turkish, and Vietnamese
- https://ict.fbk.eu/must-c/
The Business Scene Dialogue corpus
- Japanese <--> English
- https://github.com/tsuruoka-lab/BSD
The AMI Meeting Parallel Corpus
- English --> Japanese
- https://github.com/tsuruoka-lab/AMI-Meeting-Parallel-Corpus
Japanese-to-English Discourse Translation Test Set
- https://github.com/nttcslab-nlp/discourse-mt-test-sets
kosp2e: Korean Speech to English Translation Corpus
- https://github.com/warnikchow/kosp2e

User-generated Text

μtopia - Microblog Translated Posts Parallel Corpus
- Weibo Corpus: Chinese --> English, Arabic, Russian, Korean, German, French, Spanish, Portuguese, Czech
- Twitter Corpus: English <--> Chinese, Arabic, Russian, Korean, Japanese
- Twitter Gold Corpus: English <--> Spanish, French, Russian, Korean, Japanese
- http://www.cs.cmu.edu/~lingwang/microtopia/
MTNT: Machine Translation of Noisy Text
- English <--> French, Japanese
- https://pmichel31415.github.io/mtnt/index.html
PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents
- Japanese <--> English
- https://github.com/cl-tohoku/PheMT

Other

Japanese-English Subtitle Corpus
- https://nlp.stanford.edu/projects/jesc/
Graham Neubig：日本語対訳データ
- http://www.phontron.com/japanese-translation-data.php?lang=en
small_parallel_enja: 50k En/Ja Parallel Corpus for Testing SMT Methods
- https://github.com/odashi/small_parallel_enja

Monolingual

Multilingual Amazon Reviews Corpus
- En, De, Es, Fr, Ja, Zh
- https://registry.opendata.aws/amazon-reviews-ml/
Amazon Multilingual Counterfactual Dataset (AMCD)
https://github.com/amazon-research/amazon-multilingual-counterfactual-dataset
CC-100: Monolingual Datasets from Web Crawl Data
http://data.statmt.org/cc-100/
Tweets2011: TREC 2011 microblog track
https://trec.nist.gov/data/tweets/

Chinese

SIGHAN 2005 Chinese Word Segmentation Bakeoff dataset
http://sighan.cs.uchicago.edu/bakeoff2005/
SCTB: A Chinese Treebank in Scientific Domain
- Word segmentation, phrase structure
- http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?A%20Chinese%20Treebank%20in%20Scientific%20Domain%20%28SCTB%29
Baidu Dataset
https://ai.baidu.com/broad/introduction?dataset
THCHS-30 (A Free Chinese Speech Corpus Released by CSLT@Tsinghua University)

Korean

Open Korean Corpora
- https://github.com/ko-nlp/Open-korean-corpora
KAIST Corpus
- http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus
AI HUB Korean Speech Corpus
- https://aihub.or.kr/aidata/105#
Zeroth-Korean Speech Corpus
- http://www.openslr.org/40/
Zeroth: Kaldi-based Korean ASR open-source project
https://github.com/goodatlas/zeroth
West Point Korean Speech
https://catalog.ldc.upenn.edu/LDC2006S36
Pansori-TEDxKR
speech.ko
https://github.com/homink/speech.ko
Korean Conversational Speech Corpus
https://magichub.com/datasets/korean-conversational-speech-corpus/

Vietnamese

Python Vietnamese Toolkit
- Tokenization, POS tagging, Accents removal, Accents adding
- https://github.com/trungtv/pyvi
VLSP 2013 datasets
- Word segmentation, POS tag
- https://vlsp.org.vn/resources-vlsp2013
VNESEcorpus and VNTQcorpus
- http://viet.jnlp.org/download-du-lieu-tu-vung-corpus
PhoBERT: Pre-trained language models for Vietnamese
https://github.com/VinAIResearch/PhoBERT

Burmese (Myanmar)

Burmese ALT
- Tokenization, POS tag, phrase structure
- http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/
- https://dl.acm.org/doi/pdf/10.1145/3373268

Thai

InterBEST 2009 dataset
- Word segmentation, named entity
- https://thailang.nectec.or.th/downloadcenter/indexae01.html?option=com_docman&task=cat_view&gid=40&Itemid=61

Khmer

Khmer ALT
- Tokenization, POS tag
- http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/