asianlang_resources - shigashiyama/nlp_survey GitHub Wiki
Asian Language Resources
List of Language Resources
- A curated list of Japanese, Korean and Vietnamese open speech corpora
https://www.hieuthi.com/blog/2018/04/22/speech-japanese-korean-vietnamese.html
Multilingual
Simultaneous Interpretation/Translation
- NAIST-SIC: NAIST Simultaneous Interpretation Corpus
- Japanese <--> English
- https://dsc-nlp.naist.jp/data/NAIST-SIC/
- AHC-SI: the first large scale English-Japanese Interpretation Data
- Japanese <--> English
- https://github.com/mingzi151/AHC-SI
- CIAIR Simultaneous Interpretation Corpus (同時通訳データベース)
- Japanese <--> English
- http://sidb.jp/
- JNPC Corpus (GSK2020-A 通訳データベース)
- Japanese <--> English
- https://www.gsk.or.jp/catalog/gsk2020-a/
- 2020 Duolingo Shared Task: Simultaneous Translation And Paraphrase for Language Education (STAPLE)
- English <--> Portuguese, Hungarian, Japanese, Korean, and Vietnamese
- https://sharedtask.duolingo.com/
News, Web, and General Text
- OPUS
- TUFS Asian Language Parallel Corpus (TALPCo)
- Japanese -> English, Korean, Burmese, Malay, Indonesian, Thai, and Vietnamese
- https://github.com/matbahasa/TALPCo
- TUFS Media Corpus
- Japanese <-- Arabic, Bengali, Burmese, Indonesian, Persian, Turkish, Urdu, and Vietnamese
- http://ngc2068.tufs.ac.jp/tufsmedia-corpus/
- Asian Language Treebank (ALT)
- English --> Bengali, Indonesian, Japanese, Filipino, Vietnamese, Myanmar, Thai, Malay, and Khmer
- http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/
- Facebook Low Resource MT Benchmark (FLoRes)
- English --> Nepali, Sinhala, Khmer, Pashto
- https://github.com/facebookresearch/flores
- JParaCrawl
- Japanese <--> English, Chinese
- http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/
- JIJI Corpus (only available for WAT participants)
- Japanese --> English
- https://lotus.kuee.kyoto-u.ac.jp/WAT/jiji-corpus/2020/TaskDescription.html
- Japanese-Russian-English News Commentary Parallel Data
- Japanese-Vietnamese Parallel Corpora
- WCC-JC: A Web Crawled Corpus for Japanese-Chinese NMT
- IIT Bombay English-Hindi Corpus
- EnTam: An English-Tamil Parallel Corpus
- UCSY corpus and ALT corpus for WAT Myanmar-English task
- ECCC corpus and ALT corpus for WAT Khmer-English task
- Korean Parallel Corpora
- Korean <--> English
- https://github.com/jungyeul/korean-parallel-corpora
Science, Patent, and Technical Documents
- Asian Scientific Paper Excerpt Corpus (ASPEC)
- Japanese <--> English, Chinese
- http://lotus.kuee.kyoto-u.ac.jp/ASPEC/
- NTCIR-10 PatentMT (Patent Mahine Translation Test Collection)
- Japanese <--> English; Chinese --> English
- http://research.nii.ac.jp/ntcir/permission/ntcir-10/perm-en-PatentMT.html
- JPO Patent Corpus
- Japanese <--> English, Chinese, Korean
- https://alaginrc.nict.go.jp/resources/jpo-info/jpo-outline.html
- https://lotus.kuee.kyoto-u.ac.jp/WAT/patent/jpc2021.html (only available for WAT participants)
- Timely Disclosure Documents Corpus (東証適時開示コーパス)
- Japanese -> English
- https://www.jpx.co.jp/corporate/news/news-releases/0060/20200916-01.html
Speech and Dialogue
- Open Speech and Language Resources
- MAGICHUB - Datasets
- Web Inventory of Transcribed and Translated Talks (WIT3)
- IWSLT 2017 data: English --> Arabic, German, French, Japanese, Korean, Chinese
- IWSLT 2015 data: English --> French, German, Chinese, Thai, Vietnamese
- https://wit3.fbk.eu/home
- Multilingual TEDx
- English <- Spanish, French, Portuguese, Italian, Russian, Greek, Arabic, and German
- https://www.openslr.org/100/
- The Multitarget TED Talks Task (MTTT)
- MUST-C - multilingual speech translation corpus from English TED Talks
- English -> Arabic, Chinese, Czech, Dutch, French, German, Italian, Persian, Portuguese, Romanian, Russian, Spanish, Turkish, and Vietnamese
- https://ict.fbk.eu/must-c/
- The Business Scene Dialogue corpus
- Japanese <--> English
- https://github.com/tsuruoka-lab/BSD
- The AMI Meeting Parallel Corpus
- English --> Japanese
- https://github.com/tsuruoka-lab/AMI-Meeting-Parallel-Corpus
- Japanese-to-English Discourse Translation Test Set
- kosp2e: Korean Speech to English Translation Corpus
User-generated Text
- μtopia - Microblog Translated Posts Parallel Corpus
- Weibo Corpus: Chinese --> English, Arabic, Russian, Korean, German, French, Spanish, Portuguese, Czech
- Twitter Corpus: English <--> Chinese, Arabic, Russian, Korean, Japanese
- Twitter Gold Corpus: English <--> Spanish, French, Russian, Korean, Japanese
- http://www.cs.cmu.edu/~lingwang/microtopia/
- MTNT: Machine Translation of Noisy Text
- English <--> French, Japanese
- https://pmichel31415.github.io/mtnt/index.html
- PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents
- Japanese <--> English
- https://github.com/cl-tohoku/PheMT
Other
- Japanese-English Subtitle Corpus
- Graham Neubig:日本語対訳データ
- small_parallel_enja: 50k En/Ja Parallel Corpus for Testing SMT Methods
Monolingual
- Multilingual Amazon Reviews Corpus
- En, De, Es, Fr, Ja, Zh
- https://registry.opendata.aws/amazon-reviews-ml/
- Amazon Multilingual Counterfactual Dataset (AMCD)
https://github.com/amazon-research/amazon-multilingual-counterfactual-dataset - CC-100: Monolingual Datasets from Web Crawl Data
http://data.statmt.org/cc-100/ - Tweets2011: TREC 2011 microblog track
https://trec.nist.gov/data/tweets/
Chinese
- SIGHAN 2005 Chinese Word Segmentation Bakeoff dataset
http://sighan.cs.uchicago.edu/bakeoff2005/ - SCTB: A Chinese Treebank in Scientific Domain
- Word segmentation, phrase structure
- http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?A%20Chinese%20Treebank%20in%20Scientific%20Domain%20%28SCTB%29
- Baidu Dataset
https://ai.baidu.com/broad/introduction?dataset - THCHS-30 (A Free Chinese Speech Corpus Released by CSLT@Tsinghua University)
Korean
- Open Korean Corpora
- KAIST Corpus
- AI HUB Korean Speech Corpus
- Zeroth-Korean Speech Corpus
- Zeroth: Kaldi-based Korean ASR open-source project
https://github.com/goodatlas/zeroth - West Point Korean Speech
https://catalog.ldc.upenn.edu/LDC2006S36 - Pansori-TEDxKR
- speech.ko
https://github.com/homink/speech.ko - Korean Conversational Speech Corpus
https://magichub.com/datasets/korean-conversational-speech-corpus/
Vietnamese
- Python Vietnamese Toolkit
- Tokenization, POS tagging, Accents removal, Accents adding
- https://github.com/trungtv/pyvi
- VLSP 2013 datasets
- Word segmentation, POS tag
- https://vlsp.org.vn/resources-vlsp2013
- VNESEcorpus and VNTQcorpus
- PhoBERT: Pre-trained language models for Vietnamese
https://github.com/VinAIResearch/PhoBERT
Burmese (Myanmar)
- Burmese ALT
- Tokenization, POS tag, phrase structure
- http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/
- https://dl.acm.org/doi/pdf/10.1145/3373268
Thai
- InterBEST 2009 dataset
- Word segmentation, named entity
- https://thailang.nectec.or.th/downloadcenter/indexae01.html?option=com_docman&task=cat_view&gid=40&Itemid=61
Khmer
- Khmer ALT
- Tokenization, POS tag
- http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/