resources - shigashiyama/nlp_survey GitHub Wiki

Tools and Resources

Resources

Language resources

GSK：言語資源カタログ
http://www.gsk.or.jp/catalog/
NTCIR：テストコレクション一覧
http://research.nii.ac.jp/ntcir/data/data-ja.html
日本の言語資源・ツールのカタログ
https://www.jaist.ac.jp/project/NLP_Portal/doc/LR/lr-cat-j.html
Datasets for Natural Language Processing https://machinelearningmastery.com/datasets-natural-language-processing/

Text corpus

Large-scale raw text
- WaCky
  http://wacky.sslmit.unibo.it/doku.php
- Chinese Gigaword 5th Edition https://catalog.ldc.upenn.edu/ldc2011t13
Japanese linguistic analysis
- 現代日本語書き言葉均衡コーパス (BCCWJ)
  http://pj.ninjal.ac.jp/corpus_center/bccwj/dvd-index.html
  - BCCWJ-DepPara (文節単位係り受け・並列構造アノテーションデータ)
    https://github.com/masayu-a/BCCWJ-DepPara2
    http://www.anlp.jp/proceedings/annual_meeting/2013/pdf_dir/X1-2.pdf
  - BCCWJ NEコーパス
    https://sites.google.com/site/projectnextnlpne/
  - 日本語Wikificationコーパス
    http://www.cl.ecei.tohoku.ac.jp/jawikify/
  - Project Next NLP 形態素解析班 (id list of BCCWJ test data)
    http://plata.ar.media.kyoto-u.ac.jp/mori/research/topics/PST/NextNLP.html
- 京都大学黒橋・河原研究室
  http://nlp.ist.i.kyoto-u.ac.jp/index.php?NLP%E3%83%AA%E3%82%BD%E3%83%BC%E3%82%B9
  - 京都大学テキストコーパス
  - 京都大学ウェブ文書リードコーパス etc.
- 京都大学学術情報メディアセンター
  http://www.ar.media.kyoto-u.ac.jp/data/
  - Japanese Dependency Corpus
  - Japanese Wikification Corpora etc.
- 京都大学情報学研究科--NTTコミュニケーション科学基礎研究所共同研究ユニット
  http://nlp.ist.i.kyoto-u.ac.jp/kuntt/
  - 解析済みブログコーパス etc.
- 首都大日本語 Twitter コーパス
  https://github.com/tmu-nlp/TwitterCorpus
- クックパッドデータセット / フローグラフコーパス
  http://www.nii.ac.jp/dsc/idr/cookpad/cookpad.html
  http://www.ar.media.kyoto-u.ac.jp/data/recipe/
Chinese linguistic analysis
- [Weibo] NLPCC-ICCPOL 2016 Shared Task: Chinese Word Segmentation for Weibo Text
  https://github.com/FudanNLP/NLPCC-WordSeg-Weibo
- [AS, CityU, PKU, MSR] Second International Chinese Word Segmentation Bakeoff
  http://sighan.cs.uchicago.edu/bakeoff2005/
- [CTB6] Chinese Treebank 6.0
  https://catalog.ldc.upenn.edu/LDC2007T36
Multilingual linguistic analysis
- Universal Dependencies
  http://universaldependencies.org/
- CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989
Named entity recognition / classification
- 「拡張固有表表現＋Wikipedia」データ
  http://www.languagecraft.com/enew/
- GSK2014-A 拡張固有表現タグ付きコーパス (BCCWJ コアデータ, 毎日新聞'95)
  http://www.gsk.or.jp/catalog/gsk2014-a/
- FIGER
  https://github.com/xiaoling/figer
- FNET - preprocessor for BBN, OntoNote, WIki data
  (Abhishek 2017)
  https://github.com/abhipec/fnet
Relation Extraction
- RANIS - Relational representation of context-dependent roles on information science papers
  http://mynlp.github.io/ranis/
- the datasets for automatic keyphrase extraction task
  https://github.com/snkim/AutomaticKeyphraseExtraction
Language Modeling
- One Billion Word Benchmark
  https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark
RTE
- The Stanford Natural Language Inference (SNLI) Corpus
  https://nlp.stanford.edu/projects/snli/
Parallel corpus
- Japanese-English Subtitle Corpus
  http://cs.stanford.edu/~rpryzant/jesc/
- Asian Language Treebank (ALT) Project
  http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/index.html
- Graham Neubig：日本語対訳データ
  http://www.phontron.com/japanese-translation-data.php?lang=ja
- small_parallel_enja: 50k En/Ja Parallel Corpus for Testing SMT Methods
  https://github.com/odashi/small_parallel_enja
- OPUS - the open parallel corpus
  http://opus.lingfil.uu.se/
QA
- SQuAD
  https://rajpurkar.github.io/SQuAD-explorer/
- TriviaQA
  http://nlp.cs.washington.edu/triviaqa/
Sentiment Analysis
- AcademiaSinicaNLPLab/sentiment_dataset (SST, MR, CR, TREC etc.)
  https://github.com/AcademiaSinicaNLPLab/sentiment_dataset
Semantic Parsing
- Geoquery
  http://www.cs.utexas.edu/users/ml/geo.html

Tools

Text analysis

Text analysis toolkit
- 国立情報学研究所宮尾研究室 https://mynlp.github.io/ja/projects
  - Enju - Accurate natural language parser for English
  - Corbit - Chinese text analyzer etc.
- 京都大学黒橋・河原研究室
  http://nlp.ist.i.kyoto-u.ac.jp/index.php?NLP%E3%83%AA%E3%82%BD%E3%83%BC%E3%82%B9
  - 日本語形態素解析システムJUMAN/JUMAN++
  - 日本語構文解析システムKNP etc.
- gensim
  https://radimrehurek.com/gensim/apiref.html
- NLP4J - NLP Toolkit for JVM Languages
  https://emorynlp.github.io/nlp4j/
  - (Choi 2016)
- spaCy
  https://spacy.io/
- Natural Language Toolkit (NLTK)
  http://www.nltk.org/
- Stanford NLP Group
  https://nlp.stanford.edu/software/
  https://github.com/stanfordnlp
- harvardnlp
  http://nlp.seas.harvard.edu/code/
  - OpenNMT etc.
- CMU Chris Dyer's lab
  http://www.cs.cmu.edu/~cdyer/code.html
  https://github.com/clab
Morphological analysis / word segmentation
- MeCab http://taku910.github.io/mecab/
- KyTea
  http://www.phontron.com/kytea/index-ja.html
- (Yang 2017)
  https://github.com/jiesutd/RichWordSegmentor
- (Zhang 2016)
  https://github.com/SUTDNLP/NNSegmentation
Dependency parsing
- CaboCha
  https://taku910.github.io/cabocha/
- J.DepP - C++ implementation of Japanese Dependency Parsers
  http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/
- EDA 係り受け解析器
  http://plata.ar.media.kyoto-u.ac.jp/tool/EDA/
- 中・長単位解析器 Comainu
  http://comainu.org/
Tokenization
- SentencePiece
  https://github.com/google/sentencepiece
NER
- NeuroNER
  https://github.com/Franck-Dernoncourt/NeuroNER
- (Lample 2016)
  https://github.com/glample/tagger (Theano)
  https://github.com/clab/stack-lstm-ner (dynet)
NE classification
- (Shimaoka 2017)
  https://github.com/shimaokasonse/NFGEC

Word Embedding

Word embedding
- word2vec
  https://code.google.com/archive/p/word2vec/
- fastText
  https://github.com/facebookresearch/fastText
- GloVe
  https://nlp.stanford.edu/projects/glove/
- (Chen 2015) CWE
  https://github.com/Leonard-Xu/CWE
- (Ling 2015) C2W
  https://github.com/wlin12/JNN
Sentence embedding
- (Conneau 2017) https://github.com/facebookresearch/SentEval
StarSpace
https://github.com/facebookresearch/StarSpace

Machine Translation

NMT
- OpenNMT
  http://opennmt.net/
- SYSTRAN
  https://arxiv.org/abs/1610.05540
- (Caglayan 2017) NMTPY
  https://github.com/lium-lst/nmtpy (Theano)
  https://arxiv.org/abs/1706.00457
- (Sennrich 2017)
  https://github.com/rsennrich/nematus (Theano)
- (Luong 2017) etc.
  https://nlp.stanford.edu/projects/nmt/ (Matlab)
- (Bahdanau 2015) RNNsearch
  https://github.com/lisa-groundhog/GroundHog/tree/master/experiments/nmt (Theano)
SMT
- Moses
  http://www.statmt.org/moses/
- GIZA++
  http://www.statmt.org/moses/giza/GIZA++.html

Other NLP techniques

Knowledge Graph Embeddings including TransE, TransH, TransR and PTransE
https://github.com/thunlp/KB2E

Preprocess for NLP

Data Preprocessor
- WikiExtractor
  https://github.com/attardi/wikiextractor
- nwc-toolkit
  https://code.google.com/archive/p/nwc-toolkit/
  https://github.com/xen/nwc-toolkit
- Chazutsu
  https://github.com/chakki-works/chazutsu
Annotation
- brat rapid annotation tool
  http://brat.nlplab.org/

Search and visualization

Faiss: A library for efficient similarity search
https://code.facebook.com/posts/1373769912645926/faiss-a-library-for-efficient-similarity-search/
scattertext
https://github.com/JasonKessler/scattertext
Picasso: A free open-source visualizer for Convolutional Neural Networks
https://medium.com/merantix/picasso-a-free-open-source-visualizer-for-cnns-d8ed3a35cfc5

Deep learning framework and implementations

TensorFlow implementations
https://github.com/tensorflow/models
- Language Model on 1B Word Benchmark
  https://github.com/tensorflow/models/tree/master/lm_1b
Quasi-Recurrent Neural Network (QRNN) for PyTorch
https://github.com/salesforce/pytorch-qrnn

Reinforcement learning and dialog

OpenAI Baselines
https://github.com/openai/baselines
OpenAI Universe
https://github.com/openai/universe
Dialog
- ParlAI: a framework for dialog AI research
  https://github.com/facebookresearch/ParlAI

resources - shigashiyama/nlp_survey GitHub Wiki

Tools and Resources

Resources

Language resources

Text corpus

Knowledge resources for NLP

Word embedding

Other resources

Tools

Text analysis

Word Embedding

Machine Translation

Other NLP techniques

Preprocess for NLP

Search and visualization

Deep learning framework and implementations

Reinforcement learning and dialog