resources - shigashiyama/nlp_survey GitHub Wiki
Tools and Resources
Resources
Language resources
- GSK:言語資源カタログ
http://www.gsk.or.jp/catalog/ - NTCIR:テストコレクション一覧
http://research.nii.ac.jp/ntcir/data/data-ja.html - 日本の言語資源・ツールのカタログ
https://www.jaist.ac.jp/project/NLP_Portal/doc/LR/lr-cat-j.html - Datasets for Natural Language Processing https://machinelearningmastery.com/datasets-natural-language-processing/
Text corpus
- Large-scale raw text
- WaCky
http://wacky.sslmit.unibo.it/doku.php - Chinese Gigaword 5th Edition https://catalog.ldc.upenn.edu/ldc2011t13
- WaCky
- Japanese linguistic analysis
- 現代日本語書き言葉均衡コーパス (BCCWJ)
http://pj.ninjal.ac.jp/corpus_center/bccwj/dvd-index.html- BCCWJ-DepPara (文節単位係り受け・並列構造アノテーションデータ)
https://github.com/masayu-a/BCCWJ-DepPara2
http://www.anlp.jp/proceedings/annual_meeting/2013/pdf_dir/X1-2.pdf - BCCWJ NEコーパス
https://sites.google.com/site/projectnextnlpne/ - 日本語Wikificationコーパス
http://www.cl.ecei.tohoku.ac.jp/jawikify/ - Project Next NLP 形態素解析班 (id list of BCCWJ test data)
http://plata.ar.media.kyoto-u.ac.jp/mori/research/topics/PST/NextNLP.html
- BCCWJ-DepPara (文節単位係り受け・並列構造アノテーションデータ)
- 京都大学 黒橋・河原研究室
http://nlp.ist.i.kyoto-u.ac.jp/index.php?NLP%E3%83%AA%E3%82%BD%E3%83%BC%E3%82%B9- 京都大学テキストコーパス
- 京都大学ウェブ文書リードコーパス etc.
- 京都大学 学術情報メディアセンター
http://www.ar.media.kyoto-u.ac.jp/data/- Japanese Dependency Corpus
- Japanese Wikification Corpora etc.
- 京都大学情報学研究科--NTTコミュニケーション科学基礎研究所 共同研究ユニット
http://nlp.ist.i.kyoto-u.ac.jp/kuntt/- 解析済みブログコーパス etc.
- 首都大日本語 Twitter コーパス
https://github.com/tmu-nlp/TwitterCorpus - クックパッドデータセット / フローグラフコーパス
http://www.nii.ac.jp/dsc/idr/cookpad/cookpad.html
http://www.ar.media.kyoto-u.ac.jp/data/recipe/
- 現代日本語書き言葉均衡コーパス (BCCWJ)
- Chinese linguistic analysis
- [Weibo] NLPCC-ICCPOL 2016 Shared Task: Chinese Word Segmentation for Weibo Text
https://github.com/FudanNLP/NLPCC-WordSeg-Weibo - [AS, CityU, PKU, MSR] Second International Chinese Word Segmentation Bakeoff
http://sighan.cs.uchicago.edu/bakeoff2005/ - [CTB6] Chinese Treebank 6.0
https://catalog.ldc.upenn.edu/LDC2007T36
- [Weibo] NLPCC-ICCPOL 2016 Shared Task: Chinese Word Segmentation for Weibo Text
- Multilingual linguistic analysis
- Universal Dependencies
http://universaldependencies.org/ - CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989
- Universal Dependencies
- Named entity recognition / classification
- 「拡張固有表表現+Wikipedia」データ
http://www.languagecraft.com/enew/ - GSK2014-A 拡張固有表現タグ付きコーパス (BCCWJ コアデータ, 毎日新聞'95)
http://www.gsk.or.jp/catalog/gsk2014-a/ - FIGER
https://github.com/xiaoling/figer - FNET - preprocessor for BBN, OntoNote, WIki data
(Abhishek 2017)
https://github.com/abhipec/fnet
- 「拡張固有表表現+Wikipedia」データ
- Relation Extraction
- RANIS - Relational representation of context-dependent roles on information science papers
http://mynlp.github.io/ranis/ - the datasets for automatic keyphrase extraction task
https://github.com/snkim/AutomaticKeyphraseExtraction
- RANIS - Relational representation of context-dependent roles on information science papers
- Language Modeling
- One Billion Word Benchmark
https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark
- One Billion Word Benchmark
- RTE
- The Stanford Natural Language Inference (SNLI) Corpus
https://nlp.stanford.edu/projects/snli/
- The Stanford Natural Language Inference (SNLI) Corpus
- Parallel corpus
- Japanese-English Subtitle Corpus
http://cs.stanford.edu/~rpryzant/jesc/ - Asian Language Treebank (ALT) Project
http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/index.html - Graham Neubig:日本語対訳データ
http://www.phontron.com/japanese-translation-data.php?lang=ja - small_parallel_enja: 50k En/Ja Parallel Corpus for Testing SMT Methods
https://github.com/odashi/small_parallel_enja - OPUS - the open parallel corpus
http://opus.lingfil.uu.se/
- Japanese-English Subtitle Corpus
- QA
- Sentiment Analysis
- AcademiaSinicaNLPLab/sentiment_dataset (SST, MR, CR, TREC etc.)
https://github.com/AcademiaSinicaNLPLab/sentiment_dataset
- AcademiaSinicaNLPLab/sentiment_dataset (SST, MR, CR, TREC etc.)
- Semantic Parsing
Knowledge resources for NLP
- Dictinonary
- UniDic
http://pj.ninjal.ac.jp/corpus_center/unidic/ - mecab-ipadic-NEologd
https://github.com/neologd/mecab-ipadic-neologd/
- UniDic
- Thesaurus
- WordNet
https://wordnet.princeton.edu/
- WordNet
- Paraphrase database
- PPDB: The Paraphrase Database
http://www.cis.upenn.edu/~ccb/ppdb/ - PPDB : Japanese - 日本語言い換えデータベース
http://ahclab.naist.jp/resource/jppdb/
- PPDB: The Paraphrase Database
Word embedding
- word similarity dataset
- (Sakaizawa 2017) 日本語単語類似度データセット
https://github.com/tmu-nlp/JapaneseWordSimilarityDataset/blob/master/README.md - (Chen 2015) Chinese wordsim and analogy dataset
https://github.com/Leonard-Xu/CWE/tree/master/data - (Finkelstein 2002) The WordSimilarity-353 Test Collection
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html - (Bruni 2012) The MEN Test Collection
https://staff.fnwi.uva.nl/e.bruni/MEN - (Hunag 2012) SCWS
http://ai.stanford.edu/~ehhuang/ - (Luong 2013) LX-Rare Word Similarity Dataset
http://metashare.metanet4u.eu/repository/browse/lx-rare-word-similarity-dataset/f8dd0332e6d911e6a2aa782bcb074135a226cf379cf746a8976dd3420f5a2813/ - (Hill 2014) SimLex-999 https://www.cl.cam.ac.uk/~fh295/simlex.html
- (Baker 2014) Verb Similarity Dataset
https://ie.technion.ac.il/~roiri/ - (Gerz 2016) SimVerb-3500
http://people.ds.cam.ac.uk/dsg40/simverb.html
- (Sakaizawa 2017) 日本語単語類似度データセット
- phrase similarity dataset
Relational pattern similarity dataset
https://github.com/takase/relPatSim - analogy task dataset
- (Mikolov 2013) Google analogy test set: contained in word2vec repogitory
- pre-trained word embedding models
- chakin - downloader for pre-trained word vectors
https://github.com/chakki-works/chakin
http://qiita.com/Hironsan/items/85b281270671dde3555d - word2vec google news model
https://github.com/mmihaltz/word2vec-GoogleNews-vectors - fastText - Pre-trained word vectors
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md - 東北大学 乾・岡崎研究室:日本語 Wikipedia エンティティベクトル
http://www.cl.ecei.tohoku.ac.jp/~m-suzuki/jawiki_vector/ - 白ヤギコーポレーション:word2vecの学習済み日本語モデル
http://aial.shiroyagi.co.jp/2017/02/japanese-word2vec-model-builder/
- chakin - downloader for pre-trained word vectors
Other resources
- Reasoning
- csi-corpus - CSI: Crime Scene Investigation episodes
https://github.com/EdinburghNLP/csi-corpus
- csi-corpus - CSI: Crime Scene Investigation episodes
- Image captioning
- Computer Vision
- ImageNet
http://www.image-net.org/
- ImageNet
- Audio Recognition
- AudioSet
https://research.google.com/audioset/ - 声優統計コーパス
http://voice-statistics.github.io/
- AudioSet
Tools
Text analysis
- Text analysis toolkit
- 国立情報学研究所 宮尾研究室
https://mynlp.github.io/ja/projects
- Enju - Accurate natural language parser for English
- Corbit - Chinese text analyzer etc.
- 京都大学 黒橋・河原研究室
http://nlp.ist.i.kyoto-u.ac.jp/index.php?NLP%E3%83%AA%E3%82%BD%E3%83%BC%E3%82%B9- 日本語形態素解析システムJUMAN/JUMAN++
- 日本語構文解析システムKNP etc.
- gensim
https://radimrehurek.com/gensim/apiref.html - NLP4J - NLP Toolkit for JVM Languages
https://emorynlp.github.io/nlp4j/- (Choi 2016)
- spaCy
https://spacy.io/ - Natural Language Toolkit (NLTK)
http://www.nltk.org/ - Stanford NLP Group
https://nlp.stanford.edu/software/
https://github.com/stanfordnlp - harvardnlp
http://nlp.seas.harvard.edu/code/- OpenNMT etc.
- CMU Chris Dyer's lab
http://www.cs.cmu.edu/~cdyer/code.html
https://github.com/clab
- 国立情報学研究所 宮尾研究室
https://mynlp.github.io/ja/projects
- Morphological analysis / word segmentation
- MeCab http://taku910.github.io/mecab/
- KyTea
http://www.phontron.com/kytea/index-ja.html - (Yang 2017)
https://github.com/jiesutd/RichWordSegmentor - (Zhang 2016)
https://github.com/SUTDNLP/NNSegmentation
- Dependency parsing
- CaboCha
https://taku910.github.io/cabocha/ - J.DepP - C++ implementation of Japanese Dependency Parsers
http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/ - EDA 係り受け解析器
http://plata.ar.media.kyoto-u.ac.jp/tool/EDA/ - 中・長単位解析器 Comainu
http://comainu.org/
- CaboCha
- Tokenization
- SentencePiece
https://github.com/google/sentencepiece
- SentencePiece
- NER
- NeuroNER
https://github.com/Franck-Dernoncourt/NeuroNER - (Lample 2016)
https://github.com/glample/tagger (Theano)
https://github.com/clab/stack-lstm-ner (dynet)
- NeuroNER
- NE classification
- (Shimaoka 2017)
https://github.com/shimaokasonse/NFGEC
- (Shimaoka 2017)
Word Embedding
- Word embedding
- word2vec
https://code.google.com/archive/p/word2vec/ - fastText
https://github.com/facebookresearch/fastText - GloVe
https://nlp.stanford.edu/projects/glove/ - (Chen 2015) CWE
https://github.com/Leonard-Xu/CWE - (Ling 2015) C2W
https://github.com/wlin12/JNN
- word2vec
- Sentence embedding
- (Conneau 2017) https://github.com/facebookresearch/SentEval
- StarSpace
https://github.com/facebookresearch/StarSpace
Machine Translation
- NMT
- OpenNMT
http://opennmt.net/ - SYSTRAN
https://arxiv.org/abs/1610.05540 - (Caglayan 2017) NMTPY
https://github.com/lium-lst/nmtpy (Theano)
https://arxiv.org/abs/1706.00457 - (Sennrich 2017)
https://github.com/rsennrich/nematus (Theano) - (Luong 2017) etc.
https://nlp.stanford.edu/projects/nmt/ (Matlab) - (Bahdanau 2015) RNNsearch
https://github.com/lisa-groundhog/GroundHog/tree/master/experiments/nmt (Theano)
- OpenNMT
- SMT
Other NLP techniques
- Knowledge Graph Embeddings including TransE, TransH, TransR and PTransE
https://github.com/thunlp/KB2E
Preprocess for NLP
- Data Preprocessor
- Annotation
- brat rapid annotation tool
http://brat.nlplab.org/
- brat rapid annotation tool
Search and visualization
- Faiss: A library for efficient similarity search
https://code.facebook.com/posts/1373769912645926/faiss-a-library-for-efficient-similarity-search/ - scattertext
https://github.com/JasonKessler/scattertext - Picasso: A free open-source visualizer for Convolutional Neural Networks
https://medium.com/merantix/picasso-a-free-open-source-visualizer-for-cnns-d8ed3a35cfc5
Deep learning framework and implementations
- TensorFlow implementations
https://github.com/tensorflow/models- Language Model on 1B Word Benchmark
https://github.com/tensorflow/models/tree/master/lm_1b
- Language Model on 1B Word Benchmark
- Quasi-Recurrent Neural Network (QRNN) for PyTorch
https://github.com/salesforce/pytorch-qrnn
Reinforcement learning and dialog
- OpenAI Baselines
https://github.com/openai/baselines - OpenAI Universe
https://github.com/openai/universe - Dialog
- ParlAI: a framework for dialog AI research
https://github.com/facebookresearch/ParlAI
- ParlAI: a framework for dialog AI research