유용한 데이터셋 - BD-SEARCH/MLtutorial GitHub Wiki

01. NLP

Multidomain Sentiment Analysis Dataset

page: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/

sentiment analysis에 사용

IMDB Reviews

page: http://ai.stanford.edu/~amaas/data/sentiment/

25,000건의 영화 리뷰. sentiment analysis에 사용

Stanford Sentiment Treebank

page: https://nlp.stanford.edu/sentiment/code.html

Rotten Tomatoes로부터 10,000 건의 리뷰. 다른 리뷰에 비해 길다. 25,000건의 영화 리뷰. sentiment analysis에 사용

Sentiment140

page: http://help.sentiment140.com/for-students/

polarity, ID, tweet date, query, user, text를 포함하는 160,000 건의 트위터 데이터. 25,000건의 영화 리뷰. sentiment analysis에 사용

Twitter US Airline Sentiment

page: https://www.kaggle.com/crowdflower/twitter-airline-sentiment

sentiment analysis에 사용

20 Newsgroups

page: http://qwone.com/~jason/20Newsgroups/

20개 분야의 20,000개의 문서들의 집합

Reuters News Dataset

page: https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection

The WikiQA Corpus

page: https://www.microsoft.com/en-us/download/details.aspx?id=52419&from=http%3A%2F%2Fresearch.microsoft.com%2Fapps%2Fmobile%2Fdownload.aspx%3Fp%3D4495da01-db8c-4041-a7f6-7984a4f6a905

QnA pair로 이루어진 데이터셋.

UCI’s Spambase

page: https://archive.ics.uci.edu/ml/datasets/Spambase

spam filtering에 유용한 데이터셋.

Yelp Reviews

page: https://www.yelp.com/dataset

Yelp의 5,000,000 개의 리뷰

WordNet

page: https://wordnet.princeton.edu/

단어에 대해 동의어, 반의어 등을 포함하는 온톨로지형 데이터셋

Enron Dataset

page: https://www.cs.cmu.edu/~./enron/

email 툴을 더 깊게 이해하기 위한 500,000 개의 메시지 데이터셋

Amazon Reviews

page: https://snap.stanford.edu/data/web-Amazon.html

18년 간의 아마존 3,500 만 개의 리뷰. user information, rating, plaintext review 포함

Google Books Ngrams

page: https://aws.amazon.com/ko/datasets/google-books-ngrams/

구글 책의 n-gram 문서들의 집합.

Blogger Corpus

page: http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

blogger.com에서부터 얻은 681,277 개의 블로그 포스트에서 얻은 14,000 만 개의 단어들.

Wikipedia Links Data

page: https://code.google.com/archive/p/wiki-links/downloads

1,300 만 개의 문서를 포함한 데이터셋.

Gutenberg eBooks List

page: http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs

구텐베르그 프로젝트. eBook 데이터셋.

Hansards Text Chunks of Canadian Parliament

page: https://www.isi.edu/natural-language/download/hansard/

36번째 캐나다 의회 기록에서 얻은 1,300 만 쌍의 말뭉치.

Jeopardy

page: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

퀴즈 쇼 Jeopardy에서 얻은 200,000 개의 QnA 데이터셋. information, category of question, show number, air date 포함.

SMS Spam Collection in English

page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

5,574 개의 영어 스팸 SMS. 425 개의 텍스트는 Grumbletext 웹사이트에서 추출된 스팸 메시지

2000 HUB5 English

link: https://catalog.ldc.upenn.edu/LDC2002T43

40개의 휴대폰에서 얻은 영어 음성 대화 데이터

LibriSpeech

link: http://www.openslr.org/12/

여러 스피커가 읽은 1,000 시간의 영어 스피치 오디오 데이터. 각 책의 챕터로 이루어져 있음. 음성 인식 데이터로 사용.

Spoken Wikipedia Corpora

link: https://nats.gitlab.io/swc/

수백 시간의 오디오로 이루어져 있다. 영어, 독어, 네덜란드어로 기록된 위키피디아 기사 음성 데이터.

Free Spoken Digit Dataset

git: https://github.com/Jakobovski/free-spoken-digit-dataset

1,500 개의 숫자 영어 음성 데이터

TIMIT

link: https://catalog.ldc.upenn.edu/LDC93S1

640명의 미국인이 읽은 phonetically rich 문장들의 음성 데이터.

02. Vision

Moments

page: http://moments.csail.mit.edu/
paper: http://moments.csail.mit.edu/#paper
git: https://github.com/metalbubble/moments_models
you can download cached raw data(305GB), cached data (256*256, 30fps)

339개의 클래스(link. action recognizer에 많이 쓰인다.

UCF101

Kinetics

Reference

https://lionbridge.ai/datasets/the-best-25-datasets-for-natural-language-processing/