Datasets for deep learning

This page list some datasets used for deep learning, neural networks, classification... I try to update this page with new dataset as soon as I can.

I'm currently looking for translation dataset, if you know some (top quality only) that are not listed on this page, please contact me.

If you find a broken link or a mistake on this page, please contact me.

Natural images

The MNIST database

Database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples.

NIST Special Database 19

entire corpus of training materials for handprinted document and character recognition. It contains over 800,000 images with hand checked classifications.

The CIFAR-10 dataset

Consists of 60000 32x32 colour images in 10 classes (airplane, bird, cat, truck ...) with 6000 images per class. There are 50000 training images and 10000 test images.

Caltech 101

Pictures of objects belonging to 101 categories. About 40 to 800 images per category (roughly 300 x 200 pixels).

Caltech 256

Pictures of objects belonging to 256 categories.

Text

The 20 Newsgroups data set

Collection d'environ 20 000 documents de newsgroups, répartis (presque) uniformément sur 20 newsgroups différents.

Reuters Corpora

Une vaste collection d'articles de Reuters News à utiliser dans la recherche et le développement de systèmes de traitement du langage naturel, de recherche d'informations et d'apprentissage automatique.

English Web Treebank Propbank

Provides semantic role annotation and predicate sense disambiguation for roughly 50,000 predicates, corresponding to all verbs, all adjectives in equational clauses and all nouns considered to be predicative.

The New York Times Annotated Corpus

Contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007.

Web 1T 5-gram Version 1

Contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams.

Wikipedia Database download

Wikipedia offers free copies of all available content to interested users.

Datasets for deep learning

Natural images

Text

See also