MNIST Datasets for Machine Learning

October 16, 2019

Many data scientists consider the MNIST dataset (Modified National Institute of Standards and Technology database) to be one of the benchmark datasets for machine learning. Partly due to its small size and ease-of-use, it is often one of the first datasets that data scientists work with. Furthermore, many use it to compare different machine learning algorithms and test their performance capabilities. 

The dataset contains 60,000 images of handwritten digits for training and 10,000 images for testing. As a benchmark in machine learning, it has inspired others to create datasets in a similar style. The datasets on this list use a similar format as the original and many of them were created as drop-in replacements for the original MNIST dataset.

 

MNIST Reformattings, Extensions, and Modifications

EMNIST – Extended MNIST (EMNIST) is a series of 6 datasets created from the original NIST Database. While the original dataset includes only handwritten digits, EMNIST uses the same conversion process on the handwritten letters portion of the NIST database. 

MNIST Datasets for Machine Learning

MNIST as JPG – As the title suggests, this is a reformatting of the original dataset. Instead of a string format, the files are in a strictly image format, consisting of JPEGS. 

MNIST in CSV – This is a simple reformatting of the original MNIST into a more easily-accessible CSV file format.

 

Datasets for Machine Learning Inspired by MNIST

3D MNIST – The creator of this dataset aimed to provide a resource for those working with 3D computer vision problems. The dataset was formed by generating 3D point clouds from the original MNIST images. There are 5000 training and 1000 testing point clouds included. 

Fashion MNIST – From Zalando Research, this dataset contains clothing and accessory images from Zalando’s product catalogue. The format follows the original MNIST. Therefore, it contains 60,000 training images and 10,000 images for testing. Furthermore, the images are all 28×28 pixels and in grayscale. Each image has one of the following labels: ankle boot, bag, coat, dress, pullover, sandal, shirt, sneaker, T-shirt/top, and trouser. 

Sign Language MNIST – A drop-in MNIST replacement, this dataset was created to help train hand gesture recognition models. This dataset closely matches the format of the original MNIST. Furthermore, the aforementioned Fashion MNIST inspired its creation.

Colorectal Histology MNIST – With data taken from Zenodo.org, this medical image dataset contains over 5,000 histological images of colorectal cancer.

Skin Cancer MNIST: HAM10000 – The Skin Cancer MNIST medical image dataset contains 10,015 dermatoscopic images of skin lesions. It was created for the ISIC 2018 challenge: Skin Lesion Analysis Towards Melanoma Detection.


For more info and access to the original MNIST database, please visit the creator’s website. For more datasets and reading on OCR and handwritten data, please see our related resources below.

Multilingual OCR Data Services

Lionbridge provides professional OCR data services in over 300 languages.
Some of our most popular languages include:

  • Chinese OCR data
  • Italian OCR data
  • Dutch OCR data
  • Japanese OCR data
  • French OCR data
  • Portuguese OCR data
  • German OCR data
  • Spanish OCR data

If you’re looking for custom OCR datasets or handwritten data collection services, Lionbridge can help. Get in touch with our sales team to learn about our training data services.

Interested? Get high-quality data now

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.