How to Find Datasets for Machine Learning: Tips for Open Source and Custom Datasets

Article by Hengtee Lim | June 17, 2020

Knowing the basics of finding and creating datasets is a valuable skill for anybody in machine learning. After all, data is at the core of any machine learning model, and knowing your data and its labeling requirements are important first steps towards training an accurate model.

Put simply, the creation of a dataset can be broken into two parts: data collection and data labeling. Collection is the process of gathering raw data as text, audio, image, or video, while labeling is the process of preparing that data for training.

However, creating a dataset is often complicated and time consuming. Do you always have to create your data from scratch? In this article, we’ll look at two ways to find datasets for machine learning projects: open source datasets and custom datasets. We’ll look at when each type of dataset is used, and what options are available when sourcing specific datasets.


When can I use open-source datasets?

The Internet contains a wealth of data for machine learning projects, and open-source datasets are a good way to find this data already packaged for labeling or use. Kaggle and Google’s Dataset Search are good examples of general repositories for both structured and unstructured data. For examples of commonly used open-source datasets, look no further than Imagenet and the Coco dataset. Both are large image datasets commonly used for object detection, segmentation, and classification tasks.

An open-source dataset is often perfect for those who are studying machine learning algorithms and want to experiment with them. A pre-labeled dataset can allow you to skip the collection and labeling process to focus on preprocessing, testing, and application. In our recent studies of multiclass classification for text and image datasets, open-source datasets were the perfect option for exploring the development of specific ML systems and how they work.

Similarly, open-source datasets are an option when building a proof of concept or use-case example to quickly develop and test. Once proven, you can decide whether to improve and scale the model with in-house data, a custom dataset, or other open-source datasets.

With this in mind, an open-source dataset might be enough for your purposes. But how do you know when it isn’t?


When do I need a custom dataset?

Generally speaking, custom datasets differ from open-source in that they are built from the ground up for a very specific purpose. Custom datasets are most often necessary when the data for a specific project is not available elsewhere, but they are also often built from in-house data in need of cleaning and labeling, or adapted and refined from an existing dataset.

The Japanese company Zaizen, for example, creates chatbots that engage in natural conversation with users based on their emotions. In order to train their system, they required samples of everyday conversation in Japanese labeled for intent. To ensure a wide enough range of text data for their model, they opted for a custom-made dataset of 5,000+ Q/A samples.

Another example of a custom dataset is when new data becomes necessary during a model’s development cycle. This is sometimes a case of needing data to teach the system something it doesn’t know (i.e. providing missing data) or simply improving a model with more data for training. You can see this in our ongoing work with Japan’s NICT (National Institute of Information and Communications Technology). To help with the development of a translation app for travelers, we are regularly delivering audio and text datasets.

In both of the examples above, the necessary datasets were created and curated to fit specific training purposes. This allowed the data science teams to then get the best from their specific systems.


Resources for Data Collection, Data Labeling, and Open-source Datasets

Data Collection

Data collection methods can range from data scraping and in-house data to synthetic data creation and data augmentation. Which method is best for a particular project depends on a variety of factors including time, money, and manpower. For a detailed guide on how to choose a data collection method, be sure to check our article here.


Data Labeling

Once you have collected data for a dataset, you then need to label it for its particular purpose. This can range wildly depending on the type of data and its use, so it’s important to make sure you have access to the right tools or platform to work with. You can find a full guide to choosing data labeling approaches in our guide here.


Open-Source Data Repositories

The following resources should provide a good starting point for finding and exploring open-source datasets.

Google Dataset Search:
Launched in 2018, this service is designed to help researchers find online datasets that are freely available for use. It boasts access to close to 25 million publicly available datasets.

Kaggle is an online community of data scientists and machine learning practitioners. The website is home to a huge number of resources for data science projects including a wide range of datasets.

The Lionbridge Ultimate Dataset Aggregator:
This dataset repository contains links to curated collections of datasets for a wide variety of purposes. It includes text and audio datasets in a variety of languages and image and video datasets for machine learning projects.


The Essential Guide to Training Data

Our essential guide to training data is a comprehensive look at important issues in the training data process, such as how to build, format, and annotate a training dataset. It also looks at how to improve the quality of your dataset, and where to get more data for it.


If you don’t know how to find the right dataset for your project, or are unsure of how to approach the collection or labeling process, get in touch. Our access to leading data scientists and a global community of over 1 million contributors makes us well-equipped for collecting and preparing datasets for a variety of machine learning uses.

Learn more about AI dataset solutions
The Author
Hengtee Lim

Hengtee is a writer with the Lionbridge marketing team. An Australian who now calls Tokyo home, you will often find him crafting short stories in cafes and coffee shops around the city.


Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.