It’s logical to want to know how much training data you will need to ensure top-of-the-range performance from your algorithm. After all, no-one has ever built a functioning model without a strong foundation in richly-detailed data. However, it’s not as easy as you might think to come up with the magic number of data points that will turn your model from good into great. In fact, trying to figure it out can seriously complicate things for your data scientists.
The quantity of data points you need is affected by a huge range of factors, all of which have a varying degree of influence on the eventual size of your dataset. To work a perfect number out beforehand, you’d probably need another machine learning algorithm, dedicated to calculating the effect of everything from the type of model you’re building to the way you plan to use it.
Despite this, it’s still useful to know whether you’re going to need hundreds, thousands, or millions of data points, even if you can’t pinpoint an exact figure. With this in mind, let’s explore some of the most common issues affecting dataset size. Afterwards, we’ll look at some ways to navigate them and figure out roughly how much data you need for your project.
Why is it difficult to estimate the size of your dataset?
A lot of the difficulty around locking in a target number of data points stems from the goals of the training process. We’ve discussed how AI training works before, but it’s worth remembering that the goal of training is to build a model that understands the patterns and relationships behind the data, rather than just the data itself. When gathering data, you need to be sure that you have enough to give your algorithm an accurate picture of the complex network of meaning that exists behind and between your data points.
This might seem like a straightforward exercise on the surface. However, as we’ve seen in previous articles, the varied goals of machine learning projects can result in a vast range of training data types. Consequently, every project has a unique combination of factors that make it extremely difficult to work out your data needs ahead of time. These may include some or all of the following:
- Complexity of model: Each parameter that your model has to consider in order to perform its task increases the amount of data that it will need for training. For example, a model that is asked to identify the make of a specific car has a small, set number of parameters relating mostly to the shape of the vehicle. A model that has to determine how much that car costs has a far bigger picture it needs to understand, including not only the make and condition of the car but also economic and social factors. Thanks to this greater degree of complexity, the second model will need significantly more data than the first.
- Training method: As models are forced to understand a greater number of interlinking parameters, the resulting complexity forces a change in the way that they are trained. Traditional machine learning algorithms use structured learning, which means that they quickly reach a point where additional data has very little ROI. In contrast, deep learning models figure out their own parameters and learn how to improve without structure. This means that they not only require significantly more data, but also have a much longer learning curve where further data has a positive impact. As a result, the training method you use will cause significant variation in the amount of training data that is useful to your model.
- Labeling needs: Depending on the task you’re doing, data points can be annotated in different ways. This can cause significant variation in the number of labels your data produces, as well as the effort it takes to create those labels. For example, if you have 1000 sentences of input data for sentiment analysis, you may only need to label them as positive or negative and therefore produce one label per sentence. However, if those same 1000 sentences are annotated for entity extraction, you may need to label 5 words per sentence. Despite having the same raw input data, one task yields five times more labels than the other. The way you prepare your data can therefore affect the amount you need for your project, as well as the price of procuring it.
- Tolerance for errors: The intended role of the model within your business also affects data quantity. A 20% error rate is acceptable for a model that predicts the weather, but not for one that detects patients at risk of heart attacks. Improvement on edge cases is what will decrease this risk. If your algorithm is highly risk-averse or integral to the success of your business, the amount of data you need will increase to reflect the need for flawless performance.
- Diversity of input: We live in a complex world, which is capable of throwing a variety of inputs at your model. For example, a chatbot has to be able to understand a variety of languages, written in a range of formal, informal and even grammatically incorrect styles. In cases where your model’s input won’t be highly controlled, more data will be necessary to help your model function in that unpredictable environment.
From this, it’s clear to see that the amount of data you need is decided by your project’s unique needs and goals. In the end, project leaders have to balance these factors out for themselves and come up with their own target. With this in mind, let’s have a look at some ways to begin determining your data needs.
How can you calculate your data needs?
It’s probably impossible to determine the exact number of data points that any given algorithm requires. Fortunately, a general estimate based on analysis of your project is more than enough to get you started. Here are two common methods of doing this to help you get the ball rolling:
- Rule of 10: One common and much debated rule of thumb is that a model will often need ten times more data than it has degrees of freedom. A degree of freedom can be a parameter which affects the model’s output, an attribute of one of your data points or, more simply, a column in your dataset. The rule of 10 aims to compensate for the variability that those combined parameters bring to the model’s input. For complex models this is arguably unhelpful, as it simply reframes the debate around another question that’s impossible to answer. However, this rule provides a quick estimate that might be enough to keep your project moving.
- Learning curves: If you already have some data and want to make a decision based on a little more evidence, you could consider creating a study that will evaluate the ability of the model based on the size of the dataset. By plotting your results on a graph, you should be able to figure out the relationship between the size of your dataset and the model’s skill, while also identifying the point after which more data provides diminishing returns. This is a more labor-intensive method involving the creation of a few logistic regression problems, but it may give you a more reliable result than simple guessing.
Often it’s best to simply begin working on your model with the data that you have, adding more data when you feel that you need it. Your data needs will become more obvious once your project has some results. However, for those who would prefer to see a concrete figure before beginning, below are a few estimates of dataset size for projects that we’ve found across the Internet. Perhaps these specific examples will give you an idea of a number to aim for with your own project.
|Project||Task||Amount of data|
|FaceNet||Facial detection and recognition||450,000 samples|
|MIT CSAIL||Image annotation||185,000 images, 62,000 annotated images, 650,000 labeled objects|
|Sprout||Sentiment analysis for Twitter||‘tens of thousands of Tweets’|
|‘Twitter Sentiment Analysis: The Good, the Bad and the OMG!’||Research on sentiment analysis for Twitter||Selections from 3 corpora totaling 600,000 data points|
|‘Analysis and Classification of Arabic Newspapers’ Facebook Pages using Text Mining Techniques’||Sentiment analysis and classification of Facebook pages in Arabic||62,000 posts, 9,000 comments|
|‘Improved Text Language Identification for the South African Languages’||Text language identification||3,000 training samples and 1,000 test samples per language|
|TransPerfect||Machine translation||4 million words|
|‘Building Chatbots from Forum Data: Model Selection Using Question Answering Metrics’||Chatbot training||2 million answers paired with 200,000 questions|
|Online Learning Library||Natural Language Processing experiments||15,000 training points, 1 million+ features|
Quantity vs Quality
The limits of your data are the limits of your model’s world. However, with all the discussion about how much data you need, don’t forget that this applies to data quality as well as quantity. A million messy data points will be far worse for your model than 100 spotlessly clean, richly detailed examples that will help your algorithm to hone in on its goal. Whatever you’re building, make sure that the data you’re using is going to give you a solid foundation and the best possible chance of success.
At Lionbridge, our twenty years of experience in building a crowdsourcing platform has helped us to perfect the blend of quality and quantity. Our professional crowd of 500,000 qualified contributors have all the skills necessary to create, annotate and improve your data across a wide range of use cases. For a gold standard that will break down barriers and help your model to scale, we’re the obvious choice. Contact us now to find out how we can take your data to the next level.