Data bias in machine learning is a type of error in which certain elements of a dataset are more heavily weighted and/or represented than others. A biased dataset does not accurately represent a model’s use case, resulting in skewed outcomes, low accuracy levels, and analytical errors.
In general, training data for machine learning projects has to be representative of the real world. This is important because this data is how the machine learns to do its job. Data bias can occur in a range of areas, from human reporting and selection bias to algorithmic and interpretation bias. The image below is a good example of the sorts of biases that can appear in just the data collection and annotation phase alone.
Resolving data bias in machine learning projects means first determining where it is. It’s only after you know where a bias exists that you can take the necessary steps to remedy it, whether it be addressing lacking data or improving your annotation processes. With this in mind, it’s extremely important to be vigilant about the scope, quality, and handling of your data to avoid bias where possible. This effects not just the accuracy of your model, but can also stretch to issues of ethics, fairness, and inclusion.
Below, we’ve listed seven of the most common types of data bias in machine learning to help you analyze and understand where it happens, and what you can do about it.
And if you’re looking for in-depth information on data collection data labeling for machine learning projects, be sure to check out our in-depth guide to training data for machine learning.
Types of data bias:
Though not exhaustive, this list contains common examples of data bias in the field, along with examples of where it occurs.
Sample bias: Sample bias occurs when a dataset does not reflect the realities of the environment in which a model will run. An example of this is certain facial recognition systems trained primarily on images of white men. These models have considerably lower levels of accuracy with women and people of different ethnicities. Another name for this bias is selection bias.
Exclusion bias: Exclusion bias is most common at the data preprocessing stage. Most often it’s a case of deleting valuable data thought to be unimportant. However, it can also occur due to the systematic exclusion of certain information. For example, imagine you have a dataset of customer sales in America and Canada. 98% of the customers are from America, so you choose to delete the location data thinking it is irrelevant. However, this means you model will not pick up on the fact that your Canadian customers spend two times more.
Measurement bias: This type of bias occurs when the data collected for training differs from that collected in the real world, or when faulty measurements result in data distortion. A good example of this bias occurs in image recognition datasets, where the training data is collected with one type of camera, but the production data is collected with a different camera. Measurement bias can also occur due to inconsistent annotation during the data labeling stage of a project.
Recall bias: This is a kind of measurement bias, and is common at the data labeling stage of a project. Recall bias arises when you label similar types of data inconsistently. This results in lower accuracy. For example, let’s say you have a team labeling images of phones as damaged, partially-damaged, or undamaged. If someone labels one image as damaged, but a similar image as partially damaged, your data will be inconsistent.
Observer bias: Also known as confirmation bias, observer bias is the effect of seeing what you expect to see or want to see in data. This can happen when researchers go into a project with subjective thoughts about their study, either conscious or unconscious. We can also see this when labelers let their subjective thoughts control their labeling habits, resulting in inaccurate data.
Racial bias: Though not data bias in the traditional sense, this still warrants mentioning due to its prevalence in AI technology of late. Racial bias occurs when data skews in favor of particular demographics. This can be seen in facial recognition and automatic speech recognition technology which fails to recognize people of color as accurately as it does caucasians. Google’s Inclusive Images competition included good examples of how this can occur..
Association bias: This bias occurs when the data for a machine learning model reinforces and/or multiplies a cultural bias. Your dataset may have a collection of jobs in which all men are doctors and all women are nurses. This does not mean that women cannot be doctors, and men cannot be nurses. However, as far as your machine learning model is concerned female doctors and male nurses do not exist. Association bias is best known for creating gender bias, as was visible in the Excavating AI study.
How do I avoid data bias in machine learning projects?
The prevention of data bias in machine learning projects is an ongoing process. Though it is sometimes difficult to know when your data or model is biased, there are a number of steps you can take to help prevent bias or catch it early. Though far from a comprehensive list, the bullet points below provide an entry-level guide for thinking about data bias for machine learning projects.
- To the best of your ability, research your users in advance. Be aware of your general use-cases and potential outliers.
- Ensure your team of data scientists and data labelers is diverse.
- Where possible, combine inputs from multiple sources to ensure data diversity.
- Create a gold standard for your data labeling. A gold standard is a set of data that reflects the ideal labeled data for your task. It enables you to measure your team’s annotations for accuracy.
- Make clear guidelines for data labeling expectations so data labelers are consistent.
- Use multi-pass annotation for any project where data accuracy may be prone to bias. Examples of this include sentiment analysis, content moderation, and intent recognition.
- Enlist the help of someone with domain expertise to review your collected and/or annotated data. Someone from outside of your team may see biases that your team has overlooked.
- Analyze your data regularly. Keep track of errors and problem areas so you can respond to and resolve them quickly. Carefully analyze data points before making the decision to delete or keep them.
- Make bias testing a part of your development cycle. Google, IBM, and Microsoft have all released tools and guides to help with analyzing bias for a number of different data types.
If you’re looking for a deeper dive into how bias occurs, its effects on machine learning models, and past examples of it in automated technology, we recommend checking out Margaret Mitchell’s “Bias in the Vision and Language of Artificial Intelligence” presentation. You can take a look at the slides for the presentation here, or watch the video below.
It’s important to be aware of the potential biases in machine learning for any data project. By putting the right systems in place early and keeping on top of data collection, labeling, and implementation, you can notice it before it becomes a problem, or respond to it when it pops up.
Alternatively, if you are looking at putting together a team of diverse data scientists and data labelers to ensure high quality data, get in touch. With access to leading data scientists in a variety of fields and a global community of 1,000,000+ contributors, Lionbridge can help you define, collect, and prepare the data you need for your machine learning project.