How to Use Inaccurate Data for Machine Learning with Weakly Supervised Learning

Article by Bharath Raj | April 29, 2020

Supervised learning techniques provide a simple framework to solve complex tasks such as semantic segmentation and object detection, but they also require large quantities of high quality annotations as labels. For tasks like semantic segmentation, the time and cost of creating annotations is often higher than tasks like classification. These difficulties (and hence, cost/time also) can also shoot up drastically if you need to consult subject matter experts to create them.

To overcome these limitations, one low cost/time option is to make use of lower quality annotations collected in large quantities. But of course, the natural question then is, “How do we use low quality annotations, when the supervised learning framework requires high quality ones?”

In this article, we’ll look at Weakly Supervised Learning (WSL), which provides a solution by leveraging “weak” annotations to learn the task. But before we dive deeper into the techniques, it is worth exploring the various types of WSL techniques and the sections we intend to cover in this article.

 

Types of Weakly Supervised Learning

This Wikipedia article on weak supervision and Zhou’s literature survey on WSL mention slightly different classifications of WSL techniques. I have combined their respective classifications and listed them below:

  • Inexact/Imprecise Supervision: When our data has higher level (abstract), less precise labels.
  • Inaccurate Labels: In this case, we might have some data this is labelled incorrectly.
  • Existing Resources: Making use of existing resources like knowledge bases, alternative datasets, or pre-trained models to create labels that are helpful, though not perfectly suited for the given task.
  • Incomplete Supervision: Only a small subset of the training data has labels.

 

Illustration depicting sources of Weak Supervision (source: http://ai.stanford.edu/blog/weak-supervision/)

 

As we can see from the list above, WSL is a rather broad topic. In this article, we’re mostly interested in “how to work with data where labels are higher-level (or imprecise), and/or where labels are potentially noisy (or incorrect).” Moreover, we primarily focus on applying WSL techniques to image-based data. That being said, some of the techniques presented in this article (like Snorkel) can be used for other types of data, such as text.

 

Weakly Supervised Learning Techniques


1. Weakly Supervised Semantic Segmentation

It should come as no surprise that getting high quality labelled data for semantic segmentation is often hard and expensive. The authors of the paper BoxSup note: “the workload of labeling segmentation masks is more than 15 times heavier than that of spotting object locations.” 

It would be a tremendous reduction in labeling effort if we could use inexact labels such as the list of objects in an image (aka image level labels) or their bounding boxes to train a semantic segmentation model.

So, let’s explore a few papers that use the inexact labels described above to perform Weakly Supervised Semantic Segmentation (WSSS).

Multiple Instance Learning (MIL) is a type of learning framework where the user provides data in which instances are not individually labelled. Rather, the user provides a set of labelled bags, which are a collection of instances. In the case of binary classification, a bag is labelled positive if at least one instance in the bag is positive. A bag is labelled negative if all instances inside the bag are negative. From our set of labelled bags, we then try to infer the labels of the individual instances.

 

Illustration of the concept of bags and instances (or “examples”) in MIL (Source: https://www.researchgate.net/figure/An-illustration-of-the-concept-of-multiple-instance-learning-In-MIL-training-examples_fig1_315925709)

 

The above MIL concept can be applied for WSSS when only image level labels are available. As mentioned in this paper, we can consider every pixel in the image as an “instance” and the image itself as a “bag”. The image level labels then act as the bag labels. By inferring the labels of the instances (pixels) we get our pixel-level segmentation map. More recent MIL methods are available in this literature survey.

The above survey also mentions the use of Self-Supervised Learning (SSL) techniques to solve WSSS problems. The authors state that SSL approaches are similar to MIL except that they use the inferred pixel-level activations as pseudo ground truth cues (seeds) for self-supervised learning of the final pixel-level segmentation maps. Methods of this type often train a backbone classifier to produce Class Activation Maps (CAMs) as seeds, and then train a segmentation network on these seeds.

 

Class Activation Map highlighting the class specific discriminative regions (Source: http://cnnlocalization.csail.mit.edu/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf)

 

For instance, consider the paper “Seed, Expand and Constrain” (SEC). CAMs are generated for each class (and background) as weak localization cues and used for training a neural network. A three part loss function is used for the training process:

  1. Seed loss comparing the network output with the weak localization cues.
  2. Expansion loss comparing the network output with the image level labels.
  3. Constraint loss comparing the network output with the network output refined by a dense Conditional Random Field (CRF).

At test time, a dense CRF was used for post processing.

 

Schematic of the SEC network (Source: https://arxiv.org/pdf/1603.06098.pdf)

 

Methods that rely on raw CAM information tend to produce good segmentations only for discriminative parts of the image (i.e. parts of the image useful for the backbone classifier to distinguish between classes). To overcome this limitation, techniques like Adversarial Erasing and Region Growing are proposed.

Distinct from the MIL and SSL techniques, Papandreou et al. present Expectation Maximization (EM) methods for learning the semantic segmentation task from both image level labels and bounding box annotations. When only image level labels are available, they consider the image values x and the image level labels z as observed variables and the pixel level segmentations y as latent (hidden) variables. The same framework can be adapted for bounding box annotations. Using the EM methods described in the paper, we can estimate the pixel level segmentations.

 

Illustrations taken from the paper of Papandreou et al. which describe WSSS techniques using image level labels (left) and bounding box level annotations (right) (Source: https://arxiv.org/pdf/1502.02734.pdf)

 

There are also other methods like BoxSup which use region proposal methods and bounding box annotations for the WSSS task. For the sake of brevity, we have only explored a few key ideas from a small number of papers here. For more information, we recommend the following literature surveys and paper collections:

 

2. Weakly Supervised Object Localization

Object Localization refers to the process of learning to draw a bounding box around an object of interest. While traditional supervised methods require bounding box annotations to learn such a task, Weakly Supervised Object Localization (WSOL) methods can learn with just image level labels (such as the list of objects in the image).

A common deep learning approach to solving this problem is to find the Class Activation Map (CAM) of the object of interest and fit a bounding box onto it. Since we already explored a few CAM approaches in the previous section, in this section we will only explore how CAMs can be used for creating a bounding box.

As mentioned in this paper, one option is to use a simple thresholding method to segment the CAM and obtain a bounding box. Firstly, regions in the CAM whose value is above 20% of the max-value of the CAM are used to create a segmentation map. Then, a bounding box is drawn to tightly encompass the largest connected component in the segmentation map.

 

Localization results by using the method presented by Zhou et al. Green Boxes represent the ground truth whereas the Red Boxes represent the prediction by the model (Source: http://cnnlocalization.csail.mit.edu/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf)

 

You can refer to these papers to learn more about how the authors create CAMs. For more information about WSOL approaches, please refer to the following resources:

 

3. Multi Source Inexact and Inaccurate Supervision

In this section we’ll primarily explore the workflow of Snorkel, a framework which can be used to learn from multiple sources of inaccurate (noisy) labels. Snorkel assumes that we can query high level or less precise information about data points from multiple labeling functions.

For instance, let’s consider the task of classifying the existence of an object for which we do not have accurately labelled data. Each labeling function can use their own heuristics and approximate rules to provide a guess as to whether the object is present in the image or not. Snorkel then uses a generative model to learn the accuracies of the labeling functions, and outputs a probabilistic label for each data point. Now, we can train a discriminative model (like a neural network) using the probabilistic labels output by the generative model to robustly classify the existence of objects.

 

Snorkel Workflow (Source: http://ai.stanford.edu/blog/weak-supervision/)

 

The key advantage of this approach is evident when you consider that accurate labeling for some domain tasks can be very expensive and time consuming. With the above technique, we can ask domain experts to write a few labeling functions to provide less accurate labels for a fraction of the cost and time investment.

The beauty of abstracting away sources of weak labels as labeling functions enable us to extend the same approach to a few interesting applications. For example, we can intelligently combine crowd sourced labels for the same instance to resolve discrepancies and noise. One can also think about using a similar approach for sensor fusion or fusing information from multiple modalities. Another possibility is using the same approach but for combining the results of an ensemble of pre-trained weak classifiers.

 

Combining weak supervisory information from various knowledge resources (A) to train a machine learning model (C) for content/event classification of web data (X) using Snorkel Drybell (B) (Source: https://ai.googleblog.com/2019/03/harnessing-organizational-knowledge-for.html)

 

Similar to Snorkel, there are many frameworks that try to tackle similar problems . For instance, the FlyingSquid framework proposes an improvement over the Snorkel framework. Other systems, such as Snorkel Drybell, Overton and Osprey, make use of WSL in various manners to combine information from multiple sources or labeling functions.

 

4. Miscellaneous

There are of course many other problem domains where WSL has been used. For instance, Facebook used WSL to improve the accuracy of their image recognition algorithm by using large sets of public images with hashtags as weak labels. They have also used WSL techniques to incorporate Open Street Map data for their own mapping algorithms. Elsewhere, the HazyResearch group has a blog about how weak supervision has been used in science and medicine.

 

Conclusion

In this article we presented a high level overview of WSL techniques that can be used for a variety of tasks. While we explored a variety of techniques, one crucial aspect we did not discuss is the accuracy/performance of these techniques.

The accuracy of some of these WSL techniques (especially those using a single source of inexact/inaccurate information) may be less than their supervised learning counterparts. One way to mitigate this issue is by having a small subset of high quality and accurate labels to motivate the model to learn better features (such as in Papandreou et al).

In some domains however the potential drop in accuracy might be a fair price to pay. For example, having some model instead of no model can be beneficial where high quality annotations are expensive. Another example is deriving value from multiple sources of inexact/inaccurate data which would have otherwise been unused.

Since WSL is an active research field, we expect to see better performance and further interesting advances in the coming years. To keep the blog concise while covering a wide range of techniques, the scope of this introduction and the depth of explanation were heavily constrained. We encourage you to visit the links provided at each section and to explore how WSL is used in other domains to augment your knowledge.

Subscribe to our newsletter for more technical articles
The Author
Bharath Raj

Bharath is a curious learner who explores the depths of Computer Vision and Machine Learning. He often writes code to build hobby projects and jots down his findings and musings as blogs. You can check out more of his work on Medium and GitHub @thatbrguy.

Welcome!

Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.