Supervised learning techniques provide a simple framework to solve complex tasks such as semantic segmentation and object detection, but they also require large quantities of high quality annotations as labels. For tasks like semantic segmentation, the time and cost of creating annotations is often higher than tasks like classification. These difficulties (and hence, cost/time also) can also shoot up drastically if you need to consult subject matter experts to create them.
To overcome these limitations, one low cost/time option is to make use of lower quality annotations collected in large quantities. But of course, the natural question then is, “How do we use low quality annotations, when the supervised learning framework requires high quality ones?”
In this article, we’ll look at Weakly Supervised Learning (WSL), which provides a solution by leveraging “weak” annotations to learn the task. But before we dive deeper into the techniques, it is worth exploring the various types of WSL techniques and the sections we intend to cover in this article.
Types of Weakly Supervised Learning
This Wikipedia article on weak supervision and Zhou’s literature survey on WSL mention slightly different classifications of WSL techniques. I have combined their respective classifications and listed them below:
- Inexact/Imprecise Supervision: When our data has higher level (abstract), less precise labels.
- Inaccurate Labels: In this case, we might have some data this is labelled incorrectly.
- Existing Resources: Making use of existing resources like knowledge bases, alternative datasets, or pre-trained models to create labels that are helpful, though not perfectly suited for the given task.
- Incomplete Supervision: Only a small subset of the training data has labels.
As we can see from the list above, WSL is a rather broad topic. In this article, we’re mostly interested in “how to work with data where labels are higher-level (or imprecise), and/or where labels are potentially noisy (or incorrect).” Moreover, we primarily focus on applying WSL techniques to image-based data. That being said, some of the techniques presented in this article (like Snorkel) can be used for other types of data, such as text.
Weakly Supervised Learning Techniques
1. Weakly Supervised Semantic Segmentation
It should come as no surprise that getting high quality labelled data for semantic segmentation is often hard and expensive. The authors of the paper BoxSup note: “the workload of labeling segmentation masks is more than 15 times heavier than that of spotting object locations.”
It would be a tremendous reduction in labeling effort if we could use inexact labels such as the list of objects in an image (aka image level labels) or their bounding boxes to train a semantic segmentation model.
So, let’s explore a few papers that use the inexact labels described above to perform Weakly Supervised Semantic Segmentation (WSSS).
Multiple Instance Learning (MIL) is a type of learning framework where the user provides data in which instances are not individually labelled. Rather, the user provides a set of labelled bags, which are a collection of instances. In the case of binary classification, a bag is labelled positive if at least one instance in the bag is positive. A bag is labelled negative if all instances inside the bag are negative. From our set of labelled bags, we then try to infer the labels of the individual instances.
The above MIL concept can be applied for WSSS when only image level labels are available. As mentioned in this paper, we can consider every pixel in the image as an “instance” and the image itself as a “bag”. The image level labels then act as the bag labels. By inferring the labels of the instances (pixels) we get our pixel-level segmentation map. More recent MIL methods are available in this literature survey.
The above survey also mentions the use of Self-Supervised Learning (SSL) techniques to solve WSSS problems. The authors state that SSL approaches are similar to MIL except that they use the inferred pixel-level activations as pseudo ground truth cues (seeds) for self-supervised learning of the final pixel-level segmentation maps. Methods of this type often train a backbone classifier to produce Class Activation Maps (CAMs) as seeds, and then train a segmentation network on these seeds.
For instance, consider the paper “Seed, Expand and Constrain” (SEC). CAMs are generated for each class (and background) as weak localization cues and used for training a neural network. A three part loss function is used for the training process:
- Seed loss comparing the network output with the weak localization cues.
- Expansion loss comparing the network output with the image level labels.
- Constraint loss comparing the network output with the network output refined by a dense Conditional Random Field (CRF).
At test time, a dense CRF was used for post processing.
Methods that rely on raw CAM information tend to produce good segmentations only for discriminative parts of the image (i.e. parts of the image useful for the backbone classifier to distinguish between classes). To overcome this limitation, techniques like Adversarial Erasing and Region Growing are proposed.
Distinct from the MIL and SSL techniques, Papandreou et al. present Expectation Maximization (EM) methods for learning the semantic segmentation task from both image level labels and bounding box annotations. When only image level labels are available, they consider the image values x and the image level labels z as observed variables and the pixel level segmentations y as latent (hidden) variables. The same framework can be adapted for bounding box annotations. Using the EM methods described in the paper, we can estimate the pixel level segmentations.
There are also other methods like BoxSup which use region proposal methods and bounding box annotations for the WSSS task. For the sake of brevity, we have only explored a few key ideas from a small number of papers here. For more information, we recommend the following literature surveys and paper collections:
- A Comprehensive Analysis of Weakly-Supervised Semantic Segmentation in Different Image Domains
- Weakly Supervised Segmentation List — Github
2. Weakly Supervised Object Localization
Object Localization refers to the process of learning to draw a bounding box around an object of interest. While traditional supervised methods require bounding box annotations to learn such a task, Weakly Supervised Object Localization (WSOL) methods can learn with just image level labels (such as the list of objects in the image).
A common deep learning approach to solving this problem is to find the Class Activation Map (CAM) of the object of interest and fit a bounding box onto it. Since we already explored a few CAM approaches in the previous section, in this section we will only explore how CAMs can be used for creating a bounding box.
As mentioned in this paper, one option is to use a simple thresholding method to segment the CAM and obtain a bounding box. Firstly, regions in the CAM whose value is above 20% of the max-value of the CAM are used to create a segmentation map. Then, a bounding box is drawn to tightly encompass the largest connected component in the segmentation map.
- WSOL Papers — GitHub
- Weakly Supervised Object Detection — CVPR 2018 tutorial slides by Hakan Bilen; University of Edinburgh
- Evaluating Weakly Supervised Object Localization Methods Right
3. Multi Source Inexact and Inaccurate Supervision
In this section we’ll primarily explore the workflow of Snorkel, a framework which can be used to learn from multiple sources of inaccurate (noisy) labels. Snorkel assumes that we can query high level or less precise information about data points from multiple labeling functions.
For instance, let’s consider the task of classifying the existence of an object for which we do not have accurately labelled data. Each labeling function can use their own heuristics and approximate rules to provide a guess as to whether the object is present in the image or not. Snorkel then uses a generative model to learn the accuracies of the labeling functions, and outputs a probabilistic label for each data point. Now, we can train a discriminative model (like a neural network) using the probabilistic labels output by the generative model to robustly classify the existence of objects.
The key advantage of this approach is evident when you consider that accurate labeling for some domain tasks can be very expensive and time consuming. With the above technique, we can ask domain experts to write a few labeling functions to provide less accurate labels for a fraction of the cost and time investment.
The beauty of abstracting away sources of weak labels as labeling functions enable us to extend the same approach to a few interesting applications. For example, we can intelligently combine crowd sourced labels for the same instance to resolve discrepancies and noise. One can also think about using a similar approach for sensor fusion or fusing information from multiple modalities. Another possibility is using the same approach but for combining the results of an ensemble of pre-trained weak classifiers.
Similar to Snorkel, there are many frameworks that try to tackle similar problems . For instance, the FlyingSquid framework proposes an improvement over the Snorkel framework. Other systems, such as Snorkel Drybell, Overton and Osprey, make use of WSL in various manners to combine information from multiple sources or labeling functions.
There are of course many other problem domains where WSL has been used. For instance, Facebook used WSL to improve the accuracy of their image recognition algorithm by using large sets of public images with hashtags as weak labels. They have also used WSL techniques to incorporate Open Street Map data for their own mapping algorithms. Elsewhere, the HazyResearch group has a blog about how weak supervision has been used in science and medicine.
In this article we presented a high level overview of WSL techniques that can be used for a variety of tasks. While we explored a variety of techniques, one crucial aspect we did not discuss is the accuracy/performance of these techniques.
The accuracy of some of these WSL techniques (especially those using a single source of inexact/inaccurate information) may be less than their supervised learning counterparts. One way to mitigate this issue is by having a small subset of high quality and accurate labels to motivate the model to learn better features (such as in Papandreou et al).
In some domains however the potential drop in accuracy might be a fair price to pay. For example, having some model instead of no model can be beneficial where high quality annotations are expensive. Another example is deriving value from multiple sources of inexact/inaccurate data which would have otherwise been unused.
Since WSL is an active research field, we expect to see better performance and further interesting advances in the coming years. To keep the blog concise while covering a wide range of techniques, the scope of this introduction and the depth of explanation were heavily constrained. We encourage you to visit the links provided at each section and to explore how WSL is used in other domains to augment your knowledge.