March 22, 2023

How to effectively label images for computer vision projects ?

ML6 is often contacted by clients having a problem that looks like it can be solved with some cameras and a clever machine learning solution. Great, so where do we start? There is one more thing we need: labeled data. Currently, the most practical and robust way to implement a real-world computer vision algorithm is to teach it what patterns it needs to be able to recognize. This means creating a training dataset, a collection of labeled example data which the model parameters can be tuned on to learn to solve the problem.

In our experience with projects involving labeled data there is one constant truth: having a well-annotated training set is crucial to success. As the saying goes: garbage in, garbage out. In this post, we focus on the most common ones for computer vision applications. Most of the problems we tackle for clients fall into one of the following categories: Image Classification, Object Detection, or Instance Segmentation.

‍

*Illustration showing the computer vision tasks discussed here.*

‍

How to Gather a Training Dataset

To get a good grip on what your training set needs to look like, you have to consider what problem you’re actually trying to solve. How different are the various things you want to be able to detect? Is there a lot of variation in how they look, how they are lit, the angle the image is taken from? Is one camera type used, or multiple different models?

While various techniques exist to make the most of annotating on a limited time budget, it is still good practice to assemble a training set that covers images covering all of the above variety that applies to your problem. A machine learning model will have an easier time generalizing what it knows to a new example closely resembling to what it was trained on, rather than something entirely novel. For example, if you are gathering a training set for a car detection model you want to label examples that show as many different colors, makes, angles, and lighting conditions as you expect the model to see. If you only label compact city cars, the model is not guaranteed to detect a monster truck. Of course, not every possible combination of attributes needs to be in the dataset, just a healthy mix of real-world variety.

Before labeling, do a pass over the dataset to spot patterns that could make for some difficult decisions in your labeling approach. Did you think of all relevant categories? Is there a certain type of object that is in between two categories? Do you have some partially visible objects in some images? Is it actually impossible to recognise some objects in certain conditions (e.g. dark objects in dark images) Think about these questions and decide on an appropriate plan beforehand. If your labeling is inconsistent, the quality of the model trained on your data will be negatively affected. You might need to leave out images that are not clear enough. A good rule of thumb is using your own eye: if you can make a split-second decision on a visual task, a machine learning model can usually learn to replicate it. However, if enough visual information is just not present in the image, the model will not be able to pull it from thin air.

In our experience, it is essential to include both the machine learning and domain experts from the very start of the project, and come up with an approach together. It is best to start by annotating a small batch of images and reviewing these in an iterative process. This avoids misunderstandings and lets you work together to get the most value out of your time spent annotating.

In the world of machine learning, there are many publicly available datasets. These often contain millions of images and dozens or even hundreds of classes of all kinds. Examples of this are ImageNet and COCO. Similarly, many models are available for reuse. While your problem might not exactly overlap with these datasets, at a basic level there is still similar visual knowledge related to ‘detecting things’ needed in both tasks. We can use this to our advantage by adapting such a powerful model to our task, instead of starting from scratch.

‍

Image Classification

In image classification, the goal is to assign a single label to the entire image. This is interesting if you have a collection of images that belong to one of several categories, or where you want to make a yes/no decision on the basis of a prominent feature in the image. This type of model could be used to detect what type of part passes by a factory camera, or perhaps suggest an item category from a sellers’ uploaded picture on a second-hand marketplace.

This is usually also the easiest task to label for. Be sure to create clearly distinct and well-defined categories. For example, if you’re labeling defective parts in a production line, carefully think about the different defects that could occur, and consider how many categories can be defined that are clearly visually distinct. It is also a good practice to go through your dataset and think about which cases might be difficult, and come up with a consistent strategy to deal with them.

Object Detection

If we are interested in a count of objects, their general location in the images, or we’re looking for something relatively small in a larger image, object detection is the name of the game.

For this purpose, we train an algorithm to detect rectangular regions containing the objects or regions of interest. Just like before, one model can learn to distinguish many different categories of object. This could be applied for detecting pedestrians, cyclists and cars in a street scene (think self-driving cars or CCTV), pinpoint and identify defects on products in a factory line, or track a ball and players in a sports game.

Annotation of your training data consists of drawing boxes across objects of interest in your images. Each box is also assigned a class label. Boxes are allowed to overlap.

‍

*Rectangular annotations in a street scene. Note the generally tight fit, and overlap between close objects*

In general, it is a good idea to draw the box boundaries reasonably tight around the object. Inconsistent and loose annotations making the model’s learning process needlessly hard. Additionally, this means your evaluation metrics will be less informative, as the model will be penalized for drawing bounds in a different place.

‍

Instance Segmentation

If your problem is really about precisely localizing an object in the image down to the exact pixels, you’ll want to go to instance segmentation. This could be needed in order to calculate measures like object area or cross-section, or to use the information for downstream tasks, like automated image editing. This can sometimes be helpful for complicated objects that are not easily approximated by a rectangle, e.g. thin objects often leave too much irrelevant space in the box to make simple object detection work well. Instance segmentation is also helpful to distinguish closely-packed objects into individual items.

Roughly speaking, two approaches exist for this task, polygonal annotation, and pixel-based annotation. The first essentially works by drawing a series of dots around the object outline, which are connected by line segments. The second consists of bitmap approaches, similar to a paint brush tool in image editing software. Which approach is easier depends on your problem, the shape of the objects you want to recognize, and the tools offered by your annotation software. (Some offer Photoshop-like tools such as flood fill, magic lasso, etc.) These shape annotations are also accompanied by a class label.

Many more advanced labeling software also offer problem-independent semi-automated approaches, such as DEXTR, which lets you extract a detailed mask from only four key points. These are also very handy to quickly label large amounts of objects.

As with rectangle annotations, you want to find a good balance between annotating a single object tightly enough to provide a consistent training signal to the algorithm, and moving quickly enough to be able to label many examples in the time you have available for annotation.

*Manual polygon-based instance labeling (left), and semi-manual DEXTR annotation, generated from four extreme points (right).*

Other Tasks

The above categories are not an exhaustive list of all possible annotation types. For some problems, semantic segmentation, where you label individual image pixels without detecting separate objects, is more appropriate. There’s also keypoint annotation, which is the annotation of, well, key points in an object, for example, the joints in a person dancing or a piece of machinery moving.

In addition, not every problem is appropriately modeled using discrete class labels. In some cases, a continuous or numbered label is more appropriate. You might want to label a house in a satellite image with an estimated sale value, grade a biopsy microscopy image with a disease grade, or estimate a person’s age from a picture. Free-form continuous label input is generally less widely supported in annotation tools, so you might have to make sure this is supported by your tool of choice.

*Semantic segmentation: pixels have classes, but not grouped into individual objects*

*Keypoint annotation for pose estimation*

Labeling Software and Third Party Labeling Services

Choosing the right labeling software is essential to boost the efficiency of your labeling efforts. Make sure your software allows easy labeling for your specific task, with as few manual actions per annotation. Be sure to learn all relevant shortcuts: little fractions of time saved on each annotation will add up to significant time savings over the entire dataset.

If you’re working on a larger project, check if the software has a smart way to upload and manage datasets. If you have multiple people labeling, having the option to assign roles, permissions and labeling tasks to individual users can also be necessary to keep everything manageable. Some tools allow you to define a more complex labeling pipeline, allowing for gains in efficiency by dividing a problem into multiple smaller, quicker steps, or by using a previously trained machine learning algorithm to generate a first approximation of object contours.

Multiple free open-source tools exist that have many of these features, for example CVAT, Label Studio, or LOST. There are also commercial solutions that may come with proprietary implementations of these features and other extras.

Also consider how this software will be deployed. Some older or smaller software runs as a regular application, others can be deployed as a service. Some, such as the aforementioned two, have docker containers available for rapid deployment.

‍

*A typical labeling interface (CVAT), with tools (left), data navigation (top) and label management (right)*

‍

If you have a particularly large dataset that needs to be labeled, it might be worth getting in touch with third party labeling service providers. Alternatively, you can also set up your own crowd-sourcing task using a crowdsourcing platform. This approach can allow for a massive increase in labeling throughput, but also introduces some overhead and setup time, and requires an additional party to fully understand the problem domain. It is particularly important to keep an eye on labeling quality, both between individuals and across time.

With this, we have covered the most important aspects of tackling a computer vision problem by labeling your own data. If you have a problem that might be automated using one of the above techniques, we encourage you to get in touch with us at ML6. We can help you find the optimal workflow for your computer vision project.

‍