The Computer Vision Chapter is the ML6 expert group on all things related to Computer Vision.
As the ML6 special unit for computer vision, our goal is to keep in touch with the latest developments in the field and share our learnings with colleagues, customers, the open-source community, and the public at large. Some areas we are active in at the moment are Object Detection, Video Analysis, Generative AI, Edge Vision.
Object detection
Develop custom, high-performance machine learning models for detecting objects at high speed, at high resolutions, and in challenging real-world circumstances. Different use cases demand different approaches to data pre-processing, modeling, tuning and setup.
Video analysis
Use object tracking across frames to support object detection and segmentation. Detect phenomena or activities that can only be recognized taking the entire stream of images into account. Video analysis presents unique challenges in terms of resource management and model architecture.
Generative AI
Neural networks can transfer faces, poses, stylistic attributes or can generate unseen instances of faces, people, objects or even artworks based on examples.We are only scratching the surface of the potential in generative modeling in media but also design, retail and other areas. For more information, visit gener8.ai
Edge Vision
Processing video on edge, near the camera can reduce network traffic and increase data security. Example applications include a high-performing solution for anonymization and re-identification on edge. Edge processing presents a number of challenges in terms of performance, architecture, ops and security.
Vision based quality control and assurance, based on the latest advancements in machine vision. With machine learning, we can detect a wide range of defects on a diverse set of products. Using these SOTA algorithms, production processes can be monitored, steered and optimized.
How to spot a deepfake ?
This video explains and illustrates the small clues that can help you distinguish deepfakes from real videos.
View full demoUsing StyleClip to generate a face based on a description
MLSox explained : How to create a sock matching application from scratch using YOLOv4 and siamese networks
View full demoJeroom dancing through pose transfer ( with VT4/GoPlay)
View full demoUsing generative AI for image manipulation: discrete absorbing diffusion models explained
Using generative AI for image manipulation: discrete absorbing diffusion models explained
Building a coral segmentation model using sparse data
Building a coral segmentation model using sparse data
How to detect small objects in (very) large images
How to detect small objects in (very) large images
In the context of video analysis, action recognition is the task of recognizing (human) actions in a video [1]. Actions range from extreme outdoor activities, such as abseiling, to everyday activities, such as scrambling eggs (Figure 1). Action recognition is typically also used to describe the broader field of event detection, for example in sports. It is considered one of the most important tasks of video understanding. It has many real-world applications, including behavior analysis, video retrieval, human-robot interaction, gaming, and entertainment. Action recognition can be further divided into classification and localisation. Classification only involves assigning a label to the entire video, whereas localisation additionally involves localising the actions in space and/or time.
Figure 1. Example video action classes.
With the emergence of large high-quality datasets over the past decade, there has been a growing interest in video action recognition research. Datasets have increased both in the number of videos and number of classes. They went from 7K videos and 51 classes with HMDB51 to 8 million videos and 3,862 classes in YouTube8M. Additionally, the rate at which new datasets are released is increasing: from 3 datasets in 2011-2015 to 13 datasets in 2016-2020. Thanks to the availability of these growing datasets and steady innovation in deep learning, action recognition models are rapidly improving.
While there is growing interest, video action recognition still faces major challenges in developing effective algorithms. Some of these challenges are summarized below:
Keywords: action recognition, event detection, sports, video analysis
Although limited to a subset of actions, event detection in sports is very challenging. A sports event is typically not only defined by the actions of a single person but rather by a combination of the actions of multiple people and their environment. As a result, modeling the environment and location of players may be required to gain a proper understanding of the sports game being played.
Sports event detection can greatly enhance the user experience, both during and after the game. During the game, relevant statistics can be shown on screen without the need for manual data entry. After the game, automatic video summaries can be created. Additionally, the gathered statistics can be linked back to previous games to create interesting reports and dashboards.
Research and create a machine learning algorithm to detect events, such as a tackle, goal attemptor tumble, in sports games (e.g. soccer, field hockey, cycling). The algorithm should be built with computational cost and data scarcity in mind. Data-efficient domain adaptation through transfer learning is one possible solution.
Keywords: action recognition, event detection, monitoring, video analysis, edge
Transmitting video feeds to a centralized datacenter to be processed there is both expensive and requires high investment in infrastructure, especially for use cases where a high number of cameras are needed. Moreover, there are security risks in transmitting video containing privacy sensitive data over the network. At the same time, ever more powerful, lightweight edge processing devices with GPUs and TPUs have been entering the market. Hence, there is a growing interest in video processing on edge while only statistics and/or representations are transmitted and processed centrally. This reduces infrastructure requirements and improves security as images need not leave the camera location. Notable use cases in this regard include surveillance and traffic, environment and other types of monitoring.
Research and create an optimized action detection / recognition algorithm for edge devices such as NVIDIA Jetson Xavier to be used in a monitoring context. The particular use case is open and could be related to traffic, animals, people or other phenomena including the sports case above. An important focus will be to compare, select and optimize different machine learning models for use on edge. Techniques that can be used for optimization include quantization, pruning and knowledge distillation.
In the context of videos, the goal of anomaly detection is to temporally or spatially localise anomalous events in video [2]. Anomalous events are defined as events or activities that are unusual and signify irregular behavior (Figure 3). Temporal localisation involves identifying the start and end frames of the anomaly event. Spatial localisation means to spatially identify the anomaly in each corresponding frame. Video anomaly detection has extensive applications in surveillance monitoring, such as detecting illegal activities, traffic accidents and unusual events. It not only increases monitoring efficiency but also significantly reduces the burden of manual live monitoring by allowing humans to focus on events that are likely of interest.
Figure 3. Example anomalous events from four datasets.
Research into video anomaly detection is growing due to the increase of cameras being used in public places. Cameras are deployed on squares, streets, intersections, banks, shopping malls, etc. to increase public safety. However, the capabilities of surveillance agencies have not kept pace. There is a glaring deficiency in the utilisation of surveillance cameras due to an imbalanced ratio of cameras to human monitors.
Video anomaly detection is still in its early stages and faces major challenges in effective deployment. These challenges are summarized below:
Keywords: anomaly detection, event detection, crime, violence, smart video surveillance
Violence and harmful pattern detection has become an active research area due to the abundance of surveillance cameras and the need to rapidly respond to incidents to prevent further escalation. Among all anomalous events, violence is one of the more challenging to detect. It can happen at any point, in any environment and there is no fixed scenario. A timely response to violent events can significantly increase public safety. Furthermore, it can help with incident reports and automated statistics to prevent future incidents.
Research and create a machine learning algorithm to detect violent events in surveillance footage. The algorithm should be built with computational cost and data scarcity in mind. Footage of fights from other domains, e.g. ice hockey, can be used to build a dataset.
Keywords: anomaly detection, event detection, monitoring, video analysis, edge
Transmitting video feeds to a centralized datacenter to be processed there is both expensive and requires high investment in infrastructure, especially for use cases where a high number of cameras are needed. Moreover, there are security risks in transmitting video containing privacy sensitive data over the network. At the same time, ever more powerful, lightweight edge processing devices with GPUs and TPUs have been entering the market. Hence, there is a growing interest in video processing on edge while only statistics and/or representations are transmitted and processed centrally. This reduces infrastructure requirements and improves security as images need not leave the camera location. Notable use cases in this regard include surveillance and traffic, environment and other types of monitoring.
Research and create an optimized video anomaly detection algorithm for edge devices such as NVIDIA Jetson Xavier to be used in a monitoring context. The particular use case is open and could be related to traffic, animals, people or other phenomena. An important focus will be to compare, select and optimize different machine learning models for use on edge. Techniques that can be used for optimization include quantization, pruning and knowledge distillation.
In industrial manufacturing processes, quality assurance is an important topic. It is one of the top priorities for Industry 4.0 with a good reason. Defect detection improves the quality, efficiency, and saves lots of money. It is about to become more accessible however this problem faces a number of unique challenges:
For this use-case, we want to explore anomaly detection methods which use anomaly-free training data combined with probabilistic AI to detect anomalies. The goal is that all a new client has to do is provide us with a dataset containing non-defective samples and we can build a custom anomaly detection solution for their use case. Recently Intel released the Anomalib library which implements a couple of the current state-of-the-art methods. Some initial exploration of the library has been done by ML6, however there is much more work to be done before we can use it for a client.
You can take a headstart when working on this project, as some work has already been done. An actively developed library is available called Anomalib which contains implementations of the current state-of-the-art. However, there is still a gap to be bridged before we can use it in practice. An initial comparison of three algorithms was done however multiple interesting algorithms were excluded.
During this internship you will:
The duration of the internship can be flexible and depends on the candidate preference and the project requirements. The typical duration is 6 to 8 weeks. The preferred duration for this specific project is 6 weeks.
SMOG (NL: Spreken Met ondersteuning van Gebaren, EN: Speaking with gesture support) is a form of supportive communication. This allows children, young people and adults with a communication disability to clarify their needs and wishes and to better understand their environment. Unfortunately, most people do not understand or even know about SMOG. This thesis aims to close this gap by allowing a broader audience to understand the gestures with the aid of technology.
The goal of this thesis is to recognize SMOG sign language gestures using a Google Glas. The word corresponding to the gesture must be shown to the user. The model must be able to operate in (near) real-time.
A machine learning model should process the Google Glass’ camera feed to detect (a subset of) the 500 base SMOG gestures. Optionally, the user can use the Glass’ controls to indicate the start and end of a gesture. A model must then be able to classify the gesture and return the corresponding word to the user, using the Glass’ display.
Google Glass (Enterprise Edition 2) applications are based on the Android Orea 8.1 SDK. MLKi and MediaPip can be used for machine learning workloads. The ML model must be developed in TensorFlow. Android and/or TensorFlow experience is a plus. The thesis contains both a theoretical and a practical aspect. Research into optimal lightweight model architectures is required in addition to the development of an Android app.
Remember the typical scene in a crime series when they have a blurry image of a suspect and ask their technology expert to “zoom in and enhance”?
Although those scenes are nowhere technically accurate, there exist some techniques that take low-resolution images as input and upscale them to higher resolution ones. Super-resolution is one of them and for a long time the idea was thought to be science fiction as the “data processing inequality theorem” states that the post-processing of data cannot add any information that was not already there. However, with the advent of neural networks and GAN’s, you can add information that was learned by training these networks on large amounts of examples thus allowing for actual reconstruction of faces for example.
Super-resolution has a lot of interesting real-world applications that are only just starting to be explored, such as reducing the file sizes of images and videos, as a preprocessing step for various AI applications such as for example deepfakes and as a post-processing step in various industries such as in the medical field, cosmology or simply for enhancing your favorite old movies and pictures.
Figure 1. Example of upscaling a blurry image.
Although the idea is not new, the field of super-resolution has revived with the advent of GAN’s and made significant improvements in only a couple of years. Moreover, a big advantage of this particular field is that an unlimited amount of data is available, since you can easily downscale high resolution images and use these pairs as training data. There are also lots of publicly available datasets such as for example https://data.vision.ee.ethz.ch/cvl/DIV2K/.
While there is growing interest, super resolution still faces major challenges in developing effective algorithms. These challenges are summarized below:
A recently published paper called DFDNet [1] achieved state of the art results on the upscaling of human faces. However, it only can scale up the faces itself, but keeps the surroundings as is. In this thesis, you would investigate the possibility of also upscaling the background, as a separate network or incorporated in the DFDNet architecture. This would open the door for video upscaling, as now there are clearly visible artifacts when only upscaling someone's face next to the background staying blurred.
Research and create a Machine Learning algorithm that can upscale the resolution of an image and fill in the details realistically. Technologies that can be used are Python, Tensorflow, Keras and in general the Python data science and machine learning track.
Always wondered how the old photo album of your family heritage would look like in color? Interested in bringing the past more to life? Then this might be a subject for you.
Image colorization is the process of trying to convert a grayscale image to a colored one, while filling in the colors as realistically as possible. The idea is not new, people have been hand-coloring photos since decades and also some computer-aided, reference based techniques popped up in the early 2000’s. However, there has been tremendous progress in the last 5 years through the use of diverse deep-learning architectures ranging from the early brute-force networks [3] to more recent custom-designed Generative Adversarial Networks [4].
Figure 2. Image colorization example.
While there is growing interest, image colorization still faces major challenges in developing effective algorithms. These challenges are summarized below:
Figure 3. Comparison between Color Image and Gray Image. [5]
What if these colorization techniques could be applied to videos? The research around image colorization has almost exclusively been centered around images, and currently video colorization is mostly just the application of image colorization to the individual frames of the video. There are a lot of possibilities to improve the state of the art for video colorization by for example taking the temporal component into account when coloring in the frames or trying to fix some of the challenges that are specific to old videos such as mitigating the flickering effect.
Research and create a Machine Learning algorithm that can colorize videos realistically, improving on the current state of the art of colorizing individual frames by taking temporal components into account. Technologies that can be used are Python, Tensorflow, Keras and in general the Python data science and machine learning track.
Ever wonder how you would look in a certain t-shirt or pair of shoes without having to try it on? Well, that’s the problem that garment transfer is trying to solve. Given an image of a person and piece of clothing as input, the goal is to get a photo-realistic picture of that person wearing that piece of clothing.
Garment transfer existed as science fiction for a long time, but only recently became possible to solve with the advent of GAN’s. Since, it has already evolved into a popular subtopic for research and seen a lot of progress, as can be seen on the figure below.
Figure 4. Garment transfer example. [7]
Garment transfer comes in a variety of flavours with slight variations on the inputs (e.g. from a single image of the clothing that should be transferred, to a collection of images, to an image of another person wearing the clothes that should be transferred), but in general the problem can be divided into 2 subproblems. First the algorithm should learn to separate a person’s body (pose, shape, skin color) from their clothing. Secondly, it should generate new images of the person wearing a new clothing item. The outputs also come in different forms and range from generating a single image, to generating a full 3D clothing transfer [8] where images of different viewpoints and poses can be generated.
While there is growing interest, garment transfer still faces major challenges in developing effective algorithms. These challenges are summarized below:
Because garment transfer research is still in its infancy and due to the lack of consensus on how to approach the problem, it can be hard to see the forest for the trees. Summarizing and organizing the different approaches and their advances along with an analysis and comparison of their advantages and drawbacks can add a lot of value to the field. Lowering the threshold for new researchers to enter the field and helping current researchers make connections between current approaches.
Research, analyse and summarize the current state of art for garment transfer techniques.
In a well trained GAN, the generator part of the network is able to generate new, photo-realistic examples of the type of images that the network was trained on. However, it’s hard to control what kind of image you want the GAN to generate, other than a random image that comes from the same distribution as the training set.
Let’s take for example the StyleGAN architecture from NVIDIA [9] that is behind the well-known website thispersondoesnotexist.com that generates photo-realistic faces of people that don’t exist.
Figure 5. Examples from thispersondoesnotexist.com.
Once fully trained, it’s easy to ask StyleGAN to generate a new realistic looking face, but there is no way to ask it to generate for example an image of a middle-aged asian man with long hair, except to keep generating images until you get a face with the desired properties.
This problem significantly reduces the usability of GAN’s in real-world applications.
There have already been various approaches to solve this problem with the most popular being conditional GAN’s and controllable generation. Conditional GAN’s are GAN’s that receive additional input during the training phase, which is the label of which class the image belongs to. Controllable generation happens after training and consists of adjusting the latent feature vector in an attempt to control the features of the output image.
While there is growing interest, conditional GAN’s still faces major challenges in developing effective algorithms. These challenges are summarized below:
With controllable generation, you try to tweak the latent feature vector of the generator in a way that the output changes in the desired direction. However, when different features have a high correlation in the data set that was used to train your GAN, it becomes difficult to control specific features without modifying the ones that are correlated to them. For example, if you want to add a beard to the picture of a woman this will likely also change other facial features like the nose and jawline in a way that it looks more masculin. This is not desirable if you only want to edit a single feature. Furthermore, this also applies to features that aren’t correlated in the training set since without special attention, the Z-space is learned to become entangled.
Research and create a GAN that has a disentangled Z-space in a particular subdomain such as medical imaging. The goal is to be able to influence single, relevant features of medical images such as for example the size of a tumor. Technologies that can be used are Python, Tensorflow, Keras and in general the Python data science and machine learning track.