The Computer Vision Chapter is the ML6 expert group on all things related to Computer Vision.
As the ML6 special unit for computer vision, our goal is to keep in touch with the latest developments in the field and share our learnings with colleagues, customers, the open-source community, and the public at large. Some areas we are active in at the moment are Object Detection, Video Analysis, Generative AI, Edge Vision and Visual Inspection.
Develop custom, high-performance machine learning models for detecting objects at high speed, at high resolutions, and in challenging real-world circumstances. Different use cases demand different approaches to data pre-processing, modeling, tuning and setup.
Video analysis
Use object tracking across frames to support object detection and segmentation. Detect phenomena or activities that can only be recognized taking the entire stream of images into account. Video analysis presents unique challenges in terms of resource management and model architecture.
Generative AI
Neural networks can transfer faces, poses, stylistic attributes or can generate unseen instances of faces, people, objects or even artworks based on examples.We are only scratching the surface of the potential in generative modeling in media but also design, retail and other areas. For more information, visit gener8.ai
Edge Vision
Processing video on edge, near the camera can reduce network traffic and increase data security. Example applications include a high-performing solution for anonymization and re-identification on edge. Edge processing presents a number of challenges in terms of performance, architecture, ops and security.
Visual inspection
Vision based quality control and assurance, based on the latest advancements in machine vision. With machine learning, we can detect a wide range of defects on a diverse set of products. Using these SOTA algorithms, production processes can be monitored, steered and optimized.
Demos
How to spot a deepfake ?
This video explains and illustrates the small clues that can help you distinguish deepfakes from real videos.
In the context of video analysis, action recognition is the task of recognizing (human) actions in a video [1]. Actions range from extreme outdoor activities, such as abseiling, to everyday activities, such as scrambling eggs (Figure 1). Action recognition is typically also used to describe the broader field of event detection, for example in sports. It is considered one of the most important tasks of video understanding. It has many real-world applications, including behavior analysis, video retrieval, human-robot interaction, gaming, and entertainment. Action recognition can be further divided into classification and localisation. Classification only involves assigning a label to the entire video, whereas localisation additionally involves localising the actions in space and/or time.
Figure 1. Example video action classes.
With the emergence of large high-quality datasets over the past decade, there has been a growing interest in video action recognition research. Datasets have increased both in the number of videos and number of classes. They went from 7K videos and 51 classes with HMDB51 to 8 million videos and 3,862 classes in YouTube8M. Additionally, the rate at which new datasets are released is increasing: from 3 datasets in 2011-2015 to 13 datasets in 2016-2020. Thanks to the availability of these growing datasets and steady innovation in deep learning, action recognition models are rapidly improving.
Challenges
While there is growing interest, video action recognition still faces major challenges in developing effective algorithms. Some of these challenges are summarized below:
Datasets typically define a limited label space. Human actions are composite concepts and the hierarchy of these concepts is ill-defined. Furthermore, each action typically has multiple descriptions, at different levels of granularity. Additionally, labeling videos is time-consuming and ambiguous. In order to properly annotate all actions, one must watch the entire video and mark the start and end of each action. To further complicate matters, these actions may overlap or even together constitute a higher-level combined action.
Human actions have strong intra- and inter-class variations. The same action can be performed at different speeds under various viewpoints. Secondly, some actions share similar motions and/or environments that are difficult to distinguish.
Action recognition requires understanding of both short- and long-range temporal dynamics. The span of an action may range from a couple of seconds to hours. An action may involve various types of temporal information and the model must be able to capture these. Moreover, models must handle different perspectives. Clearly, sophisticated models are required to capture the more challenging aspects of action recognition.
The added temporal dimension involved in action recognition leads to high computational cost for both training and inference. Current datasets typically contain actions of around 10 seconds. At 25 frames per second, this results in 250 frames to be analysed. The temporal dimension thus adds significant complexity compared to image analytics. Not only is there an increased amount of frames to be processed, the dynamics between these frames need additional processing. In essence, action recognition can be seen as a sequence modelling task.
Example topics
Sports event detection
Keywords: action recognition, event detection, sports, video analysis
Although limited to a subset of actions, event detection in sports is very challenging. A sports event is typically not only defined by the actions of a single person but rather by a combination of the actions of multiple people and their environment. As a result, modeling the environment and location of players may be required to gain a proper understanding of the sports game being played.
Sports event detection can greatly enhance the user experience, both during and after the game. During the game, relevant statistics can be shown on screen without the need for manual data entry. After the game, automatic video summaries can be created. Additionally, the gathered statistics can be linked back to previous games to create interesting reports and dashboards.
Goal
Research and create a machine learning algorithm to detect events, such as a tackle, goal attemptor tumble, in sports games (e.g. soccer, field hockey, cycling). The algorithm should be built with computational cost and data scarcity in mind. Data-efficient domain adaptation through transfer learning is one possible solution.
Edge inference
Keywords: action recognition, event detection, monitoring, video analysis, edge
Transmitting video feeds to a centralized datacenter to be processed there is both expensive and requires high investment in infrastructure, especially for use cases where a high number of cameras are needed. Moreover, there are security risks in transmitting video containing privacy sensitive data over the network. At the same time, ever more powerful, lightweight edge processing devices with GPUs and TPUs have been entering the market. Hence, there is a growing interest in video processing on edge while only statistics and/or representations are transmitted and processed centrally. This reduces infrastructure requirements and improves security as images need not leave the camera location. Notable use cases in this regard include surveillance and traffic, environment and other types of monitoring.
Goal
Research and create an optimized action detection / recognition algorithm for edge devices such as NVIDIA Jetson Xavier to be used in a monitoring context. The particular use case is open and could be related to traffic, animals, people or other phenomena including the sports case above. An important focus will be to compare, select and optimize different machine learning models for use on edge. Techniques that can be used for optimization include quantization, pruning and knowledge distillation.
Anomaly detection
In the context of videos, the goal of anomaly detection is to temporally or spatially localise anomalous events in video [2]. Anomalous events are defined as events or activities that are unusual and signify irregular behavior (Figure 3). Temporal localisation involves identifying the start and end frames of the anomaly event. Spatial localisation means to spatially identify the anomaly in each corresponding frame. Video anomaly detection has extensive applications in surveillance monitoring, such as detecting illegal activities, traffic accidents and unusual events. It not only increases monitoring efficiency but also significantly reduces the burden of manual live monitoring by allowing humans to focus on events that are likely of interest.
Figure 3. Example anomalous events from four datasets.
Research into video anomaly detection is growing due to the increase of cameras being used in public places. Cameras are deployed on squares, streets, intersections, banks, shopping malls, etc. to increase public safety. However, the capabilities of surveillance agencies have not kept pace. There is a glaring deficiency in the utilisation of surveillance cameras due to an imbalanced ratio of cameras to human monitors.
Challenges
Video anomaly detection is still in its early stages and faces major challenges in effective deployment. These challenges are summarized below:
Anomalous events are complicated, diverse and typically very rare. Large and diverse video anomaly datasets are hard to come by. As a result various semi-supervised training techniques are used to overcome the labeled data scarcity. Additionally, models are commonly trained on a very small subset of anomalous events and do not translate well to other domains. It is not sufficient to model normal behaviour and flag everything that deviates from it as an anomaly.
Video anomalies are very diverse and can span a couple of seconds to hours. A model must be able to capture both short- and long-range temporal information.
As with most video analytics techniques, video anomaly detection has high computational complexity due to the need for temporal sequence modeling. Furthermore, it is commonly used to alleviate the burden of live CCTV surveillance monitoring. An effective algorithm must thus be able to flag anomalous events in real-time.
Example topics
Violence detection
Keywords: anomaly detection, event detection, crime, violence, smart video surveillance
Violence and harmful pattern detection has become an active research area due to the abundance of surveillance cameras and the need to rapidly respond to incidents to prevent further escalation. Among all anomalous events, violence is one of the more challenging to detect. It can happen at any point, in any environment and there is no fixed scenario. A timely response to violent events can significantly increase public safety. Furthermore, it can help with incident reports and automated statistics to prevent future incidents.
Goal
Research and create a machine learning algorithm to detect violent events in surveillance footage. The algorithm should be built with computational cost and data scarcity in mind. Footage of fights from other domains, e.g. ice hockey, can be used to build a dataset.
Edge inference
Keywords: anomaly detection, event detection, monitoring, video analysis, edge
Transmitting video feeds to a centralized datacenter to be processed there is both expensive and requires high investment in infrastructure, especially for use cases where a high number of cameras are needed. Moreover, there are security risks in transmitting video containing privacy sensitive data over the network. At the same time, ever more powerful, lightweight edge processing devices with GPUs and TPUs have been entering the market. Hence, there is a growing interest in video processing on edge while only statistics and/or representations are transmitted and processed centrally. This reduces infrastructure requirements and improves security as images need not leave the camera location. Notable use cases in this regard include surveillance and traffic, environment and other types of monitoring.
Goal
Research and create an optimized video anomaly detection algorithm for edge devices such as NVIDIA Jetson Xavier to be used in a monitoring context. The particular use case is open and could be related to traffic, animals, people or other phenomena. An important focus will be to compare, select and optimize different machine learning models for use on edge. Techniques that can be used for optimization include quantization, pruning and knowledge distillation.