In the context of video analysis, action recognition is the task of recognizing (human) actions in a video [1]. Actions range from extreme outdoor activities, such as abseiling, to everyday activities, such as scrambling eggs (Figure 1). Action recognition is typically also used to describe the broader field of event detection, for example in sports. It is considered one of the most important tasks of video understanding. It has many real-world applications, including behavior analysis, video retrieval, human-robot interaction, gaming, and entertainment. Action recognition can be further divided into classification and localisation. Classification only involves assigning a label to the entire video, whereas localisation additionally involves localising the actions in space and/or time.
Figure 1. Example video action classes.
With the emergence of large high-quality datasets over the past decade, there has been a growing interest in video action recognition research. Datasets have increased both in the number of videos and number of classes. They went from 7K videos and 51 classes with HMDB51 to 8 million videos and 3,862 classes in YouTube8M. Additionally, the rate at which new datasets are released is increasing: from 3 datasets in 2011-2015 to 13 datasets in 2016-2020. Thanks to the availability of these growing datasets and steady innovation in deep learning, action recognition models are rapidly improving.
While there is growing interest, video action recognition still faces major challenges in developing effective algorithms. Some of these challenges are summarized below:
Keywords: action recognition, event detection, sports, video analysis
Although limited to a subset of actions, event detection in sports is very challenging. A sports event is typically not only defined by the actions of a single person but rather by a combination of the actions of multiple people and their environment. As a result, modeling the environment and location of players may be required to gain a proper understanding of the sports game being played.
Sports event detection can greatly enhance the user experience, both during and after the game. During the game, relevant statistics can be shown on screen without the need for manual data entry. After the game, automatic video summaries can be created. Additionally, the gathered statistics can be linked back to previous games to create interesting reports and dashboards.
Research and create a machine learning algorithm to detect events, such as a tackle, goal attemptor tumble, in sports games (e.g. soccer, field hockey, cycling). The algorithm should be built with computational cost and data scarcity in mind. Data-efficient domain adaptation through transfer learning is one possible solution.
Keywords: action recognition, event detection, monitoring, video analysis, edge
Transmitting video feeds to a centralized datacenter to be processed there is both expensive and requires high investment in infrastructure, especially for use cases where a high number of cameras are needed. Moreover, there are security risks in transmitting video containing privacy sensitive data over the network. At the same time, ever more powerful, lightweight edge processing devices with GPUs and TPUs have been entering the market. Hence, there is a growing interest in video processing on edge while only statistics and/or representations are transmitted and processed centrally. This reduces infrastructure requirements and improves security as images need not leave the camera location. Notable use cases in this regard include surveillance and traffic, environment and other types of monitoring.
Research and create an optimized action detection / recognition algorithm for edge devices such as NVIDIA Jetson Xavier to be used in a monitoring context. The particular use case is open and could be related to traffic, animals, people or other phenomena including the sports case above. An important focus will be to compare, select and optimize different machine learning models for use on edge. Techniques that can be used for optimization include quantization, pruning and knowledge distillation.
In the context of videos, the goal of anomaly detection is to temporally or spatially localise anomalous events in video [2]. Anomalous events are defined as events or activities that are unusual and signify irregular behavior (Figure 3). Temporal localisation involves identifying the start and end frames of the anomaly event. Spatial localisation means to spatially identify the anomaly in each corresponding frame. Video anomaly detection has extensive applications in surveillance monitoring, such as detecting illegal activities, traffic accidents and unusual events. It not only increases monitoring efficiency but also significantly reduces the burden of manual live monitoring by allowing humans to focus on events that are likely of interest.
Figure 3. Example anomalous events from four datasets.
Research into video anomaly detection is growing due to the increase of cameras being used in public places. Cameras are deployed on squares, streets, intersections, banks, shopping malls, etc. to increase public safety. However, the capabilities of surveillance agencies have not kept pace. There is a glaring deficiency in the utilisation of surveillance cameras due to an imbalanced ratio of cameras to human monitors.
Video anomaly detection is still in its early stages and faces major challenges in effective deployment. These challenges are summarized below:
Keywords: anomaly detection, event detection, crime, violence, smart video surveillance
Violence and harmful pattern detection has become an active research area due to the abundance of surveillance cameras and the need to rapidly respond to incidents to prevent further escalation. Among all anomalous events, violence is one of the more challenging to detect. It can happen at any point, in any environment and there is no fixed scenario. A timely response to violent events can significantly increase public safety. Furthermore, it can help with incident reports and automated statistics to prevent future incidents.
Research and create a machine learning algorithm to detect violent events in surveillance footage. The algorithm should be built with computational cost and data scarcity in mind. Footage of fights from other domains, e.g. ice hockey, can be used to build a dataset.
Keywords: anomaly detection, event detection, monitoring, video analysis, edge
Transmitting video feeds to a centralized datacenter to be processed there is both expensive and requires high investment in infrastructure, especially for use cases where a high number of cameras are needed. Moreover, there are security risks in transmitting video containing privacy sensitive data over the network. At the same time, ever more powerful, lightweight edge processing devices with GPUs and TPUs have been entering the market. Hence, there is a growing interest in video processing on edge while only statistics and/or representations are transmitted and processed centrally. This reduces infrastructure requirements and improves security as images need not leave the camera location. Notable use cases in this regard include surveillance and traffic, environment and other types of monitoring.
Research and create an optimized video anomaly detection algorithm for edge devices such as NVIDIA Jetson Xavier to be used in a monitoring context. The particular use case is open and could be related to traffic, animals, people or other phenomena. An important focus will be to compare, select and optimize different machine learning models for use on edge. Techniques that can be used for optimization include quantization, pruning and knowledge distillation.