With an ever increasing amount of cameras, the abundance of video material is driving a new frontier in video analysis. By leveraging the temporal dimension, video analysis models enable a vast amount of impactful use cases from retail consumer analytics to real-time sports monitoring.
At ML6 we’ve noticed an increased interest in time-based video analysis. For years, ML has been applied to videos but it typically remained limited to frame-based techniques. The use of image-based models fails to unlock the true potential of video analytics. However, the limits of what’s possible with frame-based methods is becoming ever so clear. The solution, involving the temporal dimension, unveils new information and opens the doors to a whole range of new possibilities.
Take the following frame and video for example. The frame on the left seemingly shows two people trying to fix their broken down car. Based on a single frame, it is impossible to determine whether they are truly fixing the car, or trying to steal it. Only by taking the entire context into account, you see that the two people arrive in another car, forcefully open the hood and swiftly leave, all whilst nervously looking around.
A sample stealing video from the UCF Crime dataset. Source: https://www.crcv.ucf.edu/projects/real-world/
This blogpost gives an overview of the most important research areas in video analysis. For each of the following subfields, we present some of the most relevant use cases:
Tracking is one of the most fundamental techniques in the realm of video analytics. In case of a single image, each object is unique and known. As soon as we add the time dimension, we have multiple images, or frames, of the same unique object. The goal of tracking is to associate these sightings of the same object to form a track through time. Depending on the amount of objects and viewpoints, various types of tracking exist. These will be discussed next.
Single object tracking (SOT), also referred to as visual object tracking (VOT), aims to follow a single object throughout a video. A bounding box of the target object in the first frame is given to the tracker. The tracker will then track this object throughout the subsequent frames. There is no need for further object detections after the initial bounding box is given. These types of trackers are called detection-free: they do not rely on a detector. As a result, any type of object can be tracked as there is no dependence on an object detector with a fixed set of classes.
As the name suggests, multi-object tracking (MOT) involves multiple objects to track. A MOT tracker is detection-based, it needs object detections as input and outputs a set of new bounding boxes with corresponding track identifiers. MOT trackers typically associate detections based on movement and visual appearance. They are limited to a fixed set of classes because of the dependence on both the underlying tracker and the visual appearance model. For example, a model trained to detect similar people will not perform well on vehicles because it has learned to look for discriminative features of people.
Multi-target multi-camera tracking adds another level of complexity on top of MOT and introduces multiple cameras or viewpoints. The tracker now has a notion of depth and is able to output more accurate tracks. Unfortunately, this typically comes at the cost of computational complexity because of the additional information to be processed. At each point in time, the MTMCT tracker receives a frame from each viewpoint.
Re-identification is a subfield of MTMCT, it concerns multiple objects and viewpoints but the temporal relation among the detections differs. With MTMCT, at each point in time, the tracker receives from each viewpoint a set of detections at that time. In contrast, ReID is commonly performed on detections from multiple viewpoints at multiple timestamps. Another difference with MTMCT is that ReID cameras don’t have to point at the same area. Generally, ReID viewpoints are scattered across a larger area.
Most use cases first involve MOT to track people at each viewpoint. Based on the unique people per viewpoint, a gallery of all people seen across viewpoints, is built. Given an image of a query person, ReID then aims to retrieve detections of this person from other viewpoints in the gallery. ReID thus effectively takes over where MOT stops, ands tracks across viewpoints.
Three frames, from three different viewpoints at three different times, containing the same object.
Next to video-based techniques, image tasks can greatly benefit from temporal information. Various image-based techniques have been extended to include the temporal dimension. Instead of naively outputting information per frame, the models track objects over time and use the output of the previous frame to improve the next prediction. Examples include instance segmentation and pose estimation.
Video recognition is the ability to recognize entities or events in videos. Similar to image recognition, there are multiple types of ‘recognition’, depending on the information content of the outputted information. A model can classify, localize or do both. The types of recognition are discussed below.
Among the most basic techniques, video classification assigns a relevant label to an entire video clip. There is no localization in space or time, thus no bounding boxes or timestamps. As a result, the video clips are commonly only a few seconds long.
With temporal localization, relevant actions or entities are both classified and localized in time. It is also know as frame-level detection. In a single video, multiple events or entities, with their corresponding start and end time, can be detected. Temporal localization is more challenging compared to classification because the model has to predict when an action or entity starts and ends.
Action or entity detection classifies and localizes relevant actions or entities both in time and space. Entity detection is similar to multi-object tracking, an object is detected and associated across frames. Action detection, however, detects actions that typically only exist across time. Action/entity detection is classified as pixel-level detection.
What if we want to detect events that deviate from regular behavior but we don’t have a dataset of every such event? What if you are not really interested in what type of event occurs, only that it occurs? This is were anomaly detection comes into play!
Video anomaly detection aims to detect and temporally localize anomalous events or actions in videos. Anomalous events are defined as events that signify irregular behavior. They vary from walking in the wrong direction to violent crimes. Anomaly detection models generally output a score that indicates the likelihood of an anomaly at each point in time. Consequently, there is no classification of a specific type of event.
To conclude, the world of video analysis is not limited to bounding boxes and class labels. Below are two less common video analysis tasks that aim to give a compact representation of a video: summarization and description.
Video summarization is the process of extracting the most informative or descriptive frames from a video. In the most extreme case, only a single frame is extracted to represent the video (e.g. YouTube thumbnail).
Automatic video description aims to provide a textual description, indicating what is happening in the video clip. Optionally, description models may also include a segmentation step, splitting the video into distinct chunks and providing a textual description for each.
Video analysis enables a variety of use cases, spanning multiple domains. This blogpost is by no means an exhaustive list but rather sheds a light on the potential of the many techniques. Although video analysis has been around for a while, we’ve only recently started seeing a glimpse of its full potential. By leveraging temporal information, the true power of video emerges.
Want to find out what video analysis can do for you? Get in touch!