July 13, 2021

Beyond frames: The new frontier in Video Analysis

Jules Talloen
Machine Learning Engineer
No items found.
Subscribe to newsletter
Share this post

With an ever increasing amount of cameras, the abundance of video material is driving a new frontier in video analysis. By leveraging the temporal dimension, video analysis models enable a vast amount of impactful use cases from retail consumer analytics to real-time sports monitoring.

At ML6 we’ve noticed an increased interest in time-based video analysis. For years, ML has been applied to videos but it typically remained limited to frame-based techniques. The use of image-based models fails to unlock the true potential of video analytics. However, the limits of what’s possible with frame-based methods is becoming ever so clear. The solution, involving the temporal dimension, unveils new information and opens the doors to a whole range of new possibilities.

Take the following frame and video for example. The frame on the left seemingly shows two people trying to fix their broken down car. Based on a single frame, it is impossible to determine whether they are truly fixing the car, or trying to steal it. Only by taking the entire context into account, you see that the two people arrive in another car, forcefully open the hood and swiftly leave, all whilst nervously looking around.

A sample stealing video from the UCF Crime dataset. Source: https://www.crcv.ucf.edu/projects/real-world/

A sample stealing video from the UCF Crime dataset. Source: https://www.crcv.ucf.edu/projects/real-world/

This blogpost gives an overview of the most important research areas in video analysis. For each of the following subfields, we present some of the most relevant use cases:

  • Tracking and re-identification
  • Video recognition
  • Video anomaly detection
  • Video summarization and description

Tracking and re-identification

Tracking is one of the most fundamental techniques in the realm of video analytics. In case of a single image, each object is unique and known. As soon as we add the time dimension, we have multiple images, or frames, of the same unique object. The goal of tracking is to associate these sightings of the same object to form a track through time. Depending on the amount of objects and viewpoints, various types of tracking exist. These will be discussed next.

An overview of the tracking subfields.

Single object tracking (SOT)

Single object tracking (SOT), also referred to as visual object tracking (VOT), aims to follow a single object throughout a video. A bounding box of the target object in the first frame is given to the tracker. The tracker will then track this object throughout the subsequent frames. There is no need for further object detections after the initial bounding box is given. These types of trackers are called detection-free: they do not rely on a detector. As a result, any type of object can be tracked as there is no dependence on an object detector with a fixed set of classes.

A single object moves throughout video frames.

An example of SOT. A single can of coke is being tracked. Source: https://cv.gluon.ai/build/examples_tracking/demo_SiamRPN.html

Example use cases

  • Animal monitoring: track an arbitrary animal without the need to train a custom object detector.
  • Robotics: track the object currently being handled by robotic arms.

Multi-object tracking (MOT)

As the name suggests, multi-object tracking (MOT) involves multiple objects to track. A MOT tracker is detection-based, it needs object detections as input and outputs a set of new bounding boxes with corresponding track identifiers. MOT trackers typically associate detections based on movement and visual appearance. They are limited to a fixed set of classes because of the dependence on both the underlying tracker and the visual appearance model. For example, a model trained to detect similar people will not perform well on vehicles because it has learned to look for discriminative features of people.

Three distinct objects moving throughout video frames.
An example of MOT for vehicles. Each car is assigned a unique ID. The boxes of the detector and tracker are blue and red respectively.

Example use cases

  • People/vehicle counting: count the number of unique people/vehicles passing through a certain area.
  • In store retail analytics: track customers in a store and analyse behavioral patterns to optimize the store layout.
  • Crowd management: analyse crowd movement and patterns.
  • Traffic monitoring: monitor traffic patterns.
  • Broadcast sports analytics: track players and analyse their movement.

Multi-target multi-camera tracking (MTMCT)

Multi-target multi-camera tracking adds another level of complexity on top of MOT and introduces multiple cameras or viewpoints. The tracker now has a notion of depth and is able to output more accurate tracks. Unfortunately, this typically comes at the cost of computational complexity because of the additional information to be processed. At each point in time, the MTMCT tracker receives a frame from each viewpoint.

Two viewpoints of the same three object moving throughout video frames.
Four viewpoints of the same area, showcasing MTMCT annotations. Source: https://www.youtube.com/watch?v=dliRQ9zOFPU

Example use cases

  • Sports analytics: accurate player and ball tracking in ball sport games.

Re-identification (ReID)

Re-identification is a subfield of MTMCT, it concerns multiple objects and viewpoints but the temporal relation among the detections differs. With MTMCT, at each point in time, the tracker receives from each viewpoint a set of detections at that time. In contrast, ReID is commonly performed on detections from multiple viewpoints at multiple timestamps. Another difference with MTMCT is that ReID cameras don’t have to point at the same area. Generally, ReID viewpoints are scattered across a larger area.

Most use cases first involve MOT to track people at each viewpoint. Based on the unique people per viewpoint, a gallery of all people seen across viewpoints, is built. Given an image of a query person, ReID then aims to retrieve detections of this person from other viewpoints in the gallery. ReID thus effectively takes over where MOT stops, ands tracks across viewpoints.

Three frames, from three different viewpoints at three different times, containing the same object.

A schematic representation of re-identification. A gallery is built with people detected (and tracked) in each viewpoint. The gallery is then queried with an image to retrieve detections from the same person, from other viewpoints.

Example use cases

  • Person/vehicle retrieval: search for a target person or vehicle across viewpoints, in an entire camera network.
  • Cross-camera trajectory estimation: estimate an entity’s trajectory over multiple camera viewpoints.
  • Returning visitor detection: detect returning visitors based on their visual appearance.

End-to-end multi-task tracking

Next to video-based techniques, image tasks can greatly benefit from temporal information. Various image-based techniques have been extended to include the temporal dimension. Instead of naively outputting information per frame, the models track objects over time and use the output of the previous frame to improve the next prediction. Examples include instance segmentation and pose estimation.

Multi-task tracking: in addition to pose keypoints/segmentation masks, each person is tracked and assigned a unique color.

Video recognition

Video recognition is the ability to recognize entities or events in videos. Similar to image recognition, there are multiple types of ‘recognition’, depending on the information content of the outputted information. A model can classify, localize or do both. The types of recognition are discussed below.

An overview of the video recognition subfields.


Among the most basic techniques, video classification assigns a relevant label to an entire video clip. There is no localization in space or time, thus no bounding boxes or timestamps. As a result, the video clips are commonly only a few seconds long.

A video consisting of n frames is classified as “class A”.
Example clips of an action recognition dataset (THUMOS14). Classes include boxing, drumming, knitting… Source: http://crcv.ucf.edu/THUMOS14/home.html

Example use cases

  • Content monitoring: classify inappropriate videos containing nudity or violence for example.
  • Automatic video categorization: automatically categorize video material into a fixed set of categories.

Temporal localization

With temporal localization, relevant actions or entities are both classified and localized in time. It is also know as frame-level detection. In a single video, multiple events or entities, with their corresponding start and end time, can be detected. Temporal localization is more challenging compared to classification because the model has to predict when an action or entity starts and ends.

A group of frames of a video is classified as “class A”.

Example use cases

  • Automatic video cropping: crop interesting parts of a video.
  • Content search: find a specific event in hours of video footage.
  • Automatic clip extraction: automatically extract interesting clips from a longer video.
  • Event counting: count the occurrence of specific events.


Action or entity detection classifies and localizes relevant actions or entities both in time and space. Entity detection is similar to multi-object tracking, an object is detected and associated across frames. Action detection, however, detects actions that typically only exist across time. Action/entity detection is classified as pixel-level detection.

A group of frames from a video contain an action or entity of “class A”, as indicated by the bounding boxes.

The output of the SlowFast action detection model by Facebook research. Source: https://github.com/facebookresearch/SlowFast

Example use cases

  • Automatic video cropping/zooming: automatically crop/zoom a video to only contain a certain entity or event.
  • Workplace safety monitoring: detect unsafe events to reduce industrial accidents, e.g. by detecting missing protective equipment.

Video anomaly detection

What if we want to detect events that deviate from regular behavior but we don’t have a dataset of every such event? What if you are not really interested in what type of event occurs, only that it occurs? This is were anomaly detection comes into play!

Video anomaly detection aims to detect and temporally localize anomalous events or actions in videos. Anomalous events are defined as events that signify irregular behavior. They vary from walking in the wrong direction to violent crimes. Anomaly detection models generally output a score that indicates the likelihood of an anomaly at each point in time. Consequently, there is no classification of a specific type of event.

A video with frames containing various levels of anomalous events, as indicated by the anomaly score.

An example clip of an anomalous event (robbery) and the corresponding anomaly score for each frame. The red zone indicates the ground truth. Source: https://www.youtube.com/watch?v=8TKkPePFpiE

Example use cases

  • Smart CCTV surveillance: notify camera operators of potential anomalous events to focus their attention.
  • Safety monitoring: detect possibly unsafe events to prevent incidents.

Video summarization and description

To conclude, the world of video analysis is not limited to bounding boxes and class labels. Below are two less common video analysis tasks that aim to give a compact representation of a video: summarization and description.


Video summarization is the process of extracting the most informative or descriptive frames from a video. In the most extreme case, only a single frame is extracted to represent the video (e.g. YouTube thumbnail).

Video summarization extracts the most important frames from a video.

An example of a video summarization model outputting an importance score for each frame. Source: https://www.microsoft.com/en-us/research/publication/video-summarization-learning-deep-side-semantic-embedding/


Automatic video description aims to provide a textual description, indicating what is happening in the video clip. Optionally, description models may also include a segmentation step, splitting the video into distinct chunks and providing a textual description for each.

Example output from a (grounded) video description model. Source: https://github.com/facebookresearch/grounded-video-description

Example use cases

  • Automatic thumbnail selection: automatically select the most descriptive thumbnail for a video.
  • Textual content retrieval: search videos with textual queries describing the content of the video.


Video analysis enables a variety of use cases, spanning multiple domains. This blogpost is by no means an exhaustive list but rather sheds a light on the potential of the many techniques. Although video analysis has been around for a while, we’ve only recently started seeing a glimpse of its full potential. By leveraging temporal information, the true power of video emerges.

Want to find out what video analysis can do for you? Get in touch!

Related posts

View all
No results found.
There are no results with this criteria. Try changing your search.
Large Language Model
Foundation Models
Structured Data
Chat GPT
Voice & Sound
Front-End Development
Data Protection & Security
Responsible/ Ethical AI
Hardware & sensors
Generative AI
Natural language processing
Computer vision