From schools to shopping centers, closed-circuit and monitoring cameras are now found everywhere. With their rising prevalence, concerns about individual privacy and data protection have risen. Even if these cameras do not use facial-recognition technology, the current reality is that wherever there are security cameras, in large cities, buildings, or other locations, there are people looking at these images irrespective of GDPR or other regulations. Watching this footage invades the privacy of its subjects and the information so obtained is only protected by the discretion of the observer. Anonymization of these images would improve compliance with data protection regulations while also increasing public confidence in such monitoring systems.
A major step in any endeavor to protect the privacy of individuals is the removal of personally identifiable information of which facial information is a large part. While simple techniques such as blurring/obfuscation of detected faces serve this purpose, there is a risk of the identity being revealed if the face detection algorithm fails for a few frames. A technique to mitigate this is by replacing the real faces with faces that are artificially generated. If the real face is not replaced for a few frames, this will be less obvious when watching the video as compared to a momentary lapse of blurring/obfuscation.
In this blogpost, our Machine Learning Engineer, Nikhil Nagaraj, explores the two facial anonymization techniques: DeepPrivacy and CLEANIR and expounds on their strengths and weaknesses.
DeepPrivacy proposes a conditional generative adversarial network (CGAN) to anonymize faces in a given image. The model considers the original pose and background and aims to generate realistic faces that fit seamlessly into the original image.
Before delving into the architecture proposed by DeepPrivacy, this section takes a brief look at generative adversarial networks and their conditional counterparts.
Based on the concept of a zero-sum game and used in generative modeling, a GAN consists of two networks that compete against each other. One of these networks is the generator, which aims to generate data as similar as possible to the real data distribution. The other is the discriminator, which aims to distinguish between the real data distribution and the distribution of the generated samples. Both these networks are trained in an adversarial manner, alternating between optimizing the generator and the discriminator. Ideally, training stops when the discriminator is no longer able to distinguish between the real samples and their generated counterparts.
In a CGAN, the generator/discriminator must generate/discriminate based on certain auxiliary conditions that are fed to the network. For example, a CGAN might be required to generate images of handwritten numbers in accordance with a number that is given as its conditional input. The discriminator on the other hand must additionally check if the number in the real/generated image matches the condition.
DeepPrivacy proposes a CGAN, which generates images based on the surroundings of the face and sparse pose information.
The official implementation of the model proposes the usage of the dual shot face detector (DSFD) to detect faces in the given image. A Mask RCNN is used to estimate seven keypoints to describe the pose of the face: left/right eye, left/right ear, left/right shoulder, and nose. The detected face is then obfuscated and the resulting image along with the pose information is fed to the generative network which has a U-Net architecture and employs progressive growing during the training process.
Results obtained using DeepPrivacy. The left image in each pair is the original image while the image on the right is its anonymized counterpart. In the original image(s), the red bounding box indicates the face detected by the DSFD. The red points are the facial keypoints as detected by the Mask RCNN.
DeepPrivacy is capable of generating relatively good quality anonymized faces according to the background and pose information. But the anonymization process is non-deterministic and hence there isn’t any consistency in the anonymized faces even if the original face is the same. This is evident in the video below, where a video anonymized using DeepPrivacy. Note the lack of temporal consistency in the anonymized face.
DeepPrivacy aims to generate an anonymized face without any emphasis on preserving auxiliary non-identifiable information that can be gleaned from the face. CLEANIR aims to bridge this shortcoming.
CLEANIR proposes a model to anonymize faces while preserving non-privacy violating information. The model based on a variational autoencoder aims to modify the facial identity to a completely new one while the other attributes that are loosely related to personal identity are preserved.
Before delving into the architecture proposed by CLEANIR, this section takes a brief look at the basic concepts involved in a variational autoencoder.
An autoencoder is used to learn efficient data codings in an unsupervised manner. It essentially distills the input data down to its most discriminative/representative features. Generally, the autoencoder aims to be able to reproduce input data from its representation vector (latent vector). This leads to a phenomenon where the latent vectors for different inputs are quite different (mathematically distant) from each other.
A variational autoencoder (VAE) is an autoencoder with an additional goal of compactness in the latent space (The space containing all the latent vectors is known as the latent space). This allows the latent space to be sampled to generate new samples that are similar to real data points. A more detailed explanation on VAEs and the math behind them can be found here.
CLEANIR proposes a VAE based architecture to generate anonymized faces while preserving information loosely related to the subject’s identity. It aims to preserve information such as skin color and the emotion(s) portrayed. The network is trained to separate the latent vector into features related to the subject’s identity and other features which are only loosely related to the same.
During inference, the part of the latent vector which is related to the person’s identity is transformed and fed together with the auxiliary feature vector to the decoder to produce a face that has been anonymized but still preserves general characteristics of the face.
Results obtained using CLEANIR. From the anonymized images (right), it is evident that CLEANIR can preserve general information about the face while also anonymizing the face to protect personal identity.
CLEANIR generates high quality anonymized faces while preserving the general information about the person. Due to the nature of the model involved, the anonymization process can produce similar anonymized faces for the same original image. This ensures greater temporal consistency when used to anonymize faces in videos.
Since CLEANIR only generates a cropped face, the anonymized image might be quite jarring if smoothing techniques are not used to merge the original image with the new anonymized face.
After expounding on the techniques involved in both CLEANIR and DeepPrivacy, this section provides a comparison concerning the strengths and weaknesses of both techniques.
Both models generate an anonymized face which cannot be visually matched with the original face. However, when quantitatively tested on the Labelled Faces in the Wild dataset using Facenet to generate embeddings, CLEANIR seems to marginally outperform DeepPrivacy in terms of the distance between an anonymized face and the original face. This is probably a result of CLEANIR’s training methodology, which uses Facenet as a part of its training pipeline.
In the case of videos, CLEANIR is capable of producing better temporal consistency in terms of the anonymized faces.
The bottleneck for both CLEANIR and DeepPrivacy in the anonymization pipeline is face/keypoint detection. The actual anonymization process is quite fast in comparison.
Due to its GAN based approach in generating anonymized images, DeepPrivacy is more susceptible to the ill-effects caused by a noisy/complex background. However, post-processing which replaces only the original face with its anonymized version can reduce any such impact.
Both CLEANIR and DeepPrivacy are limited in their anonymization capabilities by the quality of their face detectors. Spurious face detections can cause a large-scale distortion in the anonymized image esp. in the case of DeepPrivacy.
If no face is detected, no face is anonymized.
Facial anonymization is a topic that’s seeing a lot of activity these days. With the increasing demand for greater privacy protections and adherence to stricter regulations, the amount of buzz and research in this area is sure to increase. Both DeepPrivacy and CLEANIR are excellent attempts each with its strengths and weaknesses. Future work should possibly involve trying to generate hyper-realistic anonymized images consistently and accurately with the same face being anonymized similarly. It would also be interesting to try anonymizing segmented faces instead of using bounding boxes.
The code to generate the images and videos seen above is available in this Google Colab notebook. As a small reward for reading through this blog, here’s a video comparing the performance of both the models. [ A comparison of CLEANIR (center) and DeepPrivacy (right). [Original Video (left) Source: CCTV Camera Pros]