Wouldn’t it be cool if people without knowledge of Sign Language can understand Sign Language? Then what’s stopping us (researchers & developers) from achieving this goal?
The work in this blog was done as part of my internship at ML6 and MSc Thesis at TU Delft.
Sign Language (SL) is the primary language for the deaf and mute community. According to the World Federation of the Deaf , there are more than 70 million deaf people around the world that use sign language. It is a natural and complete language that has its own linguistic intricacies. Every spoken language has its own sign language, like American Sign Language (ASL), Chinese Sign Language (CSL), German Sign Language (DGS), and so on. In total, there are around 300 different sign languages. Sign languages are not a one-to-one mapping of spoken languages, they have their own definite grammar.
For instance, a well-constructed question must be accompanied by the correct eyebrow position. When a person is asking questions related to who, where, what, why, and when, the eyebrows are expected in a certain position. If the question is regarding a yes/no situation, the eyebrows are expected in some particular way. SL does not only use hand gestures to communicate but also includes facial expressions, hand movements and positions, and body posture. Any change in them can change the entire meaning of the sign. That is why it is generally hard for someone with no knowledge of sign languages to understand them.
All of these factors make translation into spoken language difficult. There are mainly two research areas going on in Sign Language interpretation, i.e. Sign Language Recognition (SLR) and Sign Language Translation (SLT), which we touch on later in the blog and utilize a state-of-the-art architecture for translation. We will also be discussing and listing down some of the crucial gaps in the architecture and current research setting for SLT in a real-time setting.
SLR is about recognizing actions from sign language. It is considered to be the naïve gesture recognition problem but not just limited to alphabets and numbers. It focuses on recognizing a sequence of continuous signs but disregards the underlying rich grammatical and linguistic structures of sign language that differ from spoken language. The main goal is to interpret the signs, isolated or in a continuous sequence.
On the other hand, SLT is about interpreting sign language in terms of natural language with its grammar, keeping in mind the language. The primary objective of SLT is to translate sign language videos into spoken language forms, taking into account the different grammatical aspects of the language. It is a relatively new problem and is complex as it involves considering facial features and body postures as well along with hand movements and positions. The image below clearly presents the difference between Continuous SLR and SLT.
There are several gaps and challenges in the current research landscape for SLT in real-time human interaction. To get a better idea of these gaps, we utilized a state-of-the-art architecture for Continuous SLR proposed in “Visual alignment constraint (VAC) for continuous sign language recognition (CSLR).” research paper by Min, Yuecong, et al [2]. To use of this architecture for the SLT problem, we added a two-layer transformer for translation over the VAC_CSLR architecture, as shown in the image below. Moreover, RWTH Phoenix Weather 14T dataset [3] was used to train both networks separately. This dataset is extracted from weather forecast airings of the German tv station PHOENIX. It has 9 different signers, gloss-level annotations with a vocabulary of 1,066 different signs and translations into German spoken language with a vocabulary of 2,887 different words.
The architecture is based on a two-step, Sign-to-Gloss Gloss-to-Text, translation where the first step is to obtain glosses from the video sequence, and in the next step, the glosses are converted into spoken language sentences. After the training and testing phase, the model was utilized in a real-time setting. It was tested on the different videos with translation happening on the go, in sets of frames using OpenCV. MediaPipe was used to identify when to start and end a sign sequence.
The first stage required utilizing the VAC_CSLR network to obtain glosses from the video sequences. The Visual Alignment Constraint network focuses on enhancing the feature extractor with alignment supervision by proposing two auxiliary losses: the Visual Enhancement (VE) loss and the Visual Alignment (VA) loss. The VE loss provides direct supervision for the feature extractor, which itself is enhanced with the addition of an auxiliary classifier on visual features to get the auxiliary logits. This auxiliary loss makes the feature extractor make predictions based on local visual information only.
Then, to compensate for the contextual information that VE loss lacks, the VA loss is proposed. The VA loss is implemented as a knowledge distillation loss which regards the entire network and the visual feature extractor as the teacher and student models, respectively. The final objective function is composed of the primary Connectionist Temporal Classification (CTC) loss, the visual enhancement loss, and the visual alignment loss. In the second stage to obtain translation from glosses, a two-layered Transformer was used to maximize the log-likelihood over all gloss-text pairs.
We referred to the original Transformer [4] implementation for more details.
After hyper-parameter tuning and model validation, the model was applied to different videos from the published datasets and clips from various SL-friendly news channels. The videos were mainly selected from German SL sources as the models were trained on a German SL dataset. We utilized random videos from RWTH-Phoenix-Weather 2014, RWTH-Phoenix-Weather 2014-T dataset, and took SL snippets from Tagesschau, a news show in Germany, for evaluation. These videos were not very long, just a sentence long (so, up to 8–10 seconds).
In the translation pipeline, a video is broken down into frames of images and on every image a MediaPipe holistic model is run, which identifies key points from the image. If the identified key points contain left or right-hand key points then the SLR model starts taking frames for prediction. The set of frames is decided on the basis of key-point detection of the left or right hand from the MediaPipe holistic model i.e. till one of the hands is in the frame. After we get the glosses from the VAC model, these glosses are passed to the Transformer model which provides the spoken translations. The final translations were compared to the actual text for the SL video sequence.
In addition to this, we also applied different transformations to the frames captured from videos. Here are the transformations that were applied:
★ Segmentation masks: A mask is used to segment an image. It is used to identify the parts of an image containing a particular object, in this case, a human. It was mainly used to avoid noise in the images, with the background being insignificant for prediction.
★ Image rotation: It is a common image augmentation operation. The image is rotated at various angles to capture the different aspects of the image features in different orientations.
★ Image resizing: In this, the size of the image was changed by the central cropping method at different dimensions.
★ Image scaling: This is different from image resizing as it happens on the entire image by resampling. The images were scaled randomly between 0.5 to 1.5 interval.
After several experiments on the architecture with different videos, we listed down the gaps we observed and which are important to improve SLT for real-world application. Here are the observed gaps:
In this blog, we showed the identified gaps in the architectures for SLR/SLT by considering and exploiting existing state-of-the-art architectures. Our mentioned gaps suggest that there needs to be more advancement in architectures and datasets to achieve high-level real-world applications. We conclude that although the current architectures for SLR/SLT might not be fully equipped for a real-world application for SL interpretation, the progress in terms of datasets and architectures looks promising. As the problem of SLT at hand is difficult, various aspects of SL must be considered to solve this challenging problem.
References
[1] Camgoz, Necati Cihan, et al. “Neural sign language translation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.https://openaccess.thecvf.com/content_cvpr_2018/papers/Camgoz_Neural_Sign_Language_CVPR_2018_paper.pdf
[2] Min, Yuecong, et al. “Visual alignment constraint for continuous sign language recognition.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. https://arxiv.org/abs/2104.02330
[3] Necati Cihan Camgöz, Simon Hadfield, Oscar Koller, Hermann Ney, Richard Bowden, Neural Sign Language Translation, IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018. https://www-i6.informatik.rwth-aachen.de/~koller/RWTH-PHOENIX-2014-T/
[4] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017). https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[5] Papineni, Kishore, et al. “Bleu: a method for automatic evaluation of machine translation.” Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002. https://aclanthology.org/P02-1040.pdf