Better Sign Language Translation with STMC-Transformer

Sign Language Translation (SLT) first uses a Sign Language Recognition (SLR) system to extract sign language glosses from videos. Then, a translation system generates spoken language translations from the sign language glosses. This paper focuses on the translation system and introduces the STMC-Transformer which improves on the current state-of-the-art by over 5 and 7 BLEU respectively on gloss-to-text and video-to-text translation of the PHOENIX-Weather 2014T dataset. On the ASLG-PC12 corpus, we report an increase of over 16 BLEU. We also demonstrate the problem in current methods that rely on gloss supervision. The video-to-text translation of our STMC-Transformer outperforms translation of GT glosses. This contradicts previous claims that GT gloss translation acts as an upper bound for SLT performance and reveals that glosses are an inefficient representation of sign language. For future SLT research, we therefore suggest an end-to-end training of the recognition and translation models, or using a different sign language annotation scheme.


Introduction
Communication holds a central position in our daily lives and social interactions. Yet, in a predominantly aural society, sign language users are often deprived of effective communication. Deaf people face daily issues of social isolation and miscommunication to this day (Souza et al., 2017). This paper is motivated to provide assistive technology that allow Deaf people to communicate in their own language.
In general, sign languages developed independently of spoken language and do not share the grammar of their spoken counterparts (Stokoe, 1960). For this, Sign Language Recognition (SLR) systems on their own cannot capture the underlying grammar and complexities of sign language, and Sign Language Translation (SLT) faces the additional challenge of taking into account the unique linguistic features during translation. As shown in Figure 1, current SLT approaches involve two steps. First, a tokenization system generates glosses from sign language videos. Then, a translation system translates the recognized glosses into spoken language. Recent work (Orbay and Akarun, 2020;Zhou et al., 2020) has addressed the first step, but there has been none improving the translation system. This paper aims to fill this research gap by leveraging recent success in Neural Machine Translation (NMT), namely Transformers.
Another limit to current SLT models is that they use glosses as an intermediate representation of sign language. We show that having a perfect continuous SLR system will not necessarily improve SLT results. We introduce the STMC-Transformer model performing video-to-text translation that surpasses translation of ground truth glosses, which reveals that glosses are a flawed representation of sign language.
The contributions of this paper can be summarized as: 1. A novel STMC-Transformer model for video-to-text translation surpassing GT glosses translation contrary to previous assumptions 2. The first successful application of Transformers to SLT achieving state-of-the-art results in both gloss to text and video to text translation on PHOENIX-Weather 2014T and ASLG-PC12 datasets 3. The first usage of weight tying, transfer learning, and ensemble learning in SLT and a comprehensive series of baseline results with Transformers to underpin future research

Methods
Despite considerable advancements made in machine translation (MT) between spoken languages, sign language processing falls behind for many reasons. Unlike spoken language, sign language is a multidimensional form of communication that relies on both manual and non-manual cues which presents additional computer vision challenges (Asteriadis et al., 2012). These cues may occur simultaneously whereas spoken language follows a linear pattern where words are processed one at a time. Signs also vary in both space and time and the number of video frames associated to a single sign is not fixed either.

Sign Language Glossing
Glossing corresponds to transcribing sign language word-for-word by means of another written language. Glosses differ from translation as they merely indicate what each part in a sign language sentence mean, but do not form an appropriate sentence in the spoken language. While various sign language corpus projects have provided different guidelines for gloss annotation (Crasborn et al., 2007;Johnston, 2013), there is no universal standard which hinders the easy exchange of data between projects and consistency between different sign language corpora. Gloss annotations are also an imprecise representation of sign language and can lead to an information bottleneck when representing the multi-channel sign language by a single-dimensional stream of glosses.

Sign Language Recognition
SLR consists of identifying isolated single signs from videos. Continuous sign language recognition (CSLR) is a relatively more challenging task that identifies a sequence of running glosses from a running video. Works in SLR and CSLR, however, only perform visual recognition and ignore the underlying linguistic features of sign language.

Sign Language Translation
As illustrated in Figure 1, the SLT system takes CSLR as a first step to tokenize the input video into glosses. Then, an additional step translates the glosses into a valid sentence in the target language.
SLT is novel and difficult compared to other translation problems because it involves two steps: extract meaningful features from a video of a multi-cue language accurately then generate translations from an intermediate gloss representation, instead of translation from the source language directly. 3 Related Work

Sign Language Recognition
Early approaches for SLR rely on hand-crafted features (Tharwat et al., 2014;Yang, 2010) and use Hidden Markov Models (Forster et al., 2013) or Dynamic Time Warping (Lichtenauer et al., 2008) to model sequential dependencies. More recently, 2D convolutional neural networks (2D-CNN) and 3D convolutional neural networks (3D-CNN) effectively model spatio-temporal representations from sign language videos (Cui et al., 2017;Molchanov et al., 2016). Most existing work on CSLR divides the task into three sub-tasks: alignment learning, single-gloss SLR, and sequence construction Zhang et al., 2014) while others perform the task in an end-to-end fashion using deep learning (Huang et al., 2015;Camgoz et al., 2017).

Sign Language Translation
SLT was formalized in Camgoz et al. (2018) where they introduce the PHOENIX-Weather 2014T dataset and jointly use a 2D-CNN model to extract gloss-level features from video frames, and a seq2seq model to perform German sign language translation. Subsequent works on this dataset (Orbay and Akarun, 2020;Zhou et al., 2020) all focus on improving the CSLR component in SLT. A contemporaneous paper (Camgoz et al., 2020) also obtains encouraging results with multi-task Transformers for both tokenization and translation, however their CSLR performance is sub-optimal, with a higher Word Error Rate than baseline models. Similar work has been done on Korean sign language by Ko et al. (2019) where they estimate human keypoints to extract glosses, then use seq2seq models for translation. Arvanitis et al. (2019) use seq2seq models to translate ASL glosses of the ASLG-PC12 dataset (Othman and Jemni, 2012).

Neural Machine Translation
Neural Machine Translation (NMT) employs neural networks to carry out automated text translation. Recent methods typically use an encoder-decoder architecture, also known as seq2seq models.
Earlier approaches use recurrent (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014) and convolutional networks (Kalchbrenner et al., 2016;Gehring et al., 2017) for the encoder and the decoder. However, standard seq2seq networks are unable to model long-term dependencies in large input sentences without causing an information bottleneck. To address this issue, recent works use attention mechanisms (Bahdanau et al., 2015;Luong et al., 2015) that calculates context-dependent alignment scores between encoder and decoder hidden states. Vaswani et al. (2017) introduces the Transformer, a seq2seq model relying on self-attention that obtains state-of-the-art results in NMT.

Model architecture
For translation from videos to text, we propose the STMC-Transformer network illustrated in Figure 2.

Spatial-Temporal Multi-Cue (STMC) Network
Our work is the first to use STMC networks (Zhou et al., 2020) for SLT. A spatial multi-cue (SMC) module with a self-contained pose estimation branch decomposes the input video into spatial features of multiple visual cues (face, hand, full-frame and pose). Then, a temporal multi-cue (TMC) module with stacked TMC blocks and temporal pooling (TP) layers calculates temporal correlations within (inter-cue) and between cues (intra-cue) at different time steps, which preserves each unique cue while exploring their relation at the same time. The inter-cue and intra-cue features are each analyzed by Bi-directional Long Short-Term Memory (BiLSTM) (Sutskever et al., 2014) and Connectionist Temporal Classification (CTC) (Graves et al., 2006) units for sequence learning and inference.
This architecture efficiently processes multiple visual cues from sign language video in collaboration with each other, and achieves state-of-the-art performance on three SLR benchmarks. On the PHOENIX-Weather 2014T dataset, it achieves a Word Error Rate of 21.0 for the SLR task.

Transformer
For translation, we train a two-layered Transformer to maximize the log-likelihood where D contains gloss-text pairs (x i , y i ).
Two layers, compared to six in most spoken language translation, is empirically shown to be optimal in Section 6.1, likely because our datasets are limited in size. We refer to the original Transformer paper (Vaswani et al., 2017) for more architecture details.  Table 1: Statistics of the RWTH-PHOENIX-Weather 2014T and ASLG-PC12 datasets. Out-ofvocabulary (OOV) words are those that appear in the development and testing sets, but not in the training set. Singletons are words that appear only once during training.

Datasets
PHOENIX-Weather 2014T (Camgoz et al., 2018) This dataset is extracted from weather forecast airings of the German tv station PHOENIX. This dataset consists of a parallel corpus of German sign language videos from 9 different signers, gloss-level annotations with a vocabulary of 1,066 different signs and translations into German spoken language with a vocabulary of 2,887 different words. It contains 7,096 training pairs, 519 development and 642 test pairs.

ASLG-PC12 (Othman and Jemni, 2012)
This dataset is constructed from English data of Project Gutenberg that has been transformed into ASL glosses following a rule-based approach. This corpus with 87,709 training pairs allows us to evaluate Transformers on a larger dataset, where deep learning models usually require lots of data. It also allows us to compare performance across different sign languages. However, the data is limited since it does not contain sign language videos, and is less complex due to being created semi-automatically. We make our data and code publicly available 2 .

Experiments and Discussions
Our models are built using PyTorch (Paszke et al., 2019) and Open-NMT (Klein et al., 2017). We configure Transformers with word embedding size 512, gloss level tokenization, sinusoidal positional encoding, 2,048 hidden units and 8 heads. For optimization, we use Adam (Kingma and Ba, 2014) with β 1 = 0.9, β 2 = 0.998, Noam learning rate schedule, 0.1 dropout, and 0.1 label smoothing. We evaluate on the dev set each half-epoch and employ early stopping with patience 5. During decoding, generated unk tokens are replaced by the source token having the highest attention weight. This is useful when unk symbols correspond to proper nouns that can be directly transposed between languages (Klein et al., 2017). We perform a series of experiments to find the optimal setup for this novel application. We equally experiment with various techniques often used in classic NMT to SLT such as transfer learning, weight tying and ensembling to improve model performance.
For evaluation we use BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005). For BLEU, we report BLEU-1,2,3,4 scores and as ROUGE score we report the ROUGE-L F1 score. These metrics allow us to directly compare directly with previous works. METEOR is calculated in addition as it demonstrates higher correlation with human evaluation than BLEU on several MT tasks. All reported results unless otherwise specified are averaged over 10 runs with different random seeds.
We organize our experiments into two groups: 1. Gloss2Text (G2T) in which we translate GT gloss annotations to simulate perfect tokenization on both PHOENIX-Weather 2014T and ASLG-PC12 2. Sign2Gloss2Text (S2G2T) where we perform video-to-text translation on PHOENIX-Weather 2014T with the STMC-Transformer G2T is a text-to-text translation task that is novel and challenging compared to classic translation tasks between spoken languages because of the high linguistic variance between source and target sentences, scarcity of resources, and information loss or imprecision in the source sentence itself. For ASLG-PC12, many ASL glosses are English words with an added prefix so during data preprocessing we remove all such prefixes. We also set all glosses that appear less than 5 times during training as unk to reduce vocabulary size.  Table 2: Statistics of the ASLG-PC12 dataset before and after preprocessing. Table 2 shows that the source and target corpora in ASLG-PC12 are more similar to each other with many shared vocabulary and a relatively high BLEU-4 score on raw data. This allows us to compare Transformer performance on a larger and less challenging dataset.

Model size
The original Transformer in (Vaswani et al., 2017) uses 6 layers for the encoder and decoder for NMT. However, our task differs from a standard MT task between two spoken languages so we first train Transformers with 1, 2, 4 and 6 encoder-decoder layers. Networks are trained with batch size 2,048 and initial learning rate 1.  To choose the best model, we mainly take into account BLEU-4 as it is currently the most widely used metric in MT. We do find that our final model outperforms the other models across all metrics. Table  3 shows that on PHOENIX-Weather 2014T, using 2 layers obtains the highest BLEU-4. Because our dataset is much smaller than spoken language datasets, larger networks may be disadvantaged. Moreover, a smaller model has the advantage of taking up less memory and computation time. Repeating the same experiment on ASLG-PC12, we also find 2 layers to be the optimal model size. ASLG-PC12 is larger but less complex which may also explain why smaller networks are more suitable. We carry out the rest of our experiments using 2 enc-dec layers.

Embedding schemes
Press and Wolf (2017) shows that tying the input and output embeddings while training language models may provide better performance. Our decoder is in fact a language model conditioned on the encoding of the source sentence and previous outputs, we can tie the decoder embeddings by using a shared weight matrix for the input and output word embeddings.
In addition, models are often initialized with pre-trained embeddings for transfer learning. These embeddings are typically trained in an unsupervised manner on a large corpus of text in the desired language. We perform experiments on PHOENIX-Weather 2014T using two popular word embeddings: GloVe 3 (Pennington et al., 2014), and fastText (Bojanowski et al., 2017). To the best of our knowledge, weight-tying or pre-trained embeddings have never been employed in SLT. GloVe (de) Table 4 shows there is only one matching token between German glosses and the pre-trained embeddings, while over 90% of the words in the German text appear in both pre-trained embeddings. We therefore initialize pre-trained embeddings on the decoder only, and keep random initialization for the encoder. The embedding layers are fine-tuned during training.   Table 5 shows that the new embedding schemes do not improve performance on PHOENIX-Weather 2014T. It may be because pre-trained embeddings are shown to be more effective when used on the encoding layer (Qi et al., 2018). Another possible reason is the difference between the domain of our dataset and of the corpus the embeddings were trained on. We therefore keep random initialization of word embeddings for experiments on PHOENIX-Weather 2014T. Using this setting, we run a parameter search over the learning rate and warm-up steps, and we use initial learning rate 0.5 with 3,000 warm-up steps for the remaining experiments. Details of the parameter search are included in Appendix A.1.
Both GloVe and fastText English vectors have a reasonable overlap with the vocabulary of ASL glosses as well as the English targets (Table 4). Therefore on ASLG-PC12 we load pre-trained embeddings on only the decoder, as well as on both the encoder and decoder.  Table 6: G2T performance comparison using different embedding schemes on ASLG-PC12. Table 6 shows that fastText pre-trained embeddings for the decoder improves performance, and tied decoder embeddings with random initialization gives the best performance. Weight tying is more suited on this dataset likely because it acts as regularization and combats overfitting, while the previous dataset is more complex and therefore less prone to overfitting. For the remaining experiments, we use tied decoder embeddings, initial learning rate 0.2 and 8,000 warm-up steps. A naive method for decoding is greedy search, where the model simply chooses the word with the highest probability at each time step. However, this approach may become sub-optimal in the context of the entire sequence. Beam search addresses this by expanding all possible candidates at each time step and keeping a number of most likely sequences, or the beam width. Large beam widths do not always result in better performance and take more space in memory and decoding time. We search and find the optimal beam width value to be 4 on PHOENIX-Weather 2014T and 5 on ASLG-PC12.

Ensemble decoding
Ensemble methods combine multiple models to improve performance. We propose ensemble decoding, where we combine the output of different models by averaging their prediction distributions. We chose 9 models from our experiments that gave the highest BLEU-4 during testing on PHOENIX-Weather 2014T. The number of models is chosen empirically, as using fewer models will lead to less ensembling but too many weaker models may lessen the quality of the ensemble model. These models are of the same architecture, but are initialized with different seeds and trained using different batch sizes and/or learning rates. These models give a BLEU-4 on testing between 22.92 and 23.41 individually.    7 gives a performance comparison on PHOENIX-Weather 2014T of the recurrent seq2seq model by Camgoz et al. (2018), Transformer trained concurrently by Camgoz et al. (2020), our single model, and ensemble model. We also provide the scores on the gloss annotations to illustrate the difficulty of this task.
Without any additional training, ensembling improves testing performance by over 1 BLEU-4. Also, we report an improvement of over 5 BLEU-4 on the state-of-the-art. A single Transformer also gives an improvement of over 4 BLEU-4 more than the state-of-the-art, which shows the advantage of Transformers for SLT, as shown also in Camgoz et al. (2020).  We also use 5 of the best models from our experiments on ASLG-PC12 in an ensemble. Individually, these models obtain between 81.72 and 82.41 BLEU-4 on testing. Table 8 shows that the performance of our single Transformer surpasses the recurrent seq2seq model by Arvanitis et al. (2019) by over 16 BLEU-4. The ensemble model reports an improvement of 0.46 BLEU-4 over the single model. There is relatively less increase from ensembling possibly because there is less variance across different models.

German Sign2Gloss2Text (S2G2T)
In S2G2T, both gloss recognition from videos and its translation to text are performed automatically. Camgoz et al. (2018) claims the previous G2T setup to be an upper bound for translation performance, since it simulates having a perfect recognition system. However, this claim assumes that the ground truth gloss annotations give a full understanding of sign language, which ignores the information bottleneck in glosses. Camgoz et al. (2020) hypothesizes that it is therefore possible to surpass G2T performance without using GT glosses, which we confirm in this section.
We perform experiments on the PHOENIX-Weather 2014T dataset as it contains parallel video, gloss and text data. On the other hand, the ASLG-PC12 corpus does not have sign language video information.

S2G →G2T
To begin, we use the best performing model for German G2T to translate glosses predicted by a trained STMC network. In Table 9 we can see that despite no additional training for translation, this model already obtains a relatively high score that beats the current state-of-the-art by over 5 BLEU-4.

Recurrent seq2seq networks
For comparison, we also train and evaluate STMC used with recurrent seq2seq networks for translation. The translation models are composed of four stacked layers of Gated Recurrent Units (GRU) (Chung et al., 2014), with either Luong (Luong et al., 2015) or Bahdanau (Bahdanau et al., 2015) attention.
In Table 9, recurrent seq2seq models obtain slightly better performance with Luong attention. Surprisingly, these models outperform previous models of similar architecture that translate GT glosses.

Transformer
For the STMC-Transformer, we train Transformer models with the same architecture as in G2T. Parameter search yields an initial learning rate 1 with 3,000 warm-up steps and beam size 4. We empirically find using the 8 best models in ensemble decoding to be optimal. These models individually obtain between 23.51 and 24.00 BLEU-4.
Again, we observe that STMC-Transformer outperforms the previous system with ground truth glosses and Transformer. While STMC performs imperfect CSLR, its gloss predictions may be more useful than ground-truth annotations during SLT and are more readily analyzed by the Transformer. Again, the ground truth glosses represent merely a simplified intermediate representation of the actual sign language, so it is not entirely unexpected that translating ground truth glosses does not give the best performance.
STMC-Transformer also outperforms Transformers that translate GT glosses. While STMC performs imperfect CSLR, its gloss predictions may be better processed by the Transformer. Glosses are merely a simplified intermediate representation of the actual sign language so they may not be optimal. This result also reveals, training the recognition model to output more accurate glosses will not necessarily improve translation.
Both our STMC-Transformer and STMC-RNN also outperform Camgoz et al. (2020)'s model. Their best model jointly train Transformers for recognition and translation, however it obtains 24.49 WER on recognition whereas STMC obtains a better WER of 21.0, which suggests their model may be weaker in processing the videos.
Moreover, Transformers outperform recurrent networks in this setup as well and STMC-Transformer improves the state-of-the-art for video-to-text translation by 7 BLEU-4.

Qualitative comparison
Example outputs of the G2T and S2G2T models (Table 10) show that the translations are of generally good quality, even with low BLEU scores. Most translations may have slight differences in word choice that do not change the overall meaning of the sentence or make grammatical errors, which suggests BLEU is not a good representative of human useful features for SLT. As for the comparison between the G2T and S2G2T networks, there does not seem to be a clear pattern between cases where S2G2T outperforms G2T and vice versa. One thing to note, though, is that the PHOENIX-Weather 2014T is restricted to the weather forecast domain, and a SLT dataset with a wider domain would be required to fully assess the performance of our model in more general real-life settings.
We also provide sample G2T outputs on the ASLG-PC12 corpus in Appendix A.2.

Conclusions and Future Work
In this paper, we proposed Transformers for SLT, notably the STMC-Transformer. Our experiments demonstrate how Transformers obtain better SLT performance than previous RNN-based networks. We also achieve new state-of-the-art results on different translation tasks on the PHOENIX-Weather 2014T and ASLG-PC12 datasets.
A key finding is we obtain better performance by using a STMC network for tokenization instead of translating GT glosses. This calls into question current methods that use glosses as an intermediate representation, since reference glosses themselves are suboptimal.
End-to-end training without gloss supervision is one promising step, though Camgoz et al. (2020)'s end-to-end model does not yet surpass their joint training model. As future work, we suggest continuing work on end-to-end training of the recognition and translation models, so the recognition model learns an intermediate representation that optimizes translation, or using a different sign language annotation scheme that has less information loss.

A Appendices
A.1 Experiments on German G2T learning rate A learning rate that is too low results in a notably slower convergence, but setting the learning rate too high risks leading the model to diverge. To prevent the model from diverging, we apply the Noam learning rate schedule where the learning rate increases linearly during the first training steps, or the warmup stage, then decreases proportionally to the inverse square root of the step number. The number of warmup steps is a parameter that has shown to influence Transformer performance (Popel and Bojar, 2018) therefore we first run a parameter search over the number of warmup steps before finding the optimal initial learning rate.

A.2 Qualitative G2T Results on ASLG-PC12
Because quantitative metrics provide only a limited evaluation of translation performance, manual evaluation by viewing the translation outputs directly may give a better assessment of the quality of translations. Table 11 provides examples of SLT output on the ASLG-PC12 dataset. Here we can see how ASL glosses include prefixes that are not necessary to encapture the meaning of the phrase, which we have removed during data pre-processing before training. With a BLEU-4 testing score of 82.87, most predictions by our system are very close to the target English phrases and are able to convey the same meaning. We have also selected translation examples with lower BLEU-4 score and we can see that common errors include mistranslation of numbers and proper nouns. These are likely corner cases with infrequent examples during training.