A Dataset for Linguistic Understanding, Visual Evaluation, and Recognition of Sign Languages: The K-RSL

The paper presents the first dataset that aims to serve interdisciplinary purposes for the utility of computer vision community and sign language linguistics. To date, a majority of Sign Language Recognition (SLR) approaches focus on recognising sign language as a manual gesture recognition problem. However, signers use other articulators: facial expressions, head and body position and movement to convey linguistic information. Given the important role of non-manual markers, this paper proposes a dataset and presents a use case to stress the importance of including non-manual features to improve the recognition accuracy of signs. To the best of our knowledge no prior publicly available dataset exists that explicitly focuses on non-manual components responsible for the grammar of sign languages. To this end, the proposed dataset contains 28250 videos of signs of high resolution and quality, with annotation of manual and non-manual components. We conducted a series of evaluations in order to investigate whether non-manual components would improve signs’ recognition accuracy. We release the dataset to encourage SLR researchers and help advance current progress in this area toward real-time sign language interpretation. Our dataset will be made publicly available at https://krslproject.github.io/krsl-corpus


Introduction
There exist over 300 sign languages around the world that are native to 70 million deaf people (Bragg et al., 2019). Sign languages are comprised of hand gestures, arms and body movements, head position, facial expressions, and lip patterns (Sandler and Lillo-Martin, 2006). While automatic speech recognition has progressed to being commercially available, automatic Sign Language Recognition (SLR) is still in its infancy (Cooper et al., 2011).
To date, more than half of published visionbased research utilizes isolated sign language data with a vocabulary size of less than 50 signs (Koller, 2020). But the real-world utility of SLR solutions requires continuous recognition, which is significantly more challenging than recognising individual signs due to co-articulation (the ending of one sign affecting the start of the next), depiction (visually representing or enacting content), epenthesis effects (insertion of extra features into signs), generalization, and so on (Bragg et al., 2019). As a result, realistic, generalisable, and large datasets are necessary to advance SLR.
Current efforts in SLR do not address the complexities of sign language linguistics, and thus have a limited real-world value (Bragg et al., 2019). Chatzis et al. (2020) highlight the importance of non-manual components of sign languages. For example, they can change meaning of a verb, or differentiate between objects and people. According to Koller (2020), there is an overall lack of nonmanual parameters that are included in medium and larger vocabulary recognition systems. For example, many computer vision approaches focus on the signers' hands only and tend to ignore the rich channel of information conveyed by non-manual articulators: facial expressions, mouthing, movement and position of the head and body conveying important grammatical and lexical information. In addition, many datasets allowed novice or nonnative contributions (i.e. students) in addition to slower signing and simplifying the style and the vocabulary to make the computer vision problem easier but of no real value (Bragg et al., 2019). For the progress in SLR, interdisciplinary efforts are required with an involvement of native signers and sign language linguists.
Beyond targeting the local need of creating the first corpus within CIS (Commonwealth of Independent States) region suitable for machine learn-  (Huang et al., 2018) 50 178 25,000 KETI (2019) (Ko et al., 2019) 12 419 11,578 GSL SI (2019) (Chatzis et al., 2020) 7 310 10,290 K-RSL 10 600 28,250 ing, the motivation behind the proposed dataset is in the need to stress the importance of non-manual components present in many signs. The proposed dataset contains continuous sign language data with a focus on specifically selected cases where nonmanual markers play a vital role in differentiating between similar signs or sentences. This approach of corpus creation allows researchers from different fields to conduct experiments utilising this dataset. To date, SL linguists and ML researchers were rarely able to utilize the same datasets due to limitations of both kinds. Thus, we make the following contributions: • we release the first Kazakh-Russian Sign Language (KRSL) corpus consisting of 10 signers, 28250 continuous sentences, and vocabulary size 600 signs appropriate for ML research; • we release raw videos appropriate for linguists and general population; • we release isolated signs, extracted frames and features for easy and fast experiments aiming at compatibility with the formats of other SL datasets; • we evaluate pose estimation and action recognition approaches to setup baselines on the K-RSL dataset.
Section 2 presents the background on sign languages and non-manual components followed by a brief description of other SL datasets. Section 3 outlines the proposed dataset. Section 4 details a series of baseline evaluations conducted in order to investigate whether non-manual components would improve recognition accuracy. Section 5 details our use case evaluation. Section 6 concludes the paper.

Related work
This section discusses related work on sign language datasets, state of the art in SLR, and the importance of non-manual features for sign languages.

Sign Language Datasets
Sign language datasets consist of videos of either isolated or continuous signing. Table 1 presents a comparison of the continuous sign language datasets commonly utilized for sign language recognition with an inclusion of the proposed K-RSL ordered by date. Bragg et al. (2019) specify that the size of the datasets, continuous signing, involvement of native signers, and signers' variety are the main concerns related to current datasets. These challenges put a limitation on the accuracy and robustness of the models developed for SLR to be deployed in the real-world applications.

Sign Language Recognition
Latest works in the area of SLR are focused on vision-based continuous sign language recognition. All the evaluations are performed on the RWTH-PHOENIX-Weather 2014 dataset (Cihan Camgoz et al., 2018). There are various approaches offering recognition frameworks utilizing deep neural networks, reinforcement learning or recurrent neural networks. For example,  proposed an approach that apply encoder-decoder structure to the reinforcement learning. Their method achieved competitive results when compared with other methods and has a Word Error Rate (WER) of 38.3%. Temporal segmentation creates additional challenges for continuous SLR. To address this issue, Huang et al. (2018) proposed the Hierarchical Attention Network with Latent  Space (LS-HAN). This proposed framework eliminated the preprocessing of temporal segmentation and achieved the accuracy of 0.617.  proposed I3D-TEM-CTC framework with iterative optimization for continuous sign language recognition. By increasing the quality of pseudo labels, the final performance of the system was improved and achieved a WER of 34.5%. However, the most promising results were achieved by combining different modalities. Cui et al. (2019) proposed recurrent convolutional neural network on the multi-modal fusion data of RGB images along with the optical flow data and achieved WER of 22.86%.  presented approaches where they focused on the sequential parallelism to learn a sign language, mouth shape and handshape classifier. They have improved the WER to 26.0%. This clearly shows that combination of manual and non-manual features such as mouth shape could significantly improve performance of the recognition systems.

Importance of Non-manual Features
Sign languages are natural languages existing in the visual modality (Sandler and Lillo-Martin, 2006). Signs in sign languages are produced not only by using the manual articulators (the hands), but also by non-manual articulators (the body, head, facial features). The importance of the non-manual features is evidenced e.g. by the fact that signers focus their attention not on the hands of the interlocutor, but on the face (Pfau and Quer, 2010). It has been shown that non-manual markers function at different levels in sign languages (Pfau and Quer, 2010). On the lexical level, signs which are manually identical can be distinguished by facial expression or specifically by mouthing (silent articulation of a word from a spoken language) (Crasborn et al., 2008). Signs referring to emotions are obligatorily accompanied by lexicalized facial expressions related to the corresponding emotion. Non-manual markers are especially important on the level of sentence and beyond. Specifically, negation in many sign languages is expressed by head movements (Zeshan, 2004a), and questions are distinguished from statements by eyebrow and head position almost universally (Zeshan, 2004b). Of course, signers also use the face to express their emotions, so emotional and linguistic non-manual markers can interact in complex ways (De Vos et al., 2009). Antonakos et al. (2015) presented an overview of non-manual parameter employment for SLR and conclude that a limited number of works focused on employing non-manual features in SLR. There have been works that focused on combining both manual and non-manual features (Freitas et al., 2017;Liu et al., 2014;Yang and Lee, 2013;Mukushev et al., 2020) or non-manual features only (Kumar et al., 2017). While the importance of nonmanual markers has been thoroughly demonstrated in linguistic research, their role in sign language recognition has not been investigated in detail yet.

The Proposed K-RSL Corpus
Given the important role of non-manual markers, in this paper we present a corpus which is motivated by the importance of both manual and non-manual features. We focus on specific cases where nonmanual markers play a vital role in differentiating between similar signs or similar sentences.

Kazakh-Russian Sign Language (KRSL)
KRSL is the sign language used in the Republic of Kazakhstan. KRSL is closely related to Russian Sign Language (RSL) as centralized language policy of Soviet Union led to the spread of RSL in the Soviet republics. According to Kimmelman et al.

The Data
K-RSL dataset consists of videos of phrases, recorded by five professional sign language interpreters and one subset was additionally recorded by five deaf participants who are also native signers. Dataset can be divided into four subsets from the linguistic point of view: question-statement pairs, signs of emotion, emotional question-statement pairs, and phonologically similar signs (minimal pairs). They have been asked to sign 200 phrases for the first subset, 60 phrases for the second subset, 30 phrase with 3 emotional characteristics for the third subset, and 125 phrases for the fourth subset accordingly. Each phrase was repeated at least ten times in a row by each signer.
The five hearing participants are hearing native signers of KRSL, as they grew up with parents using KRSL at home. Four of them are employed as news interpreters at the national television. The setup had a green background and a LOGITECH C920 HD PRO WEBCAM. The shooting was performed in an office space without professional lighting sources. The summary of the K-RSL dataset is presented in Table 2.

Question vs Statement
Similar to question words in many spoken languages, question signs in KRSL can be used not only in questions (Who came?) but also in statements (I know who came). Thus, each question sign can occur either with non-manual question marking (eyebrow raise, sideward or backward head tilt), or without it. In addition, question signs are usually accompanied by mouthing of the corresponding Russian/Kazakh word (e.g. kto/kim for 'who', and chto/ne for 'what'). While question signs are also distinguished from each other by manual features, mouthing provides extra information, which can be used in recognition. Thus, the two types of non-manual markers (eyebrow and head position vs. mouthing) can play a different role in recognition: the former can be used to distinguish statements from questions, and the latter can be used to help distinguish different question signs from each other. To this end, we selected ten words and composed twenty phrases with each word (ten statements and ten questions): 'what', 'who', 'which', 'which one', 'when', 'where (direction)', 'where (location)', 'why', 'how', and 'how much'. We distinguish them to twenty classes (as ten words have a pair in both statement and question form).

Emotion signs
In KRSL, as in other sign languages, the signs for emotions, such as ANGRY, SAD, SURPRISED, SCARED, PITY, HAPPY are accompanied with facial expressions corresponding to the emotion named by the sign. Therefore, we collected phrases containing the six signs for basic emotions. We hypothesized that, since facial expressions in this signs are lexically associated with them, inclusion of non-manual components can improve recognition of these signs.

Emotional questions vs. emotional statements
De Vos et al. (2009) analyzed interaction of emotional facial expressions and grammatical nonmanual markers in Sign Language of the Netherlands (NGT). They elicited polar and content questions in NGT, as well as sentences with topic marking signed neutrally, with anger, or with surprise. Polar questions and topics are normally accompanied with raised eyebrows, while content questions with furrowed eyebrows; the emotion of anger causes eyebrow furrowing, and the emotion of surprise causes eyebrow raise. Therefore, in some of the contexts emotions and grammar were in agreement (e.g. surprised polar questions), while in others in competition (e.g. angry polar questions). The researchers found that emotional and grammatical non-manuals interact in complex ways.
We created a similar dataset for KRSL. The signers were asked to sign ten sentences as either a statement (no eyebrow movement expected), a polar questions (eyebrow raise expected) or wh-questions (adding single question sing), and with three different emotions: neutral, surprise (eyebrow raise expected), and anger (eyebrow furrowing expected). We hypothesized that emotions and grammatical markers would interact in complex ways, and that these interactions might negatively influence recognition accuracy when recognizing sentence types (questions vs statements).

Minimal pairs
Similar to words in spoken languages, signs can form minimal pairs: one can find signs that are minimally different in their manual component (Sandler and Lillo-Martin, 2006). For instance, the KRSL signs "Moscow", "old", and "grandmother" all have the same handshape (the fist) and location (the cheek), but different movements. It is possible to find signs which are distinguished by handshape only or by location only as well.
We hypothesized that minimal pairs of signs are potentially difficult for recognition, as they are quite similar in shape. However, these signs are additionally distinguished by mouthing (see above). Therefore, including non-manual components can improve sign recognition for such pairs of signs. We thus created a dataset with 15 minimal pairs of signs signed as parts of phrases.

Openpose Feature Extraction
We utilized OpenPose library (Cao et al., 2017;Wei et al., 2016) in order to extract the keypoints of the person in the videos. OpenPose is the realtime multi-person keypoint detection library for body, face, hands, and foot estimation provided by Carnegie Mellon University . It detects 2D information of 25 keypoints (joints) on the body and feet, 2x21 keypoints on both hands and 70 keypoints on the face. It also provides a 3D single-person keypoint detection in real time on multi-camera videos. OpenPose provides the values for each keyframe as an output in JSON format. Since the dataset we use consists of RGB videos, we only consider 2D keypoints in this work.

Baseline methods
Signing recognition can be considered as a variation of action recognition or human pose estimation tasks. Keypoint detection library OpenPose (Cao et al., 2017;Wei et al., 2016) enables us to evaluate both manual (hand keypoints) and nonmanual features (face and pose keypoints). One of the latest works in action recognition (Tran et al., 2018) introduces a new spatiotemporal convolutional block R(2+1)D that achieves state-of-the-art results. In order to analyze and classify collected dataset we employ both approaches as a baseline models for isolated sign recognition. We have extracted isolated clips from the statement-question subset of following signs: 'what', 'who', 'which', 'which one', 'when', 'where (direction)', 'where (location)', 'why', 'how', and 'how much'. We distinguish them to twenty classes (as ten words have a pair in both statement and question form).

Pose estimation baseline
Our subsets mainly imply classification problems and have sequential features. Generally, we extract features in each frame of videos using OpenPose (Cao et al., 2017;Wei et al., 2016) library and then feed it to the classification algorithm. Therefore, we exploit classical machine learning techniques, namely Logistic regression by concatenating sequences of keypoints into one sample. The sequence of keyframes holds the frames of each sign video. Since we aim to compare performances of non-manual features, we prepared two conditions: manual only and manual and non-manual fea-tures combined. Consequentially, in the first case, one datapoint consists of concatenated keypoints of each video and has a maximum of 30 frames * 84 keypoints = 2520 manual only features, while in the second case, one datapoint consists of 30 frames * 274 keypoints = 8220 manual and nonmanual features for each of the twenty classes. We used the scikit-learn library for Python as the keypoints classification method for the experiments presented in this paper.
In this paper, we employ R(2+1)D (Ghadiyaram et al., 2019) model which is highly accurate and significantly faster than other approaches. It is additionally pre-trained on over 65 million videos. Also, it uses as input only video frames, which makes it faster comparing to other approached that require optical flow fields as additional input. In order to recognize signs from our dataset we finetuned R(2+1)D on the statement-questions subset. Since we have a different number of classes in our subset, only the last fully connected of the model is re-trained.

Implementation details
The action recognition baseline is implemented in PyTorch (Paszke et al., 2019) and uses a R(2+1)D pre-trained model (Ghadiyaram et al., 2019). Model input size (number of consecutive frames) is set to 8 and batch size is 16. We train the model for 20 epochs with a starting learning rate of 0.0001. All frames are scaled to a resolution of 112 112 and keeping original ratio. Also, during the training process frames are randomly cropped with scale between 0.6 and 1. The pose estimation baseline is implemented using scikit-learn library (Pedregosa et al., 2011) and takes as an input sequence of keypoints extracted using the OpenPose library (Cao et al., 2017;Wei et al., 2016). We train Logistic Regression classifier using the 'lbfgs' solver and L2 penalty.

Suggested Train-Test Splits
As stated in Table 2, each subset has 5 signers, which were assigned an approximately equal number of videos. The only exception is the Emotional Question-Statement subset which has 10 signers. We assign all videos performed by 4 signers in the train set and videos with the remaining signer into the test set. In addition, we choose the remaining signer for each class randomly, to diversify train and test data. Validation set is randomly chosen from the train set and has 20% length of the train set.

Data augmentation
The main problem of developing sign language recognition algorithm is that data is usually not big and/or diverse enough for generalization. Thus, we suggest a simple method to augment image sequences of fixed length from videos with a variable amount of frames. The only constraint is that a video has to be longer than a chosen fixed length.
Given a sign video V = (f 1 , f 2 , ..., f m ) that contains m frames, which satisfies condition m ≥ n, where n is the chosen fixed sequence length, we pick equally distanced frames from videos with a random initial frame. By distance between the frames, we mean the difference between their indexes, let's call it s.
The initial frame is picked among all possible candidates which are first s frames with k leftover frames after them. Here, k = m mod n. Therefore, the augmented fixed sized sequence is S = (f i , f i+s , f i+2s , ..., f i+ns ), where i is a random integer from 1 to s + k.

Experimental Results
A series of experiments was conducted in order to investigate whether non-manual features would improve recognition accuracy. All experiments were performed on isolated signs extracted from the Question-Statement subset and divided into 20 classes (10 signs as statement and questions). The first experiment was the classification of 20 classes. For this reason we trained two baseline models: a logistic regression model using only manual features and with non-manual features as an input, and a R(2+1)D model on full frames as an input. Evaluation of each model was repeated 10 times with random train/test splits to avoid extreme cases. Table  3 presents the mean scores and standard deviations for the first experiment. The second experiment used the same dataset with 20 classes to compare and contrast the accuracy in terms of its improvement with different combinations of non-manual components.

Question vs. Statement
Our first experiment used the Question-Statement subset divided into 20 classes (10 signs used in statements and questions). We have extracted manual and non-manual features for the isolated signs of the Question-Statement subset. The highest accuracy was achieved by the R(2+1)D model and was 86%, which is 9% higher comparing to the Logistic regression model. For the Logistic regression model trained on sequence of keypoints testing mean accuracy scores are 73.4% and 77% on manual-only and both manual and non-manual features respectively. As expected, non-manual features improved the results by 3.6% on average (from 73.4% accuracy to 77% accuracy). At the same time, improvement was not very high. The reason for that could be that the number of nonmanual features is bigger than the number of manual features.

A case of combining different modalities
In this experiment different combinations of nonmanual markers (eyebrow and head position vs. mouthing) were compared and their role in recognition was analyzed. The lowest testing accuracy was 73.25% for the combination of manual features and eyebrows keypoints. Eyebrows without any other non-manual feature did not provide valuable information for recognition. Only when they were used in combination with other features, the accuracy was im-proved. The highest testing accuracy was 78.2% for the combination of manual features and faceline, eyebrows, and mouth keypoints. When only mouth keypoints were used in combination with the manual features, the accuracy also increased by 0.5% compared to the baseline of 77%. Thus, we see that mouthing provides extra information, which can be used in recognition, because signers usually articulate words while performing corresponding signs. Eyebrows and head position provide additional grammatical markers to differentiate statements from questions.

Conclusion
This paper presents the K-RSL dataset motivated by the need to create SL datasets for interdisciplinary purposes e.g. for computer vision and computational linguistics research. Due to the challenging nature of SLR, the proposed dataset aims to attract the attention of the computer vision community with the K-RSL dataset being linguistically rich. The data was carefully selected to find various cases when manual gestures will not provide good performance and will stress the need to include nonmanual components into consideration. In addition to computer vision community, this dataset can be utilized by the linguistics community to explore research questions and computationally prove their hypotheses. Future work will include expanding the vocabulary of the corpus in addition to diversifying and increasing the number of signers recorded in noisy environmental conditions (e.g. outside of the office environment).