Seeing Both the Forest and the Trees: Multi-head Attention for Joint Classification on Different Compositional Levels

In natural languages, words are used in association to construct sentences. It is not words in isolation, but the appropriate use of hierarchical structures that conveys the meaning of the whole sentence. Neural networks have the ability to capture expressive language features; however, insights into the link between words and sentences are difficult to acquire automatically. In this work, we design a deep neural network architecture that explicitly wires lower and higher linguistic components; we then evaluate its ability to perform the same task at different hierarchical levels. Settling on broad text classification tasks, we show that our model, MHAL, learns to simultaneously solve them at different levels of granularity by fluidly transferring knowledge between hierarchies. Using a multi-head attention mechanism to tie the representations between single words and full sentences, MHAL systematically outperforms equivalent models that are not incentivized towards developing compositional representations. Moreover, we demonstrate that, with the proposed architecture, the sentence information flows naturally to individual words, allowing the model to behave like a sequence labeler (which is a lower, word-level task) even without any word supervision, in a zero-shot fashion.


Introduction
Compositional reasoning is fundamental in human cognition: we use it to interact with objects, take actions, reason about numbers, and move in space (Spelke and Kinzler, 2007). This is also reflected in some aspects of the human language (Wagner et al., 2011;Piantadosi and Aslin, 2016;Sandler, 2018) since we use words and phrases in association to construct sentences. Consequently, there are lower linguistic components that act as building blocks for higher levels.
In this work, we focus on two levels of the compositional hierarchy -words and sentences -and ask the following question: are deep neural networks (DNNs) trained for a higher-level task (i.e., at the sentence level) able to pick up the features of the compounds needed to solve the same task but at a lower level, such as at the word level? Moreover, how are the different hierarchical levels interacting under varying supervision signals? To the best of our knowledge, very few studies have investigated the transferability of a task solution between compositional levels using DNNs in controlled experiments.
It has been shown that neural networks are universal function approximators (Hornik, 1991;Leshno et al., 1993); they can perform arbitrary function combinations to learn expressive features. DNNs trained for language tasks are not an exception to this rule, and recent studies have shown their power in extracting linguistically-rich representations (Mikolov et al., 2013;Devlin et al., 2018). However, when trained end-to-end, learning the connection between the different compositional levels is not trivial for these models. This is in part due to the vast syntactic and semantic complexity of natural language. There are also data limitations on most tasks, resulting in networks picking up the noise and biases of the datasets. Crucially, DNNs trained to solve a task at a higher hierarchical level are usually treated as black boxes with respect to the lower levels.
We propose a novel DNN design that stimulates the development of hierarchical connections. The architecture is based on a multi-head attention mechanism that ties the representations between single words and full sentences in a way that enables them to reinforce each other. The proposed multi-level architecture can be viewed as a sentence classifier, where each customized attention head is guided to behave like a sequence labeler, detecting one particular label on each token. Thus, it can simultaneously solve language tasks that are situated at different levels of granularity. Based on experiments, this architecture systematically outperforms equivalent models that focus only on one level. The token-level supervision explicitly teaches the classifier which areas it needs to focus on in each sentence, while the sentence-level objective provides a regularizing effect and encourages the model to return coherent sequence labeling predictions. Moreover, we show that the sentence-level information flows naturally to individual words, allowing the model to behave like a sequence labeler even when it does not receive any word-level supervision. Our model exhibits strong transfer capabilities, which we validated on three different tasks: sentiment analysis, named entity recognition, and grammatical error detection.

Multi-head attention labeling (MHAL)
We describe an architecture that directly ties together the sentence and word representations for multiclass classification, incentivizing the model to make better use of the information on each level of granularity. In addition, we present several auxiliary objectives that guide this architecture towards useful hierarchical representations and better performance.

Architecture
Our model is based on a bidirectional long short-term memory (BiLSTM) that builds contextual vector representations for each word. These vectors are then passed through a multi-head attention mechanism, which predicts label distributions for both individual words and the whole sentence. Each attention head is incentivized to be predictive of a particular label, allowing the system to also assign labels to individual words while composing a sentence-level representation for sentence classification.
The network takes as input a tokenized sentence of length N and maps it to a sequence of vectors [x 1 , x 2 , ..., x N ]. Each vector x i , corresponding to the i th token in a sentence, is the concatenation of its pre-trained GloVe word embedding w i (Pennington et al., 2014) with its character-level representation c i , similar to Lample et al. (2016). Passing each vector x i to a BiLSTM (Graves and Schmidhuber, 2005), we obtain compact token representations z i by concatenating the hidden states from each direction at every time step and projecting these onto a joint feature space using a tanh activation (Equations 1-3). This is followed by a multi-head attention mechanism with H heads (Vaswani et al., 2017). By setting H equal to the size of our token-level tagset, we can create a direct one-to-one correspondence between attention heads and possible token labels -attention head h ∈ {1, 2, ..., H} gets assigned to the h-th token-level label. For each attention head we calculate keys, queries and values at every word position through a non-linear projection of z i (Equations 4-6). All the queries for a given attention head are then combined into a single vector through averaging, which will represent a query for the corresponding token-level label in the context of the given sentence (Equation 7).
where − → z i and ← − z i are LSTM hidden states in either direction; W z , W kh , W qh and W vh are weight matrices; and b z , b kh , b qh and b vh are bias vectors.
The unnormalized attention scores a ih ∈ R 1 are then calculated through a dot product between the query and the associated key for a particular token in position i (Equation 8). Given the established correspondence between attention heads and token labels, this score now represents the model confidence that the token in position i has label h. Therefore, we can predict the probability distribution over the token-level labels by normalizing a ih with a softmax function (Equation 9). By concatenating the scores and normalizing them across the heads h, we gett i ∈ R H , which we use as the token-level output from the model, for both optimization and evaluation as a token-level tagger. Figure 1: Illustration of the MHAL architecture for one head h only. We present the computations performed for the i th word in a sentence, mapped to its vector representation x i .
Next, we use the same attention scores a ih to construct sentence-level representations of the input. We apply a sigmoid activation (σ) and normalization to produce the normalized attention weights α ih ∈ [0, 1] (Equation 10). Standard attention functions use a softmax activation, which is best suited for assigning most of the attention weight to a single token and effectively ignoring the rest. However, it is often necessary that higher-level representations pay attention to many different locations, or, in our case, to multiple tokens in a given sentence. By using the sigmoid instead of softmax, similar to Shen and Lee (2016), the model will need to make separate decisions for each token, which in turn encourages the attention scores to behave more similarly to sequence labeling predictions.
A sentence-level representation s h is obtained as the weighted sum over all the value vectors in the sentence (Equation 11). This is followed by two feed-forward layers: the first one is non-linear and projects the sentence representation onto a smaller feature space, while the second one is linear and outputs a scalar sentence-level score o h ∈ R for each head h (Equation 12).
To make a sentence prediction, the sentence scores need to be collected across all heads. The challenge arises as these H scores (equal to the number of token labels) have to map to the number of sentence labels S, which are not always in direct correspondence. To solve this, we use the fact that many text classification tasks have a default label that is common between the token and the sentence label sets -for example, the neutral label for sentiment analysis or the no-named-entity label for NER. In our datasets, only two situations arise: 1. H = S: Each sentence label has a corresponding word-tag, and thus one head associated with it. Therefore, we can directly concatenate the sentence scores across all heads into a vectorõ = [o 1 ; o 2 ; ...; o H ]. An example of such a task is sentiment analysis, as the possible labels (positive, negative, and neutral) are the same for both sentences and tokens.
2. H = S and S = 2: The sentence labels are binary, while the token labels are multi-class, therefore an appropriate correspondence between the heads and the two sentence labels needs to be found. We concatenate the score obtained for the default head (o h ). Named entity recognition can be an example of such a task -while there are many possible tags on the token level, we only detect the binary presence of any named entities on the sentence level.
A probability distributionỹ ∈ R S over the sentence labels is obtained by applying a softmax on the extracted scores:ỹ = sof tmax(õ). The most probable label is returned as the sentence-level prediction.
In our model, the sentence-level scores are directly intertwined with the token-level predictions in order to improve performance on both levels. The attention weights are explicitly using the same predicted scores as the token-level output. Therefore, when the model learns to detect specific types of tokens, it will also assign more importance to those tokens in the corresponding attention heads. At the same time, when the model learns to attend to particular tokens for sentence classification, this will also help in identifying the correct labels for the token-level tagging task. By joining the two tasks, the architecture is able to share the information on both levels of granularity and achieve better results. In addition, this allows us to explicitly teach the model to focus on the same evidence as humans when performing text classification, leading to more explainable systems.
In Figure 1, we illustrate how this architecture -to which we refer to as the multi-head attention labeler (MHAL) -is applied on one input word to compute one attention head.

Optimization objectives
Our model can be optimized both as a sentence classifier and as a sequence labeler using a cross-entropy loss. Both L sent and L tok minimize the summation over the negative log likelihood between the predicted sentence (or token) label distribution and the gold annotation: where y ij are binary indicator variables specifying whether sentence s truly is a sentence of label j and token t at position i in sentence s truly is a token of tag type j, respectively.
Recall that the sentence label distribution is based on the attention evidence scores, which represent, in turn, the token scores used for word-level classifications. If we train our model solely as a sentence classifier (by providing only sentence-level annotations), the network will also learn to label individual tokens. As all the parameters used by the token labeling component are also part of the sentence classifier, they will be optimized during the sentence-level training. Moreover, the network will learn the important areas of a sentence, combining the scores from individual words to determine the overall sentence label. In this way, our model performs zero-shot sequence labeling, a type of transductive transfer learning (Ruder, 2017). In addition, when both levels receive supervision, the token signal encourages the network to put more weight on the attention heads indicative of the correct labels.
We include an auxiliary attention loss objective, based on Rei and Søgaard (2019), which encourages the model to more closely connect the two labeling tasks on different granularity levels. In its original formulation, the loss could only operate over binary labels, whereas we extend it for general multi-class classification by imposing two conditions on the attention heads: 1. There should be at least one word of the same label as the ground-truth sentence. Intuitively, most of the focus should be on the words indicative of the sentence type.
2. There should be at least one word that has a default label. Even if the sentence has a non-default class, it should still contain at least one default word.
While these conditions are not true for every text classification task, they are applicable in many settings and hold true for all the datasets that we experimented with. The two conditions can be formulated as a loss function and then optimized during training: where d is the default label, l is the true sentence label,t (s) i is the predicted token label distribution for word i in sentence s, and thust (s) i,h is the predicted probability of word i having label h. Next, we propose a custom regularization term for the multi-head attention mechanism to motivate the network to learn a truly distinct representation sub-space for each of the query vectors q h . As opposed to the keys and values, which are associated with different words, the queries q h encapsulate the essence of a certain tag. Therefore, these vectors need to capture the distinctive features of a particular label and how it is different from other labels. To push the network towards this goal, we introduce the term R q and calculate it as the average cosine similarity between every pair of queries q h and q i , with h = i (equation 16). R q penalizes high similarity between any two query vectors and motivates the model to push them apart. Thus, this technique imposes a wider angle between the queries, encouraging the model to learn unique, diverse, and meaningful vector representations for the tags.
Lastly, we include an auxiliary objective for language modeling (LM) operating over characters and words, following the settings proposed by Rei (2017). The hidden representations from the forward and backward LSTMs are mapped to a new, non-linear space and used to predict the next word in the sequence, from a fixed smaller vocabulary. Recently, many NLP systems using multi-task learning include LM objectives along the core task to inject corpus-specific information into the model, as well as syntactic and semantic patterns (Dai and Le, 2015;Peters et al., 2017;Akbik et al., 2018;Marvin and Linzen, 2018). In our case, we include an LM loss to help the model learn general language features. While performing well on language modelling itself is not an objective, we expect it to provide improved biases and language-specific knowledge that would benefit performance.
The final loss function L is a weighted sum of all the objectives described above. Setting particular coefficients λ allows us to investigate the effect of the different components as well as controlling the flow of the supervision signal and the importance of each auxiliary task: L = λ sent L sent + λ tok L tok + λ attn L attn + λ Rq R q + λ LM L LM .

Experiments
In this section, our goal is to test whether the proposed joint training architecture is able to 1) transfer knowledge between words and sentences and improve on both text labeling tasks, 2) learn to re-use the supervision signal received on the sentence-level to perform a word-level task, and 3) use the auxiliary objectives and regularization loss to improve its performance. We perform three main experiments under different training regimes: • Fully supervised: full annotations are provided both for sentences and words. The model has all the information needed to perform well at each separate level (i.e., in isolation). However, we are mainly interested in how performance changes as we train two related tasks together: does such a model take advantage of the joint learning regime and the supplemental labeled data?
• Semi-supervised: some supervision signal is provided, but only for a subset of the words, while sentences are always receiving it in full. Under this setting, we determine the proportion of token annotation that is sufficient for the network to reach as good a performance as the fully supervised one. We check whether the (more instructed) sentence representations can pass unified, reusable knowledge about the entire sentence to its composing words.
• Unsupervised: no word-level annotations are provided, but we test whether the model learns to perform sequence labeling, solely based on the sentence level signal, which is always provided in full (this is called zero-shot sequence labeling). In other words, we train a sentence classifier and evaluate it as a sequence labeler. Under this setting, we aim to assess how much implicit knowledge a model can acquire about a low-level task (on words) solely by being trained on a higher-level task (on sentences). This zero-shot experiment is challenging: supervision signal is solely received at a higher, abstract sentence-level, while the task to be evaluated is at a lower, fine-grained token-level.
If successful, this model will be able to perform sophisticated word-predictions solely based on the considerably cheaper sentence annotations.

Data
To evaluate all the different properties of the model, we focus on three text classification datasets where annotations are available either for both individual words and full sentences or only for the words, but we can infer the ones for the sentences. We show some concrete examples in Table 1. SST: The Stanford Sentiment Treebank (Socher et al., 2013) is a dataset of human-annotated movie reviews used for sentiment analysis. It contains not only sentence annotations for positive (P), negative (N), and neutral (O) reviews but also phrase-level annotations, which we converted to token labels by accounting for the minimum spans of tokens (up to length three) of a certain sentiment. Therefore, on SST we have three labels both at the sentence and at the word level.
CoNLL03: The CoNLL-2003 dataset (Tjong Kim Sang and De Meulder, 2003) for named entity recognition (NER), contains five possible word-level tags, for person, organization, location, miscellaneous, or other, which is used for non-named entities. At the sentence-level, binary classification labels can be inferred based on whether the sentence contains any entities (annotated with label O) or not (annotated with O).
FCE: The First Certificate in English (Yannakoudakis et al., 2011) is a dataset for fine-grained grammatical error detection. Ungrammatical words can contain five possible mistakes: in content, form, function, orthography, or other. There is also a sixth label for grammatical words. A sentence that contains at least one world-level mistake is ungrammatical overall. Therefore, a binary sentence classification task naturally occurs, as sentences can be grammatical (annotated with O) or not (annotated with O).
All datasets are already tokenized and split into training, dev, and test sets. In Appendix A, we provide some statistics on the corpus (Table 4) and on the annotations available per split and label (

Hyperparameter settings
We chose the best values for our hyperparameters based on the performance on the development set (see Table 6 in Appendix A). We perform each experiment with five different random seeds and report the average results. Following Vaswani et al. (2017), we also applied label smoothing (Szegedy et al., 2016) with = 0.15 to increase the robustness to noise and regularize the label predictions during training. As evaluation metrics, we report (based on the task) the precision (P), accuracy (Acc), and micro-averaged F 1 score of all the labels and of all the non-default labels (denoted by a superscript * ), as it is common in the multi-task learning literature (Changpinyo et al., 2018;Martínez Alonso and Plank, 2017). For CoNLL03, we use the dedicated CoNLL evaluation script, which calculates F 1 on the entity level.

Model variants
We can optimize different variations of the architecture by changing the λ weights in the loss and thereby choosing which components are active. We experiment with the following variations of the model: • MHAL-joint: Corresponds to the fully supervised experiment, and is optimized both as a sentence classifier and a sequence labeler by setting λ sent = λ tok = 1.0, while all the other λ values are 0.0.
• MHAL-sent: The model receives only sentence-level supervision (λ sent = 1.0) while all the other λ values are set to 0.0. No supervision is provided on the token level, which means this model performs zero-shot sequence labeling.
For comparison, we also evaluate two baseline models that do not connect the different hierarchical levels and specialize only on sentence classification or sequence labeling.
• BiLSTM-sent: Following the description and implementation of Yang et al. (2016), we built one of the strongest neural sentence classifiers based on BiLSTMs and soft attention; we tuned the hyper-parameters based on the development set to achieve the best performance on our tasks.
• BiLSTM-tok: Widely-used bidirectional LSTM architecture for sequence labeling, which has been applied to many tasks including part-of-speech tagging (Plank et al., 2016) and named entity recognition (Panchendrarajan and Amaresan, 2018). We also tuned the hyperparameters based on the development set in order to achieve the best results on each of the evaluation datasets.

Results
Fully-supervised: In this setting, we investigate whether training a joint model to solve the task on multiple levels provides a performance improvement over focusing only on one level. Table 2 compares the MHAL joint text classification performances to a BiLSTM attention-based sentence classifier and a BiLSTM sequence labeler. The results show that the multi-task models systematically outperform the single-task models across all tasks and datasets, emphasizing the effectiveness of sharing information between hierarchical levels. While additional annotation is required to train the multi-task models, the same input sentences are used in all cases, indicating that the benefits are coming directly from the model solving the task on multiple levels, as opposed to just from seeing more data examples. Despite the sentence-level labels for CoNLL03 and FCE having been derived automatically from the existing tokenlevel annotation, they still provide a performance improvement for sequence labeling, further showing the benefit of the multi-level architecture. By teaching the model where to focus at the token level, the architecture is able to make better decisions on the sentence-level classification task. In addition, the sentence-level objective acts as a contextual regularizer and encourages the model to learn better compositional representations, thereby improving performance also on the token-level labeling task. Comparing MHAL-joint against MHAL-joint+ with auxiliary objectives shows further improvements. The attention loss optimizes the model to make matching predictions between both levels, while the language modeling objective encourages the network to learn more informative word representations and composition functions. We also separately evaluated the regularization term, R q , and found that it helps more on the sequence labeling tasks. This implies that using the intermediate per-head queries, the model indeed learns unique sub-space representations that help it assess the uncertainty of each word-tag pair and strengthen its labeling decision.
Semi-supervised: We further experiment with MHAL-joint+, using the supervision signal of all sentences but varying the percentage p of the word-level annotations. In Figure 2, we present the sequence  Table 2: Results on sentence classification and sequence labeling, comparing MHAL (which solves the tasks simultaneously by joining the two levels) with BiLSTM-sent and BiLSTM-tok, its equivalent single-task models. Note that metrics over the non-default labels are denoted by a superscript * . labeling results of the multi-task model, in comparison to the single-task BiLSTM-tok, gradually increasing p to allow more tokens to guide learning. We observe that using only 30%-50% of the tokenannotated data, our model already approaches the fully-supervised performance of the regular model (BiLSTM-tok-100%), suggesting that the two tasks are positively influencing each other. General, abstract knowledge about the entire sentence meaning fluidly flows down to the words, while fine-grained word-level information is propagated up to the sentence, showing a beneficial transfer in both directions.  Table 3: Zero-shot sequence labeling results. Table 3, we evaluate the architecture as a zero-shot sequence labeler, trained without any token-level supervision signal. In this setting, the model learns to label individual tokens by seeing only examples of sentence-level annotations. Because the multi-level attention is directly wired together with the sequence labeling output, the model is still able to learn in this difficult setting. This experiment also illustrates that the information does indeed flow from the sentence level down to individual tokens in this architecture. As no other model can operate in this setting, we can only compare against a random baseline, in which the labels are assigned uniformly at random from the available set of labels. For datasets where the label distribution is very skewed, this can still be a difficult baseline to beat. While MHAL-sent only outperforms this baseline on the CoNLL03 dataset, MHAL-sent+ outperforms both on all datasets and metrics. These results show that our auxiliary losses introduce a necessary inductive bias and allow for better transfer of information from the sentence level to the tokens. We visualized the decisions computed inside the attention heads for different example sentences and provide them in Appendix A (Figures 3 and 4, respectively). We also observed that the choice of the metric based on which the stopping criterion is selected plays an important and interesting role in our zero-shot experiments (see Appendix B for details).

Related work
Most methods for text classification (hierarchical or not) treat sentence classification and sequence labeling as completely separate tasks (Lample et al., 2016;Huang et al., 2015;Lei et al., 2018;Cui and Zhang, 2019). More recently, training a model end-to-end for more than one language task has become increasingly popular (Yang et al., 2017;Devlin et al., 2018), as well as using auxiliary objectives to inject useful inductive biases (Martínez Alonso and Plank, 2017;Plank et al., 2016;Bingel and Søgaard, 2017). Our work is similar in terms of motivation for the auxiliary objectives and multi-task training procedure. However, instead of learning to perform multiple tasks on the same level, we focus on performing the same task on multiple levels. By carefully designing the network and including specific auxiliary objectives, these levels are able to provide mutually beneficial information to each other. Other hierarchical multi-task systems, such as the models proposed by Hashimoto et al. (2017) and Sanh et al. (2018), solve each task at a different DNN layer, but their formulation does not follow a compositional linguistic motivation.
Our work is most similar to Rei and Søgaard (2019) and , who described an architecture for supervising attention in a binary text classification setting. Barrett et al. (2018) also used a related model to guide the network to focus on similar areas as humans, based on human gaze recordings. We build on these ideas and describe a more general framework, extending it to both multiclass text classification and multiclass sequence labeling. An important part of our new architecture is based on attention mechanisms (Bahdanau et al., 2014;Luong et al., 2015), and, in particular, on the properties of multi-head attention (Vaswani et al., 2017;Li et al., 2019). Other regularization techniques that explicitly encourage the learning of more diverse attention functions have been proposed by Li et al. (2018), who introduced a disagreement regularization term, and by Niculae and Blondel (2017) and Correia et al. (2019), who proposed sparse attention for increased interpretability.

Conclusion
We investigated a novel neural architecture for natural language representations, which explicitly ties together predictions on multiple levels of granularity. The dynamically calculated weights in a multi-head attention component for composing sentence representations are also connected to token-level predictions, with each attention head focusing on detecting one particular label. This model can then be optimized as either a sentence classifier or a token labeler, or jointly for both tasks, with information being shared between the two levels. Supervision on the token labeling task also teaches the model where to assign more attention when composing sentence representations. In return, the sentence-level objective acts as a contextual regularizer for the token labeler and encourages the model to predict token-level tags that cohere with the rest of the sentence. We also introduce several auxiliary objectives that are further incentivizing the architecture to share information between the different levels and help this model get the most benefit from the available training data, increasing its efficiency.
We evaluated the proposed architecture on three different text classification tasks: sentiment analysis, named entity recognition, and grammatical error detection. The experiments showed that supervision on both levels of granularity consistently outperformed models that were optimized only on one level. This held true even for cases where the sentence-level labels could be automatically derived from token-level annotation, therefore requiring manually annotated labels only on one level. The auxiliary objectives, designed to connect the predictions between the two tasks more closely, further improved model performance. The semi-supervised experiments showed that this architecture can also be used with partial labeling -with a 50-70% reduction in token-annotated data, the model was able to get comparable results to the baseline architecture using the full dataset. Finally, we presented the first experiments for multi-class zero-shot sequence labeling, where the model needs to label tokens while only learning from the sentence-level annotation. As the architecture connects each attention head to a particular label, it was able to learn even in this challenging setting, with the auxiliary objectives being particularly beneficial. The overall multi-level learning approach also has potential future applications in the area of neural network interpretability, as the model can be trained to focus on the same evidence as human users when classifying text and the resulting token-level decisions can be both measured and visualized.

B Appendix B. Stopping criterion: results and discussion
During model training, we measure performance on the development set and apply one of the two stopping criteria: 1. the sentence-level classification performance (S-F * 1µ ), adopted by all models that do not receive any token-level annotation, such as MHAL-sent; 2. the token-level classification performance (F * 1µ ), adopted by all models that receive some token annotation, such as MHAL-joint. We observed that, even in the case of MHAL-sent, stopping based on the token performance improves the word-level predictions at test time, but usually hurts the sentence predictions. However, as suggested by the results in Table 7, stopping based on the average of these two metrics generally improves both the token and the sentence predictions.
The network usually takes more time to reach the common optimal point when we include the tokenbased stopping criterion. Sentence classification converges faster than sequence labeling -being predicted at a higher layer in the network hierarchy, it accumulates more information and thus builds solid abstractions, while having fewer unique instances to learn from. For these reasons, the network falls into a local minimum when guided by the sentence-level performance. However, choosing tokens as a stopping criterion requires annotated development data, which would not comply with the framing of our zero-shot learning experiment. Nevertheless, reporting this finding emphasizes that the stopping criterion requires careful consideration -it is responsible for choosing the best performing model used during testing and for driving the application of the learning rate decay. Several performance percentage points could be gained by carefully selecting the stopping metric. The development metric used as the stopping criterion. optimization algorithm AdaDelta Optimization algorithm used. initializer Glorot Method for random initialization.