The OSU/Facebook Realizer for SRST 2019: Seq2Seq Inflection and Serialized Tree2Tree Linearization

We describe our exploratory system for the shallow surface realization task, which combines morphological inflection using character sequence-to-sequence models with a baseline linearizer that implements a tree-to-tree model using sequence-to-sequence models on serialized trees. Results for morphological inflection were competitive across languages. Due to time constraints, we could only submit complete results (including linearization) for English. Preliminary linearization results were decent, with a small benefit from reranking to prefer valid output trees, but inadequate control over the words in the output led to poor quality on longer sentences.


Introduction
With our entry in the shallow surface realization shared task, we aimed to (1) implement an up-to-date morphological inflection model based on the approach of Faruqui et al. (2016) and Kann and Schütze (2016), and (2) conduct exploratory experiments with linearization using the constrained decoding approach of  adapted to dependency trees.
Our system is a pipeline that begins by generating inflected wordforms from uninflected terminals in the tree using character seq2seq models. We then serialize these inflected syntactic trees as constituent trees by converting the relations to non-terminals. The serialized constituent trees are fed to seq2seq models (including models with copy and with tree-LSTM encoders), whose outputs also contain tokens marking the tree structure. We obtain n-best outputs for orderings and choose the highest confidence output sequence with a valid treei.e., one where the input and output trees are isomorphic up to sibling order-in order to obtain a projective linearization where possible, given that the vast majority of gold linearizations are projective. 1 While we found that this validity checking step provided a small benefit, fully adapting the constrained decoding approach to dependency trees would have required adding a step to ensure that all and only the input words appeared in the output tree, and enforcing these constraints during beam search. Due to time constraints, however, we were only able to obtain preliminary linearization results for English without these word-level checks.
Development results for morphological inflection were competitive across languages as compared to previous implementations (King and White, 2018;Puzikov and Gurevych, 2018). With linearization, the preliminary results were decent, but showed substantial degradation for longer sentences where problems with lack of control over the output words became more severe.
In the rest of the paper, we describe our inflection and linearization components in more detail, along with our experimental results.

Inflection
Our pipeline begins by producing fully inflected word forms from the citation forms pro-vided in the UD input. In a sense, at this stage, the system has to be able to perform the wug test (Berko, 1958): having never seen a word before, we need to have the ability to produce the correct form for a given paradigm cell. We utilize sequence-to-sequence models (Bahdanau et al., 2014) in keeping with previous successful approaches (Kann and Schütze, 2016;Faruqui et al., 2016;King and White, 2018). Additionally, we reimplemented Kann and Schütze's 2016 approach in PyTorch. 2 We also follow Kann and Schütze's approach by training our inflection model at the languages level and not at the level of individual paradigm cells as originally proposed by Faruqui et al. More formally, our LSTMs (Hochreiter and Schmidhuber, 1997) create an encoding, producing hidden state h t which is dependent on the input x t , the hidden state from the previous time step h t−1 , and nonlinear function f . c is the context of all previous time steps. Additionally, we set h j to be the concatenated forward and backward encodings since we use bidirectional LSTMs.
During inference (i.e. decoding), output y depends on the input sequence and previous inference steps. We also use the same attention as described by Bahdanau et al. and Kann and Schütze: As seen in Table 1, uncased results are almost always higher than cased. This should not surprise us as, operating on the wordinternal level, any sequence-to-sequence model would have no access to syntagmatic information outside of how UDs encode that information in the morphosynctactic feature sets. Also Arabic, Hindi, and Japanese do not have cased orthography and therefore have no difference in their case/uncased accuracies.
As for feature sets, we include the same set as described by King and White. In addition to using the morphosyntactic features provided by the UD schema, we also used the POS tag and dependency name as input to the system. Differing from previous shared tasks (Cotterell et al., 2016(Cotterell et al., , 2017, we do not alter the token frequencies. In traditional SIG-MORPHON inflection tasks, each system only sees a word form once per epoch. We found that this causes the system to miss irregulars. Since irregular forms tend to occur with higher frequency, allowing the system to see more examples during each epoch increased performance on irregular forms. We also found that adding a rule for English specifically designed to account for the "to be" paradigm raises accuracy for English another 0.6% to 98.5%. Finally, for Korean and Chinese, we simply write rule sets for their morphology. The Korean dataset exclusively uses concatenation. The input forms list items and their corresponding affixes, in order, and simply removing the morpheme boundary token (a "+") yielded 100% accuracy. For Chinese, the plural marker " " (men) only ever occurred with " " (rén, "person"), " " (wǒ, "I"), " " (tā, "it" [animals]), " " (tā, "it" [inanimate]), " " (tā, "she"), and " " (tā, "he"). Writing a rule that adds " " when any of the character co-occur with the Num=Plur feature also gives us 100% accuracy for Chinese.

Linearization
To help assess the potential of using tree-totree models with constrained decoding (Balakrishnan et al., 2019) for linearization and guide future work in this direction, we conducted exploratory experiments using off-theshelf sequence-to-sequence models where the input and output trees are represented as sequences using non-terminal tokens corresponding to dependency relations. In these serialized trees, each non-terminal token is followed by the inflected form, its dependents,   and finally a closing-bracket indicating the end of the non-terminal's span, as exemplified in Figure 2 shows an example of serialized inputs and outputs.
We experimented with three different variants of sequence-to-sequence models: Seq2Seq: Simple encoder-decoder model with attention (Bahdanau et al., 2014). Both the encoder and decoder are LSTMs.
Tree2Seq: Similar to Seq2Seq, but we use a variant of the N-ary tree-LSTM (Tai et al., 2015) as the encoder, as described in , thereby potentially taking better advantage of the input tree structure.
Seq2Seq-Copy: Seq2Seq model with a pointer-generator mechanism (See et al., 2017) for copying tokens from input. The decoder can choose to either generate a word from the vocabulary or copy an input token instead. We did not have an off-the-shelf implementation for a Tree2Seq-Copy model, though our experiments suggest it would be worth developing one.
Additionally, we also experimented with constrained decoding  with each of the above model. Using this method, in each step of beam search, we check for and remove candidates whose tree structures deviate from that of the input tree. The constraints include ensuring that a parent node only accepts valid children, and that all its children have been generated before it can accept a closing bracket, thereby helping to ensure a projective realization. However, as noted in the introduction, we did not have time to extend the constraints to ensure that all and only the input words appeared in the output, so we did not expect this method to work as well as we would have liked. As such, we also experimented with reranking an n-best list to select the highest-scoring output with a valid tree (i.e., one that matches the tree of the input, up to sibling ordering).

Results
We picked the approach that gave the best performance on dev set. We combined samples of all English train sets, training on all sets together gave better dev BLEU scores than training individually. Table 2 shows a comparison of the different models that we tried.  In the table, the BLEU scores are calculated with the non-terminals included in both input and output sequences, inflating them somewhat relative to regular BLEU scores. Gold inflected forms were also used. Table 3 compares the constrained and unconstrained versions of the Seq2Seq-Copy (again with non-terminals in the output and gold inflected forms). Since we did not have time to implement word-level constraints, the results seem to be mixed. In the end, we chose the constrained model on datasets where dev BLEU was higher than its unconstrained counterpart. Table 4 shows the gains obtained by doing validity reranking (again with gold inflected forms); here the scores shown are calculated without non-terminals.
Given our time constraints, we only submitted English results for evaluation. Although we generated inflected forms for all languages in the T1 task, we could only obtain linearization results for English. Our results are decent (with the exception of the en_partutud-test dataset), suggesting that the approach may represent a viable starting point for future work. In particular, in the human evaluation results for English in the shared task overview paper (Mille et al., 2019), our system was ranked in the middle group of systems for meaning preservation and in the large group of systems tied for third-twelfth place in readability. Consistent with the human evaluation, the automatic scores for our system (Table 5) were also in the middle of the pack. Note that the test scores are lower than the dev scores at least in part because only the former are calculated with generated inflected forms.

Discussion
Regarding the en_partut-ud-test dataset, our preliminary error analysis seems to indicate that the inflection model overfit the dev set. Although the model outputs relatively sane er-    rors with the other test sets, errors with this particular set are much noisier. For example, in another file the model emits "multichart" as "multichartart". This kind of error is extremely consistent with errors regarding the attention mechanism. In fact, Faruqui et al. explicitly feed the lemma into their decoder for this very reason. That said, errors from the en_partut-ud-test file are not as clear (e.g. "copyright" → "Sropopyright"). Turning to linearization, Seq2Seq-Copy does much better than the other models. We believe this is due to the architectural prior of copying words from the input, as nearly all output words are present in the input (modulo words whose inflected forms are sensitive to adjacent words). Figure 3 shows that BLEU scores significantly decrease as sequence length increases. Figure 4 shows that the number of extra or missing words increases with lengths, which could explain the drop in BLEU. Such mistakes could perhaps have been avoided by adding word-level constraints to constrained decoding. Other errors are due to picking  the correct words but in the incorrect order. Out of 366 mismatches on the en-gum dev set, 193 (53%) are cases of mismatched word order with the correct words. Figure 5 shows examples of linearization model predictions. In 1, the model misses the word "summer" and repeats "olympic" instead. This can potentially be alleviated by constraining the generation of a word based on the number of times it appears in the input. In 2, the model picks the right set of words but in an order that is different from the gold order. In 3, the model fails by stuttering, i.e. it repeats the same phrase again and again.

Conclusions and Future Work
Our exploratory experiments show that combining a morphological inflection with a baseline linearizer achieves decent results. Our pipeline for the shallow surface realization shared task first produces inflected wordforms from lemmas using a character level sequenceto-sequence model. We then use those forms in serialized trees as input to a tree-to-tree model, which is also implemented using a sequence-to-sequence architecture, yielding serialized trees as output. This allows outputs to be filtered for validity in most cases, enforcing projective outputs. Due to time limitations we could only submit fully linearized results for English, and we were not able to implement word-level constraints, so we consider these preliminary baseline results. Given our error analysis, in future work it may be fruitful to update the attention mechanism in the inflection model (Aharoni and Goldberg, 2017), and to use a tree encoder + copy mechanism in the linearizer together with word-level constraints in decoding.