Chinese Pinyin Aided IME, Input What You Have Not Keystroked Yet

Chinese pinyin input method engine (IME) converts pinyin into character so that Chinese characters can be conveniently inputted into computer through common keyboard. IMEs work relying on its core component, pinyin-to-character conversion (P2C). Usually Chinese IMEs simply predict a list of character sequences for user choice only according to user pinyin input at each turn. However, Chinese inputting is a multi-turn online procedure, which can be supposed to be exploited for further user experience promoting. This paper thus for the first time introduces a sequence-to-sequence model with gated-attention mechanism for the core task in IMEs. The proposed neural P2C model is learned by encoding previous input utterance as extra context to enable our IME capable of predicting character sequence with incomplete pinyin input. Our model is evaluated in different benchmark datasets showing great user experience improvement compared to traditional models, which demonstrates the first engineering practice of building Chinese aided IME.


Introduction
Pinyin is the official romanization representation for Chinese and the P2C converting the inputted pinyin sequence to Chinese character sequence is the most basic module of all pinyin based IMEs.
Most of the previous research (Chen, 2003;Zhang et al., 2006;Lin and Zhang, 2008;Chen and Lee, 2000;Jiang et al., 2007;Cai et al., 2017a) for IME focused on the matching correspondence between pinyin syllables and Chinese characters. Yang et al., 2012;Jia and Zhao, 2014; regarded the P2C as a translation between two languages and solved it in statistical or neural machine translation framework. The fundamental difference between  work and ours is that our work is a fully end-to-end neural IME model with extra attention enhancement, while the former still works on traditional IME only with converted neural network language model enhancement. (Zhang et al., 2017) introduced an online algorithm to construct appropriate dictionary for P2C. All the above mentioned work, however, still rely on a complete input pattern, and IME users have to input very long pinyin sequence to guarantee the accuracy of P2C module as longer pinyin sequence may receive less decoding ambiguity.
The Chinese IME is supposed to let user input Chinese characters with least inputting cost, i.e., keystroking, which indicates extra content predication from incomplete inputting will be extremely welcomed by all IME users. (Huang et al., 2015) partially realized such an extra predication using a maximum suffix matching postprocessing in vocabulary after SMT based P2C to predict longer words than the inputted pinyin.
To facilitate the most convenience for such an IME, in terms of a sequence to sequence model as neural machine translation (NMT) between pinyin sequence and character sequence, we propose a P2C model with the entire previous inputted utterance confirmed by IME users being used as a part of the source input. When learning the type of the previous utterance varies from the previous sentence in the same article to the previous turn of utterance in a conversation, the resulting IME will make amazing predication far more than what the pinyin IME users actually input.
In this paper, we adopt the attention-based NMT framework in (Luong et al., 2015) for the P2C task. In contrast to related work that simply extended the source side with different sized context window to improve of translation quality (Tiedemann and Scherrer, 2017), we add the entire input utterance according to IME user choice at previous time (referred to the context hereafter). Hence the resulting IME may effectively improve P2C quality with the help of extra information offered by context and support incomplete pinyin input but predict complete, extra, and corrected character output. The evaluation and analysis will be performed on two Chinese datasets, include a Chinese open domain conversations dataset for verifying the effectiveness of the proposed method.

Model
As illustrated in Figure 1, the core of our P2C is based the attention-based neural machine translation model that converts at word level. Still, we formulize P2C as a translation between pinyin and character sequences as shown in a traditional model in Figure 1(a). However, there comes a key difference from any previous work that our source language side includes two types of inputs, the current source pinyin sequence (noted as P ) as usual, and the extended context, i.e., target character sequence inputted by IME user last time (noted as C). As IME works dynamically, every time IME makes a predication according to a source pinyin input, user has to indicate the 'right answer' to output target character sequence for P2C model learning. This online work mode of IMEs can be fully exploited by our model whose work flow is shown in Figure 1(b).
As introduced a hybrid source side input, our model has to handle document-wide translation by considering discourse relationship between two consecutive sentences. The most straightforward modeling is to simply concatenate two types of source inputs with a special token 'BC' as separator. Such a model is in Figure 1(c). However, the significant drawback of the model is that there are a slew of unnecessary words in the extended context (previous utterance) playing a noisy role in the source side representation.
To alleviate the noise issue introduced by the extra part in the source side, inspired by the work of (Dhingra et al., 2016;Pang et al., 2016;Zhang et al., 2018c,a,b;Cai et al., 2017b), our model adopts a gated-attention (GA) mechanism that performs multiple hops over the pinyin with the extended context as shown in Figure 1(d). In order to ensure the correlation between each other, we build a parallel bilingual training corpus and use it to train the pinyin embeddings and the Chinese embeddings at once. We use two Bidirectional gated recurrent unit (BiGRU) (Cho et al., 2014) to get contextual representations of the source pinyin and context respectively, H p = BiGRU(P ), H c = BiGRU(C), where the representation of each word is formed by concatenating the forward and backward hidden states.
For each pinyin p i in H p , the GA module forms a word-specific representation of the context c i ∈ H c using soft attention, and then adopts element-wise product to multiply the context representation with the pinyin representation.
where is multiplication operator. The pinyin representationH p = x 1 , x 2 , ..., x k is augmented by context representation and then sent into the encoder-decoder framework. The encoder is a bi-directional long short-term memory (LSTM) network (Hochreiter and Schmidhuber, 1997). The vectorized inputs are fed to forward and backward LSTMs to obtain the internal representation of two directions. The output for each input is the concatenation of the two vectors from both directions. Our decoder based on the global attentional models proposed by (Luong et al., 2015) to consider the hidden states of the encoder when deriving the context vector. The probability is conditioned on a distinct context vector for each target word. The context vector is computed as a weighted sum of previous hidden states. The probability of each candidate word as being the recommended one is predicted using a softmax layer over the inner-product between source embeddings and candidate target characters.
This work belongs to one of the first line which fully introduces end-to-end deep learning solution to the IME implementation following a series of our previous work (Zhu et al., 2018;Wu and Zhao, 2018;Qin et al., 2018;.

Definition of Incomplete Input for IME
The completeness in IME is actually uneasily well-defined as it is a relative concept for inputting procedure. Note that not until user types the return key enter, user will not (usually) really make the input choice. Meanwhile, even though the entire/complete input can be strictly defined by the time when user types enter, user still can make decision at any time and such incompleteness cannot be well evaluated by all the current IME metrics. As the incomplete from is hard to simulate and it is diverse in types, we have to partially evaluate it in the following two ways 1 ,

The incomplete pinyin as abbreviation pinyin
To compare with previous work directly, we followed (Huang et al., 2015) and focused on the abbreviated pinyin (the consonant letter only) to perform evaluation (i.e., tian qi to t q).
Take incomplete user input as the incomplete As IME works as an interactive system, it will always give prediction only if users keep typing. If 1 Our code is at https://github.com/YvonneHuang/gaIME user's input does not end with typing enter, we can regard the current input pinyin sequence is an incomplete one.  Our model is evaluated on two datasets, namely the People's Daily (PD) corpus and Douban conversation (DC) corpus. The former is extracted from the People's Daily from 1992 to 1998 that has word segmentation annotations by Peking University. The DC corpus is created by (Wu et al., 2017) from Chinese open domain conversations. One sentence of the DC corpus contains one complete utterance in a continuous dialogue situation. The statistics of two datasets is shown in Table 1. The relativity refers to total proportion of sentences that associate with contextual history at word level. For example, there are 65.8% of sentences of DC corpus have words appearing in the context. With character text available, the needed parallel corpus between pinyin and character texts is automatically created following the approach proposed by (Yang et al., 2012).

PD
Our model was implemented using the Py-Torch 2 library, here is the hyperparameters we used: (a) the RNNs used are deep LSTM models, 3 layers, 500 cells, (c) 13 epoch training with plain SGD and a simple learning rate schedule -start with a learning rate of 1.0; after 9 epochs, halve the learning rate every epoch, (d) mini-batches are of size 64 and shuffled, (e) dropout is 0.3. Word embeddings are pre-trained by word2vec (Mikolov et al., 2013) toolkit on the adopted corpus and unseen words are assigned unique random vectors. (f) the gated attention layers size is 3, the hidden units number of BiGRU is 100.
Two metrics are used: Maximum Input Unit (MIU) accuracy (Zhang et al., 2017) and KeyStroke Score (KySS)   The former measures the conversion accuracy of MIU, whose definition is the longest uninterrupted Chinese character sequence during inputting. As the P2C conversion aims to output a rank list of the corresponding character sequences candidates, the Top-K MIU accuracy means the possibility of hitting the target in the first K predicted items. The KySS quantifies user experience by using keystroke count. For an ideal IME with complete input, we have KySS = 1. An IME with higher KySS is supposed to perform better.

Model Definition
We considered the following baselines: (a) Google IME: the only commercial Chinese IME providing a debuggable API in the market now; (b) OMWA: online model for word acquisition proposed by (Zhang et al., 2017); (c) CoCat: an SMT based input method proposed by (Huang et al., 2015) that supports incomplete pinyin inputs.
Three models with incomplete or complete inputs will be evaluated: (a) Basic P2C, the basic P2C based on attention-NMT model; (b) Basic C2C, the basic C to C model based on Seq2Seq model; (b) Simple C+ P2C, the simple concatenated P2C conversion model that concatenate context to pinyin representation; (c) Gated C+ P2C, our gated attention based context-enhanced pinyin-to-character model. Pinyin in model * has been actually set to abbreviation form when we say it goes to (Huang et al., 2015) incomplete definition.

Result and Analysis
Effect of Gated Attention Mechanism Table 3 shows the Effect of gated attention mechanism. We compared models with Gated C+ P2C and Simple C+ P2C. The MIU accuracy of the P2C  model has over 10% improvement when changing the operate pattern of the extra information proves the effect of GA mechanism. The Gated C+ P2C achieves the best in DC corpus, suggesting that the gated-attention works extremely well for handling long and diverse context.  not surprise that straightforward concatenation strategy for source inputs performs poorly when the input pinyin is incomplete in DC corpus, due to obvious noise in too long context. The relatively small gap between the results of CoCat and CoCat indicate that statistical learning model may be helpful in obtaining some useful patterns from limited input. When the input statement contains adequacy information, the MIU accuracy of Gated C+ P2C system achieves more than 20% improvement in both corpora. However, we find that the KySS scores are much more close even with different pinyin integrity, which indicates that user experience in terms of KySS are more hard improved.

Effect of P2C modules with Different Input Forms
Instance Analysis We input a dialogue in wonder to how much of the contextual information is used when P2C module find the input pinyin is unknown. Figure 2 demonstrates the effect of the gated attention mechanism on candidates offering and unknown word replacement. As shown in Figure 2(a), we find that our IME suggests a more suitable candidates to the user when user is obviously not consistent with what the model has learned previously, which shows our model exceeds the Simple C+ P2C learning for maximally matching the inputted pinyin, but become capable of effectively resisting user pinyin input noise, and turns to learn potential language knowledge in previous input history 3 .
As the ability predict user input from incomplete pinyin cannot be covered by any current IME performance metrics, thus the reported results yielded by our model actually underestimate our model performance to some extent. We illustrate the empirical discoveries of Figure 2(b) to demonstrate the extra effect of our P2C system on such situation, which indicates that the gatedattention pattern has taken great advantage of contextual information when given an unknown word. Or, namely, our model enables the incomplete input prediction though has to let it outside the current IME performance measurement. We display the attention visualization of Figure 2(b) in Figure  3 for better reference to explain the effect extended context plays on the generation of target characters.

Main Result
Our model is compared to other models in Table 3. So far, (Huang et al., 2015) and (Zhang et al., 2017) reported the state-of-theart results among statistical models. We list the top-5 accuracy contrast to all baselines with top-10 results, and the comparison indicates the noticeable advancement of our P2C model. To our surprise, the top-5 result on PD of our best Gated C+ P2C system approaches the top-10 accuracy of Google IME. On DC corpus, our Gated C+ P2C model with the best setting achieves 90.14% accuracy, surpassing all the baselines. The comparison shows our gated-attention system outperforms all state-of-the-art baselines with better user experience.

Conclusion
For the first time, this work makes an attempt to introduce additional context in neural pinyin-tocharacter converter for pinyin-based Chinese IME as to our best knowledge. We propose a gatedattention enhanced model for digging significant context information to improve conversion quality. More importantly, the resulting IME supports incomplete user pinyin input but returns complete, extra and even corrected character outputs, which brings about a story-telling mode change for all existing IMEs.