A Neural Network Architecture for Multilingual Punctuation Generation

Even syntactically correct sentences are perceived as awkward if they do not contain correct punctuation. Still, the problem of automatic generation of punctuation marks has been largely neglected for a long time. We present a novel model that introduces punc-tuation marks into raw text material with transition-based algorithm using LSTMs. Unlike the state-of-the-art approaches, our model is language-independent and also neutral with respect to the intended use of the punctuation. Multilingual experiments show that it achieves high accuracy on the full range of punctuation marks across languages.


Introduction
Although omnipresent in (language learner) grammar books, punctuation received much less attention in linguistics and natural language processing (Krahn, 2014). In linguistics, punctuation is generally acknowledged to possess different functions. Its traditionally most studied function is that to encode prosody of oral speech, i.e., the prosodic rhetorical function; see, e.g., (Kirchhoff and Primus, 2014) and the references therein. In particular the comma is assumed to possess a strong rhetorical function (Nunberg et al., 2002). Its other functions are the grammatical function, which leads it to form a separate (along with semantics, syntax, and phonology) grammatical submodule (Nunberg, 1990), and the syntactic function (Quirk et al., 1972), which makes it reflect the syntactic structure of a sentence.
The different functions of punctuation are also reflected in different tasks in natural language process-ing (NLP): introduction of punctuation marks into a generated sentence that is to be read aloud, restoration of punctuation in speech transcripts, parsing under consideration of punctuation, or generation of punctuation in written discourse. Our work is centered in the last task. We present a novel punctuation generation algorithm that is based on the transitionbased algorithm with long short-term memories (LSTMs) by  and character-based continuous-space vector embeddings of words using bidirectional LSTMs (Ling et al., 2015b;. The algorithm takes as input raw material without punctuation and effectively introduces the full range of punctuation symbols. Although intended, first of all, for use in sentence generation, the algorithm is function-and language-neutral, which makes it different, compared to most of the stateof-the-art approaches, which use function-and/or language-specific features.

Related Work
The most prominent punctuation-related NLP task has been so far introduction (or restoration) of punctuation in speech transcripts. Most often, classifier models are used that are trained on n-gram models (Gravano et al., 2009), on n-gram models enriched by syntactic and lexical features (Ueffing et al., 2013) and/or by acoustic features (Baron et al., 2002;Kolář and Lamel, 2012). Tilk and Alumäe (2015) use a lexical and acoustic (pause duration) feature-based LSTM model for the restoration of periods and commas in Estonian speech transcripts. The grammatical and syntactic functions of punctuation have been addressed in the context of written language. Some of the proposals focus on the grammatical function (Doran, 1998;White and Rajkumar, 2008), while others bring the grammatical and syntactic functions together and design rule-based grammatical resources for parsing (Briscoe, 1994) and surface realization (White, 1995;Guo et al., 2010). Guo et al. (2010) is one of the few works that is based on a statistical model for the generation of punctuation in the context of Chinese sentence generation, trained on a variety of syntactic features from LFG f-structures, preceding punctuation bigrams and cue words.
Our proposal is most similar to Tilk and Alumäe (2015), but our task is more complex since we generate the full range of punctuation marks. Furthermore, we do not use any acoustic features. Compared to Guo et al. (2010), we do not use any syntactic features either since our input is just raw text material.

Algorithm
We define a transition-based algorithm that introduces punctuation marks into sentences that do not contain any punctuation. In the context of NLG, the input sentence would be the result of the surface realization task (Belz et al., 2011). As in transitionbased parsing (Nivre, 2004), we use two data structures: Nivre's queue is in our case the input buffer and his stack is in our case the output buffer. The algorithm starts with an input buffer full of words and an empty output buffer. The two basic actions of the algorithm are SHIFT, which moves the first word from the input buffer to the output buffer, and GEN-ERATE, which introduces a punctuation mark after the first word in the output buffer. Figure 1 shows an example of the application of the two actions.
At each stage t of the application of the algorithm, the state, which is defined by the contents of the out- put and input buffers, is encoded in terms of a vector s t ; see Section 3.3 for different alternatives of state representation. As Dyer et al. (2015), we use s t to compute the probability of the action at time t as: where g z is a vector representing the embedding of the action z, and q z is a bias term for action z. The set A represents the actions (either SHIFT or GENERATE(p)). 1 s t encodes information about previous actions (since it may include the history with the actions taken and the generated punctuation symbols are introduced in the output buffer, see Section 3.3), thus the probability of a sequence of actions z given the input sequence is: As in , the model greedily chooses the best action to take given the state with no backtracking. 2

Word Embeddings
Following the tagging model of Ling et al. (2015b) and the parsing model of , we compute character-based continuous-space vector embeddings of words using bidirectional LSTMs (Graves and Schmidhuber, 2005) to learn similar representation for words that are similar from an orthographic/morphological point of view.
The character-based representations may be also concatenated with a fixed vector representation from a neural language model. The resulting vector is passed through a component-wise rectifier linear unit (ReLU). We experiment with and without pretrained word embeddings. To pretrain the fixed vector representations, we use the skip n-gram model introduced by Ling et al. (2015a).

Representing the State
We work with two possible representations of the input and output buffers (i.e, the state s t ): (i) a lookahead model that takes into account the immediate context (two embeddings for the input and two embeddings for the output), which we use as a baseline, and (ii) the LSTM model, which encodes the entire input sequence and the output sentence with LSTMs.

Baseline: Look-ahead Model
The look-ahead model can be interpreted as a 4gram model in which two words belong to the input and two belong to the output. The representation takes the average of the two first embeddings of the output and the two first embeddings at the front of the input. The word embeddings contain all the richness provided by the character-based LSTMs and the pretrained skip n-gram model embeddings (if used). The resulting vector is passed through a componentwise ReLU and a softmax transformation to obtain the probability distribution over the possible actions given the state s t ; see Section 3.1.

LSTM Model
The baseline look-ahead model considers only the immediate context for the input and output sequences. In the proposed model, we apply recurrent neural networks (RNNs) that encode the entire input and output sequences in the form of LSTMs. LSTMs are a variant of RNNs designed to deal with the vanishing gradient problem inherent in RNNs (Hochreiter and Schmidhuber, 1997;Graves, 2013). RNNs read a vector x t at each time step and compute a new (hidden) state h t by applying a linear map to the concatenation of the previous time step's state h t−1 and the input, passing then the outcome through a logistic sigmoid non-linearity.
We use a simplified version of the stack LSTM model of . The input buffer is encoded as a stack LSTM, into which we PUSH the entire sequence at the beginning and POP words from it at each time step. The output buffer is a sequence, encoded by an LSTM, into which we PUSH the final output sequence. As in , we include a third sequence with the history of actions taken, which is encoded by another LSTM. As already mentioned above, the three resulting vectors are passed through a component-wise ReLU and a softmax transformation to obtain the probability distribution over the possible actions that can be taken (either to shift or to generate a punctuation mark), given the current state s t ; see Section 3.1.

Experiments
To test our models, we carried experiments on five languages: Czech, English, French, German, and Spanish. English, French and Spanish are generally assumed to be characterized by prosodic punctuation, while for German the syntactic punctuation is more dominant (Kirchhoff and Primus, 2014). Czech punctuation also leans towards syntactic punctuation (Kolář et al., 2004), but due to its rather free word order we expect it to reflect prosodic punctuation as well.
The punctuation marks that the models attempt to predict (and that also occur in the training sets) for each language are listed in Table 1. 3 Commas represent around 55% and periods around 30% of the total number of marks in the datasets.   look-ahead and the stack LSTM models, characterbased embeddings, punctuation embeddings and pretrained embeddings (if used) also have 100 dimensions. Both models are trained to maximize the conditional log-likelihood (Eq. 2) of output sentences, given the input sequences. For Czech, English, German, and Spanish, we use the wordforms from the treebanks of the CoNLL 2009 Shared Task (Hajič et al., 2009); the French dataset is by Candito et al. (2010). Development sets are used to optimize the model parameters; the results are reported for the held-out test sets. Table 2 displays the outcome of the experiments for periods and commas in all five languages and summarizes the overall performance of our algorithm in terms of the micro-average figures. In order to test whether pretrained word embeddings provide further improvements, we incorporate them for English, Spanish and German. 4 The figures show that the LSTMs that encode the entire context of a punctuation mark are better than a strong baseline that takes into account a 4-gram sliding window of tokens. They also show that character-based representations are already useful for the punctuation generation task on their own, but when concatenated with pretrained vectors, they are even more useful.

Results and Discussion
The model is capable of providing good results for all languages, being more consistent for English, Czech and German. Average sentence length may indicate why the model seems to be worse for Spanish and French, since sentences are longer in the Spanish (29.8) and French (27.0) datasets, compared to German (18.0), Czech (16.8) or English (24.0). The training set is also smaller in Spanish and French compared to the other languages. It is worth noting that the results across languages are not directly comparable since the datasets are different, and as shown in Table 1, the sets of punctuation marks that are to be predicted diverge significantly.
The figures in Table 2 cannot be directly compared with the figures reported by Tilk and Alumäe (2015) for their LSTM-model on period and comma restoration in speech transcripts: the tasks and datasets are different.
Our results prove that the state representation (through LSTMs, which have already been shown to be effective for syntax ) and character-based representations (which allow similar embeddings for words that are mor-phologically similar (Ling et al., 2015b;) are capturing strong linguistic clues to predict punctuation.

Conclusions
We presented an LSTM-based architectured that is capable of adding punctuation marks to sequences of tokens as produced in the context of surface realization without punctuation with high quality and linear time. 5 Compared to other proposals in the field, the architecture has the advantage to operate on sequences of word forms, without any additional syntactic or acoustic features. This tool could be used for ASR (Tilk and Alumäe, 2015) and grammatical error correction (Ng et al., 2014). In the future, we plan to create cross-lingual models by applying multilingual word embeddings (Ammar et al., 2016).