Statistics-Based Lexical Choice for NLG from Quantitative Information

We discuss a fully statistical approach to the expression of quantitative information in English. We outline the approach, focussing on the problem of Lexical Choice. An initial evaluation experiment suggests that it is worth investigating the method further.


Introduction
NLG systems express information in human language. To do this well, these systems need to "know"what expressions are most suitable for expressing a given piece of information. The most direct way to define words in NLG systems is manual coding, as it was done in systems such as FoG (Golberg et al., 1994) and SumTime-Mousam (Sripada et al., 2003). However, manual coding is time consuming, it can be argued to be theoretically unsatisfactory, and it is error prone even when performed by domain experts. The process is complicated in the fact that words like pink (Roy, 2002) and evening (Reiter et al., 2005) have different meanings for individual speakers.
Recent NLG approaches learn the use of words through statistical analysis of data-text corpora. For example, Belz's semi-automatic system for weather forecasting automatically learns a grammar based on a pre-existing (i.e., manually coded) set of grammar rules (Belz, 2008). Liang et al. (2009) developed a fully statistical alignment-based algorithm that automatically acquires a mapping from quantitative information to English words by adopting a hierarchical hidden semi-Markov model trained by Expectation Maximization. Konstas and Lapata (2013) introduced a generation model based on Liang's algo-rithm . However, these existing approaches have difficulty handling situations in which a word expresses a combination of data dimensions, for example as when the word "mildëxpresses a combination of warm temperatures and low wind speed.
In this paper, we discuss a new approach to the problem; the approach is fully statistical and it is able to handle situations in which a word or phrase maps to a combination of data dimensions. We focus on Lexical Choice but are investigating applications to other areas of NLG.

Methodology
In many areas of perception research, a method called "contour stylizationïs employed to mimic a complex signal (i.e., a complex graph) by means of a limited number of straight lines (Johan t Hart and Cohen, 1990). Our method uses the similar idea and applies it to two dimensions (i.e., weather data and language) at the same time. Our approach builds a bridge between quantitative information and words by discretising the data.

Representing Data in Vector
A continuous dimension can be represented by a set of discrete parameters, so called key-points. For example, wind speed (ws) is a continuous dimension with its value between 0 knot to 36 knots. A group of key-points can then be used to represent any value of wind speed. For instance, a possible key-point group is {ws = 0, ws = 12, ws = 24, ws = 36}, in which key-points are evenly spaced. The aim of using key-points is to transform the original quantitative dimension into probability dimensions. This process is similar to Signal Analysis (Reiter 2007) 104 in which each key-point plays a role as a Signal Sensor. In the above example, 5 key-points are used to represent wind speed collectively, where each keypoint specifies a specific range of wind speed. In this way, if a word describes wind speed within a certain range, we will find the connection of the word to the relative key-points.
Based on this formulation, any wind speed can be represented by weighted key-points through linear interpolation. Suppose one would like to represent an arbitrary wind speed, say ws = 5. Note that ws = 5 falls between the range of key-points ws = 0 and ws = 12 as described above. Using linear interpolation, one can derive the weights of key-points ws = 0 and ws = 12 for representing ws = 5, which are 0.58 and 0.42 respectively. Because the remaining key points does not contribute to represent wind speed ws = 5, their weights are set to 0. Finally, the wind speed ws = 5 can be represented as a vector 0.58, 0.42, 0, 0 , which encodes the weights for the key-point group.
Although in the above example key-points {ws = 0, ws = 12, ws = 24, ws = 36} are set evenly spaced, it should be noted that the setting of key-points (e.g., the choice of key-point values) has relatively little impact on predicting the use of words. This is because the our method can be regarded as fitting the occurrence function of words by a straight line in the contour stylization angle (in addition to the Signal Analysis), and the key-points present the inflection points' abscissa of the straight line. Although carefully selecting key-points can possibly enhance the model's performance, our model adopt the evenly spaced key-points, which empirically works well enough in general.

Representing Words in Vector
Expressions such as words can be represented by key-points weight vectors as well. For example, in English the expression calm is only used to describe wind speed close to 0. So, calm can also be represented using the same key-point group as before, i.e., represented with a high weight for ws = 0 (such as 0.9, for instance), and a low weight for ws = 12 (e.g., 0.01). For the moment, the weights of calm are estimated by hand. In section 2.4 we will see how the weights can be estimated from a data-text corpus.

Lexical Choice
This section introduces how our proposed approcah handles the lexical choice in the NLG process through Cosine similarity. Suppose both quantitative information and lexical expressions have been converted into vectors (i.e., q and e) in the same vector space parameterised by the key-points. The problem of finding the most likely expression ( e) for the given quantitative information ( q) can be transformed to the process of finding the most similar lexical expression vector e to q. We exemplify the lexical choice process below, using wind speed as quantitative dimension.
Suppose the key-points are still {ws = 0, ws = 12, ws = 24, ws = 36}. The candidate expression words are calm and breeze, which can be represented in a form of key-point weight vectors as below: e calm = 0.9, 0.01, −0.9, −1 (1) e breeze = 0.7, 0.9, −0.8, −1 Now our goal is to choose the most suitable word to describe wind speed ws = 5 from the available candidate word expressions (i.e., calm and breeze).
As discussed in Section 2.1, ws = 5 can also be represented by a key-point weight vector q ws=5 = 0.58, 0.42, 0, 0 Based on the same key-point vector space, we calculate the Cosine similarities between each candidate word and the target wind speed ws = 5, and the most suitable word is naturally the one with the highest similarity to ws = 5.
Sim( e calm , q ws=5 ) = e calm · q ws=5 e calm q ws=5 = 0.45 Sim( e breeze , q ws=5 ) = e breeze · q ws=5 e breeze q ws=5 = 0.64 As can be seen above, the similarity between q ws=5 and e breeze is higher than that of e clam . Therefore, breeze is a better choice for expressing ws = 5.

Estimating Weight Vector for Word Expressions
One key challenge in applying our approach for learning the relationship between quantitative information and words is to find the optimal vector e for 105 each possible expression word. Suppose we have r data to text pairs denoted as < data i , text i > r i=1 , where data i in the pairs consists of quantitative dimensions and text i refers to the expression words as shown in Eq. 6. < data, text >⇒ {dim 1,...,m , exp 1,...,n } Following section 2.1, for each data to text pair, we firstly discretise the data dimensions (dim 1,...,m ) into a key-point group { d 1 , d 2 , ..., d m } ≡ d. Next, we can find the optimal values for weight vector e i by solving Eq. 7 constructed based on the training The function isOccur(exp i |text i ) returns 1 if exp i occurs in the corresponding text i , and returns 0 otherwise. Generally, there are fewer free parameters than the number of equations, so we can always find the optimised solution for estimating e i using Least Square. If there are more than one solution, we adopt the solution with the least norm. In the same way, we can obtain weight vectors for all the candidate expressions. So far we have described how to estimate the keypoint weight vector for every candidate expression from training data, i.e., data-text pairs. In the test phase, to predict the most likely words for unseen data, we firstly represent data as a weight vector, and then compare its cosine similarity against every candidate expression. Since the weight vectors for expressions e i are trained through the occurrence function isOccur(), the similarity between unseen data and a candidate expression reflects the suitability of an expression in expressing the data.

Discussion: Handling multiple dimensions
One of the important features of our approach is the ability of choosing expressions for data with multiple dimensions. We stress that both the training process and lexical choice process are applicable to multiple data dimensions. First, in the training process, information of different quantitative dimensions is converted into key-point weights, so the boundaries between different dimensions have disappeared. The training process could even calculate the implicit relationship between expressions and quantitative data. Second, the lexical choice process selects expressions based on a set of dimensions rather than each single dimension. This is why this approach can handle the multiple dimension information.

Evaluating the proposed approach to Lexical Choice
To perform an initial sanity check on our approach, we built a small corpus from SumTime-Meteo Corpus (Sripada et al., 2002), which contains human writing weather forecasts with meteorological data. We selected 144 wind speed forecasts with data whose wind speeds do not change a lot during a forecast period, and summarize these data into three dimensions, as shown in Table 1. We randomly selected 96 records of the total 114 data records to train the model, and adopt the rest of data records to evaluate. We evaluated 10 words 1 : LESS, N, S, OR, SE, NE, VARIABLE, GUSTS, WS, MAINLY, which are the words occurring more than 5 times in the small corpus. For each candidate word w i , we separate the testing data into two groups. Forecast texts in group 1 contain word w i but not in group 2. When we use our model (trained with the SumTime-Meteo Corpus) to predict the occurring probability of w i in group 1 and group 2 respectively, we expect to obtain higher occurring probability p(w i |G 1 ) from group 1 than p(w i |G 2 ) from group 2. The results are shown in Figure 1.
As shown in Figure 2, it is clear that experimental results are inline with our expectation: our approach does produce higher occurring probabilities in group 1 than in group 2. Recall that one key feature of our approach is its capability to model multiple dimensional features. To show the benefit of this feature, we have also applied our approach modelling taking into account each single dimension separately. By comparing Table 1 and Table 2, we can see that the prediction performance of words based on multiple dimension outperforms all the models considering a single dimension only, especially when predicting words variable and mainly.

Conclusion
We have sketched an approach to choosing lexical expressions according to multiple quantitative information. To have this ability, this approach learns the relationship between quantitative information and words by the following steps: a) resolving quantitative information and the occurrence of expressions into the same linear space; b) building equations of expressions' weight vector; c) finding the best solution of the equations. Initial evaluation suggest that this approach may be on the right track.
The possibility of applications to Lexical Choice in Natural Language Generation is perhaps most obvious, but the mapping that we learn is applicable to interpretation as well. In other words, our proposal aims to solve the age-old problem in Linguistics and Fuzzy Logic of how to specify the meaning of vague words (Van Deemter, 2012), which resists traditional approaches to semantics, because these words admit borderline cases.