Grounding learning of modifier dynamics: An application to color naming

Grounding is crucial for natural language understanding. An important subtask is to understand modified color expressions, such as “light blue”. We present a model of color modifiers that, compared with previous additive models in RGB space, learns more complex transformations. In addition, we present a model that operates in the HSV color space. We show that certain adjectives are better modeled in that space. To account for all modifiers, we train a hard ensemble model that selects a color space depending on the modifier-color pair. Experimental results show significant and consistent improvements compared to the state-of-the-art baseline model.


Introduction
Grounded color descriptions are employed to describe colors which are not covered by basic color terms (Monroe et al., 2017). For instance, "greenish blue" cannot be expressed by only "blue" or "green". Grounded learning of modifiers, as a result, is essential for grounded language understanding problems such as image captioning (Karpathy and Fei-Fei, 2015), visual question answering (Goyal et al., 2017) and object recognition (van de Sande et al., 2010).
In this paper, we present models that are able to predict the RGB code of a target color given a reference color and a modifier. For example, as shown in Figure 1, given a reference color code r = 101 55 0 and a modifier m = "greenish", our models are trained to predict the target color code t = 105 97 18 . The state-of-the-art approach for this task (Winn and Muresan, 2018) represents both colors and modifiers as vectors in RGB space, and learns a 1 Code available at https://github.com/ HanXudong/GLoM Figure 1: Examples of the grounded modifier modelling task, shown in RGB space. Given the reference and modifier, the system must predict the target color.
vector representation of modifiers m as part of a simple additive model, r + m ≈ r, in RGB color space. For instance, given the reference color r = 229 0 0 , the target color t = 132 0 0 , the modifier m = "darker" is learned as a vector m = −97 0 0 . This model works well when the modifier is well represented as a single vector independent of the reference color, 2 but fails to model modifiers with more complex transformations, for example, color related modifiers, like "greenish", which are better modelled through color interpolation.
To fit a better model, we assume that there are approximate intersection points for the extension lines of modifier vectors, for instance, Figure 2c shows the "darker" related vectors in RGB space and we can see that the intersection point is approximately 0 0 0 . On the basis of this, we introduce an RGB model which can learn a transformation matrix and an interpolation point for each modifier.
There are many other color spaces besides RGB, e.g., HSL and HSV (Joblove and Greenberg, 1978;Hanbury, 2008), and mapping between such spaces can be done via invertible transformations (Agoston, 2005). As shown in Figure 2d, for some modifiers, color vectors can be approximately parallel in HSV space, thus simplifying the modeling problem. We propose a HSV model using a von-Mises loss, and show that this model outperforms the RGB method for many modifiers. We also present an ensemble model for color space selection, which determines for each modifier the best color space model. Overall our methods substantially outperform prior work, achieving stateof-the-art performance in the grounded color modelling task.

Methods
Here we describe the methods employed for color modeling. Formally, the task is to predict a target color vector t in RGB space from a reference color vector r and a modifier string m.

Modeling in RGB space
Baseline Model Winn and Muresan (2018) present a model (WM18) which represents a vector m ∈ R 3 as a function of (m, r) pointing from a reference color vector r to the target color vector t, such that t = r + m. In the simplest case the modifier m is irrelevant to the reference color r. This assumption, however, does not hold in all situations. For example, when predicting an instance with the reference vector r = 193 169 106 and the modifier "greenish", the outcomet is expected to be 177 183 102 , however, WM18 predictst = 195 156 95 . The cosine similarity between (t − r) and ( t − r) is −0.76, i.e., m points in the opposite direction to where it should. Figure 2c, pairs of vectors for the same modifier are often not parallel. In theory, such vectors can even be orthogonal: compare "darker red" vs "darker blue", which fall on different faces of the RGB cube. To model this, we propose a model in RGB space as follows:

RGB Model As shown in
Where M ∈ R 3×3 is a transformation matrix and β is a modifier vector which is designed to capture the information of m. Given an error term ε ∼ N ( 0, σI 3 ), the RGB model is trained to minimize the following loss for the log Gaussian likelihood: where t i is the target vector in each instance andt i is the prediction. 3 Specific Settings Our model generalizes WM18, which can be realized by setting M = I 3 and β = m.
Another interesting instance of the model is obtained by setting M = (1 − α m )I 3 and β = α m m, which we call the Diagonal Covariance (DC) model. In contrast to the RGB model and WM18, which model m as a function of r and m, m in the DC model does not depend on r. Given r and m, to predict the t, our DC model predicts m first and then applies a linear transformation to get the target color vector as follows: is a scalar which only depends on m and measures the distance from r to m. In the DC model m is the interpolation point for modifiers, such as 0 0 0 for the modifier "darker".

Modeling in HSV space
Compared with the RGB color space, when modeling modifiers in HSV, there are two main differ-ences: hue is represented as an angular dimension, and the modifier vectors are more frequently parallel (see Fig 2d). As shown in Figure 2b, HSV space forms a cylindrical geometry with hue as its angular dimension, with the value red occurring at both 0 and 2π. For this reason, modelling the hue with a Gaussian regression loss is not appropriate. To account for the angularity, we model "hue" with a von-Mises distribution, with the following pdf : The mean valueĥ represents the center of hue dimension, k indicates the concentration about the mean, and I 0 (k) is the modified Bessel function of order 0. When training the model, the parameter k is assumed constant, and thus the loss function is: where h i is the hue value of the target color in each instance andĥ i is the prediction.
The second difference to modeling in RGB space is that the modifier behavior is simpler in HSV. For modifiers, vectors from reference colors to target colors are more likely to be parallel (see Figure 2d). As a result, we present an additive model in HSV space where a modifier m will be modeled as a vector from r to t: Here m is a function of both m and r. In addition, modifier modeling will be split into two parts: modeling "hue" dimension as von-Mises distribution and other dimensions together as a bivariate normal distribution (See Equation 2). Notice that Equation (5) is the same equation as used by WM18, however, here it is applied in a color space that better fits its assumptions. WM18 is presented and evaluated only in RGB color space. To compare its performance with our models, we transform output into RGB space.

Ensemble model
An ensemble model is trained to make the final prediction, which we frame as a hard binary choice to select the color space appropriate for the given modifier. This works by applying the general RGB model and HSV model (Equations 1 and 5) to get their predictions, and converting the HSV predictions into RGB space. Then the hard ensemble is trained to predict which color space should be used based on the modifier m and the reference vector r, using as the learning signal which model prediction had the smallest error against the reference colour for each instance (measured using Delta-E distance, see §3.2). The probability of the RGB model being selected is: p = σ(f (m, r)), where σ is the logistic sigmoid function and f (m, r) is a function of modifier m and reference color r.

Dataset
The dataset 4 used to train and evaluate our model includes 415 triples (reference color label, r, modifier, m, and target color label, t) in RGB space presented by Winn and Muresan (2018). Munroe (2010) collected the original dataset consisting of color description pairs collected in an open online survey; the dataset was subsequently filtered by McMahan and Stone (2015). Winn and Muresan processed color labels and converted pairs to triples with 79 unique reference color labels and 81 unique modifiers.
We train models in both RGB and HSV color space, but samples in WM18 are only presented in RGB space. Because modifiers encode the general relationship between r and t we use the same approach presented by Winn and Muresan (2018): using the mean value of a set of points to represent a color. A drawback of this approach is that it does not account for our uncertainty about the appropriate RGB encoding for a given color word.

Experiment Setup
Model configuration: The model presented by Winn and Muresan (2018) is initialized with Google's pretrained 300-d word2vec embeddings (Mikolov et al., 2013b,a) which are not updated during training. To perform comparable experiments, all models in paper are designed with the same pre-trained embedding model. Other pretrained word embeddings, such as GloVe (Pennington et al., 2014) and BERT (Devlin et al., 2019), were also tested but there was no significant Architecture: An input modifier is represented as a vector by word2vec pretrained embeddings and followed by two fully connected layers(F C 1 and F C 2 ) with size 32 and 16 respectively. Let h 1 be the hidden state of F C 2 then h 1 = F C 2 (F C 1 ( r, E m , r) where E are fixed, pretrained word2vec embeddings. r is used as an input for both F C 1 and F C 2 . After F C 2 , all the other layers are based on hidden state h 1 .
Evaluation: Following Winn and Muresan (2018), we evaluate the performance in 5 distinct input conditions: (1) Seen Pairings: The triple (r, m, t) has been seen when training models.
(2) Unseen Pairings: Both r and m have been seen in training data, but not the triple (r, m, t).
(3) Unseen Ref. Color: r has not been seen in training, while m has been seen. (4) Unseen modifiers: m has not been seen in training, while r has been seen. (5) Fully Unseen: Neither r nor m have been seen in training.
Because of the small size of the dataset, we report the average performance over 5 runs with different random seeds. Two scores, cosine similarity, and Delta-E are applied for evaluating the per-formance. Cosine similarity measures the difference in terms of vector direction in color space and Delta-E is a non-uniformity metric for measuring color differences. Delta-E was first presented as the Euclidean Distance in CIELAB color space (McLaren, 1976). Lower Delta-E values are thus preferable as they indicate better matching of the target color. Luo et al. (2001) present the latest and most accurate CIE color difference metrics, Delta-E 2000, which improve the original formula by taking into account weighting factors and fixing the lightness inaccuracies. Our models are evaluated with Delta-E 2000. Table 1 shows the results. Compared with WM18, our RGB model outperforms under all conditions. As we have stated, our model is a generalization of their approach. The more complex transformation matrix in our RGB model is able to learn more information, such as the effects of covariance between color channels, and thus achieves a better performance than WM18. Note that our reimplementation of the original WM18 system lead to significantly better performance. 5 According to the cosine similarity, the HSV model is superior for most test conditions (con- firming our hypothesis about simpler modifier behaviour in this space). However for Delta-E, the RGB model and ensemble perform better. Unlike cosine, Delta-E is sensitive to differences in vector length, and we would argue it is the most appropriate metric because lengths are critical to measuring the extent of lightness and darkness of colors. Accordingly the HSV model does worse under this metric, as it more directly models the direction of color modifiers, but as a consequence this leads to errors in its length predictions. Overall the ensemble does well according to both metrics, and has the best performance for several test conditions with Delta-E.

Results
Error Analysis We first focused error analysis on prediction of "Unseen Modifiers" and "Fully Unseen" instances. As shown in Table 1, our models are able to predict target colors given seen modifiers but fail to make predictions for instances with unseen modifiers. All modifiers are represented by word2vec embeddings, and we expect that predictions of unseen modifiers should be close to instances with similar seen modifiers. For example, the prediction of a reference color r and modifier "greeny" should be similar to the prediction of the same reference color r and a similar seen modifier, e.g. "green" and "greenish". However, the prediction of "greeny" is more similar to "bluey", a consequence of these terms having highly similar word embeddings (as do other colour modifiers with a -y suffix, irrespective of their colour). This is related to the problem reported in Mrkšić et al. (2016), whereby words and their antonyms often have similar embeddings, as a result of sharing similar distributional contexts. Accordingly for the unseen modifier condition, our model is often misled by attempting to gener-alise from nearest neighbour modifiers which have a different meaning. Baroni and Zamparelli (2010) was the first work to propose an approach to adjective-noun composition (AN) for corpus-based distributional semantics which represents nouns as vectors and adjectives as matrices nominal vectors. However it is hard to gain an intuition for what the transformation does since these embeddings generally live on a highly structured but unknown manifold. In our case, we operate on colors and we actually know the geometry of the colour spaces we use. This makes it easier for us to interpret the learned mapping (see Figure 2c and 2d that show convergence to a point in RGB and parallelism in HSV space).

Conclusion and Future Work
In this paper, we proposed novel models of predicting color based on textual modifiers, incorporating a matrix transformation than the previous largely linear additive method. As well as our more general approach, we exploit the properties of another color space, namely HSV, in which the modifier behaviours are often simpler. Overall our method leads to state of the art performance on a standard dataset.
In future work, we intend to develop more accurate modifier representations to allow for better generalisation to unseen modifiers. This might be achieved by using a composition sub-word representation for modifiers, such as character-level encoding. Finally, we also strive to acquire larger datasets. This is a crucial step towards comparing the generalization performance of different colormodifier models. Models trained on larger data sets are likely to be more applicable to real-world problems since they learn representations for more color terms.