Scalar Adjective Identification and Multilingual Ranking

The intensity relationship that holds between scalar adjectives (e.g., nice<great<wonderful) is highly relevant for natural language inference and common-sense reasoning. Previous research on scalar adjective ranking has focused on English, mainly due to the availability of datasets for evaluation. We introduce a new multilingual dataset in order to promote research on scalar adjectives in new languages. We perform a series of experiments and set performance baselines on this dataset, using monolingual and multilingual contextual language models. Additionally, we introduce a new binary classification task for English scalar adjective identification which examines the models' ability to distinguish scalar from relational adjectives. We probe contextualised representations and report baseline results for future comparison on this task.


Introduction
Scalar adjectives relate the entities they modify to specific positions on the evoked scale (e.g., GOOD-NESS, TEMPERATURE, SIZE): A wonderful view is nicer than a good view, and one would probably prefer a delicious to a tasty meal. But not all adjectives express intensity or degree. Relational adjectives are derived from nouns (e.g., wood → wooden, chemistry → chemical), have no antonyms and serve to classify nouns (e.g., a wooden table, a chemical substance) (McNally and Boleda, 2004). The distinction between scalar and relational adjectives is an important one. Identifying adjectives that express intensity can serve to assess the emotional tone of a given text, as opposed to words that mostly contribute to its descriptive content. Additionally, estimating the intensity of a scalar adjective is useful for textual entailment (wonderful |= good but good |= wonderful), product review analysis and recommendation systems, emotional chatbots and question answering (de Marneffe et al., 2010). DEMELO EN dim < gloomy < dark < black FR terne < sombre < foncé < noir ES sombrío < tenebroso < oscuro < negro EL αμυδρός || αχνός < μουντός < σκοτεινός< μαύρος WILKINSON EN bad < awful < terrible < horrible FR mauvais < affreux < terrible < horrible ES malo < terrible < horrible < horroroso EL κακός < απαίσιος < τρομερός < φρικτός Table 1: Example translations from each dataset. "||" indicates adjectives at the same intensity level (ties).
Work on scalar adjectives has until now evolved around pre-compiled datasets (de Melo and Bansal, 2013;Taboada et al., 2011;Wilkinson and Oates, 2016;Cocos et al., 2018). Reliance on external resources has also restricted research to English, and has led to the prevalence of pattern-based and lexicon-based approaches. Recently, Garí Soler and Apidianaki (2020) showed that BERT representations (Devlin et al., 2019) encode intensity relationships between English scalar adjectives, paving the way for applying contextualised representations to intensity detection in other languages. 1 In our work, we explicitly address the scalar adjective identification task, overlooked until now due to the focus on pre-compiled resources. We furthermore propose to extend scalar adjective ranking to new languages. We make available two new benchmark datasets for scalar adjective identification and multilingual ranking: (a) SCAL-REL, a balanced dataset of relational and scalar adjectives which can serve to probe model representations for scalar adjective identification; and (b) MULTI-SCALE, a scalar adjective dataset in French, Spanish and Greek. In order to test contextual models on these two tasks, the adjectives need to be seen in sentential context. We thus provide, alongside the datasets, sets of sentences that can be used to extract contextualised representations in order to promote model comparability. We conduct experiments and report results obtained with simple baselines and state-of-the-art monolingual and multilingual models on these new benchmarks, opening up avenues for research on sentiment analysis and emotion detection in different languages. 2 2 The Datasets

The MULTI-SCALE Dataset
We translate two English scalar adjective datasets into French, Spanish and Greek: DEMELO consists of 87 hand crafted half-scales 3 (de Melo and Bansal, 2013) and WILKINSON contains 12 full scales (Wilkinson and Oates, 2016). We use the partitioning of WILKINSON into 21 half-scales proposed by Cocos et al. (2018). In what follows, we use the term "scale" to refer to half-scales.
The two translators have (near-)native proficiency in each language. They were shown the adjectives in the context of a scale. This context narrows down the possible translations for polysemous adjectives to the ones that express the meaning described inside the scale. For example, the Spanish translations proposed for the adjective hot in the scales {warm < hot} and {flavorful < zesty < hot || spicy} are caliente and picante, respectively. Additionally, the translators were instructed to preserve the number of words in the original scales when possible. In some cases, however, they proposed alternative translations for English words, or none if an adequate translation could not be found. As a result, the translated datasets have a different number of words and ties.  modify. In order to keep the method resource-light, we gather sentences that contain the adjectives in their unmarked form.
For each scale s, we randomly select ten sentences from OSCAR where adjectives from s occur. Then, we generate additional sentences through lexical substitution. Specifically, for every sentence (context) c that contains an adjective a i from scale s, we replace a i with ∀ a j ∈ s where j = 1...|s| and j = i. This process results in a total of |s| * 10 sentences per scale and ensures that ∀ a ∈ s is seen in the same ten contexts. For English, we use the ukWaC-Random set of sentences compiled by Garí Soler and Apidianaki (2020) which contains sentences randomly collected from the ukWaC corpus (Baroni et al., 2009).

The SCAL-REL Dataset
SCAL-REL contains scalar adjectives from the DEMELO, WILKINSON and CROWD (Cocos et al., 2018) datasets (i.e. 79 additional half-scales compared to MULTI-SCALE). We use all unique scalar adjectives in the datasets (443 in total), and subsample the same number of relational adjectives, which are labelled with the pertainym relationship in WordNet (Fellbaum, 1998). There are 4,316 unique such adjectives in WordNet, including many rare or highly technical terms (e.g., birefringent, anaphylactic). 4 Scalar adjectives in our datasets are much more frequent than these relational adjectives; their average frequency in Google Ngrams (Brants and Franz, 2006) is 27M and 1.6M, respectively. We balance the relational adjectives set by frequency, by subsampling 222 frequent and 221 rare adjectives. We use the mean frequency of the 4,316 relational adjectives in Google Ngrams as a threshold. 5 We propose a train/dev/test split of the SCAL-REL dataset (65/10/25%), observing a balance between the two classes (scalar and relational) in each set. To obtain contextualised representations, we collect for each relational adjective ten random sentences from ukWaC. For scalar adjectives, we use the ukWaC-Random set of sentences (cf. Section 2.1).

Methodology
Models We conduct experiments with state-ofthe-art contextual language models and several baselines on the MULTI-SCALE dataset. We use the pre-trained cased and uncased multilingual BERT model (Devlin et al., 2019) and report results of the best variant for each language. We also report results obtained with four monolingual models: bert-base-uncased (Devlin et al., 2019), flaubert_base_uncased (Le et al., 2020), bert-base-spanish-wwmuncased (Cañete et al., 2020), and bert-basegreek-uncased-v1 (Koutsikakis et al., 2020). We compare to results obtained using fastText static embeddings in each language (Grave et al., 2018).
For a scale s, we feed the corresponding set of sentences to a model and extract the contextualised representations for ∀ a ∈ s from every layer. When an adjective is split into multiple BPE units, we average the representations of all wordpieces (we call this approach "WP") or all pieces but the last one ("WP-1"). The intuition behind excluding the last WP is that the ending of a word often corresponds to a suffix with morphological information.
The DIFFVEC method We apply the adjective ranking method proposed by Garí Soler and Apidianaki (2020) to our dataset, which relies on an intensity vector (called − −− → dV ec) built from BERT representations. The method yields state-of-the art results with very little data; this makes it easily adaptable to new languages. We build a sentence specific intensity representation ( − −− → dV ec) by subtracting the vector of a mild intensity adjective, a mild (e.g., smart), from that of a ext , an extreme adjective on the same scale (e.g., brilliant) in the same context.
We create a dV ec representation from every sentence available for these two reference adjectives, and average them to obtain the global − −− → dV ec for that pair. Garí Soler and Apidianaki (2020) showed that a single positive adjective pair (DIFFVEC-1 (+)) is enough for obtaining highly competitive results in English. We apply this method to the other languages using the translations of a positive English (a mild , a ext ) pair from the CROWD dataset: perfect-good. 6 Additionally, we learn two dataset specific representations: one by averaging the − −− → dV ec's of all (a ext , a mild ) pairs in WILKINSON that do not appear in DEMELO (DIFFVEC-WK), and another one from pairs in DEMELO that are not in WILKINSON (DIFFVEC-DM). We rank adjectives in a scale by their cosine similarity to each − −− → dV ec: The higher the similarity, the more intense the adjective is.
Baselines We compare our results to a frequency and a polysemy baseline (FREQ and SENSE). These baselines rely on the assumption that low intensity words (e.g., nice, old) are more frequent and polysemous than their extreme counterparts (e.g., awesome, ancient). Extreme adjectives often limit the denotation of a noun to a smaller class of referents than mild intensity adjectives (Geurts, 2010). For example, an "awesome view" is more rare than a "nice view". This assumption has been confirmed for English in Garí Soler and Apidianaki (2020). FREQ orders words in a scale according to their frequency: Words with higher frequency have lower intensity. Given the strong correlation between word frequency and number of senses (Zipf, 1945), we also expect highly polysemous words (which are generally more frequent) to have lower intensity. This is captured by the SENSE baseline which orders the words according to their number of senses: Words with more senses have lower intensity.
Frequency is taken from Google Ngrams for English, and from OSCAR for the other three languages. The number of senses is retrieved from WordNet for English, and from BabelNet (Navigli and Ponzetto, 2012) for Spanish and French. 7 For adjectives that are not present in BabelNet, we use a default value which corresponds to the average number of senses for adjectives in the dataset (DEMELO or WILKINSON) for which this information is available. We omit the SENSE   Greek due to low coverage. 8

Evaluation
We use evaluation metrics traditionally used for ranking evaluation (de Melo and Bansal, 2013;Cocos et al., 2018): Pairwise accuracy (P-ACC), Kendall's τ and Spearman's ρ. Results on this task are given in Table 3. Monolingual models perform consistently better than the multilingual model, except for French. We report the best wordpiece approach for each model: WP-1 works better with all monolingual models and the multilingual model for English. Using all wordpieces (WP) is a better choice for the multilingual model in other languages. We believe the lower performance of WP-1 in these settings to be due to the fact that the multilingual BPE vocabulary is mostly English-driven; this naturally results in highly arbitrary partitionings in these languages (e.g., ES: fantástico → fantástico; EL: γιγάντιος (gigantic)→γ-ι-γ-άν-τιος). Tokenisers of the monolingual models instead tend to split words in a way that more closely reflects the morphology of the language (e.g., ES: fantástico → fantás-tico; EL: γιγάντιος→γιγά-ντι-ος. Detailed results are found in Appendix A. We observe that DIFFVEC-1 (+) yields comparable and sometimes better results than DIFFVEC-DM and DIFFVEC-WK, which are built from multiple pairs. This is important especially in the multilingual setting, since it shows that just one pair of adjectives is enough for obtaining good results in a new language. The best layer varies across models and configurations. The monolingual French and Greek models generally obtain best results in earlier layers. A similar behaviour is observed for the multilingual model for English to some extent, whereas for the other models performance improves in the upper half of the Transformer network (layers 6-12). This shows that the semantic information relevant for adjective ranking is not situated at the same level of the Transformer in different languages. We plan to investigate this finding further in future work. The lower results in French can be due to the higher amount of ties present in the datasets compared to other languages. 9 The baselines obtain competitive results showing that the underlying linguistic intuitions hold across languages. The best models beat the baselines in all configurations except for Greek on the DEMELO dataset, where FREQ and static embeddings obtain higher results. Overall, results are lower than those Figure 1: Illustration of two scalar adjectives that are close to − −− → dV ec and to its opposite (which represents low intensity). The red vector describes a relational adjective that is perpendicular to − −− → dV ec.
reported for English, which shows that there is room for improvement in new languages.

Scalar Adjective Identification
For each English adjective in the SCAL-REL dataset, we generate a representation from the available ten sentences (cf. Section 2.2) using the bert-base-uncased model (with WP and WP-1). We experiment with a simple logistic regression classifier that uses the averaged representation for an adjective (ADJ-REP) as input and predicts whether it is scalar or relational. We also apply the DIFFVEC-1 (+) method to this task and measure how intense an adjective is by calculating its cosine with − −− → dV ec. The absolute value of the cosine indicates how clearly an adjective encodes the notion of intensity. In Figure 1, we show two scalar adjective vectors with negative and positive cosine similarity to − −− → dV ec, and another vector that is perpendicular to − −− → dV ec, i.e. describing a relational adjective for which the notion of intensity does not apply. 10 We train a logistic regression model to find a cosine threshold separating scalar from relational adjectives (DV-1 (+)). Finally, we also use as a feature the cosine similarity of the adjective representation to the vector of "good", which we consider as a prototypical scalar adjective (PROTO-SIM).
The best BERT layer is selected based on the accuracy obtained on the development set. We report accuracy on the test set. The baseline classifiers only use frequency (FREQ) and polysemy (SENSE) as features. We use these baselines on SCAL-REL because the WordNet pertainyms included in the dataset are rarer than the scalar adjectives. The intuition behind the SENSE baseline explained in Section 3.1 also applies here.  Results on this task are given in Table 4. The classifier that relies on ADJ-REP BERT representations can distinguish the two types of adjectives with very high accuracy (0.946), closely followed by fastText embeddings (0.929). The DV-1 (+) method does not perform as well as the classifier based on ADJ-REP, which is not surprising since it relies on a single feature (the absolute value of the cosine between − −− → dV ec and ADJ-REP). Comparing ADJ-REP to a typical scalar word (PROTO-SIM) yields better results than DV-1 (+). The SENSE and FREQ baselines can capture the distinction to some extent. Relational adjectives in our training set are less frequent and have fewer senses on average (2.59) than scalar adjectives (5.30). A closer look at the errors of the best model reveals that these concern tricky cases: One of the four misclassified scalar adjectives is derived from a noun (microscopic), whilst five out of eight wrongly classified relational adjectives can have a scalar interpretation (e.g., sympathetic, imperative). Overall, supervised models obtain very good results on this task. SCAL-REL will enable research on unsupervised methods that could be used in other languages.

Conclusion
We propose a new multilingual benchmark for scalar adjective ranking, and set performance baselines on it using monolingual and multilingual contextual language model representations. Our results show that adjective intensity information is present in the contextualised representations in the studied languages. We also propose a new classification task and a dataset that can serve as a benchmark to estimate the models' capability to identify scalar adjectives when relevant datasets are not available. We make our datasets and sentence contexts available to promote future research on scalar adjectives detection and analysis in different languages.  Table 3 of the main paper contains results of the DIFFVEC method with the best approach for selecting wordpieces (WPs) for each model. In Table  5, we present results obtained using the alternative approach for each model and language:

A Comparison of Wordpiece Selection Methods
• for all monolingual models and the multilingual model for English, Table 5 contains results obtained with the WP approach; • for the multilingual models in the other languages, we show results with WP-1.
The best approach was determined by comparing their average scores across the different methods. Some configurations improve, but they yield overall worse results per model, especially in Spanish. Differences between WP and WP-1 are generally more pronounced in the multilingual models than in the monolingual models.  Table 5: Results of DIFFVEC (DV) methods with contextualised representations derived from monolingual and multilingual models for each language, using an alternative approach to selecting wordpieces (WP, WP-1) than the one used for the results reported in Table 3. For all languages but Greek, the multilingual model is cased.