Predicting Degrees of Technicality in Automatic Terminology Extraction

While automatic term extraction is a well-researched area, computational approaches to distinguish between degrees of technicality are still understudied. We semi-automatically create a German gold standard of technicality across four domains, and illustrate the impact of a web-crawled general-language corpus on technicality prediction. When defining a classification approach that combines general-language and domain-specific word embeddings, we go beyond previous work and align vector spaces to gain comparative embeddings. We suggest two novel models to exploit general- vs. domain-specific comparisons: a simple neural network model with pre-computed comparative-embedding information as input, and a multi-channel model computing the comparison internally. Both models outperform previous approaches, with the multi-channel model performing best.


Introduction
Automatic term extraction, i.e. the task of extracting linguistic expressions characteristic to a specialized domain, is a long-researched field within natural language processing. Assessing the technicality of the extracted terms, however, is still a niche within this area: technicality refers to the degree to which a term is specialized and exclusively used by experts in a domain. Up to date, studies on term technicality are mostly restricted to medical terminology and relate to the communication between doctors and patients. Especially in times of growing amounts of domain-specific websites with both lay and expert users (e.g. DIY 'do-it-yourself' communities, such as 1-2-do.com), the communication between experts and lays becomes increasingly important across all specialized domains. Furthermore, term technicality prediction is important for a range of tasks such as automatic thesaurus creation, assessing text specialization, and domain knowledge acquisition. Above all, predicting technicality can be considered a more fine-grained and expressive form of terminology extraction.
In this work, we first semi-automatically collect German specialized domain corpora to create a gold standard of term technicality across four domains: automotive, cooking, hunting and DIY. Based on a qualitative analysis of terminological phenomena and variants of ambiguity across domain-specific and general-language corpora, we then suggest two methods to explicitly integrate not only vector space model representations derived from the corpora, but also comparisons across the vector spaces. In a first approach, we enrich the combined general-language and domain-specific word embeddings with a difference vector as input for a classification system. In a second approach we design a multi-channel feed-forward neural network with a Siamese network component to represent the vector comparison internally.

Related Work
Existing studies on technicality predominantly focus on levels of familiarity or difficulty of terminology in medical, biomedical or health domains. Term familiarity refers to a user's subjective understanding of term technicality. These studies typically rely on classical readability features such as frequency, term length, syllable count, the Dale-Chall readability formula and affixes (Zeng et al., 2005;Zeng-Treitler et al., 2008;Grabar et al., 2014;Vinod Vydiswaran et al., 2014). They further make use of domain-specific terminology attributes such as neo-classical word components, given that medical terminology is strongly influenced by Greek and Latin (Deléger and Zweigenbaum, 2009;Bouamor et al., 2016). Besides the feature specification, the majority of studies exploits contrastive approaches.
Contrastive approaches compare a term's distribution in a domain and a reference corpus, for example a general-language corpus. Furthermore, for technicality prediction, often expert (medical) texts are compared against reference lay texts. Only a small number of studies relies on context-based approaches, e.g. Zeng-Treitler et al. (2008) use a contextual network; Bouamor et al. (2016) exploit language models; Pérez (2016) compares collocation networks.
For standard term extraction, contrastive techniques represent one of the main strands of methodologies, by comparing a term candidate's frequencies in a domain-specific and a general-language corpus (Ahmad et al., 1994;Rayson and Garside, 2000;Drouin, 2003;Kit and Liu, 2008;Bonin et al., 2010;Kochetkova, 2015;Lopes et al., 2016;Mykowiecka et al., 2018, i.a.). Recent approaches use word embeddings trained separately on contrastive corpora; e.g. Amjadian et al. (2016Amjadian et al. ( , 2018 concatenate general and domain-specific word embeddings and use them as input for classifiers, such as a multilayer perceptron. Similarly, Hazem and Morin (2017) and Liu et al. (2018) apply such a concatenation to represent a term in one language, as data enrichment pre-step for bilingual terminology extraction.
In sum, approaches using contrastive corpora are popular in both automatic term extraction and term technicality prediction studies. The few approaches that use word embeddings as basis for a contrastive approach separately train word embeddings on general-language and domain corpora. In our work, we extend these methodologies by aligning vector spaces in order to more adequately represent meaning variation across corpora.

Definition of Technicality
According to Ha and Hyland (2017), there is no consensus among researchers about what exactly defines technicality. They provide an overview of what characterizes technical vocabulary, and observe two main categories. On the one hand, technical terms often exhibit a narrow range of senses specific to the domain. They are only understood by a limited set of people, because they require domain knowledge. On the other hand, there are terms which are also frequently used in general language. These terms are ambiguous: they carry specialized meanings in a particular domain which are different to the general-language meanings. "Preprocessed" refers to the lemmatized corpus without punctuation, "Lemma:POS" to the version reduced to content words.
As Ha and Hyland (2017), we see technicality as a continuum. In the course of this paper, we adopt a simplified handling and distinguish between three broad classes of technicality: technical terms, basic terms and non-terms.

Data and Gold Standard Creation
Data. We collect German texts for four domains: automotive, cooking, DIY and hunting. Besides including technical handbooks, we crawl topicspecific data from Wikipedia 1 and similar resources such as cooking recipes from cooking homepages (e.g. kochwiki.de), and car repair and DIY instructions from wikihow.de. As general-language reference corpus, we use SdeWaC (Faaß and Eckart, 2013), a cleaned version of the web-crawled corpus deWaC (Baroni et al., 2009). All corpora are lemmatized and POS-tagged with the TreeTagger (Schmid, 1995), and reduced to content words (nouns, verbs and adjectives). We follow the preprocessing steps described in Schlechtweg et al. (2019) that led to the best results in that study. The corpus sizes are shown in Table 1.
Gold Standard. We select all words as term candidates with a minimum frequency of 10 in both the domain corpus and SdeWaC. The gold standard thus contains both simple and complex terms, the latter in the form of closed compounds. We did not extract multi-word terms other than closed compounds because we would have needed specific procedures to identify them (e.g. by chunking or by using association measures to identify valid collocations). Even more importantly, multi-word expressions are prone to variation (e.g. one could say 'wood drill' or 'drill for wood') and it is likely to not find all variants in the glossaries and other resources we use to create the gold standard.
Instead of relying on labour-intensive human annotations, we determine the technicality labels semi-automatically. First, we collect domainspecific glossaries for each domain, i.e. textual glosses and specialized terms with their meanings 2 . These glossaries contain terms which require domain knowledge (especially if they are ambiguous) and thus need to be explained to a lay person, i.e. they contain technical terms. Secondly, we collect thematic basic vocabulary lists (from thematic base vocabulary books, thematic vocabulary training lists for foreign apprentices, etc.). These lists contain the basic terminology of a domain, with a low level of technicality. Finally, we collect indices and tables of contents of domain-specific handbooks, which include all kinds of terminological vocabulary 3 . We label the data as follows: 1. technical term: a word is contained in a glossary, but not in a basic vocabulary list 2. basic term: a word is contained in a basic vocabulary list, but not in a glossary 3. non-term: all other words, which do not overlap more than 4 characters with any term in the glossaries, the basic vocabulary lists, the indices or the table of contents The resulting sizes of the gold standards per domain are presented in Table 2. Overall, our semiautomatic labeling method leads to 1,690 technical terms, 1,525 terms and 10,956 non-terms, a total of 14,171 term candidates. To evaluate the quality of the gold standard, we randomly extract 30 words per domain and per system-assigned label (which leads to a total of 30 × 4 × 3 = 360 words in total). Together with three random context sentences, three annotators (including one of the authors) rated the labeling. We obtain an average Cohen's κ inter-annotator agreement of 0.50 and an average agreement with the gold standard of 0.47. This corresponds to "moderate" agreement, which we judge as sufficient for our gold standard, given that agreement in term annotation is considered a difficult task (Terryn et al., 2019).  Qualitative Analysis We perform an in-depth analysis of our four domain corpora to identify the range of term phenomena and variants of ambiguity within and across general and domain-specific data, to motivate and apply an appropriate model. The automotive domain contains many compounds (such as Antriebsschlupfregelung 'traction slip control') and English words (Frontairbags). In the cooking and DIY corpora we find many complex verbs (such as entgraten 'deburr' for DIY and abbinden 'thicken (a sauce)' for cooking). Ambiguous terminology is an outstanding characteristic of the hunting domain, which contains many ambiguous expressions completely unknown by lay people, such as Licht 'light' as term for the eyes of game. With all those variations, it seems likely that surface form features will not be useful in a prediction task. Furthermore, frequency-based features might not be useful due to the high amount of ambiguity.
Regarding levels of technicality, we find technical terms that seem to be rather unambiguous and have a very restricted usage, such as blanchieren 'blanch' for cooking, which often co-occurs with Salzwasser 'salted water' in the domain-specific context sentences. Surprisingly, we find very similar domain-specific contexts in the generallanguage corpus, where we would not expect them. Since the general-language corpus is web-crawled, it obviously contains a certain amount of domainspecific texts as well; especially if a highly technical term is not ambiguous, the general-language corpus contains only such contexts. Consequently, the general-language and domain-specific contexts are maximally similar in these cases. In contrast, we assume that the contexts will vary more strongly for basic terms, and for non-terms we do not expect to find domain-specific sentences in the generallanguage corpus at all.
The picture is different for ambiguous terminology, where sense distributions vary across corpora. For example, for the hunting term Licht 'light/eyes of game' we both find general and domain-specific meanings in the domain corpus; for the cooking term Zauberstab 'wand/hand blender' senses seem to be largely disjunctive across the corpora. Example sentences for this phenomenon are given in Table 3 for illustration.
Based on these observations, we suggest an approach by Amjadian et al. (2016Amjadian et al. ( , 2018 as basis to detect degrees of technicality, since both generallanguage and domain-specific word embeddings will encode termhood attributes. On top of that, we hypothesize that a comparison of the word vectors represents valuable information for a prediction system.

Models
Baselines As baseline, we use a decision tree classifier (DT) with three standard features commonly used for term familiarity prediction: frequency (corpus-size normalized), word length and character n-grams. Further, we implement the approach by Amjadian et al. (2016Amjadian et al. ( , 2018 using a Multilayer Perceptron (MLP) and the concatenation of general-language word embeddings (GEN) and domain-specific word embeddings (SPEC) of a term candidate as input (MLP, GEN⊕SPEC), in comparison to using only one of the embeddings. We learn two separate word2vec SGNS vector spaces (Mikolov et al., 2013) for GEN and SPEC.
Centering and Batch Normalization Across neural models we apply batch normalization (Ioffe and Szegedy, 2015), which normalizes the output of a preceding activation layer by subtracting the batch mean and then dividing by the batch standard deviation. This reduces the effect of inhomogeneous input data, in our case the different domain corpora. We further length-normalize and apply element-wise column mean-centering to the embeddings, which has proven to be beneficial as preprocessing step for rotational alignment of vector spaces (Artetxe et al., 2016;Schlechtweg et al., 2019) and as a general post-processing step for word embeddings (Mu and Viswanath, 2018).
Note that the reason for the beneficial effect of centering is still unclear. Artetxe et al. (2016) provide an intuitive explanation that centering moves randomly similar embeddings further apart, while Mu and Viswanath (2018) consider centering as an operation making vectors "more isotropic", i.e., more uniformly distributed across the directions in the space.

Comparative Embeddings and Multi-Channel
Model Simple vector concatenation does not in-corporate any kind of comparison of the embeddings. We thus suggest two novel models to exploit general-vs. domain-specific comparisons: Comparative Embeddings (MLP, CON⊕DIFF) use an MLP classifier and add a difference vector to the input vector concatenation GEN⊕SPEC. Since the word embeddings were trained separately on different corpora, this model requires an alignment of the vector spaces. We use a state-of-the-art alignment method (Artetxe et al., 2016;Hazem and Morin, 2017), where the best rotation GW of a vector space G onto a vector space S is determined, with the rotation matrix W . W is computed as mann, 1966). After the alignment, unit length is applied again (since the vectors are not unit length after alignment anymore) and the absolute difference vector (DIFF) is computed. The concatenation vector GEN ⊕ SPEC ⊕ DIFF is then taken as input to the model. As our second model, we use a Multi-Channel Feed-Forward Neural Network (MULTI-CHANNEL). The network takes as input the unaligned GEN and SPEC vectors, and processes each GEN and SPEC in a different channel. The third channel is a variant of a Siamese network (Chopra et al., 2005), a dual-channel network with shared weights. Both GEN and SPEC are processed through the shared weight layer, in order to map them onto the same space. Then the element-wise absolute difference is computed, and the output of all three channels is concatenated. The network is defined as: where x is a term candidate, and E(x) is the embedding layer, a function E : x i → z i that maps the word x i onto its corresponding 300-dimensional vector z i . W denotes the weight matrices, b the bias, σ the activation functions, and l denotes the sizes of the hidden layers. Lichter ist die Bezeichnung für die Augen, die Ohren werden auch Lauscher genannt.
Auch bei schwachem Licht können sie noch sehr gut sehen. Training We use SMOTE subsampling (Chawla et al., 2002) and train our network to minimize the cross-entropy loss, using back-propagation with stochastic gradient descent. We perform a randomized search for hyperparameter optimization for each model, i.e. subsampling parameter combinations. We test with the following parameters: hidden layers, epochs and batch size with values between 16 and 64, learning rate between 0.001 and 0.3, momentum between 0.0 and 0.9, and tanh and rectified linear unit (ReLU) as activation functions.
To initialize the weights of the embedding layer, we use word2vec SGNS trained with a window size of 2, negative sampling with k=1 and subsampling with a threshold of t = 0.001. These parameter settings obtained the best results in our recent study on terminological meaning shifts (Schlechtweg et al., 2019). We do not train embedding layer parameters to maintain the original word meaning. Due to the relatively small size of the training data, we use 5-fold cross-validation for training.

Results
We use Macro-Precision, Recall and F1-Score for evaluation, to put more weight on the correctness of the smaller classes Base Term and Technical Term.
The experiment results are shown in Table 4. The multi-layer perceptron (MLP) results outperform the decision-tree (DT) baseline with standard term familiarity prediction features. Using only a general-language vector GEN for classification performs better than using only a domainspecific vector SPEC, and the concatenation of both  Table 4: Macro-Precision (P), Recall (R) and F1-Score results. The main results apply centering and batch normalization; results without centering are in brackets.
(GEN⊕SPEC) performs better than each of them individually. This is most likely due to more training data and having both domain-specific and generallanguage parts in the general-language corpus.
The models integrating a notion of vector comparison perform best, with the multi-channel network achieving slightly better results than the MLP comparative embeddings. Centering improves all but one results; i.e., it has an overall beneficial effect for our task.

Conclusion
We semi-automatically created the first large-scale gold standard for technicality prediction across domains and proposed two novel neural network models to fine-tune automatic terminology extraction by distinguishing between degrees of technicality. The models integrate general-vs. domain-specific word embedding information in different ways. An adapted Siamese multi-channel network model performed best, and centering has an overall beneficial effect on pre-processing the vector spaces.