A Graph Auto-encoder Model of Derivational Morphology

There has been little work on modeling the morphological well-formedness (MWF) of derivatives, a problem judged to be complex and difficult in linguistics. We present a graph auto-encoder that learns embeddings capturing information about the compatibility of affixes and stems in derivation. The auto-encoder models MWF in English surprisingly well by combining syntactic and semantic information with associative information from the mental lexicon.


Introduction
A central goal of morphology is, as famously put by Aronoff (1976), "to tell us what sort of new words a speaker can form." This definition is tightly intertwined with the notion of morphological well-formedness (MWF). While nonexisting morphologically well-formed words such as pro$computer$ism conform to the morphological patterns of a language and could be formed, non-existing morphologically ill-formed words such as pro$and$ism violate the patterns and are deemed impossible (Allen, 1979).
More recent research has shown that MWF is a gradient rather than binary property: non-existing words that conform to the morphological patterns of a language differ in how likely they are to be actually created by speakers (Pierrehumbert, 2012). This is particularly true in the case of derivational morphology, which is not obligatory and often serves communicative needs (Bauer, 2019). As a result, the degree of MWF of a non-existing derivative is influenced by a multitude of factors and judged to be hard to predict (Bauer, 2001).
In NLP, the lack of reliable ways to estimate the MWF of derivatives poses a bottleneck for generative models, particularly in languages exhibiting a rich derivational morphology; e.g., while inflected forms can be translated by generating morphologically corresponding forms in the target language (Minkov et al., 2007), generating derivatives is still a major challenge for machine translation systems (Sreelekha and Bhattacharyya, 2018). Similar problems exist in the area of automatic language generation (Gatt and Krahmer, 2018).
This study takes a first step towards computationally modeling the MWF of English derivatives. We present a derivational graph auto-encoder (DGA) that combines semantic and syntactic information with associative information from the mental lexicon, achieving very good results on MWF prediction and performing on par with a character-based LSTM at a fraction of the number of trainable parameters. The model produces embeddings that capture information about the compatibility of affixes and stems in derivation and can be used as pretrained input to other NLP applications. 1 2 Derivational Morphology

Inflection and Derivation
Linguistics divides morphology into inflection and derivation. While inflection refers to the different word forms of a lexeme, e.g., listen, listens, and listened, derivation refers to the different lexemes of a word family, e.g., listen, listener, and listenable. There are several differences between inflection and derivation, some of which are highly relevant for NLP.
Firstly, while inflection is obligatory and determined by syntactic needs, the existence of derivatives is mainly driven by communicative goals, allowing to express a varied spectrum of meanings (Acquaviva, 2016). Secondly, derivation can produce a larger number of new words than inflection since it is iterable (Haspelmath and Sims, 2010); derivational affixes can be combined, in some cases even recursively (e.g., post$post$modern$ism). However, morphotactic constraints restrict the ways in which affixes can be attached to stems and other affixes (Hay and Plag, 2004); e.g., the suffix $less can be combined with $ness (atom$less$ness) but not with $ity (atom$less$ity).
The semantic and formal complexity of derivation makes predicting the MWF of derivatives more challenging than the MWF of inflectional forms (Anshen and Aronoff, 1999;Bauer, 2019). Here, we model the MWF of derivatives as the likelihood of their existence in the mental lexicon.

Derivatives in the Mental Lexicon
How likely a derivative is to exist is influenced by various factors (Bauer, 2001;Pierrehumbert and Granell, 2018). In this study, we concentrate on the role of the structure of the mental lexicon.
The mental lexicon can be thought of as a set of associations between meaning m and form f , i.e., words, organized in a network, where links correspond to shared semantic and phonological properties (see Pierrehumbert (2012) for a review). Since we base our study on textual data, we will treat the form of words orthographically rather than phonologically. We will refer to the type of information conveyed by the cognitive structure of the mental lexicon as associative information.
Sets of words with similar semantic and formal properties form clusters in the mental lexicon (Alegre and Gordon, 1999). The semantic and formal properties reinforced by such clusters create abstractions that can be extended to new words (Bybee, 1995). If the abstraction hinges upon a shared derivational pattern, the effect of such an extension is a new derivative. The extent to which a word conforms to the properties of the cluster influences how likely the abstraction (in our case a derivational pattern) is to be extended to that word. This is what is captured by the notion of MWF.

Derivational Graphs
The main goal of this paper is to predict the MWF of morphological derivatives (i.e., how likely is a word to be formed as an extension of a lexical cluster) by directly leveraging associative information. Since links in the mental lexicon reflect semantic and formal similarities of various sorts, many of which are not morphological (Tamariz, 2008), we want to create a distilled model of the mental lexicon that only contains derivational information. One way to achieve this is by means of a derivational projection of the mental lexicon, a network that we call the Derivational Graph (DG).
Let L = (W, Q) be a graph of the mental lexicon consisting of a set of words W and a set of links between the words Q. Let W a ⊂ W be a set of words forming a fully interconnected cluster in L due to a shared derivational pattern a. We define S a as the set of stems resulting from stripping off a from the words in W a and R a = {(s, a)} s∈Sa as the corresponding set of edges between the stems and the shared derivational pattern. We then define the two-mode derivational projection B of L as the Derivational Graph (DG) where B = (V, E), V = a (S a ∪ {a}) and E = a R a . Figure 1 gives an example of L and DG (= B).
The DG is a bipartite graph whose nodes consist of stems s ∈ S with S = a S a and derivational patterns a ∈ A with A = a {a}. The derivational patterns are sequences of affixes such as re$$ize$ate$ion in the case of revitalization. The cognitive plausibility of this setup is supported by findings that affix groups can trigger derivational generalizations in the same way as individual affixes (Stump, 2017(Stump, , 2019. We define B ∈ R |V|×|V| to be the adjacency matrix of B. The degree of an individual node n is d(n). We further define Γ 1 (n) as the set of one-hop neighbors and Γ 2 (n) as the set of two-hop neighbors of n. Notice that Γ 1 (s) ⊆ A, Γ 1 (a) ⊆ S, Γ 2 (s) ⊆ S, and Γ 2 (a) ⊆ A for any s and a since the DG is bipartite.
The advantage of this setup of DGs is that it abstracts away information not relevant to derivational morphology while still allowing to interpret results in the light of the mental lexicon. The cre-  Figure 2: Experimental setup. We extract DGs from Reddit and train link prediction models on them. In the shown toy example, the derivatives super$minecraft and affirm$ation are held out for the test set.
ation of a derivative corresponds to a new link between a stem and a derivational pattern in the DG, which in turn reflects the inclusion of a new word into a lexical cluster with a shared derivational pattern in the mental lexicon.

Corpus
We base our study on data from the social media platform Reddit. 2 Reddit is divided into so-called subreddits (SRs), smaller communities centered around shared interests. SRs have been shown to exhibit community-specific linguistic properties (del Tredici and Fernández, 2018).
We draw upon the Baumgartner Reddit Corpus, a collection of publicly available comments posted on Reddit since 2005. 3 The preprocessing of the data is described in Appendix A.1. We examine data in the SRs r/cfb (cfb -college football), r/gaming (gam), r/leagueoflegends (lol), r/movies (mov), r/nba (nba), r/nfl (nfl), r/politics (pol), r/science (sci), and r/technology (tec) between 2007 and 2018. These SRs were chosen because they are of comparable size and are among the largest SRs (see Table 1). They reflect three distinct areas of interest, i.e., sports (cfb, nba, nfl), entertainment (gam, lol, mov), and knowledge (pol, sci, tec), thus allowing for a multifaceted view on how topical factors impact MWF: seeing MWF as an emergent property of the mental lexicon entails that communities with different lexica should differ in what derivatives are most likely to be created.

Morphological Segmentation
Many morphologically complex words are not decomposed into their morphemes during cognitive processing (Sonnenstuhl and Huth, 2002  tive (in a given SR). Segmentation is performed by means of an iterative affix-stripping algorithm introduced in Hofmann et al. (2020) that is based on a representative list of productive prefixes and suffixes in English (Crystal, 1997). The algorithm is sensitive to most morpho-orthographic rules of English (Plag, 2003): when $ness is removed from happi$ness, e.g., the result is happy, not happi. See Appendix A.2. for details. The segmented texts are then used to create DGs as described in Section 2.3. All processing is done separately for each SR, i.e., we create a total of nine different DGs. Figure 2 illustrates the general experimental setup of our study.

Models
Let W be a Bernoulli random variable denoting the property of being morphologically well-formed. We want to model P (W |d, C r ) = P (W |s, a, C r ), i.e., the probability that a derivative d consisting of stem s and affix group a is morphologically wellformed according to SR corpus C r .
Given the established properties of derivational morphology (see Section 2), a good model of P (W |d, C r ) should include both semantics and formal structure, where m s , f s , m a , f a , are meaning and form (here modeled orthographically, see Section 2.2) of the involved stem and affix group, respectively. The models we examine in this study vary in which of these features are used, and how they are used.

Derivational Graph Auto-encoder
We model P (W |d, C r ) by training a graph autoencoder (Kipf and Welling, 2016, 2017) on the DG B of each SR. The graph auto-encoder attempts to reconstruct the adjacency matrix B (Section 2.3) of the DG by means of an encoder function g θ and a decoder function h θ , i.e., its basic structure is whereB is the reconstructed version of B. The specific architecture we use (see Figure 3), which we call a Derivational Graph Auto-encoder (DGA), is a variation of the bipartite graph auto-encoder (van den Berg et al., 2018).
Encoder. The encoder g θ takes as one of its inputs the adjacency matrix B of the DG B. This means we model f s and f a , the stem and affix group forms, by means of the associative relationships they create in the mental lexicon. Since a DG has no information about semantic relationships between nodes within S and A, we reintroduce meaning as additional feature vectors x s , x a ∈ R n for m s and m a , stem and affix group embeddings that are trained separately on the SR texts. The input to g θ is thus designed to provide complementary information: associative information (B) and semantic information (x s and x a ).
For the encoder to be able to combine the two types of input in a meaningful way, the choice of g θ is crucial. We model g θ as a graph convolutional network Welling, 2016, 2017), providing an intuitive way to combine information from the DG with additional information. The graph convolutional network consists of L convolutional layers. Each layer (except for the last one) performs two steps: message passing and activation.
During the message passing step (Dai et al., 2016;Gilmer et al., 2017), transformed versions of the embeddings x s and x a are sent along the edges of the DG, weighted, and accumulated. We define Γ 1 + (s) = Γ 1 (s) ∪ {s} as the set of nodes whose transformed embeddings are weighted and accumulated for a particular stem s. Γ 1 + (s) is extracted from the adjacency matrix B and consists of the one-hop neighbors of s and s itself. The message passing propagation rule Welling, 2016, 2017) can then be written as where W (l) is the trainable weight matrix of layer l, x (l−1) n is the embedding of node n from layer n = x n , and |Γ 1 + (s)||Γ 1 + (n)| is the weighting factor. The message passing step is performed analogously for affix groups. The matrix form of Equation 3 is given in Appendix A.3.
Intuitively, a message passing step takes embeddings of all neighbors of a node and the embedding of the node itself, transforms them, and accumulates them by a normalized sum. Given that the DG B is bipartite, this means for a stem s that the normalized sum contains d(s) affix group embeddings and one stem embedding (and analogously for affix groups). The total number of convolutional layers L determines how far the influence of a node can reach. While one convolution allows nodes to receive information from their one-hop neighbors (stems from affix groups they co-occur with and vice versa), two convolutions add information from the two-hop neighbors (stems from stems co-occurring with the same affix group and vice versa), etc. (see Figure 4).
During the activation step, the output of the convolutional layer l for a particular stem s is where ReLU(·) = max(0, ·) is a rectified linear unit (Nair and Hinton, 2010). The final output of the encoder is i.e., there is no activation in the last layer. The activation step is again performed analogously for affix groups. z s and z a are representations of s and a enriched with information about the semantics of nodes in their DG neighborhood. Decoder. We model the decoder as a simple bilinear function, where σ is the sigmoid and z s and z a are the outputs of the encoder. 4 We set P (W |d, C r ) = h θ (z s , z a ) and interpret this as the probability that the corresponding edge in a DG constructed from a corpus drawn from the underlying distribution exists. The resulting matrixB in Equation 2 is then the reconstructed adjacency matrix of DG.
Notice that the only trainable parameters of the DGA are the weight matrices W (l) . To put the performance of the DGA into perspective, we compare against four baselines, which we present in decreasing order of sophistication.

Baseline 1: Character-based Model (CM)
We model P (W |d, C r ) as P (W |f s , f a , C r ) using a character-based model (CM), i.e., as opposed to the DGA, f s and f a are modeled directly by means of their orthographic form. This provides the CM with phonological information, a central predictor of MWF (see Section 2.2). CM might also learn semantic information during training, but it is not directly provided with it. Character-based models show competitive results on derivational tasks Deutsch et al., 2018), a good reason to test their performance on MWF prediction.
We use two one-layer bidirectional LSTMs to encode the stem and affix group into a vector o by concatenating the last hidden states from both LSTM directions h s , h s , h a , and h a , where ⊕ denotes concatentation. o is then fed into a two layer feed-forward neural network with a ReLU non-linearity after the first layer. 5 The activation function after the second layer is σ.

Baseline 2: Neural Classifier (NC)
We model P (W |d, C r ) as P (W |m s , m a , C r ) using a neural classifier (NC) whose architecture is similar to the auto-encoder setup of the DGA.
Similarly to the DGA, m s and m a are modeled by means of stem and affix group embeddings trained separately on the SRs. The first encoderlike part of the NC is a two-layer feed-forward neural network with a ReLU non-linearity after the first layer. The second decoder-like part of the NC is an inner-product layer as in the DGA. Thus, the NC is identical to the DGA except that it does not use associative information from the DG via a graph convolutional network; it only has information about the stem and affix group meanings.

Baseline 3: Jaccard Similarity (JS)
We model P (W |d, C r ) as P (W |f s , f a , C r ). Like in the DGA, we model the stem and affix group forms by means of the associative relationships they create in the mental lexicon. Specifically, we predict links without semantic information.
In feature-based machine learning, link prediction is performed by defining similarity measures on a graph and ranking node pairs according to these features (Liben-Nowell and Kleinberg, 2003). We apply four common measures, most of which have to be modified to accommodate the properties of bipartite DGs. Here, we only cover the best performing measure, Jaccard similarity (JS). JS is one of the simplest graph-based similarity measures, so it is a natural baseline for answering the question: how far does simple graph-based similarity get you at predicting MWF? See Appendix A.4 for the other three measures.
The JS score of an edge (s, a) is traditionally defined as However, since Γ 1 (s) ∩ Γ 1 (a) = ∅ for any s and a (the DG is bipartite), we redefine the set of common neighbors of two nodes n and m, Γ ∩ (n, m), as Γ 2 (n) ∩ Γ 1 (m), i.e., the intersection of the twohop neighbors of n and the one-hop neighbors of m, and analogously Γ ∪ (n, m) as Γ 2 (n) ∪ Γ 1 (m). Since these are asymmetric definitions, we define JS assumes that a stem that is already similar to a lexical cluster in its derivational patterns is more likely to become even more similar to the cluster than a less similar stem.

Baseline 4: Bigram Model (BM)
We again model P (W |d, C r ) as P (W |f s , f a , C r ), leaving aside semantic information. However, in contrast to JS, this model implements the classic approach of Fabb (1988), according to which pairwise constraints on affix combinations, or combinations of a stem and an affix, determine the allowable sequences. Taking into account more recent results on morphological gradience, we do not model these selection restrictions with binary rules. Instead, we use transition probabilities, beginning with the POS of the stem s and working outwards to each following suffix a (s) or preceding prefix a (p) . Using a simple bigram model (BM), we can thus calculate the MWF of a derivative as P (W |d, C r ) = P (a (s) |s) · P (a (p) |s), where P (a (s) |s) = P (a i−1 ) is the probability of the suffix group conditioned on the POS of the stem. P (a (p) |s) is defined analogously for prefix groups.

Setup
We train all models on the nine SRs using the same split of E into training (n train nonedges (s, a) ∈ E in every epoch (i.e., the set of sampled non-edges changes in every epoch).
Nodes are sampled according to their degree with P (n) ∝ d(n), a common strategy in bipartite link prediction (Chen et al., 2017). We make sure nonedges sampled in training are not in the validation or test sets. During the test phase, we rank all edges according to their predicted scores.  We evaluate the models using average precision (AP) and area under the ROC curve (AUC), two common evaluation measures in link prediction that do not require a decision threshold. AP emphasizes the correctness of the top-ranked edges (Su et al., 2015) more than AUC.

Training Details
DGA, DGA+: We use binary cross entropy as loss function. Hyperparameter tuning is performed on the validation set. We train the DGA for 600 epochs using Adam (Kingma and Ba, 2015) with a learning rate of 0.01. 6 We use L = 2 hidden layers in the DGA with a dimension of 100. For regularization, we apply dropout of 0.1 after the input layer and 0.7 after the hidden layers. For x s and x a , we use 100-dimensional GloVe embeddings (Pennington et al., 2014) trained on the segmented text of the individual SRs with a window size of 10. These can be seen as GloVe variants of traditional morpheme embeddings as proposed, e.g., by Qiu et al. (2014), with the sole difference that we use affix groups instead of individual affixes. For training the embeddings, derivatives are segmented into prefix group, stem, and suffix group. In the case of both prefix and suffix groups, we add prefix and suffix group embeddings.
Since the window size impacts the information represented by the embeddings, with larger windows tending to capture topical and smaller windows morphosyntactic information (Lison and Kutuzov, 2017), we also train the DGA with 200dimensional embeddings consisting of concatenated 100-dimensional embeddings trained with window sizes of 10 and 1, respectively (DGA+). 7 Since DGA already receives associative informa-6 The number of epochs until convergence lies within the typical range of values for graph convolutional networks. 7 We experimented with using vectors trained on isolated pairs of stems and affix groups instead of window-1 vectors trained on the full text, but the performance was comparable. We also implemented the DGA using only window-1 vectors (without concatenating them with window-10 vectors), but it performed considerably worse.  tion from the DG and semantic information from the embeddings trained with window size 10, the main advantage of DGA+ should lie in additional syntactic information. CM: We use binary cross entropy as loss function. We train the CM for 20 epochs using Adam with a learning rate of 0.001. Both input character embeddings and hidden states of the bidirectional LSTMs have 100 dimensions. The output of the first feed-forward layer has 50 dimensions. We apply dropout of 0.2 after the embedding layer as well as the first feed-forward layer.
NC, NC+: All hyperparameters are identical to the DGA and the DGA+, respectively.
JS: Similarity scores are computed on the SR training sets.
BM: Transition probabilities are maximum likelihood estimates from the SR training sets. If a stem is assigned several POS tags by the tagger, we take the most frequent one. Table 2 summarizes the number of trainable parameters for the neural models. Notice that CM has more than 10 times as many trainable parameters as DGA+, DGA, NC+, and NC.

Results
The overall best performing models are DGA+ and CM (see Table 3). While DGA+ beats CM on all SRs except for lol in AP, CM beats DGA+ on all SRs except for cfb and tec in AUC. Except for CM, DGA+ beats all other models on all SRs in both AP and AUC, i.e., it is always the best or second-best model. DGA beats all models except for DGA+ and CM on all SRs in AP but has lower AUC than NC+ on three SRs. It also outperforms CM on three SRs in AP. NC+ and NC mostly have scores above 0.7, showing that traditional morpheme embeddings also capture information about the compatibility of affixes and stems (albeit to a lesser degree than models with associative or orthographic information). Among the non-neural methods, JS outperforms BM (and the other non-neural link prediction models, see Appendix A.4) in AP, but is beaten by BM in AUC on six SRs.
The fact that DGA+ performs on par with CM while using less than 10% of CM's parameters demonstrates the power of incorporating associative information from the mental lexicon in modeling the MWF of derivatives. This result is even more striking since DGA+, as opposed to CM, has no direct access to orthographic (i.e., phonological) information. At the same time, CM's high performance indicates that orthographic information is an important predictor of MWF.

Comparison with Input Vectors
To understand better how associative information from the DG increases performance, we examine how DGA+ changes the shape of the vector space by comparing input vs. learned embeddings (X vs. Z DGA+ ), and contrast that with NC+ (X vs. Z N C+ ). A priori, there are two opposing demands the embeddings need to respond to: (i) as holds for bipartite graphs in general (Gao et al., 2018), the two node sets (stems and affix groups) should form two separated clusters in embedding space; (ii) stems associated with the same affix group should form clusters in embedding space that are close to the embedding of the respective affix group.
For this analysis, we define δ(N , v) as the mean cosine similarity between the embeddings of a node set N and an individual embedding v, where u n is the embedding of node n. We calculate δ for the set of stem nodes S and their centroid c S = 1 |S| s∈S u s as well as the set of affix group (a) X (b) ZNC+ (c) ZDGA+ Figure 5: Comparison of input embeddings X with learned representations Z N C+ and Z DGA+ . The plots are t-SNE projections (van der Maaten and Hinton, 2008) of the embedding spaces. We highlight two example sets of stems occurring with a common affix: the blue points are stems occurring with $esque, the orange points stems occurring with $ful. × marks the embedding of $esque, + the embedding of $ful.  nodes A and their centroid c A = 1 |A| a∈A u a . Table 4 shows that while NC+ makes the embeddings of both S and A more compact (higher similarity in Z N C+ than in X), DGA+ makes S more compact, too, but decreases the compactness of A (lower similarity in Z DGA+ than in X). Z N C+ meets (i) to a greater extent than Z DGA+ .
We then calculate δ for all sets of stems S a occurring with a common affix group a and their centroids c Sa = 1 |Sa| s∈Sa u s . We also compute δ for all S a and the embeddings of the corresponding affix groups u a . As Table 4 shows, both values are much higher in Z DGA+ than in X, i.e., DGA+ brings stems with a common affix group a (lexical clusters in the mental lexicon) close to each other while at the same time moving a into the direction of the stems. The embeddings Z N C+ exhibit a similar pattern, but more weakly than Z DGA+ (see Table 4 and Figure 5). Z DGA+ meets (ii) to a greater extent than Z N C+ .
Thus, DGA+ and NC+ solve the tension between (i) and (ii) differently; the associative information from the mental lexicon allows DGA+ to put a greater emphasis on (ii), leading to higher performance in MWF prediction.

Comparison between SRs
Another reason for the higher performance of the models with associative information could be that their embeddings capture differences in derivational patterns between the SR communities. To examine this hypothesis, we map the embeddings Z DGA+ of all SRs into a common vector space by means of orthogonal procrustes alignment (Schönemann, 1966), i.e., we optimize for every SR, where Z (i) DGA+ is the embedding matrix of the SR i, and Z (0) DGA+ is the embedding matrix of a randomly chosen SR (which is the same for all projections). We then compute the intersection of stem and affix group nodes from all SRs S ∩ = i S (i) and A ∩ = i A (i) , where S (i) and A (i) are the stem and affix group sets of SR i, respectively. To probe whether differences between SRs are larger or smaller for affix embeddings as compared to stem embeddings, we define i.e., the mean cosine similarity between projected embedding pairsẑ s from two SRs i and j representing the same stem s in the intersection set S ∩ , withẑ ) is defined analogously for affix groups.
The mean value for ∆(A (i) , A (j) ) (0.723 ± 0.102) is lower than that for ∆(S (i) , S (j) ) (0.760± 0.087), i.e., differences between affix group embeddings are more pronounced than between stem embeddings. Topically connected SRs are more similar to each other than SRs of different topic groups, with the differences being larger in ∆(A (i) , A (j) ) than in ∆(S (i) , S (j) ) (see Figure 6). These results can be related to Section 6.1: affix groups are very close to the stems they associate with in Z DGA+ , i.e., if an affix group is used with stems of meaning p in one SR and stems with meaning q in the other SR, then the affix groups also have embeddings close to p and q in the two SRs. Most technical vocabulary, on the other hand, is specific to a SR and does not make it into S ∩ . 8 A qualitative analysis supports this hypothesis: affix groups with low cosine similarities between SRs associate with highly topical stems; e.g., the affix group $ocracy has a low cosine similarity of -0.189 between the SRs nba and pol, and it occurs with stems such as kobe, jock in nba but left, wealth in pol.

Related Work
Much recent computational research on derivational morphology in NLP has focused on two related problems: predicting the meaning of a derivative given its form, and predicting the form of a derivative given its meaning.
The first group of studies models the meaning of derivatives as a function of their morphological structure by training embeddings directly on text segmented into morphemes (Luong et al., 2013;Qiu et al., 2014) or by inferring morpheme embeddings from whole-word vector spaces, e.g., using the vector offset method (Lazaridou et al., 2013;Padó et al., 2016). Formally, given a derived form f d , this line of research tries to find the meaning m d that maximizes P (m d |f d ).
The second group of studies models the form of derivatives as a function of their meaning. The meaning is represented by the base word and a semantic tag Deutsch et al., 2018) or the sentential context . Formally, given a meaning m d , these studies try to find the derived form f d of a word that maximizes P (f d |m d ).
Our study differs from these two approaches in that we model P (W |f d , m d ), i.e., we predict the overall likelihood of a derivative to exist. For future research, it would be interesting to apply derivational embeddings in studies of the second type by using them as pretrained input.
Neural link prediction is the task of inferring the existence of unknown connections between nodes in a graph. Advances in deep learning have prompted various neural models for link prediction that learn distributed node representations (Tang et al., 2015;Grover and Leskovec, 2016). Welling (2016, 2017) proposed a convolutional graph auto-encoder that allows to include feature vectors for each node. The model was adapted to bipartite graphs by van den Berg et al. (2018).
Previous studies on neural link prediction for bipartite graphs have shown that the embeddings of the two node sets should ideally form separated clusters (Gao et al., 2018). Our work demonstrates that relations transcending the two-mode graph structure can lead to a trade-off between clustering and dispersion in embedding space.

Conclusion
We have introduced a derivational graph autoencoder (DGA) that combines syntactic and semantic information with associative information from the mental lexicon to predict morphological well-formedness (MWF), a task that has not been addressed before. The model achieves good results and performs on par with a character-based LSTM at a fraction of the number of trainable parameters (less than 10%). Furthermore, the model learns embeddings capturing information about the compatibility of affixes and stems in derivation.  Table 5: Performance on MWF prediction. The table shows AP and AUC of the models for the nine Subreddits as well as averaged scores. Grey highlighting illustrates the best score in a column, light grey the second-best.

A.2 Morphological Segmentation
We start by defining a set of potential stems O (i) for each Subreddit i. A word w is given the status of a potential stem and added to O (i) if it consists of at least 4 characters and has a frequency count of at least 100 in the Subreddit. Then, to determine the stem of a specific word w, we employ an iterative algorithm. Let V (i) be the vocabulary of the Subreddit, i.e., all words occurring in it. Define the set B 1 of w as the bases in V (i) that remain when one affix is removed, and that have a higher frequency count than w in the Subreddit. For example, reaction can be segmented as re$action and react$ion, so B 1 (reaction) = {action, react} (assuming action and react both occur in the Subreddit and are more frequent than reaction). We then iteratively create B i+1 (w) = b∈B i (w) B 1 (b). Let further B 0 (w) = {w}. We define S(w) = O (i) ∩ B m (w) with m = max{k|O (i) ∩ B k (w) = ∅} as the set of stems of w. If |S(w)| > 1 (which is rarely the case in practice), the element with the lowest number of suffixes is chosen.
The algorithm is sensitive to most morphoorthographic rules of English (Plag, 2003): when $ness is removed from happi$ness, e.g., the result is happy, not happi.

A.3 Message Passing Rule
LetB ∈ R |V|×|V| be the adjacency matrix of the DG B with added self-loops, i.e.,B ii = 1 and D ∈ R |V|×|V| the degree matrix ofB withD ii = jB ij . The matrix form of the message passing step can be expressed as where W (l) is the trainable weight matrix of layer l, and X (l−1) ∈ R |V|×|n| is the matrix containing the node feature vectors from layer l − 1 (Kipf and Welling, 2016Welling, , 2017. The activation step then is

A.4 Feature-based Link Prediction
Besides Jaccard similarity, we implement three other feature-based link prediction methods.
Adamic-Adar. The Adamic-Adar (AA) index (Adamic and Adar, 2003) has to take the bipartite structure of DGs into account. Using the modified definition of common neighbors as with ζ JS , we calculate it as ζ AA (s, a) = n∈Γ∩(s,a) n∈Γ∩(a,s) 1 d(n) .
Common Neighbors. The score of an edge (s, a) is calculated as the cardinality of the set of common neighbors (CN) of s and a. Similarly to ζ JS and ζ AA , we calculate the CN score as ζ CN (s, a) = |Γ ∩ (s, a)| + |Γ ∩ (a, s)|. (17) Preferential Attachment. For preferential attachment (PA), the score of an edge (s, a) is the product of the two node degrees, ζ P A (s, a) = d(s) · d(a).
The training regime is identical to Jaccard similarity. AA outperforms PA and CN but is consistently beaten by JS (see Table 5).