300-sparsans at SemEval-2018 Task 9: Hypernymy as interaction of sparse attributes

This paper describes 300-sparsians’s participation in SemEval-2018 Task 9: Hypernym Discovery, with a system based on sparse coding and a formal concept hierarchy obtained from word embeddings. Our system took first place in subtasks (1B) Italian (all and entities), (1C) Spanish entities, and (2B) music entities.


Introduction
Natural language phenomena are extremely sparse by their nature, whereas continuous word embeddings employ dense representations of words. Turning these dense representations into a much sparser form can help in focusing on most salient parts of word representations Berend, 2017;Subramanian et al., 2018).
Sparsity-based techniques often involve the coding of a large number of signals over the same dictionary (Rubinstein et al., 2008). Sparse, overcomplete representations have been motivated in various domains as a way to increase separability and interpretability (Olshausen and Field, 1997) and stability in the presence of noise.
Non-negativity has also been argued to be advantageous for interpretability Fyshe et al., 2015;Arora et al., 2016). As Subramanian et al. (2018) illustrates this in the language domain, where sparse features are interpreted as lexical attributes, "to describe the city of Pittsburgh, one might talk about phenomena typical of the city, like erratic weather and large bridges. It is redundant and inefficient to list negative properties, like the absence of the Statue of Liberty". Berend (2018) utilizes non-negative sparse coding for word translation by training sparse word vectors for the two languages such that coding bases correspond to each other.
Here we apply sparse feature pairs to hypernym extraction. The role of an attribute pair i, j ∈ φ(q) × φ(h) (where q is the query word, h is the hypernym candidate, and φ(w) is the index of a non-zero component in the sparse representations of w) is similar to interaction terms in regression, see section 2 for details.
Sparse representation is related to hypernymy in various natural ways. One of them is through Formal concept Analysis (FCA). The idea of acquiring concept hierarchies from a text corpus with the tools of Formal concept Analysis (FCA) is relatively new (Cimiano et al., 2005). Our submissions experiment with formal concept analysis tool by Endres et al. (2010). See the next section for a description of formal concept lattices, and how hypernyms can be found in them.
Another natural formulation is related to hierarchical sparse coding (Zhao et al., 2009), where trees describe the order in which variables "enter the model" (i.e., take non-zero values). A node may take a non-zero value only if its ancestors also do: the dimensions that correspond to top level nodes should focus on "general" meaning components that are present in most words. Yogatama et al. (2015) offer an implementation that is efficient for gigaword corpora. Exploiting the correspondence between the variable tree and the hypernym hierarchy offers itself as a natural choice.
The task (Camacho-Collados et al., 2018) evaluated systems on their ability to extract hypernyms for query words in five subtasks (three languages, English, Italian, and Spanish, and two domains, medical and music). Queries have been categorized as concepts or entities. Results were reported for each category separately as well as in combined form, thus resulting in 5 × 3 combinations. Our system took first place in subtasks (1B) Italian (all and entities), (1C) Spanish entities, and (2B) music entities. Detailed results for our system appear in section 3. Our source code is available online 1 .

Formal concept analysis
Formal concept Analysis (FCA) is the mathematization of concept and conceptual hierarchy (Ganter and Wille, 2012;Endres et al., 2010). In FCA terminology, a context is a set of objects O, a set of attributes A, and a binary incidence relation I ⊆ O × A between members of O and A. In our application, I associates a word w ∈ O to the indices of its non-zero sparse coding coordinates i ∈ A. There is an order defined in the context: if A 1 , B 1 and A 2 , B 2 are concepts in C, The concept order forms a lattice. The smallest concept whose extent contains a word is said to introduce the object. We expect that h will be a hypernym of q iff n(q) ≤ n(h) where n(w) denotes the node in the concept lattice that introduces w.
The closedness of extents and intents has an important structural consequence. Adding attributes to A (e.g. responses of additional neurons) will very probably grow the model. However, the original concepts will be embedded as a substructure in the larger lattice, with their ordering relationships preserved.
We use the popular skip-gram (SG) approach (Mikolov et al., 2013) to train d = 100 dimensional dense distributed word representations for each sub-corpus. The word embeddings are trained over the text corpora provided by the shared task organizers with the default training parameters of word2vec (w2v), i.e. a window size of 10 and 25 negative samples for each positive context.
We derived multi-token units by relying on the word2phrase software accompanying the w2v toolkit. An additional source for identifying multitoken units in the training corpora was the list of potential hypernyms released for each subtask by the shared task organizers.
Given the dense embedding matrix W x ∈ R d×|Vx| , for some subcorpus of the shared task x ∈ {1A, 1B, 1C, 2A, 2B}, where |V x | is the size of the vocabulary and d is set to 100. As a subsequent step, we turn W x into sparse word vectors akin to Berend (2017) by solving for where C refers to the convex set of R d×k matrices consisting of d-dimensional columns vectors with norm at most 1, and α contains the sparse coefficients for the elements of the vocabulary. The only difference compared to Berend (2017) is that here we ensure a non-negativity constraint over the elements of α.
For the elements of the vocabulary we ran the formal concept analysis tool of Endres et al. (2010) 2 . In order to keep the size of the DAG outputted by the FCA algorithm manageable, we only included the query words and those hypernyms in the analysis which occur in the training dataset for the corpora. As we will see in the next section, this restriction turns out to be very useful.
Next, we determine a handful of features for a pair of expressions (q, h) consisting of a query q and its potential hypernym h. Table 1 provides an overview of the features employed for a pair (q, h). We denote with q and h the 100dimensional dense vectorial representations of q and h. Additionally, we denote with Q and H the sequence of tokens constituting the query and hypernym phrases. Finally, we refer to the set of basis vectors (in the FCA terminology, attributes) which are assigned non-zero weights in the reconstruction of the vectorial representation of q and h as φ(q) and φ(h). It is also considered as a feature (isFrequentHypernym) whether a particular candidate hypernym h belongs to the top-50 most frequent hypernyms for the category of q (i.e. concept or entity). Modeling the two categories separately played an important role in the success of our systems. Three additional features are defined for incorporating the concept lattice output by FCA. With n(w) denoting the concept that introduces w, i.e. the most specific location within the DAG for w, our features indicate whether n(q) (1) coincides with that of h, (2) is the parent (immediate successor) for that of h, or (3) is the child (immediate predictions) for that of h. Parents, and even the inverse relation, proved to be more predictive than the conceptually motivated q ≤ h. In Table 1, n 1 ≺ n 2 denotes that n 1 is an immediate predecessor of n 2 . We will see in post-evaluation ablation experiments, where we refer to the above three features as the FCA features, that they were not useful in our submissions. 3 At submission time, this feature did not work properly.
The attributePair ij s above, our most important features, are indicator features for every possible interaction term between the sparse coefficients in α. That means that for a pair of words (q, h) we defined φ(q) × φ(h), i.e. candidates get assigned with the Cartesian product derived from the indices of the non-zero coefficients in α. Note that this feature template induces k 2 features, with k being the number of basis vectors introduced in the dictionary matrix D according to Eq. 1.
In order to rank potential hypernym candidates over the test set we trained a logistic regression classifier for concepts and entities utilizing the sklearn package (Pedregosa et al., 2011) 4 with the regularization parameter defaulting to 1.0.
For each appropriate (q, h) pair of words for which h is a hypernym of q, we generated a number of negative samples (q, h ), such that the training data does not include h as a valid hypernym for q. For a given query q, belonging to either of the concept or entity category, we sampled h from those hypernyms which were included as a valid hypernym in the training data with respect to some q = q query phrase.
When making predictions for the hypernyms of a query, we relied on our query type sensitive logistic regression model to determine the ranking of the hypernym candidates. In our official submission we treated such phrases to rank which were included in the training data for being a proper hypernym at least once.
After the appropriate model ranked the hypernym candidates, we selected the top 15 ranked candidates and applied a post-ranking heuristic over them, i.e. reordered them according to their background frequency from the training corpus.
Our assumption here is that more frequent words tend to refer to more general concepts and more general hypernymy relations potentially tend to be more easily detectable than more specialized ones.

Our submissions
Our submissions were based on k = 200 dimensional sparse vectors computed from unit-normed 100-dimensional dense vectors with λ = .3. The sum of the two dimensions motivates our group name. For training the regression model with negative samples, 50 false hypernyms were sampled for each query q in the training dataset. One of our   Table 3: Baseline results, most frequent training hypernyms. We (upper) consider the most frequent hypernym in the given query type (concept or entity). For comparison, we also show the MFH baseline provided by the organizers (lower) that is based on the most frequent hypernyms in general.
submissions involved attribute pairs, the other not. Both submissions used the conceptually motivated but practically harmful FCA-based features. Table 2 shows submission results. The figures that can be reproduced with the code in the project repo (reprd) is slightly different from our official submissions (offic) for two reasons: because the implementation of isFreqHyp contained a bug, and because of the natural randomness in negative sampling. For reproducibility, we report result without the isFreqHyp feature. The randomness introduced by negative sampling is now factored out by random seeding.  Table 4: Number of in-vocabulary (and out-ofvocabulary, OOV) queries per query type. The ratio of the latter is also shown.

Query type sensitive baselining
Our submission with attribute pairs achieved first place in categories (1B) Italian (all and entities), (1C) Spanish entities, and (2B) music entities. This is in part due to our good choice of a fallback solution in the case of OOV queries: we applied a category-sensitive baseline returning the most frequent train hypernym in the corresponding query type (concept or entity). Table 4 shows how frequently we had to rely on this fallback, and Table 3 shows the corresponding pure baseline results.

Post-evaluation analysis
After the evaluation closed, we conducted ablation experiments the results of which are included in Table 6. In these experiments, we investigated the contribution of the features derived from sparse attribute pairs and FCA. These ablation experiments corroborate the importance of features derived from sparse attribute pairs and reveal that turning off FCA-based features does not hurt performance at all. For this reason -even though our official shared task submission included FCArelated features -we no longer employed them in our post-evaluation experiments.  training. In our post evaluation experiments we investigated the effects of generating more negative samples, i.e. we regarded all the valid hypernyms over the training set -not being a proper hypernym for q -as h upon the creation of the (q, h ) negative training instances. This latter strategy is referenced as ns = all in Table 5. In our official submission we regarded only those hypernyms as potential candidates to rank during test time which occurred at least once as a correct hypernym in the training data. We call this strategy as candidate filtering. Historically, we applied this restriction to speed up the FCA algorithm because this way the size of the concept lattice could be made smaller. As there are valid hypernyms on the test set which never occurred in the training data, our official submission would not be able to obtain a perfect score even in theory. Table 7 contains the best possible metrics on the test set that we could achieve when candidate filtering is applied. In our post evaluation experiments we also investigated the effects of turning this kind of filtering step off. As Table 5 illustrates, however, our scores degrade after turning candidate filtering off.
Our post evaluation experiments in Table 5   gest that it is advantageous to apply sparse representation of more expressive power (i.e. a higher number of basis vectors). Generating more negative samples also provides some additional performance boost. These previous observations hold irrespective whether candidate filtering is employed or not, however, their effects are more pronounced when hypernym candidates are not filtered. Finally, we report our post-evaluation results for all the subtasks and compare them to the official scores of the best performing systems in Table 8. It can be seen from these enhanced results for category "all" (concepts and entities mixed) that we would win (1B) Italian and (1C) Spanish. Our post-evaluation system -which only differs from our participating system that it fixes the calculation of a features, does not rely on FCA-based features and uses k = 1000 -would also place third in the rest of the subtasks.

Conclusion
In this paper we experimented with the integration of sparse word representations into the task of hypernymy discovery. We strived to utilize sparse word representations in two ways, i.e. via building concept lattices using formal concept analysis and modeling the hypernymy relation with the help of interaction terms. While our former approach for deriving formal concepts from sparse word representations was not successful, the interaction terms derived from sparse word representations proved to be highly beneficial.