Enhancing Automatic Wordnet Construction Using Word Embeddings

Researchers have shown that a wordnet for a new language, possibly resource-poor, can be constructed automatically by translating wordnets of resource-rich languages. The quality of these constructed wordnets is affected by the quality of the resources used such as dictionaries and translation methods in the construction process. Recent work shows that vector representation of words (word embeddings) can be used to discover related words in text. In this paper, we propose a method that performs such similarity computation using word embeddings to improve the quality of automatically constructed wordnets.


Introduction
A wordnet is a lexical ontology of words. Highquality wordnets have been developed for only a few languages. Wordnets, other than the Princeton WordNet (PWN) (Fellbaum, 1998), are typically constructed by one of two approaches. The translation approach translates the PWN to target languages (Saveski and Trajkovski, 2010;Oliver and Climent, 2012;Lam et al., 2014). In contrast, the merge approach builds the semantic taxonomy of a wordnet in a target language, and then aligns it with the Princeton WordNet by generating translations (Gunawan and Saputra, 2010;Rodríguez et al., 2008).
In this paper, we propose a method to enhance the translation approach using word embeddings produced by the word2vec algorithm (Mikolov et al., 2013). We produce wordnets in several languages although the current paper focuses only on the new Arabic wordnet we construct.

Constructing Initial Wordnet
We start by automatically generating wordnet synsets for a target language T using the method presented by (Lam et al., 2014), which translates synsets from several intermediate wordnets and ranks them. The approach generates wordnet synsets that do not include any semantic links between them. This paper discusses how we construct the semantic links between synsets in T . Figure 1 shows that we take advantage of the fact that the wordnet synsets created in the previous step are aligned with PWN. This means that synsets with the same meaning for different languages share the same synset ID. To construct the links between synsets in our new wordnet TWN for language T , we extract each synset T W N i from T W N and find the corresponding synset in PWN, synset P W N i .
Here, i is the ID of the synset. Then, for each synset P W N i , we extract each semantic relations r j and all linked synset P W N k within PWN. Finally, if synset k , i.e., a synset with ID k is present in TWN, we add a link between synset T W N i and synset T W N k in the newly constructed T W N .

Generating Word Embeddings
In order to validate the synsets we create using translation and obtain relations between them, we use the word2vec algorithm (Mikolov et al., 2013) to generate word representations from an existing corpus. The word2vec algorithm uses a feedforward neural network to predict the vector representation of Pair Cosine Similarity (word 1 , word 2 ) 0.91 (word 1 , word 3 ) 0.22 (word 1 , word 4 ) 0.82 (word 2 , word 3 ) 0.34 (word 2 , word 4 ) 0.72 (word 3 , word 4 ) 0.12 Table 1: An example of cosine similarity between words in a candidate synset words within a multi-dimensional language model. W ord2vec has two variations: Skip-Gram (SG) and Continuous Bag-Of-Words (CBOW). In the SG version, the neural network predicts words adjacent to a given word on either side, while in the CBOW model the network predicts the word in the middle of a given sequence of words. In the work presented in this paper, we generate representations of words using both models with several different vector and window sizes to obtain the settings for the highest precision. The purpose of the steps discussed next is to improve the quality of synsets produced by the translation process in addition to generating relations among the synsets.

Removing irrelevant words in synsets
We compute the cosine similarity between word vectors within each single synset in TWN, the wordnet being constructed in language T , to filter false word members within synsets. To filter the initially constructed synsets in TWN, we pick a threshold value α such that the selected words have cosine similarity larger than α with each other. For example, let synset c i = {word 1 , word 2 , word 3 , word 4 } be a candidate synset to be potentially included in TWN. We compute the cosine similarity between all the possible pairs of words in synset c i . Then, we extract the pair of words with the highest cosine similarity. If this pair of words have cosine similarity larger than α, the pair is kept in the final synset synset i , otherwise, synset c i itself is discarded. This may have been a low quality candidate synset generated in the translation process. Next, among the remaining words in synset c i , a word is kept if it has a connection with any word in synset i with similarity higher than α. For example, let us assume that the cosine similarity between the words in synset c i are as shown in Table 1 and α=0.70. First, the pair with the highest cosine similarity, (word 1 , word 2 ) is kept in the final synset i since its cosine similarity is larger than α. Then, word 3 is discarded since it does not have any cosine similarity larger than α with any of the words in the current final synset i . Finally, word 4 is kept synset i since it does have a cosine similarity with word 1 that satisfies the threshold α.

Validating candidate relations
Similarly, we compute the cosine similarity between words within pairs of semantically related synsets. This allow us to verify the constructed relations between synsets in TWN. For example, let synset i = {word i1 , word i2 , word i3 , word i4 }, synset j = {word j 1 , word j 2 , word j 3 , word j 4 } be synsets in TWN. And let ρ ij be a candidate semantic relation between synset i and synset j . We compute the cosine similarity between all the possible pairs of words from synset i to synset j and obtain the maximum similarity obtained. Then, if this value is larger than a threshold α ρ , then we retain the relation ρ ij , otherwise, we discard it.

Selecting thresholds
To pick the synset similarity threshold value α and the threshold α ρ for each semantic relation we create, we compute the cosine similarity between pairs of synonym words, semantically related words, and non-related words obtained from existing wordnets. Then, based on the previous data, we select the threshold values that are associated with higher precision and maximum coverage.

31
We discuss the generation of a wordnet for Arabic as an example although we have worked with several other languages.

Datasets and Resources Used
To construct the core wordnet, i.e., wordnet synsets, we use the Microsoft Translator to translate English synsets from PWN to Arabic synsets. We have selected the Microsoft Translator because it gives acceptable quality free of cost. For generating vector representations of the Arabic Words we use the following freely available corpora: Watan-2004 corpus (12 million words) (Abbas et al., 2011), Khaleej-2004 corpus (3 million) (Abbas and Smaili, 2005) and 21 million words of Wikipedia 1 Arabic articles, combined to a single file. In order to compute the synset similarity threshold value α and the threshold for each semantic relation α ρ , we use the freely available Arabic wordnet (AWN) (Rodríguez et al., 2008). AWN was manually constructed in 2006 and has been semiautomatically enhanced and extended several times. We start by extracting synonym words, semantically related words, and non-related words from AWN. Then, we use the histogram representation of the cosine similarity of the previous sets of words to set the thresholds. As Figure 2 shows, more than 67% of the non-related words have cosine similarity less than 0.1, while about 23% of the synonym words in   AWN have a cosine similarity less than 0.1. Furthermore, about 34% of the semantically related words in AWN have cosine similarity less than 0.1. Table  2 shows the weighted average cosine similarity between synonyms, hypernyms, topic-domain related, part-holonyms, instance-hypernyms, and membermeronyms in AWN where the frequency of the similarity value is the weight.

Creating an Arabic Wordnet
We choose the algorithm for creating wordnet synsets presented by (Lam et al., 2014) because it requires a limited number of freely available resources, which makes it applicable to resource-poor languages. Then, we apply the method we propose in Section 2 to create semantic links between Arabic synsets. Table 3 shows statistics of some of the created links between synsets in our Arabic wordnet.

Producing Word Embeddings for Arabic
We test the word2vec algorithm with different window sizes. We generate word embeddings using the CBOW version with window sizes 3, 5 and 8. Next, we compute the weighted averages of the cosine similarity between the synonyms in AWN. The highest weighted average we obtained was 0.288 with window size 3, while the weighted averages obtained with window sizes 5 and 8 were 0.283 and   0.277 respectively. Then, we compare between the SG and the CBOW with different vector sizes. Table 4 shows the weighted average cosine similarity obtained between 16, 000 pairs of synonyms in AWN using both variations of word2vec, with window size=3 and vector size set to 100, 200, and 500. We notice that both versions produce almost similar results with a slight advantage to SG with the cost of more execution time. However, for the corpus we use, smaller vector size produces better precision.

Evaluation & Discussion
We compute cosine similarity between semantically related words extracted from our initial Arabic wordnet produced in Section 4.2. The language model to calculate the cosine similarity is created using CBOW with vector size=100 and window size=3. Table 5 shows a comparison between the number of Arabic synsets we create and the number of synsets in AWN. We notice that the translation method we use produces high number of synsets compared to the manually constructed AWN. However, the number of synsets sharply decreases after filtering the initial synonyms using the method described in Section 3. Although our Arabic wordnet is automatically created, the number of synsets we create is 60% of the number of synsets in the manually created AWN  when filtering the synsets using α= 0.1. We evaluate precision by comparing 600 pairs of synonyms, hypernyms, part-holonyms, and member-meronyms with three ranges of cosine similarity values: 0 to 0.1, 0.1 to 0.288, and 0.288 to 1. We asked 3 Arabic speakers to evaluate the pairs using a 0 to 5 scale where 0 represents the minimum score and 5 represents the maximum score. We compute precision by taking the average score and converting it to a percentage. See Table 6.
The precision of the synonyms, hypernyms, partholonyms, and member-meronyms we produce is 78.4%, 84.4%, 90.4%, and 79.6% respectively, with the threshold set to 0.288. This is higher than the precision obtained by (Lam et al., 2014) which produces synonyms with 76.4% precision when just using PWN. Our results suggest that using lower precision for producing synsets reduces the quality of the other created semantic relations. Our results clearly show that pairs with higher cosine similarity are more likely to be semantically related. It confirms the benefit of combining the translation method with word embeddings in the process of automatically generating new wordnets.

Conclusion & Future Work
In this paper, we discuss an approach for automatically generating new wordnets for low-resource languages. Our approach takes advantage of word embeddings to enhance the translation method for automatic wordnet creation. We present an application of our approach to producing new Arabic Wordnet. Our method automatically produces Arabic synonyms with 78.4% precision and semantically related pairs of words with up to 90.4% precision. Currently, we are in the process of applying our method to other languages such as Assamese, Bengali and Tamil.