THU_NGN at SemEval-2018 Task 10: Capturing Discriminative Attributes with MLP-CNN model

Existing semantic models are capable of identifying the semantic similarity of words. However, it’s hard for these models to discriminate between a word and another similar word. Thus, the aim of SemEval-2018 Task 10 is to predict whether a word is a discriminative attribute between two concepts. In this task, we apply a multilayer perceptron (MLP)-convolutional neural network (CNN) model to identify whether an attribute is discriminative. The CNNs are used to extract low-level features from the inputs. The MLP takes both the flatten CNN maps and inputs to predict the labels. The evaluation F-score of our system on the test set is 0.629 (ranked 15th), which indicates that our system still needs to be improved. However, the behaviours of our system in our experiments provide useful information, which can help to improve the collective understanding of this novel task.


Introduction
Evaluating the similarity of words is an important task in semantic modeling. There have been different approaches based on corpus statistics (Jiang and Conrath, 1997;Mihalcea et al., 2006) and ontology (Seco et al., 2004;Sánchez et al., 2012). After an effective word representation proposed by mikolov et al (2013), word similarity can be evaluated based on word embedding weights (Levy and Goldberg, 2014). Usually higher cosine similarity of word embedding vectors indicates higher semantic similarity.
However, existing semantic methods are not capable of discriminating similar words between each other without additional information. For example, it is easy for these models to tell "dog" and "puppy" is similar, but they can't tell the differences between each other. It limits the use of these models to mine such fine-grained semantic information from texts. Thus, the SemEval-2018 Task 10 is proposed to determine whether an attribute can help to discriminate between two words (Krebs et al., 2018). One can express semantic differences between concepts by referring to attributes associated with those concepts. The differences between concepts can usually be identified by the presence or absence of specific attributes. For example, the attributes "red" and "yellow" are discriminative for concepts "apple" and "banana", while "sweet" or "fruit" are not discriminative.
Capturing such discriminative attributes can be regarded as a binary classification task: given two words and an attribute, predict whether the attribute is a difference between the two words. Existing methods to capture discriminative attributes are mainly based on dictionary (Parikh and Grauman, 2011). In recent years, CNN have been successfully applied to text classification task (Kim, 2014). In order to address this task, we develop a system based on MLP-CNN model. Firstly, the input words will be converted into dense vectors using the combination of different word embeddings. Then the CNN layers are used to extract features from these vectors. Finally, a MLP classifier is used to predict binary labels based on both embedding and CNN features. The experimental results show that our model outperforms several baseline neural model, and the additional features can improve the model performance. Our system still has room for development according to the experimental analysis. The behaviours of our system in our experiments can help to further fix and extend our model. The architecture of our MLP-CNN model is shown in Figure 1. The input of our system is a pair of words with an attribute. First, an em-bedding layer is used to provide different kinds of pre-trained embedding weights (v 1 − dim) and the word features (v f − dim) . We use three different pre-trained embedding weights and concatenate them together with the additional features of each word. Thus, the output of embedding layer is Second, a 2-layer convolutional neural network take these vectors as input, and output the flatten feature maps. We use zeros to pad in both sides to keep the same output length. Since the length of inputs is 3, the 3 time steps of the convolutional feature maps can respectively extract the inherent relatedness of the first word with attribute, the second word with attribute and all three words. The feature map dimensions of the two CNN layers are v 2 and v 3 respectively. In order to reduce the difficulties of gradient propagation in neural networks, we use a over-layer connection between input and output of CNN. We concatenate the flatten feature maps with all word embedding and features together. Finally a MLP with ReLU and sigmoid activation function is used to predict the normalized binary label. With the help of the over-layer connection, the MLP classifier can learn from highlevel word information and raw semantic information at the same time. Since the final labels are obtained from the triples of words through the embedding, CNN and dense layers, all parameters can be tuned in model training.

Word Embedding
Since there are several out-of-vocabulary words in the dataset when using single pre-trained word embedding, we use three different embedding models to cover them. The three embedding models include pre-trained word2vec embedding 1 provided by Mikolov et al. (Mikolov et al., 2013), the Glove embedding 2 provided by Pennington et al. (Pennington et al., 2014) and the fastText embedding 3 released by bojanowski et al. (Bojanowski et al., 2016). These embedding weights are all 300-dim. They are concatenated together as the representation of input words.

Word Feature
In our model, we use one-hot encoded POS tags and two binary features obtained by Word-

Convolutional layer
Convolutional layer Net (Miller and Fellbaum, 1998). In the dataset, the words to be discriminated are nouns, but attributes are nouns, adjectives, verbs and so on. Thus, POS tags of words can help to identify the types of attributes and the relationship between them. )We use the Stanford parser tool 4 to get the POS tags of words. The WordNet feature we use is based on synsets. Among every three input words, if one word is in the synset of another word, the corresponding feature digit will be set to 1, or it will be set to 0. In this way, a 2-dim synset feature of each word can be obtained. We use the nltk tool (Bird et al., 2009) to generate the WordNet features. The features above are concatenated with word embedding as the input of MLP-CNN model.

Model Training and Ensemble
Since the train set is unbalanced, we randomly select same numbers of positive (the attribute is discriminative) and negative (the attribute is not discriminative) samples from the train set every time. Thus, the training data we used in our experiments consists of the sampled data from the train set and 80% data sampled from the dev set. The remaining 20% part in dev set is used for validation.
Model ensemble strategy has been proven useful to neural networks (Wu et al., 2017). Therefore, we build different training samples using the method described in the above paragraph and train our model for 10 times. The final predictions on the test set are the average of the predictions of the 10 models. In this way, the performance of neural model can be further improved.

Experiment Settings
The dataset we use is constructed based on the approach proposed by Lazaridou et al. (2016) and the initial source of data is provided by McRae et al. (2005). The entire dataset contains 17,547 samples for training, 2,722 for validation and 2,340 for testing. The training set is automatically generated, while the validation and test set are manually refined. The models will be evaluated by F1measure, as is standard in a binary classification task.
In our network, the kernel sizes of CNN are set to 3. The dimensions of feature maps v 2 and v 3 are set to 256, and the dimensions of dense layers are 300. The dropout rate of both embedding weights and CNN is set to 0.2. The training batch size is set to 50. We use Adam as the optimizer for network training, which takes 10 epochs per time. We train our model for 10 times and average their predictions on the test set.

Performance Evaluation
The experimental results on the test and validation set are shown in Table 1. For comparison, we also present several baseline models here. Our official submission is the MLP-CNN model with ensemble techniques. Our F-score is 62.9 (ranked 15th) in the evaluation phase. From the evaluation results, we can see that our model outperforms these baseline models. It shows that our network architecture can learn more semantic information from the words and attributes. However, our system needs to be improved compared with the top system (75.0 of F-score). In addition, the testing results are much lower than validation results. Some detailed information will be analysis in the next section.

Influence of Trainable Word Embedding
The influence of different word embedding weights and fine-tuning them or not is shown in Table 2. Note that we don't apply the model ensemble technique here. From the results, we can see that the combination of different word embedding can significantly improve the model performance. It may be because that using different word embedding can provide richer semantic information. In addition, using the combinations of different word embedding can cover more words and the out-of-vocabulary words in the single embedding file can be reduced. Thus, the predictions of such words can be more accurate. However, we find fine-tuning the pre-trained word embedding is not a good choice. The finetuned model performance is significantly worse than models using untrainable embedding. Since the training, validation and test sets have no feature overlap between them, fine-tuning the embedding weights will lead to serious over-fitting and poor model generalization ability. We fine-tuned the embedding of our models used in the official submission, so the results are lower than the models with untrainable embedding.

Influence of Word Features
The influence of the two types of features is shown in Table 3. The results show that additional word features can improve the performance of our neural model. Attributes with different POS tags provide different semantic information. For example, given a pair of words "boy" and "woman", the attributes "young" and "run" describe very different aspects. Therefore, POS tag features can help the model to extract different features from the input words. Another feature based on WordNet can also improve our model. It may be because if the attribute is in the synsets of a concept, it's usu-

Case Study
Several examples of model predictions on the test set are shown in Table 4. From the true predictions, we can see that our model can capture simple attributes of concepts such as colors. However, more complex relationships between words and attributes are difficult for our system to mine. For example, the word "mouse" can be an animal or electronic device. It's hard to identify such semantic differences without incorporating external knowledge, since the information provided by the training data is limited.
True Positive False Positive corn,tomato,yellow alcohol,liquor,strong ant,snail,black bar,shop,sell True Negative False Negative father,brother,family mouse,dog,plastic father,mother,parent engine,vehicle,component

Conclusion
Discriminating similar words between each other without additional information is difficult for existing semantic models. Therefore, the SemEval-2018 task 10 is proposed to fill this gap. In this paper, we apply a MLP-CNN model with word feature to this task. In our model, the input and output of our CNN are highway connected. They are taken by a MLP classifier for binary classification. Based on this model, the local relationships between each pair of words can be mined. Our evaluation F-score is 62.9 (ranked 15th). The detailed analysis on our system shows our system can be further improved.