Leveraging Adjective-Noun Phrasing Knowledge for Comparison Relation Prediction in Text-to-SQL

One key component in text-to-SQL is to predict the comparison relations between columns and their values. To the best of our knowledge, no existing models explicitly introduce external common knowledge to address this problem, thus their capabilities of predicting comparison relations are limited beyond training data. In this paper, we propose to leverage adjective-noun phrasing knowledge mined from the web to predict the comparison relations in text-to-SQL. Experimental results on both the original and the re-split Spider dataset show that our approach achieves significant improvement over state-of-the-art methods on comparison relation prediction.


Introduction
Text-to-SQL (Yaghmazadeh et al., 2017;Zhong et al., 2017), which aims at mapping natural language to SQL queries, is one of the most important tasks in natural language processing. Most stateof-the-art models are end-to-end neural network based models (Zhang et al., 2017;Xu et al., 2017;Yu et al., 2018a;Herzig and Berant, 2018;Dong and Lapata, 2018;Yu et al., 2018b), which mainly extend the Seq2Seq architecture with some complicated network structures. As shown by Yu et al. (2018b), the performances of most methods are overstated, because they just match semantic parsing results, rather than truly understand the meanings of inputs (Finegan-Dollak et al., 2018). In this paper, we study the comparison relation prediction problem, as the comparison relations between columns and their values are not well understood by existing methods.
In SQL queries, comparison relations are expressed by the comparison operators (=, =, <, ≤ , >, ≥) and value ordering keywords (ASC, DESC in ORDER BY expression). In most text-to-SQL models, the comparison relations are either generated using Seq2Seq architectures (Zhong et al., 2017;Dong and Lapata, 2018), or predicted using classifiers trained with output decoding features (Xu et al., 2017;Yu et al., 2018b). We give an example to show that external common knowledge is indispensable to truly understand the comparison relations on unseen data.   Table 1 shows the basic information about the athletes of the 100-meter sprint. Given the query "what is the name of the oldest player ?", the goal is to generate the SQL "SELECT name FROM players ORDER BY birth date ASC LIMIT 1". It should be noted that ASC represents the common knowledge that small value in birth date column means "old". Supposing that there are columns named age and some queries containing "old" in the training data, it's easy for the trained models to remember that "old" means selecting a large value from column age. But if birth date is unseen in the training data, there is little chance to predict the comparison relation correctly without the common knowledge that "age" and "birth date" both represent age but have opposite value polarities. Similarly, for query "fast runner", models should select a large value from column speed or a small value from column time. There are some related works that address the intensity of adjectives (De Melo and Bansal, 2013;Ruppenhofer et al., 2014;Sharma et al., 2017), however, no existing work studies the relations between the value polarities of adjective-noun phrasing pairs.
In this paper, we propose to explicitly incorporate the column value polarities as external knowledge in text-to-SQL models. Our goal is to scale the capabilities of existing models on comparison relation prediction to unseen data. The column value polarities are named as "adjective-noun phrasing knowledge", which can be easily constructed from the web. We further formulate the phrasing knowledge as feature vectors, which can be generically fed to existing models. Experimental results on both the original and the re-split Spider dataset show that our approach achieves significant improvement over state-of-the-art methods on comparison relation prediction.

Adjective-Noun Phrasing Knowledge
There are 3 steps to construct the knowledge: 1) adjective-noun pair candidate extraction, 2) value polarity mining, and 3) adjective-noun pair clustering. The adjective-noun phrasing knowledge will be two clusters, each of which is a list of adjective-noun pairs with the same value polarity.

Adjective-Noun Pair Candidate Extraction
As the value polarities depend on both adjectives and nouns, we first extract adjective-noun pairs that could be candidates of the phrasing knowledge. Typically, table column names could be considered as hypernyms of the corresponding cell values. Motivated by this, we gather a list of noun candidates, which consists of the concepts (hypernyms) from Microsoft Concept Graph 1 (Cheng et al., 2015;Wu et al., 2012) and the column names whose value types are number or date in the Web Table Corpora 2 (Lehmberg et al., 2016).
To extract the adjectives that modify the noun candidates, we introduce two POS tag patterns: (2) [NOUN] is|are|was|were|be [ADJ] where [ADJ] and [NOUN] are the POS tags for adjectives and nouns, respectively. We apply these two patterns to Wikipedia dump 3 to extract adjective-noun pair candidates. Taking price as an example, we would obtain the adjectives: high, expensive, and cheap, etc.

Value Polarity Mining
As shown in Table 1, the value polarities of age and birth date are opposite though both words are about "age". We propose to mine the value polarities from the Web Table Corpora automatically. We assume that for two columns in the same table, if their corresponding values have negative correlations, they have opposite value polarities 4 . We use the Spearman's ρ coefficients to measure the correlations between two columns. For each two columns with value type number or date, if the frequency of their co-occurrences is above 20 and the coefficients ρ are below −0.9 for more than 50% co-occurrences in the corpora, the two columns are determined to have opposite value polarities.

Adjective-Noun Pair Clustering
So far, we have obtained the adjective-noun pairs and value polarity relations for nouns. However, it is still unclear whether the polarity is positive or negative. Positive (negative) polarity means that large (small) column values shall be selected when the noun is modified by an adjective. For example, the polarities of <age, old> and <price, ex-pensive> are positive, while <age, young> and <price, cheap> are negative. Our goal is to separate the adjective-noun pairs into two clusters based on the value polarities. Supposing that we know the polarity of <age, old> is positive, we can derive that the polarities of <age, young> and <birth date, old> are negative, because "old" and "young" are antonyms, and the value polarities of age and birth date are opposite. Motivated by this, we extend and group the adjective-noun pairs into clusters based on the following rules: • the polarities of two pairs are the same if the adjectives are synonyms and the nouns are the same, or the nouns are synonyms and the adjectives are the same; • the polarities of two pairs are the opposite if the adjectives are antonyms and the nouns are the same, or the nouns have opposite value polarities and the adjectives are the same.
It should be noted that each cluster contains two sets of pairs with opposite polarities. We manually assign polarity labels to clusters in top sizes or with high total frequencies. One potential issue is that there might be conflicts due to the synonyms and antonyms resources, i.e., there may be two pairs assigned opposite polarities in the same cluster. In practice, there are only a few such cases, and we separate them into individual clusters and manually label these pairs. After that, we obtain the adjective-noun phrasing knowledge organized in two clusters, each of which is a list of adjective-noun pairs with the same polarity.

Phrasing Knowledge As Features
We propose to formulate knowledge as features to incorporate the adjective-noun phrasing knowledge. It is very generic, because knowledge feature can be easily combined with the input or hidden output of existing neural models.

Baseline Models
We use Spider (Yu et al., 2018c) as the dataset. We do not use other datasets like WikiSQL (Zhong et al., 2017) because WikiSQL queries that contain comparison relations are very simple, and do not require additional knowledge for understanding. We introduce the knowledge feature to two baseline models, namely syntaxSQLNet 5 (Yu et al., 2018b) and SQLNet 6 (Xu et al., 2017;Yu et al., 2018c). We do not use other syntax treebased or sequence-based baselines like coarse-tofine (Dong and Lapata, 2018) because they cannot handle the Spider dataset.
Currently 7 , syntaxSQLNet achieves the best performance on Spider. SyntaxSQLNet and SQL-Net solve the text-to-SQL task using a sequenceto-set structure. They decompose the SQL generation procedure into multiple individual modules. The comparison relation prediction consists of two parts: the comparison operator prediction in OP module of WHERE and GROUP Having components and ordering prediction in DESC/ASC/LIMIT(DAL) module of ORDER BY component. Both the two modules are classifiers defined as: where X represents the input features; y is the output label; and W and V are trainable parameters. Function F is defined as sigmoid in SQLNet 5 https://github.com/taoyds/syntaxSQL 6 https://github.com/taoyds/spider/ tree/master/baselines/sqlnet 7 As of mid May when we prepare this submission.

Knowledge as Features
The adjective-noun phrasing knowledge is formulated as additional input features to existing models. Let X K denote the knowledge feature, we rewrite the classifier for OP and DAL module as: where [:] denotes the concatenation of feature vectors. The intuition is that even if X is unseen, the knowledge feature X K will help to make the correct prediction. We will give an example to show how to construct the knowledge feature.
As the adjective-noun phrasing knowledge consists of two clusters, we use two fixed random vectors to represent the two clusters, denoted as polarity features. We define the knowledge feature as the attention weighted polarity features with knowledge masks. In Figure 1, the up palette shows the attention matrix between column names and question tokens. In syntaxSQLNet and SQLNet, the attention between column names and question tokens is defined by: where H CS and H Q represent the hidden state of LSTM for column names and question tokens, respectively. Suppose that we are predicting the comparison operator for column age using the DAL module. We first construct the positive and negative knowledge mask vectors, whose lengths are equal to the number of question tokens. In mask vectors, 1 means that the pair of the column name and the corresponding question token exists in phrasing knowledge. Then, we calculate the weight of positive and negative polarity features using the inner product ( ) of knowledge mask vectors and the attention vector. For example, in Figure 1, the weight of positive and negative polarity features are 0.8 and 0, respectively. After that, we obtain the knowledge feature by the elementwise sum (⊕) of the weighted positive and negative polarity features, denoted by + weighted .
To obtain the mask vectors, we need to match the nouns in phrasing pairs with column names in the Spider dataset. For example, column names birth date, birthdate, birth date and date of birth shoud be matched with the same noun "date of birth" in the phrasing knowledge. We heuristically define the matching score for column name n 1 and noun n 2 as: where R t and R c are the token and char level fuzzy matching ratios, which is calculated by the output of fuzzywuzzy 8 divided by 100; and the last part is the cosine similarity between the inverse document frequency (idf) weighted sum of token embeddings, where the "document" refers to one column name in dataset. Given a column name in OP or DAL module, we select phrasing pairs whose noun matching score is ranked in top 10 as the knowledge to obtain the knowledge mask vectors.
Another straightforward way is to define the polarity feature as the knowledge feature directly, denoted as + direct , where the polarity is determined by the attention weighted knowledge mask score. In the example above, the positive polarity feature will be considered as knowledge feature since the positive mask has a higher attention weighted score (0.8 > 0). We will evaluate both the + weighted and + direct knowledge features in the experiments.

Data & Settings
For quality assurance, we manually choose 91 adjectives that have value polarities from top 200 frequent adjectives in Wikipedia. The adjectives with no value polarities like "important", "happy" and "reasonable" are filtered out. Then we use the adjective synonyms crawled from Bing Dictionary 9 , the noun synonyms and adjective antonyms from WordNet (Miller, 1995) to cluster the adjectivenoun pairs. We obtain 4, 133 distinct phrasing pairs in 2, 689 clusters. After that, we manually label the clusters with total sum of pair frequencies in top 50 or total number of pairs in top 50. Finally, we obtain 970 distinct adjective-noun pairs, which cover about 79.65% usages among all pairs. We manually evaluate the value polarities of all the 970 pairs, and the accuracy is 87.8%. The pairs after manually revision are served as the adjectivenoun phrasing knowledge.
For Spider dataset, we remove the nested SQL queries and queries containing JOIN, because the performance of baseline models on complex queries is rather low, and we would more like to see the result that whether the comparison relation is correctly predicted given the right column. There are 4, 490 questions after removing the complex queries, with 3, 946 training and 544 development data. We manually review all the 4, 490 questions and the corresponding tables, and find that there are 1, 212 questions containing comparison relations (excluding =, =), among which 668 questions require adjective phrasing knowledge. The total number of required distinct adjectivenoun pairs is 728, which is considered as groundtruth knowledge (denoted as gt). After matching the mined adjective phrasing knowledge (970 pairs) with the Spider column names using Equation 4, we obtain 631 distinct pairs where 598 pairs exist in the ground-truth knowledge, the precision of knowledge matching using Equation 4 is 94.8% (598/631), and the recall is 82.1% (598/728).
To further evaluate effectiveness of phrasing knowledge, we re-split the training and development data of Spider, to ensure that the knowledge applied in the development set has not been seen or applied in the training set. The sizes of the re-split training and development sets are 3, 906 and 584, respectively. For model training, the dimension of knowledge feature is 50 with other parameters remaining the same as baseline models.

Phrasing Knowledge Examples
In the step of value polarity mining (described in Section 2.2), we obtain 758 noun pairs with oppo-    Table 3 shows some examples of the adjectivenoun phrasing knowledge. It can be seen that the value polarities are opposite for age and birth date, score and rank. The results show that both the + weighted and the + direct knowledge features are effective, they significantly improve the performances of components that have comparison relations. The results also show that the overall performances are improved, which demonstrates that adjective-noun phrasing knowledge is effective and generic for existing state-of-the-art models.

Conclusion & Future Work
In this paper, we introduce the adjective-noun phrasing knowledge as feature to improve comparison relation prediction on unseen data. Experimental results show that our apporach achieves promising performance. For future work, we will further improve the quality of phrasing knowledge or incorporate other concept-level knowledge in text-to-SQL.