MICRON: Multigranular Interaction for Contextualizing RepresentatiON in Non-factoid Question Answering

This paper studies the problem of non-factoid question answering, where the answer may span over multiple sentences. Existing solutions can be categorized into representation- and interaction-focused approaches. We combine their complementary strength, by a hybrid approach allowing multi-granular interactions, but represented at word level, enabling an easy integration with strong word-level signals. Specifically, we propose MICRON: Multigranular Interaction for Contextualizing RepresentatiON, a novel approach which derives contextualized uni-gram representation from n-grams. Our contributions are as follows: First, we enable multi-granular matches between question and answer n-grams. Second, by contextualizing word representation with surrounding n-grams, MICRON can naturally utilize word-based signals for query term weighting, known to be effective in information retrieval. We validate MICRON in two public non-factoid question answering datasets: WikiPassageQA and InsuranceQA, showing our model achieves the state of the art among baselines with reported performances on both datasets.


Introduction
Non-factoid questions, unlike factoid questions answered by short facts like a word or a phrase, may get answered by a long answer spanning across multiple sentences. Following the definition in (Guo et al., 2019), neural approaches for this task can be roughly categorized into representationand interaction-focused approaches.
First, representation-focused approaches (Rücklé and Gurevych, 2017;Shao et al., 2019) encode query and answer into vectors of the same size, and match the two by computing vector similarity. Models in this category have advantages of * The authors contribute equally to this paper. efficiency, as representations can be pre-computed and indexed for efficient retrieval. However, structural information, such as some question word matching another in answer, is missing in this representation. In addition, structural information can be diluted when squeezing a long text into a single vector. This weakness is often complemented by auxiliary information such as attention Wang and Jiang, 2016). Figure 1a illustrates a representative architecture in this category (Shao et al., 2019).
Second, interaction-focused approaches aim to preserve structural information above. A naive structural information is a matrix storing pairwise word interaction, or 1:1. However, due to a typical length difference between a question and a long answer in our problem setting, most answer words are left unmatched, except a few uni-gram in the answer. Later work relaxes 1:1 constraint, to 1:N and M:N, by allowing a match to n-gram (1:N) or a match between query and answer bi-grams (2:2). A state-of-the-art in this category (Rücklé et al., 2019), shown in Figure 1b, uses bi-gram Convolutional Neural Network (CNN) to represent query/answer bi-grams and their interactions. Similar architecture was generalized for N:N or N:M matches (Song et al., 2019;Chen et al., 2018), which may introduce a new challenge of multigranular interaction we discuss later.
Our work is of combining the strength of the two, as shown in Figure 1c. We illustrate our technical contributions using the following running example: Example 1 Consider a running example of matching a question, "Who is in charge of this education process", with a matching passage on "the institution of higher learning". Interaction between a query bi-gram "education process" and the 5gram "the institution of higher learning" is a key indicator explaining this match. In addition, external word-level importance signals, such as Inverse Document Frequency (IDF), are observed to be simple yet most powerful (Guo et al., 2016), in matching a short query (or, question) with a long document, as in information retrieval or nonfactoid question answering scenarios. For our example question with eight words, the IDF weight is highest for education, appearing rarely in other questions, while that is lower for common words.
Below are our two key contributions, inspired by the above running example. 1) Multigranular interaction: Figure 1c shows a dotted area, where interaction between mand n-grams are represented. This enables matching between different sized n-grams: η 25 enables the interaction between the bi-gram "education process" and the 5-gram "the institution of higher learning" in our running example. However, existing multigranular interaction (Chen et al., 2018) cannot combine word-level signal, such as a high IDF score of word "education".
2) N-gram contextualized word representation: Our next step is to combine this matching signal into a contextualized word representation For example, we represent word higher as an aggregation of its participating consecutive 5-grams, where "... the institution of higher" and "the institution of higher ..." disambiguate that the term should not be matched a question on "high school". Similarly, question word education is represented by surrounding 2-grams: "education process" and "of education". Contextualizing into word-level representation makes it natural to combine with word-level IDF scores in the model, and also enables indexing (Hwang and Chang, 2005). This shares the spirit of contextualized embedding, such as BERT (Devlin et al., 2018) and ELMo (Peters et al., 2018), but specialized for short-distance phrase context localized within question and passage.
We summarize the main contributions of this paper as follows. First, we utilize multigranular interaction to extract important information from the question/passage matching by proposing MI-CRON: Multigranular Interaction for Contextualizing RepresentatiON. Second, we leverage strong word-level signals, which we will discuss later.
We evaluate our method in two public nonfactoid QA datasets: WikiPassageQA (Cohen et al., 2018) and InsuranceQA (Feng et al., 2015). The results show that our model achieves the state of the art among baselines with reported performances on both datasets. Our source code is freelyavailable at https://github.com/stovecat/ MICRON for further study.

Our approach
In this section, we introduce our method in detail. MICRON mainly consists of three modules: encoding module, matching module, and scoring module. We use a Siamese architecture for encoding module, which is a common setting in our target problem (Rücklé and Gurevych, 2017;Shao et al., 2019;Rücklé et al., 2019). Encoding Module For a word vector sequence W ∈ R |W |×d with dimensionality d, we encode it by n-gram CNN as the following: where n is the window size of n-gram CNN. Each Γ n (W) ∈ R |W |×d represents n-gram semantics. As a distinction from other interaction-focused approaches, we introduce an additional Contextualization Layer Φ, which returns a word representation, contextualized by surrounding n-gram phrases of the word belongs to. In our work, we define Φ as the arithmetic mean of n-gram representations, formalized as follows1: where [Γ n (W)] k is k-th row vector of Γ n (W), and each row of Φ n (W) ∈ R |W |×d is the contextualized n-gram representation, corresponding to each word.

Matching Module
Query and candidate answer Q ∈ R |Q |×d and P ∈ R |P |×d can be encoded into Φ n (Q) and Ψ m (P). We build an interaction matrix η nm by computing dot product between Φ n (Q) and Ψ m (P): Output matrix η nm ∈ R |Q |× |P | contains the relevance scores of all pairs between n-grams in query and m-grams in answer. From η nm , we conduct a row-wise max-pooling to obtain A nm , relaxing the length constraint in interactions (Rücklé et al., 2019).

Scoring Module
We then aggregate the best matching scores A nm across all combinations of question n-grams and answer m-grams from F = {1, 2, 3, 5} following the convention of (Shao et al., 2019), yielding the cumulative score for each question word γ ∈ R |Q | : Finally, we obtain the relevance score Ω from γ vector. Note that we could adopt any effective word-based signals τ ∈ R |Q | , known a priori. By 1We omit Ψ for simplicity. Φ and Ψ are the same in our architecture. applying dot product between γ and τ, we can contrast matching scores by word importance. Specifically, A widely adopted example of τ is IDF, computed either globally (treating all passages in the dataset as a corpus) or locally (treating only candidate passages of given question as a corpus) (Blair-Goldensohn et al., 2003). Note that effective wordlevel signals may depend on the characteristic of dataset. We will further show empirically which measure is more effective for each dataset and explain why in later section. Loss function Our model is trained by the loss function studied in (Cohen and Croft, 2016): where BCE q is the standard binary cross entropy for the question, µ q r is the mean score of all relevant answers and max q nr is the max score of all irrelevant answers for q.

Dataset
We evaluate MICRON on two non-factoid question answering datasets: 1) WikiPassageQA (Cohen et al., 2018) is a recent Wikipedia based collection. There are high contextual similarity between answers and non-answers since all candidate answers are from the same document. 2) Insur-anceQA (Feng et al., 2015) is another well-known large-scale non-factoid QA dataset from insurance domain constructed by putting the ground truth answers into the pool and randomly sampling negative answers2. Table 2 shows the statistics of two datasets.
As state of the art in one dataset is not likely to be that in another, we focused on baselines either open source or reported results on both datasets. We implement two interaction-focused and one representation-focused baselines: N-gram CNN   builds N:N matching matrices respectively. The size of N is the same with our method for fair comparison. Unigram CNN uses 1:1 word matching, and is able to utilize word-based signals as query term weighting value.

Implementation Details
For word embeddings, we use 300d pre-trained Glove (Pennington et al., 2014). The sequence length of the passage are all different for each dataset: 400 tokens for WikiPassageQA, 200 tokens for InsuranceQA. The dropout is applied after every layers with a keep rate of 0.7. All weights except embedding matrices are constrained by L2 regularization with constant values of 10 −7 and 10 −5 respectively for WikiPassageQA and Insur-anceQA. We use Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 10 −6 and 10 −4 for each dataset. The learning parameters were chosen by the best performance on the dev set. Table 1 shows the results on WikiPassageQA and InsuranceQA datasets. We observe that our proposed approach, named MICRON, significantly outperforms both representation-focused and interaction-focused baselines in various evaluation metrics, achieving the best performance in both datasets. Our finding could be summarized as below: First, we manifest the effectiveness of multigranular interaction. Compared to N-gram CNN, MI-CRON allows matching between different n-grams (e.g., 2:3, 3:5) and achieves the improvement on both datasets by 4.0% point accuracy, 2.57% point MAP respectively.

Results
Second, we relax length constraint in n-gram from COALA and achieve relative gain in Insur-anceQA dataset. However, this gain is marginal when the phrase is short as in WikiPassageQA dataset, considering the better performance of COALA over N-gram CNN.
Third, word-based signals may help considerably in WikiPassageQA, where both global and local IDF scores of words are vary significantly (high variance). This variance is especially high for local IDF, which serves as a strong signal as consistently observed in (Blair-Goldensohn et al., 2003). In contrast, in InsuranceQA, the variance  of word signals are low. Consequently, the use of IDF cannot contribute to performance, or even contributes negatively.

Qualitative Examples
We illustrate several qualitative examples of MI-CRON in Figure 2. In Figure 2a, multigranular interaction (2:1) between the bi-gram "United States" and the uni-gram "USA" allows the matching. Figure 2b shows the case of where the contextualized representation enables to lower the matching score between "red ocean" and "ocean view". From Figure 2c, we can see the word based signals can control the impact of each contextualized word scores: amplifying the matching of "Sweden"-"Sweden" and reducing the "is"-"is" matching.

Conclusion
In this paper, we study non-factoid question answering. Specifically, our approach is inspired by the complementary strength of representationand interaction-focused approaches. We combine the strength of the two, by allowing multigranular interactions, but represented per-word basis, contextualized by participating n-grams. For this purpose, we propose MICRON, allowing to match flexible n-grams and to combine with word-based query term weighting, achieving the state of the art among baselines with reported performances on both datasets3.