Difference between revisions of "Question Answering (State of the art)"

From ACL Wiki
Jump to navigation Jump to search
m
(Highlighted different SOTA methods for the MAP and MRR metrics)
 
(33 intermediate revisions by 11 users not shown)
Line 5: Line 5:
 
* [http://cs.stanford.edu/people/mengqiu/data/qg-emnlp07-data.tgz QA Answer Sentence Selection Dataset]: labeled sentences using TREC QA track data, provided by [http://cs.stanford.edu/people/mengqiu/ Mengqiu Wang] and first used in [http://www.aclweb.org/anthology/D/D07/D07-1003.pdf Wang et al. (2007)].  
 
* [http://cs.stanford.edu/people/mengqiu/data/qg-emnlp07-data.tgz QA Answer Sentence Selection Dataset]: labeled sentences using TREC QA track data, provided by [http://cs.stanford.edu/people/mengqiu/ Mengqiu Wang] and first used in [http://www.aclweb.org/anthology/D/D07/D07-1003.pdf Wang et al. (2007)].  
 
* Over time, the original dataset diverged to two versions due to different pre-processing in recent publications: both have the same training set but their development and test sets differ. The Raw version has 82 questions in the development set and 100 questions in the test set; The Clean version (Wang and Ittycheriah et al. 2015, Tan et al. 2015, dos Santos et al. 2016, Wang et al. 2016) removed questions with no answers or with only positive/negative answers, thus has only 65 questions in the development set and 68 questions in the test set.  
 
* Over time, the original dataset diverged to two versions due to different pre-processing in recent publications: both have the same training set but their development and test sets differ. The Raw version has 82 questions in the development set and 100 questions in the test set; The Clean version (Wang and Ittycheriah et al. 2015, Tan et al. 2015, dos Santos et al. 2016, Wang et al. 2016) removed questions with no answers or with only positive/negative answers, thus has only 65 questions in the development set and 68 questions in the test set.  
* Note: MAP/MRR scores on the two versions of TREC QA data (Clean vs Raw) are not comparable according to [http://www.cs.umd.edu/~jinfeng/publications/PairwiseNeuralNetwork_CIKM2016.pdf Rao et al. (2016)].  
+
* Note: MAP/MRR scores on the two versions of TREC QA data (Clean vs Raw) are not comparable according to [https://dl.acm.org/authorize.cfm?key=N27026 Rao et al. (2016)].  
  
  
Line 79: Line 79:
 
| 0.746
 
| 0.746
 
| 0.808
 
| 0.808
 +
|-
 +
| Yang (2016) - Attention-Based Neural Matching Model
 +
| Yang et al. (2016)
 +
| 0.750
 +
| 0.811
 +
|-
 +
| Tay (2017) - Holographic Dual LSTM Architecture
 +
| Tay et al. (2017)
 +
| 0.750
 +
| 0.815
 
|-
 
|-
 
| H&L (2016) - Pairwise Word Interaction Modelling
 
| H&L (2016) - Pairwise Word Interaction Modelling
Line 89: Line 99:
 
| 0.762
 
| 0.762
 
| 0.830
 
| 0.830
 +
|-
 +
| Tay (2017) - HyperQA (Hyperbolic Embeddings)
 +
| Tay et al. (2017)
 +
| 0.770
 +
| 0.825
 
|-
 
|-
 
| Rao (2016) - PairwiseRank + Multi-Perspective CNN
 
| Rao (2016) - PairwiseRank + Multi-Perspective CNN
Line 94: Line 109:
 
| 0.780
 
| 0.780
 
| 0.834
 
| 0.834
 +
|-
 +
| Rao (2019) - Hybrid Co-Attention Network (HCAN)
 +
| Rao et al. (2019)
 +
| 0.774
 +
| 0.843
 +
|-
 +
| Tayyar Madabushi (2018) - Question Classification + PairwiseRank + Multi-Perspective CNN
 +
| Tayyar Madabushi et al. (2018)
 +
| 0.836
 +
| 0.863
 +
|-
 +
| Kamath (2019) - Question Classification + RNN + Pre-Attention
 +
| Kamath et al. (2019)
 +
| 0.852
 +
| 0.891
 +
|-
 +
| Laskar et al. (2020) - CETE (RoBERTa-Large)
 +
| Laskar et al. (2020)
 +
| '''0.950'''
 +
| '''0.980'''
 
|}
 
|}
  
Line 119: Line 154:
 
| 0.851
 
| 0.851
 
|-
 
|-
| Wang et al.  (2016) - Lexical Decomposition and Composition
+
| Wang et al.  (2016) - L.D.C Model
 
| Wang et al. (2016)
 
| Wang et al. (2016)
 
| 0.771
 
| 0.771
Line 128: Line 163:
 
| 0.777
 
| 0.777
 
| 0.836
 
| 0.836
 +
|-
 +
| Tay et al. (2017) - HyperQA (Hyperbolic Embeddings)
 +
| Tay et al. (2017)
 +
| 0.784
 +
| 0.865
 
|-
 
|-
 
| Rao et al.  (2016) - PairwiseRank + Multi-Perspective CNN
 
| Rao et al.  (2016) - PairwiseRank + Multi-Perspective CNN
Line 133: Line 173:
 
| 0.801
 
| 0.801
 
| 0.877
 
| 0.877
 +
|-
 +
| Wang et al.  (2017) - BiMPM
 +
| Wang et al.  (2017)
 +
| 0.802
 +
| 0.875
 +
|-
 +
| Bian et al.  (2017) - Compare-Aggregate
 +
| Bian et al.  (2017)
 +
| 0.821
 +
| 0.899
 +
|-
 +
| Shen et al.  (2017) - IWAN
 +
| Shen et al.  (2017)
 +
| 0.822
 +
| 0.889
 +
|-
 +
| Tran et al. (2018) - IWAN + sCARNN
 +
| Tran et al. (2018)
 +
| 0.829
 +
| 0.875
 +
|-
 +
| Tay et al. (2018) - Multi-Cast Attention Networks (MCAN)
 +
| Tay et al. (2018)
 +
| 0.838
 +
| 0.904
 +
|-
 +
| Tayyar Madabushi (2018) - Question Classification + PairwiseRank + Multi-Perspective CNN
 +
| Tayyar Madabushi et al. (2018)
 +
| 0.865
 +
| 0.904
 +
|-
 +
| Yoon et al. (2019) - Compare-Aggregate + LanguageModel + LatentClustering
 +
| Yoon et al. (2019)
 +
| 0.868
 +
| 0.928
 +
|-
 +
| Lai et al. (2019) - BERT + GSAMN + Transfer Learning
 +
| Lai et al. (2019)
 +
| 0.914
 +
| 0.957
 +
|-
 +
| Garg et al. (2019) - TANDA-RoBERTa (ASNQ, TREC-QA)
 +
| Garg et al. (2019)
 +
| '''0.943'''
 +
| 0.974
 +
|-
 +
| Laskar et al. (2020) - CETE (RoBERTa-Large)
 +
| Laskar et al. (2020)
 +
| 0.936
 +
| '''0.978'''
 
|}
 
|}
  
Line 152: Line 242:
 
* Ming Tan, Cicero dos Santos, Bing Xiang & Bowen Zhou. 2015. [http://arxiv.org/abs/1511.04108 LSTM-Based Deep Learning Models for Nonfactoid Answer Selection]. In eprint arXiv:1511.04108.
 
* Ming Tan, Cicero dos Santos, Bing Xiang & Bowen Zhou. 2015. [http://arxiv.org/abs/1511.04108 LSTM-Based Deep Learning Models for Nonfactoid Answer Selection]. In eprint arXiv:1511.04108.
 
* Cicero dos Santos, Ming Tan, Bing Xiang & Bowen Zhou. 2016. [http://arxiv.org/abs/1602.03609 Attentive Pooling Networks]. In eprint arXiv:1602.03609.
 
* Cicero dos Santos, Ming Tan, Bing Xiang & Bowen Zhou. 2016. [http://arxiv.org/abs/1602.03609 Attentive Pooling Networks]. In eprint arXiv:1602.03609.
* Zhiguo Wang, Haitao Mi and Abraham Ittycheriah. 2016. [http://arxiv.org/pdf/1602.07019v1.pdf Sentence Similarity Learning by Lexical Decomposition and Composition]. In eprint arXiv:1602.07019.
+
* Zhiguo Wang, Haitao Mi and Abraham Ittycheriah. 2016. [http://arxiv.org/pdf/1602.07019v1.pdf Sentence Similarity Learning by Lexical Decomposition and Composition]. In Coling 2016.
 
* Hua He, Kevin Gimpel and Jimmy Lin. 2015. [http://aclweb.org/anthology/D/D15/D15-1181.pdf Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks]. In EMNLP 2015.
 
* Hua He, Kevin Gimpel and Jimmy Lin. 2015. [http://aclweb.org/anthology/D/D15/D15-1181.pdf Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks]. In EMNLP 2015.
 
* Hua He and Jimmy Lin. 2016. [https://cs.uwaterloo.ca/~jimmylin/publications/He_etal_NAACL-HTL2016.pdf Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement]. In NAACL 2016.
 
* Hua He and Jimmy Lin. 2016. [https://cs.uwaterloo.ca/~jimmylin/publications/He_etal_NAACL-HTL2016.pdf Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement]. In NAACL 2016.
* Jinfeng Rao, Hua He and Jimmy Lin. 2016. [http://www.cs.umd.edu/~jinfeng/publications/PairwiseNeuralNetwork_CIKM2016.pdf Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks]. In CIKM 2016
+
* Liu Yang, Qingyao Ai, Jiafeng Guo, W. Bruce Croft. 2016. [http://maroo.cs.umass.edu/pub/web/getpdf.php?id=1240 aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model]. In CIKM 2016.
 +
* Jinfeng Rao, Hua He and Jimmy Lin. 2016. [https://dl.acm.org/authorize.cfm?key=N27026 Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks]. In CIKM 2016.
 +
* Yi Tay, Minh C. Phan, Luu Anh Tuan and Siu Cheung Hui. 2017 [https://arxiv.org/abs/1707.06372 Learning to Rank Question Answer Pairs with Holographic Dual LSTM Architecture]. In SIGIR 2017.
 +
* Yi Tay, Luu Anh Tuan, Siu Cheung Hui. 2017 [https://arxiv.org/pdf/1707.07847 Enabling Efficient Question Answer Retrieval via Hyperbolic Neural Networks]. In eprint arXiv: 1707.07847.
 
[[Category:State of the art]]
 
[[Category:State of the art]]
 +
* Zhiguo Wang, Wael Hamza and Radu Florian. 2017.  [https://arxiv.org/pdf/1702.03814.pdf Bilateral Multi-Perspective Matching for Natural Language Sentences]. In eprint arXiv:1702.03814.
 +
* Weijie Bian, Si Li, Zhao Yang, Guang Chen, Zhiqing Lin. 2017. [https://dl.acm.org/citation.cfm?id=3133089&CFID=791659397&CFTOKEN=43388059 A Compare-Aggregate Model with Dynamic-Clip Attention for Answer Selection]. In CIKM 2017.
 +
* Gehui Shen, Yunlun Yang, Zhi-Hong Deng. 2017. [https://aclanthology.info/pdf/D/D17/D17-1122.pdf Inter-Weighted Alignment Network for Sentence Pair Modeling.]. In EMNLP 2017.
 +
* Quan Hung Tran, Tuan Manh Lai, Gholamreza Haffari, Ingrid Zukerman, Trung Bui, Hung Bui, [http://www.aclweb.org/anthology/N18-1115 The Context-dependent Additive Recurrent Neural Net], In NAACL 2018
 +
* Yi Tay, Luu Anh Tuan, Siu Cheung Hui, [https://arxiv.org/abs/1806.00778 Multi-Cast Attention Networks], In KDD 2018
 +
* Harish Tayyar Madabushi, Mark Lee and John Barnden. [https://aclanthology.coli.uni-saarland.de/papers/C18-1278/c18-1278 Integrating Question Classification and Deep Learning for improved Answer Selection], In COLING 2018
 +
* Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Kyomin Jung. 2019. [https://arxiv.org/abs/1905.12897 A Compare-Aggregate Model with Latent Clustering for Answer Selection]. In CIKM 2019.
 +
* Sanjay Kamath, Brigitte Grau and Yue Ma. 2019. [https://hal.archives-ouvertes.fr/hal-02104488/ Predicting and Integrating Expected Answer Types into a Simple Recurrent Neural Network Model for Answer Sentence Selection]. In CICLING 2019
 +
* Jinfeng Rao, Linqing Liu, Yi Tay, Wei Yang, Peng Shi, Jimmy Lin, [https://jinfengr.github.io/publications/Rao_etal_EMNLP2019.pdf Bridging the Gap between Relevance Matching and Semantic Matching for Short Text Similarity Modeling], In EMNLP 2019
 +
* Tuan Lai, Quan Hung Tran, Trung Bui, Daisuke Kihara, [https://arxiv.org/pdf/1909.09696.pdf A Gated Self-attention Memory Network for Answer Selection], In EMNLP 2019
 +
* Siddhant Garg, Thuy Vu, Alessandro Moschitti, [https://arxiv.org/abs/1911.04118 TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection], in AAAI 2020
 +
* Md Tahmid Rahman Laskar, Jimmy Huang, Enamul  Hoque, [http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.676.pdf Contextualized Embeddings based Transformer Encoder for Sentence Similarity Modeling in Answer Selection Task], In LREC 2020

Latest revision as of 15:52, 13 July 2020

Answer Sentence Selection

The task of answer sentence selection is designed for the open-domain question answering setting. Given a question and a set of candidate sentences, the task is to choose the correct sentence that contains the exact answer and can sufficiently support the answer choice.

  • QA Answer Sentence Selection Dataset: labeled sentences using TREC QA track data, provided by Mengqiu Wang and first used in Wang et al. (2007).
  • Over time, the original dataset diverged to two versions due to different pre-processing in recent publications: both have the same training set but their development and test sets differ. The Raw version has 82 questions in the development set and 100 questions in the test set; The Clean version (Wang and Ittycheriah et al. 2015, Tan et al. 2015, dos Santos et al. 2016, Wang et al. 2016) removed questions with no answers or with only positive/negative answers, thus has only 65 questions in the development set and 68 questions in the test set.
  • Note: MAP/MRR scores on the two versions of TREC QA data (Clean vs Raw) are not comparable according to Rao et al. (2016).


Algorithm - Raw Version of TREC QA Reference MAP MRR
Punyakanok (2004) Wang et al. (2007) 0.419 0.494
Cui (2005) Wang et al. (2007) 0.427 0.526
Wang (2007) Wang et al. (2007) 0.603 0.685
H&S (2010) Heilman and Smith (2010) 0.609 0.692
W&M (2010) Wang and Manning (2010) 0.595 0.695
Yao (2013) Yao et al. (2013) 0.631 0.748
S&M (2013) Severyn and Moschitti (2013) 0.678 0.736
Shnarch (2013) - Backward Shnarch (2013) 0.686 0.754
Yih (2013) - LCLR Yih et al. (2013) 0.709 0.770
Yu (2014) - TRAIN-ALL bigram+count Yu et al. (2014) 0.711 0.785
W&N (2015) - Three-Layer BLSTM+BM25 Wang and Nyberg (2015) 0.713 0.791
Feng (2015) - Architecture-II Tan et al. (2015) 0.711 0.800
S&M (2015) Severyn and Moschitti (2015) 0.746 0.808
Yang (2016) - Attention-Based Neural Matching Model Yang et al. (2016) 0.750 0.811
Tay (2017) - Holographic Dual LSTM Architecture Tay et al. (2017) 0.750 0.815
H&L (2016) - Pairwise Word Interaction Modelling He and Lin (2016) 0.758 0.822
H&L (2015) - Multi-Perspective CNN He and Lin (2015) 0.762 0.830
Tay (2017) - HyperQA (Hyperbolic Embeddings) Tay et al. (2017) 0.770 0.825
Rao (2016) - PairwiseRank + Multi-Perspective CNN Rao et al. (2016) 0.780 0.834
Rao (2019) - Hybrid Co-Attention Network (HCAN) Rao et al. (2019) 0.774 0.843
Tayyar Madabushi (2018) - Question Classification + PairwiseRank + Multi-Perspective CNN Tayyar Madabushi et al. (2018) 0.836 0.863
Kamath (2019) - Question Classification + RNN + Pre-Attention Kamath et al. (2019) 0.852 0.891
Laskar et al. (2020) - CETE (RoBERTa-Large) Laskar et al. (2020) 0.950 0.980


Algorithm - Clean Version of TREC QA Reference MAP MRR
W&I (2015) Wang and Ittycheriah (2015) 0.746 0.820
Tan (2015) - QA-LSTM/CNN+attention Tan et al. (2015) 0.728 0.832
dos Santos (2016) - Attentive Pooling CNN dos Santos et al. (2016) 0.753 0.851
Wang et al. (2016) - L.D.C Model Wang et al. (2016) 0.771 0.845
H&L (2015) - Multi-Perspective CNN He and Lin (2015) 0.777 0.836
Tay et al. (2017) - HyperQA (Hyperbolic Embeddings) Tay et al. (2017) 0.784 0.865
Rao et al. (2016) - PairwiseRank + Multi-Perspective CNN Rao et al. (2016) 0.801 0.877
Wang et al. (2017) - BiMPM Wang et al. (2017) 0.802 0.875
Bian et al. (2017) - Compare-Aggregate Bian et al. (2017) 0.821 0.899
Shen et al. (2017) - IWAN Shen et al. (2017) 0.822 0.889
Tran et al. (2018) - IWAN + sCARNN Tran et al. (2018) 0.829 0.875
Tay et al. (2018) - Multi-Cast Attention Networks (MCAN) Tay et al. (2018) 0.838 0.904
Tayyar Madabushi (2018) - Question Classification + PairwiseRank + Multi-Perspective CNN Tayyar Madabushi et al. (2018) 0.865 0.904
Yoon et al. (2019) - Compare-Aggregate + LanguageModel + LatentClustering Yoon et al. (2019) 0.868 0.928
Lai et al. (2019) - BERT + GSAMN + Transfer Learning Lai et al. (2019) 0.914 0.957
Garg et al. (2019) - TANDA-RoBERTa (ASNQ, TREC-QA) Garg et al. (2019) 0.943 0.974
Laskar et al. (2020) - CETE (RoBERTa-Large) Laskar et al. (2020) 0.936 0.978

References