ACL Wiki - User contributions [en]

MEN Test Collection (State of the art)

2020-09-06T08:05:27Z

Doboandris:

* State of the art on the [https://staff.fnwi.uva.nl/e.bruni/MEN MEN dataset] (Bruni et al., 2013)
* 3000 word pairs: 2000 pairs in the development part of the dataset, 1000 pairs in the test part of the dataset
* The similarity values in the dataset are the means of judgments made by 50 subjects
* see also: [[Similarity (State of the art)]]

== Table of results for the test part of the dataset (1000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.867
| 0.866
|-
| DC20-hybrid
| Dobó and Csirik (2020)
| Dobó and Csirik (2020)
| Hybrid
| 0.866
| 0.869
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019)
| Hybrid
| 0.866
| 0.861
|-
| Ch18
| Christopoulou et al. (2018)
| Christopoulou et al. (2018)
| Corpus-based, predictive
| 0.84
| -
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019)
| Corpus-based, distributional
| 0.813
| 0.808
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019)
| Corpus-based, distributional
| 0.798
| 0.798
|-
| DC20-corpus
| Dobó and Csirik (2020)
| Dobó and Csirik (2020)
| Corpus-based, distributional
| 0.781
| 0.749
|-
| Br13
| Bruni et al. (2013)
| Bruni et al. (2013)
| Hybrid
| 0.78
| -
|-
| Do19-corpus
| Dobó (2019), Dobó and Csirik (2019)
| Dobó (2019), Dobó and Csirik (2019)
| Corpus-based, distributional
| 0.705
| 0.709
|}

== Table of results for the full dataset (3000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| DC20-hybrid
| Dobó and Csirik (2020)
| Dobó and Csirik (2020)
| Hybrid
| 0.862
| 0.865
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019)
| Hybrid
| 0.862
| 0.846
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.861
| 0.859
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019)
| Corpus-based, distributional
| 0.809
| 0.803
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019)
| Corpus-based, distributional
| 0.802
| 0.801
|-
| DC20-corpus
| Dobó and Csirik (2020)
| Dobó and Csirik (2020)
| Corpus-based, distributional
| 0.771
| 0.746
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based, distributional
| 0.702
| 0.707
|}

== References ==

* '''Listed alphabetically.'''

Bruni, E., Tran, N. K., and Baroni, M. (2014). [https://www.jair.org/index.php/jair/article/download/10857/25905/ Multimodal distributional semantics]. ''Journal of Artificial Intelligence Research'', 49, pp. 1-47.

Christopoulou, F., Briakou, E., Iosif, E., and Potamianos, A. (2018). [https://ieeexplore.ieee.org/abstract/document/8334459/ Mixture of topic-based distributional semantic and affective models]. ''IEEE 1ICSC 2018'', pp. 203-210. IEEE.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged. [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]

Dobó, A., and Csirik, J. (2020). [https://doi.org/10.1080/09296174.2019.1570897 A comprehensive study of the parameters in the creation and comparison of feature vectors in distributional semantic models]. ''Journal of Quantitative Linguistics'', 27(3), pp. 244-271.

Dobó, A., and Csirik, J. (2019). [http://www.inf.u-szeged.hu/~dobo/Publications/Comparison%20of%20the%20best%20parameter%20settings%20of%20DSMs%20across%20languages.pdf Comparison of the best parameter settings in the creation and comparison of feature vectors in distributional semantic models across multiple languages]. ''AIAI 2019: Artificial Intelligence Applications and Innovations'', pp. 487-499.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

[[Category:State of the art]]
[[Category:Similarity]]

User:Doboandris

2020-09-06T07:59:36Z

Doboandris:

[http://www.inf.u-szeged.hu/~dobo/ András Dobó, PhD]

Institute of Informatics, University of Szeged, Hungary

Contact information

Address: University of Szeged, Institute of Informatics,
2 Árpád tér, Szeged, 6720, Hungary

Email: dobo@inf.u-szeged.hu

Web: [http://www.inf.u-szeged.hu/~dobo/ www.inf.u-szeged.hu/~dobo]

Education

2019 [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf PhD in Computer Science] (summa cum laude) - [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]
PhD School in Computer Science, University of Szeged, Hungary

2012 Guest student (1 semester)
Georg-August Universität Göttingen

2010 [https://ora.ox.ac.uk/objects/uuid:1b771160-3f13-4362-a69a-c73401bec321/download_file?file_format=pdf&safe_filename=Dissertation.pdf&type_of_work=Thesis Master of Science in Computer Science]
Computing Laboratory, University of Oxford, UK

2009 [http://diploma.bibl.u-szeged.hu/3370/ Bachelor of Science in Computer Program Designer]
Institute of Informatics, University of Szeged, Hungary

Teaching

Artificial Intelligence I. tutorials (3 semesters)

Formal Languages tutorials (3 semesters)

Databases tutorials (1 semester)

Introduction to Informatics tutorials (1 semester)

Professional experience

2012-2014 Research mathematician
nexum Magyarország Kft.

2010-2011 Software developer
Institute of Informatics, University of Szeged, Hungary

2009 Software developer
Biological Research Centre, Hungarian Academy of Sciences, Hungary

2008-2009 Software developer
Szeged és Környéke Vízgazdálkodási Társulat, Szeged, Hungary

Language exams

English: Level C2, Cambridge ESOL (2010)

German: Level B2, Goethe Institut (2005)

Prizes and further information

2013 Best Young Researcher Prize
MSZNY 2013 - IX. Magyar Számítógépes Nyelvészeti Konferencia, Szeged

2010 III. prize
Regional Scientific Conference of Students (TDK), Szeged, Hungary

2000 I. place
Makkosházi Mathematics Competition (city level)

Erdős number: 3 (András Dobó – János Csirik – Vilmos Totik – Pál Erdős)

Publications

1. Dobó, A., Csirik, J.: [https://doi.org/10.1080/09296174.2019.1570897 A Comprehensive Study of the Parameters in the Creation and Comparison of Feature Vectors in Distributional Semantic Models]. Journal of Quantitative Linguistics. 27(3), 244-271. (2020)

2. Dobó, A.: [http://journal.sepln.org/index.php/pln/article/viewFile/6205/3656 A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. Procesamiento del Lenguaje Natural. 64, 127-130. (2020)

3. Dobó, A.: [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. Ph.D. thesis, University of Szeged (2019) - [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]

4. Dobó, A., Csirik, J.: [http://www.inf.u-szeged.hu/~dobo/Publications/Comparison%20of%20the%20best%20parameter%20settings%20of%20DSMs%20across%20languages.pdf Comparison of the Best Parameter Settings in the Creation and Comparison of Feature Vectors in Distributional Semantic Models Across Multiple Languages]. In: MacIntyre J., Maglogiannis I., Iliadis L., Pimenidis E. (eds) Artificial Intelligence Applications and Innovations. AIAI 2019. IFIP Advances in Information and Communication Technology, vol 559. 487-499. Springer, Cham. (2019)

5. Dobó, A.: [http://siba-ese.unisalento.it/index.php/ejasa/article/download/19296/17273 A measure of adjusted difference between values of a variable]. Electronic Journal of Applied Statistical Analysis. 12(1), 153-175. (2019)

6. Dobó, A.: [http://www.rcs.cic.ipn.mx/rcs/2018_147_6/Multi-D%20Kneser-Ney%20Smoothing%20Preserving%20the%20Original%20Marginal%20Distributions.pdf Multi-D Kneser-Ney Smoothing Preserving the Original Marginal Distributions]. Research in Computing Science. 147(6), 11-25. (2018)

7. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Nagy, Á., Vincze, V., Zsibrita, J.: [http://link.springer.com/chapter/10.1007/978-3-319-13817-6_32 Information Extraction from Hungarian, English and German CVs for a Career Portal]. In: Prasath, R. et al. (eds.) Mining Intelligence and Knowledge Exploration. LNAI, Vol. 8891. 333-341. Springer International Publishing, Switzerland (2014)

8. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Miszori, A., Nagy, Á., Vincze, V., Zsibrita, J.: [http://rgai.inf.u-szeged.hu/mszny2014/MSZNY2014_press_b5.pdf#page=369 Információkinyerés magyar nyelvű önéletrajzokból a nexum Karrierportálhoz]. In: Tanács, A. et al. (eds.) X. Magyar Számítógépes Nyelvészeti Konferencia. 359-360. University of Szeged, Szeged (2014)

9. Dobó, A., Csirik, J.: [http://www.inf.u-szeged.hu/~dobo/Publications/Computing%20semantic%20similarity%20using%20large%20static%20corpora.pdf Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741. 491-502. Springer-Verlag, Berlin Heidelberg (2013)

10. Dobó, A., Csirik, J.: [http://www.inf.u-szeged.hu/projectdirs/mszny2013/images/stories/kepek/MSZNY2013_press.pdf#page=221 Magyar és angol szavak szemantikai hasonlóságának automatikus kiszámítása]. In: Tanács, A., Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. 213-224. University of Szeged, Szeged (2012)

11. Dobó, A., Pulman, S.G.: [http://www.inf.u-szeged.hu/projectdirs/mszny2013/images/stories/kepek/MSZNY2013_press.pdf#page=43 Angol nyelvű összetett főnevek értelmezése parafrázisok segítségével]. In: Tanács, A., Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. 35-46. University of Szeged, Szeged (2012)

12. Dobó, A., Pulman, S.G.: [http://journal.sepln.org/index.php/pln/article/viewFile/842/697 Interpreting noun compounds using paraphrases]. Procesamiento del Lenguaje Natural. 46, 59-66. (2011)

13. Dobó, A.: [http://www.inf.u-szeged.hu/sites/default/files/kutatas/konferenciak/tdk2010osz/Dobo_Andras.pdf Angol szavak szinonimáinak automatikus keresése]. TDK. University of Szeged (2010)

14. Dobó, A.: [https://ora.ox.ac.uk/objects/uuid:1b771160-3f13-4362-a69a-c73401bec321/download_file?file_format=pdf&safe_filename=Dissertation.pdf&type_of_work=Thesis Interpreting Noun Compounds]. University of Oxford (2010)

15. Dobó, A.: [http://diploma.bibl.u-szeged.hu/3370/ A Közelítő és szimbolikus számítások tárgy során MATLAB-ban írt zárthelyi dolgozatok automatizált javítása]. University of Szeged (2009)

MEN Test Collection (State of the art)

2020-09-06T07:51:37Z

Doboandris: /* References */

* State of the art on the [https://staff.fnwi.uva.nl/e.bruni/MEN MEN dataset] (Bruni et al., 2013)
* 3000 word pairs: 2000 pairs in the development part of the dataset, 1000 pairs in the test part of the dataset
* The similarity values in the dataset are the means of judgments made by 50 subjects
* see also: [[Similarity (State of the art)]]

== Table of results for the test part of the dataset (1000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.867
| 0.866
|-
| DC19a-hybrid
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Hybrid
| 0.866
| 0.869
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019c)
| Hybrid
| 0.866
| 0.861
|-
| Ch18
| Christopoulou et al. (2018)
| Christopoulou et al. (2018)
| Corpus-based, predictive
| 0.84
| -
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019c)
| Corpus-based, distributional
| 0.813
| 0.808
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019c)
| Corpus-based, distributional
| 0.798
| 0.798
|-
| DC19a-corpus
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Corpus-based, distributional
| 0.781
| 0.749
|-
| Br13
| Bruni et al. (2013)
| Bruni et al. (2013)
| Hybrid
| 0.78
| -
|-
| Do19-corpus
| Dobó (2019), Dobó and Csirik (2019b)
| Dobó (2019), Dobó and Csirik (2019b)
| Corpus-based, distributional
| 0.705
| 0.709
|}

== Table of results for the full dataset (3000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| DC19a-hybrid
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Hybrid
| 0.862
| 0.865
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019c)
| Hybrid
| 0.862
| 0.846
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.861
| 0.859
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019c)
| Corpus-based, distributional
| 0.809
| 0.803
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019c)
| Corpus-based, distributional
| 0.802
| 0.801
|-
| DC19a-corpus
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Corpus-based, distributional
| 0.771
| 0.746
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based, distributional
| 0.702
| 0.707
|}

== References ==

* '''Listed alphabetically.'''

Bruni, E., Tran, N. K., and Baroni, M. (2014). [https://www.jair.org/index.php/jair/article/download/10857/25905/ Multimodal distributional semantics]. ''Journal of Artificial Intelligence Research'', 49, pp. 1-47.

Christopoulou, F., Briakou, E., Iosif, E., and Potamianos, A. (2018). [https://ieeexplore.ieee.org/abstract/document/8334459/ Mixture of topic-based distributional semantic and affective models]. ''IEEE 1ICSC 2018'', pp. 203-210. IEEE.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged. [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]

Dobó, A., and Csirik, J. (2019a). [https://doi.org/10.1080/09296174.2019.1570897 A comprehensive study of the parameters in the creation and comparison of feature vectors in distributional semantic models]. ''Journal of Quantitative Linguistics'', 27(3), pp. 244-271.

Dobó, A., and Csirik, J. (2019b). [http://www.inf.u-szeged.hu/~dobo/Publications/Comparison%20of%20the%20best%20parameter%20settings%20of%20DSMs%20across%20languages.pdf Comparison of the best parameter settings in the creation and comparison of feature vectors in distributional semantic models across multiple languages]. ''AIAI 2019: Artificial Intelligence Applications and Innovations'', pp. 487-499.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

[[Category:State of the art]]
[[Category:Similarity]]

MEN Test Collection (State of the art)

2020-09-06T07:50:14Z

Doboandris:

* State of the art on the [https://staff.fnwi.uva.nl/e.bruni/MEN MEN dataset] (Bruni et al., 2013)
* 3000 word pairs: 2000 pairs in the development part of the dataset, 1000 pairs in the test part of the dataset
* The similarity values in the dataset are the means of judgments made by 50 subjects
* see also: [[Similarity (State of the art)]]

== Table of results for the test part of the dataset (1000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.867
| 0.866
|-
| DC19a-hybrid
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Hybrid
| 0.866
| 0.869
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019c)
| Hybrid
| 0.866
| 0.861
|-
| Ch18
| Christopoulou et al. (2018)
| Christopoulou et al. (2018)
| Corpus-based, predictive
| 0.84
| -
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019c)
| Corpus-based, distributional
| 0.813
| 0.808
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019c)
| Corpus-based, distributional
| 0.798
| 0.798
|-
| DC19a-corpus
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Corpus-based, distributional
| 0.781
| 0.749
|-
| Br13
| Bruni et al. (2013)
| Bruni et al. (2013)
| Hybrid
| 0.78
| -
|-
| Do19-corpus
| Dobó (2019), Dobó and Csirik (2019b)
| Dobó (2019), Dobó and Csirik (2019b)
| Corpus-based, distributional
| 0.705
| 0.709
|}

== Table of results for the full dataset (3000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| DC19a-hybrid
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Hybrid
| 0.862
| 0.865
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019c)
| Hybrid
| 0.862
| 0.846
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.861
| 0.859
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019c)
| Corpus-based, distributional
| 0.809
| 0.803
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019c)
| Corpus-based, distributional
| 0.802
| 0.801
|-
| DC19a-corpus
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Corpus-based, distributional
| 0.771
| 0.746
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based, distributional
| 0.702
| 0.707
|}

== References ==

* '''Listed alphabetically.'''

Bruni, E., Tran, N. K., and Baroni, M. (2014). [https://www.jair.org/index.php/jair/article/download/10857/25905/ Multimodal distributional semantics]. ''Journal of Artificial Intelligence Research'', 49, pp. 1-47.

Christopoulou, F., Briakou, E., Iosif, E., and Potamianos, A. (2018). [https://ieeexplore.ieee.org/abstract/document/8334459/ Mixture of topic-based distributional semantic and affective models]. "IEEE 1ICSC 2018. pp. 203-210. IEEE.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged. [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]

Dobó, A., and Csirik, J. (2019a). [https://doi.org/10.1080/09296174.2019.1570897 A comprehensive study of the parameters in the creation and comparison of feature vectors in distributional semantic models]. "Journal of Quantitative Linguistics", 27(3), pp. 244-271.

Dobó, A., and Csirik, J. (2019b). [http://www.inf.u-szeged.hu/~dobo/Publications/Comparison%20of%20the%20best%20parameter%20settings%20of%20DSMs%20across%20languages.pdf Comparison of the best parameter settings in the creation and comparison of feature vectors in distributional semantic models across multiple languages]. "AIAI 2019: Artificial Intelligence Applications and Innovations", pp. 487-499.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

[[Category:State of the art]]
[[Category:Similarity]]

User:Doboandris

2020-05-16T23:10:35Z

Doboandris:

[http://www.inf.u-szeged.hu/~dobo/ András Dobó, PhD]

Institute of Informatics, University of Szeged, Hungary

Contact information

Address: University of Szeged, Institute of Informatics,
2 Árpád tér, Szeged, 6720, Hungary

Email: dobo@inf.u-szeged.hu

Web: [http://www.inf.u-szeged.hu/~dobo/ www.inf.u-szeged.hu/~dobo]

Education

2019 [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf PhD in Computer Science] (summa cum laude) - [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]
PhD School in Computer Science, University of Szeged, Hungary

2012 Guest student (1 semester)
Georg-August Universität Göttingen

2010 [https://ora.ox.ac.uk/objects/uuid:1b771160-3f13-4362-a69a-c73401bec321/download_file?file_format=pdf&safe_filename=Dissertation.pdf&type_of_work=Thesis Master of Science in Computer Science]
Computing Laboratory, University of Oxford, UK

2009 [http://diploma.bibl.u-szeged.hu/3370/ Bachelor of Science in Computer Program Designer]
Institute of Informatics, University of Szeged, Hungary

Teaching

Artificial Intelligence I. tutorials (3 semesters)

Formal Languages tutorials (3 semesters)

Databases tutorials (1 semester)

Introduction to Informatics tutorials (1 semester)

Professional experience

2012-2014 Research mathematician
nexum Magyarország Kft.

2010-2011 Software developer
Institute of Informatics, University of Szeged, Hungary

2009 Software developer
Biological Research Centre, Hungarian Academy of Sciences, Hungary

2008-2009 Software developer
Szeged és Környéke Vízgazdálkodási Társulat, Szeged, Hungary

Language exams

English: Level C2, Cambridge ESOL (2010)

German: Level B2, Goethe Institut (2005)

Prizes and further information

2013 Best Young Researcher Prize
MSZNY 2013 - IX. Magyar Számítógépes Nyelvészeti Konferencia, Szeged

2010 III. prize
Regional Scientific Conference of Students (TDK), Szeged, Hungary

2000 I. place
Makkosházi Mathematics Competition (city level)

Erdős number: 3 (András Dobó – János Csirik – Vilmos Totik – Pál Erdős)

Publications

1. Dobó, A.: [http://journal.sepln.org/index.php/pln/article/viewFile/6205/3656 A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. Procesamiento del Lenguaje Natural. 64, 127-130. (2020)

2. Dobó, A.: [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. Ph.D. thesis, University of Szeged (2019) - [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]

3. Dobó, A., Csirik, J.: [http://www.inf.u-szeged.hu/~dobo/Publications/Comparison%20of%20the%20best%20parameter%20settings%20of%20DSMs%20across%20languages.pdf Comparison of the Best Parameter Settings in the Creation and Comparison of Feature Vectors in Distributional Semantic Models Across Multiple Languages]. In: MacIntyre J., Maglogiannis I., Iliadis L., Pimenidis E. (eds) Artificial Intelligence Applications and Innovations. AIAI 2019. IFIP Advances in Information and Communication Technology, vol 559. 487-499. Springer, Cham. (2019)

4. Dobó, A., Csirik, J.: [https://doi.org/10.1080/09296174.2019.1570897 A Comprehensive Study of the Parameters in the Creation and Comparison of Feature Vectors in Distributional Semantic Models]. Journal of Quantitative Linguistics (2019)

5. Dobó, A.: [http://siba-ese.unisalento.it/index.php/ejasa/article/download/19296/17273 A measure of adjusted difference between values of a variable]. Electronic Journal of Applied Statistical Analysis. 12(1), 153-175. (2019)

6. Dobó, A.: [http://www.rcs.cic.ipn.mx/rcs/2018_147_6/Multi-D%20Kneser-Ney%20Smoothing%20Preserving%20the%20Original%20Marginal%20Distributions.pdf Multi-D Kneser-Ney Smoothing Preserving the Original Marginal Distributions]. Research in Computing Science. 147(6), 11-25 (2018)

7. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Nagy, Á., Vincze, V., Zsibrita, J.: [http://link.springer.com/chapter/10.1007/978-3-319-13817-6_32 Information Extraction from Hungarian, English and German CVs for a Career Portal]. In: Prasath, R. et al. (eds.) Mining Intelligence and Knowledge Exploration. LNAI, Vol. 8891. 333-341. Springer International Publishing, Switzerland (2014)

8. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Miszori, A., Nagy, Á., Vincze, V., Zsibrita, J.: [http://rgai.inf.u-szeged.hu/mszny2014/MSZNY2014_press_b5.pdf#page=369 Információkinyerés magyar nyelvű önéletrajzokból a nexum Karrierportálhoz]. In: Tanács, A. et al. (eds.) X. Magyar Számítógépes Nyelvészeti Konferencia. 359-360. University of Szeged, Szeged (2014)

9. Dobó, A., Csirik, J.: [http://www.inf.u-szeged.hu/~dobo/Publications/Computing%20semantic%20similarity%20using%20large%20static%20corpora.pdf Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741. 491-502. Springer-Verlag, Berlin Heidelberg (2013)

10. Dobó, A., Csirik, J.: [http://www.inf.u-szeged.hu/projectdirs/mszny2013/images/stories/kepek/MSZNY2013_press.pdf#page=221 Magyar és angol szavak szemantikai hasonlóságának automatikus kiszámítása]. In: Tanács, A., Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. 213-224. University of Szeged, Szeged (2012)

11. Dobó, A., Pulman, S.G.: [http://www.inf.u-szeged.hu/projectdirs/mszny2013/images/stories/kepek/MSZNY2013_press.pdf#page=43 Angol nyelvű összetett főnevek értelmezése parafrázisok segítségével]. In: Tanács, A., Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. 35-46. University of Szeged, Szeged (2012)

12. Dobó, A., Pulman, S.G.: [http://journal.sepln.org/index.php/pln/article/viewFile/842/697 Interpreting noun compounds using paraphrases]. Procesamiento del Lenguaje Natural. 46, 59-66. (2011)

13. Dobó, A.: [http://www.inf.u-szeged.hu/sites/default/files/kutatas/konferenciak/tdk2010osz/Dobo_Andras.pdf Angol szavak szinonimáinak automatikus keresése]. TDK. University of Szeged (2010)

14. Dobó, A.: [https://ora.ox.ac.uk/objects/uuid:1b771160-3f13-4362-a69a-c73401bec321/download_file?file_format=pdf&safe_filename=Dissertation.pdf&type_of_work=Thesis Interpreting Noun Compounds]. University of Oxford (2010)

15. Dobó, A.: [http://diploma.bibl.u-szeged.hu/3370/ A Közelítő és szimbolikus számítások tárgy során MATLAB-ban írt zárthelyi dolgozatok automatizált javítása]. University of Szeged (2009)

User:Doboandris

2020-02-04T00:59:18Z

Doboandris: Updated information

[http://www.inf.u-szeged.hu/~dobo/ András Dobó, PhD]

Institute of Informatics, University of Szeged, Hungary

Contact information

Address: University of Szeged, Institute of Informatics,
2 Árpád tér, Szeged, 6720, Hungary

Email: dobo@inf.u-szeged.hu

Web: [http://www.inf.u-szeged.hu/~dobo/ www.inf.u-szeged.hu/~dobo]

Education

2019 [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf PhD in Computer Science] (summa cum laude) - [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]
PhD School in Computer Science, University of Szeged, Hungary

2012 Guest student (1 semester)
Georg-August Universität Göttingen

2010 [https://ora.ox.ac.uk/objects/uuid:1b771160-3f13-4362-a69a-c73401bec321/download_file?file_format=pdf&safe_filename=Dissertation.pdf&type_of_work=Thesis Master of Science in Computer Science]
Computing Laboratory, University of Oxford, UK

2009 [http://diploma.bibl.u-szeged.hu/3370/ Bachelor of Science in Computer Program Designer]
Institute of Informatics, University of Szeged, Hungary

Teaching

Artificial Intelligence I. tutorials (3 semesters)

Formal Languages tutorials (3 semesters)

Databases tutorials (1 semester)

Introduction to Informatics tutorials (1 semester)

Professional experience

2012-2014 Research mathematician
nexum Magyarország Kft.

2010-2011 Software developer
Institute of Informatics, University of Szeged, Hungary

2009 Software developer
Biological Research Centre, Hungarian Academy of Sciences, Hungary

2008-2009 Software developer
Szeged és Környéke Vízgazdálkodási Társulat, Szeged, Hungary

Language exams

English: Level C2, Cambridge ESOL (2010)

German: Level B2, Goethe Institut (2005)

Prizes and further information

2013 Best Young Researcher Prize
MSZNY 2013 - IX. Magyar Számítógépes Nyelvészeti Konferencia, Szeged

2010 III. prize
Regional Scientific Conference of Students (TDK), Szeged, Hungary

2000 I. place
Makkosházi Mathematics Competition (city level)

Erdős number: 3 (András Dobó – János Csirik – Vilmos Totik – Pál Erdős)

Publications

1. Dobó, A.: A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages. Procesamiento del Lenguaje Natural. 64. (2020) (in press)

2. Dobó, A.: [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. Ph.D. thesis, University of Szeged (2019) - [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]

3. Dobó A., Csirik J.: [http://www.inf.u-szeged.hu/~dobo/Publications/Comparison%20of%20the%20best%20parameter%20settings%20of%20DSMs%20across%20languages.pdf Comparison of the Best Parameter Settings in the Creation and Comparison of Feature Vectors in Distributional Semantic Models Across Multiple Languages]. In: MacIntyre J., Maglogiannis I., Iliadis L., Pimenidis E. (eds) Artificial Intelligence Applications and Innovations. AIAI 2019. IFIP Advances in Information and Communication Technology, vol 559. 487-499. Springer, Cham. (2019)

4. Dobó A., Csirik J.: [https://doi.org/10.1080/09296174.2019.1570897 A Comprehensive Study of the Parameters in the Creation and Comparison of Feature Vectors in Distributional Semantic Models]. Journal of Quantitative Linguistics (2019)

5. Dobó, A.: [http://siba-ese.unisalento.it/index.php/ejasa/article/download/19296/17273 A measure of adjusted difference between values of a variable]. Electronic Journal of Applied Statistical Analysis. 12(1), 153-175. (2019)

6. Dobó, A.: [http://www.rcs.cic.ipn.mx/rcs/2018_147_6/Multi-D%20Kneser-Ney%20Smoothing%20Preserving%20the%20Original%20Marginal%20Distributions.pdf Multi-D Kneser-Ney Smoothing Preserving the Original Marginal Distributions]. Research in Computing Science. 147(6), 11-25 (2018)

7. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Nagy, Á., Vincze, V. and Zsibrita, J.: [http://link.springer.com/chapter/10.1007/978-3-319-13817-6_32 Information Extraction from Hungarian, English and German CVs for a Career Portal]. In: Prasath, R. et al. (eds.) Mining Intelligence and Knowledge Exploration. LNAI, Vol. 8891. 333-341. Springer International Publishing, Switzerland (2014)

8. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Miszori, A., Nagy, Á., Vincze, V. and Zsibrita, J.: [http://rgai.inf.u-szeged.hu/mszny2014/MSZNY2014_press_b5.pdf#page=369 Információkinyerés magyar nyelvű önéletrajzokból a nexum Karrierportálhoz]. In: Tanács, A. et al. (eds.) X. Magyar Számítógépes Nyelvészeti Konferencia. 359-360. University of Szeged, Szeged (2014)

9. Dobó, A., Csirik, J.: [http://www.inf.u-szeged.hu/~dobo/Publications/Computing%20semantic%20similarity%20using%20large%20static%20corpora.pdf Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741. 491-502. Springer-Verlag, Berlin Heidelberg (2013)

10. Dobó, A., Csirik, J.: [http://www.inf.u-szeged.hu/projectdirs/mszny2013/images/stories/kepek/MSZNY2013_press.pdf#page=221 Magyar és angol szavak szemantikai hasonlóságának automatikus kiszámítása]. In: Tanács, A. and Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. 213-224. University of Szeged, Szeged (2012)

11. Dobó, A., Pulman, S.G.: [http://www.inf.u-szeged.hu/projectdirs/mszny2013/images/stories/kepek/MSZNY2013_press.pdf#page=43 Angol nyelvű összetett főnevek értelmezése parafrázisok segítségével]. In: Tanács, A. and Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. 35-46. University of Szeged, Szeged (2012)

12. Dobó, A., Pulman, S.G.: [http://journal.sepln.org/index.php/pln/article/viewFile/842/697 Interpreting noun compounds using paraphrases]. Procesamiento del Lenguaje Natural. 46, 59-66. (2011)

13. Dobó, A.: [http://www.inf.u-szeged.hu/sites/default/files/kutatas/konferenciak/tdk2010osz/Dobo_Andras.pdf Angol szavak szinonimáinak automatikus keresése]. TDK. University of Szeged (2010)

14. Dobó, A.: [https://ora.ox.ac.uk/objects/uuid:1b771160-3f13-4362-a69a-c73401bec321/download_file?file_format=pdf&safe_filename=Dissertation.pdf&type_of_work=Thesis Interpreting Noun Compounds]. University of Oxford (2010)

15. Dobó, A.: [http://diploma.bibl.u-szeged.hu/3370/ A Közelítő és szimbolikus számítások tárgy során MATLAB-ban írt zárthelyi dolgozatok automatizált javítása]. University of Szeged (2009)

MEN Test Collection (State of the art)

2019-09-16T00:36:16Z

Doboandris:

* State of the art on the [https://staff.fnwi.uva.nl/e.bruni/MEN MEN dataset] (Bruni et al., 2013)
* 3000 word pairs: 2000 pairs in the development part of the dataset, 1000 pairs in the test part of the dataset
* The similarity values in the dataset are the means of judgments made by 50 subjects
* see also: [[Similarity (State of the art)]]

== Table of results for the test part of the dataset (1000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.867
| 0.866
|-
| DC19a-hybrid
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Hybrid
| 0.866
| 0.869
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019c)
| Hybrid
| 0.866
| 0.861
|-
| Ch18
| Christopoulou et al. (2018)
| Christopoulou et al. (2018)
| Corpus-based, predictive
| 0.84
| -
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019c)
| Corpus-based, distributional
| 0.813
| 0.808
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019c)
| Corpus-based, distributional
| 0.798
| 0.798
|-
| DC19a-corpus
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Corpus-based, distributional
| 0.781
| 0.749
|-
| Br13
| Bruni et al. (2013)
| Bruni et al. (2013)
| Hybrid
| 0.78
| -
|-
| Do19-corpus
| Dobó (2019), Dobó and Csirik (2019b)
| Dobó (2019), Dobó and Csirik (2019b)
| Corpus-based, distributional
| 0.705
| 0.709
|}

== Table of results for the full dataset (3000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| DC19a-hybrid
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Hybrid
| 0.862
| 0.865
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019c)
| Hybrid
| 0.862
| 0.846
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.861
| 0.859
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019c)
| Corpus-based, distributional
| 0.809
| 0.803
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019c)
| Corpus-based, distributional
| 0.802
| 0.801
|-
| DC19a-corpus
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Corpus-based, distributional
| 0.771
| 0.746
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based, distributional
| 0.702
| 0.707
|}

== References ==

* '''Listed alphabetically.'''

Bruni, E., Tran, N. K., and Baroni, M. (2014). [https://www.jair.org/index.php/jair/article/download/10857/25905/ Multimodal distributional semantics]. ''Journal of Artificial Intelligence Research'', 49, pp. 1-47.

Christopoulou, F., Briakou, E., Iosif, E., and Potamianos, A. (2018). [https://ieeexplore.ieee.org/abstract/document/8334459/ Mixture of topic-based distributional semantic and affective models]. "IEEE 1ICSC 2018. pp. 203-210. IEEE.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged. [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]

Dobó, A., and Csirik, J. (2019a). [https://doi.org/10.1080/09296174.2019.1570897 A comprehensive study of the parameters in the creation and comparison of feature vectors in distributional semantic models]. "Journal of Quantitative Linguistics", pp. 1-28.

Dobó, A., and Csirik, J. (2019b). [http://www.inf.u-szeged.hu/~dobo/Publications/Comparison%20of%20the%20best%20parameter%20settings%20of%20DSMs%20across%20languages.pdf Comparison of the best parameter settings in the creation and comparison of feature vectors in distributional semantic models across multiple languages]. "AIAI 2019: Artificial Intelligence Applications and Innovations", pp. 487-499.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

[[Category:State of the art]]
[[Category:Similarity]]

WordSimilarity-353 Test Collection (State of the art)

2019-09-16T00:35:50Z

Doboandris:

* [http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ WordSimilarity-353 Test Collection]
* contains two sets of English word pairs along with human-assigned similarity judgements
* first set (set1) contains 153 word pairs along with their similarity scores assigned by 13 subjects
* second set (set2) contains 200 word pairs with similarity assessed by 16 subjects
* WordSimilarity-353 dataset is available [http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ here]
* performance is measured by [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rank correlation coefficient]
* introduced by [http://www.cs.technion.ac.il/~gabr/papers/tois_context.pdf Finkelstein et al. (2002)]
* subsequently used by many other researchers
* [https://www.wikidata.org/wiki/Q31845205 Wikidata] and [https://tools.wmflabs.org/scholia/use/Q31845205 Scholia]
* see also: [[Similarity (State of the art)]]

== Table of results ==

* '''Listed in order of increasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! Spearman's rho
! Pearson's r
|-
| L&C
| Leacock and Chodorow (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.302
| 0.356
|-
| WNE
| Jarmasz (2003)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.305
| 0.271
|-
| J&C
| Jiang and Conrath 1997
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.318
| 0.354
|-
| L&C
| Leacock and Chodorow (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.348
| 0.341
|-
| H&S
| Hirst and St-Onge (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.302
| 0.356
|-
| Lin
| Lin (1998)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.348
| 0.357
|-
| Resnik
| Resnik (1995)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.353
| 0.365
|-
| ROGET
| Jarmasz (2003)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.415
| 0.536
|-
| C&W
| Collobert and Weston (2008)
| Collobert and Weston (2008)
| Corpus-based
| 0.5
| N/A
|-
| WikiRelate
| Strube and Ponzetto (2006)
| Strube and Ponzetto (2006)
| Corpus-based
| N/A
| 0.48
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based
| 0.579
| 0.577
|-
| LSA
| Landauer et al. (1997)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.581
| 0.492
|-
| LSA
| Landauer et al. (1997)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.581
| 0.563
|-
| simVB+simWN
| Finkelstein et al. (2002)
| Finkelstein et al. (2002)
| Hybrid
| N/A
| 0.55
|-
| SSA
| Hassan and Mihalcea (2011)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.622
| 0.629
|-
| HSMN+csmRNN
| Luong et al. (2013)
| Luong et al. (2013)
| Corpus-based
| 0.65
| N/A
|-
| Pe14
| Pennington (2014)
| Dobó (2019)
| Corpus-based
| 0.706
| 0.705
|-
| Multi-prototype
| Huang et al. (2012)
| Huang et al. (2012)
| Corpus-based
| 0.71
| N/A
|-
| Multi-lingual SSA
| Hassan et al. (2011)
| Hassan et al. (2011)
| Corpus-based
| 0.713
| 0.674
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019)
| Corpus-based
| 0.733
| 0.704
|-
| ESA
| Gabrilovich and Markovitch (2007)
| Gabrilovich and Markovitch (2007)
| Corpus-based
| 0.748
| 0.503
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.795
| 0.276
|-
| TSA
| Radinsky et al. (2011)
| Radinsky et al. (2011)
| Hybrid
| 0.80
| N/A
|-
| CLEAR
| Halawi et al. (2012)
| Halawi et al. (2012)
| Corpus-based
| 0.81
| N/A
|-
| Y&Q
| Yih and Qazvinian (2012)
| Yih and Qazvinian (2012)
| Hybrid
| 0.81
| N/A
|-
| ConceptNet Numberbatch
| Speer et al. (2017)
| Speer et al. (2017)
| Hybrid
| 0.828
| N/A
|}

== References ==

* '''Listed in alphabetical order.'''

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged. [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]

Finkelstein, Lev, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. (2002) [http://www.cs.technion.ac.il/~gabr/papers/tois_context.pdf Placing Search in Context: The Concept Revisited]. ACM Transactions on Information Systems, 20(1):116-131.

Gabrilovich, Evgeniy, and Shaul Markovitch, [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis], Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, 2007.

Halawi, Guy, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. (2012). [http://gabrilovich.com/publications/papers/Halawi2012LSL.pdf Large-scale learning of word relatedness with constraints]. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1406-1414. ACM.

Hassan, Samer, and Rada Mihalcea: [http://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3616/3972/ Semantic Relatedness Using Salient Semantic Analysis]. AAAI 2011

Hirst, Graeme and David St-Onge. Lexical chains as representations of context for the detection and correction of malapropisms. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 305–332, 1998.

Huang, Eric H., Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1 (ACL '12), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 873-882.

Islam, A., and Inkpen, D. 2006. [http://www.site.uottawa.ca/~mdislam/publications/LREC_06_242.pdf Second order co-occurrence pmi for determining the semantic similarity of words]. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006) 1033–1038.

Jarmasz, M. 2003. [http://www.arxiv.org/pdf/1204.0140 Roget’s thesaurus as a Lexical Resource for Natural Language Processing]. Ph.D. Dissertation, Ottawa Carleton Institute for Computer Science, School of Information Technology and Engineering, University of Ottawa.

Jiang, Jay J. and David W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference on Research in Computational Linguistics (ROCLING X), Taiwan, pages 19–33, 1997.

Landauer, T. K.; L, T. K.; Laham, D.; Rehder, B.; and Schreiner, M. E. 1997. How well can passage meaning be derived without using word order? a comparison of latent semantic analysis and humans.

Leacock, Claudia and Martin Chodorow. Combining local context and WordNet similarity for word sense identification. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 265–283, 1998.

Lin, Dekang. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, Madison,WI, pages 296–304, 1998.

Luong, Minh-Thang, Richard Socher, and Christopher D. Manning. (2013). [http://nlp.stanford.edu/~lmthang/data/papers/conll13_morpho.pdf Better word representations with recursive neural networks for morphology]. CoNLL-2013: 104.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Pilehvar, M.T., D. Jurgens and R. Navigli. [http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity]. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 4-9, 2013, pp. 1341-1351.

Radinsky, Kira, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. (2011). [http://gabrilovich.com/publications/papers/Radinsky2011WTS.pdf A word at a time: computing word relatedness using temporal semantic analysis]. In Proceedings of the 20th international conference on World wide web, pp. 337-346. ACM.

Resnik, Philip. Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448–453, Montreal, Canada, 1995.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, Rob, Joshua Chin and Catherine Havasi. (2017). [http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972 ConceptNet 5.5: An Open Multilingual Graph of General Knowledge]. Proceedings of The 31st AAAI Conference on Artificial Intelligence, San Francisco, CA.

Strube, Michael and Simone Paolo Ponzetto. (2006). [http://www.aaai.org/Papers/AAAI/2006/AAAI06-223.pdf WikiRelate! Computing Semantic Relatedness Using Wikipedia]. Proceedings of The 21st National Conference on Artificial Intelligence (AAAI), Boston, MA.

Yih, W. and Qazvinian, V. (2012). [http://aclweb.org/anthology/N/N12/N12-1077.pdf Measuring Word Relatedness Using Heterogeneous Vector Space Models]. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2012).

[[Category:State of the art]]
[[Category:Similarity]]

SimLex-999 (State of the art)

2019-09-16T00:34:47Z

Doboandris:

[http://www.cl.cam.ac.uk/~fh295/simlex.html SimLex-999] aims at a cleaner benchmark of similarity (but not relatedness). Pairs of words were chosen to represent different ranges of similarity and with either high or low association. Subjects were instructed to differentiate between similarity and relatedness and rate regarding the former only.

See also: [[Similarity (State of the art)]], [[Similar-Associated-Both Test Collection (State of the art)]].

{|
|-
! Algorithm !! Reference for algorithm !! Reference for reported results !! Type !! Spearman's rho !! Pearson's r !! Notes
|-
| Re16
| Recski et al. (2016)<ref name=recski16>Recski, G., Iklódi, E., Pajkossy, K., & Kornai, A. (2016). [https://www.aclweb.org/anthology/W16-1622 Measuring semantic similarity of words using concept networks]. In: Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 193-200.</ref>
| Recski et al. (2016)<ref name=recski16/>
| Hybrid || 0.76 || -
|-
| SVR4
| Banjade et al. (2015)<ref name=lemontea/>
| Banjade et al. (2015)<ref name=lemontea/>
| Combined || 0.642 || 0.658
|-
| Do19-hybrid
| Dobó (2019)<ref name=dobo19>Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged. [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]</ref>
| Dobó (2019)<ref name=dobo19/>
| Hybrid || 0.621 || 0.481
|-
| Sp17
| Speer et al. (2017)<ref name=speer17>Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. AAAI-17, pp. 4444-4451.</ref>
| Dobó (2019)<ref name=dobo19/>
| Hybrid || 0.616 || 0.634
|-
| joint(SP+,skip-gram)
| Schwartz et al. (2015)<ref name=spplus>Schwartz, R., Reichart, Roi, Rappoport, A. (2015). Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction, CoNLL 2015.</ref>
| Schwartz et al. (2015)<ref name=spplus/>
| Distributional || 0.56 || - || Trained on word2vec corpus, best results for pure distributional model.
|-
| UMBC
| Han et al. (2013)<ref>Han, L., Kashyap, A., Finin, T., Mayfield, J., Weese, J.: UMBC EBIQUITY-CORE: Semantic textual similarity systems. In: Proceedings of the Second Joint Conference on Lexical and Computational Semantics, vol. 1, pp. 44–52 (2013)</ref>
| Banjade et al. (2015)<ref name=lemontea/>
| || 0.558 || 0.557 || without using POS information
|-
| SP+
| Schwartz et al. (2015)<ref name=spplus>Schwartz, R., Reichart, Roi, Rappoport, A. (2015). Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction, CoNLL 2015.</ref>
| Schwartz et al. (2015)<ref name=spplus/>
| Distributional || 0.52 || -
|-
| RNNenc
| Hill et al. (2014b)<ref name=rnnenc>Hill, F., Cho, K., Jean, S., Devin, C., & Bengio, Y. (2014b). Not All Neural Embeddings are Born Equal, 1–5.</ref>
| Hill et al. (2014b)<ref name=rnnenc/>
| Distributional, multilingual || 0.52 || -
|-
| Sa18
| Salle et al. (2018)<ref name=salle18>Salle A., Idiart M., and Villavicencio A. (2018). [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]</ref>
| Dobó (2019)<ref name=dobo19/>
| Distributional || 0.417 || 0.426
|-
| Word2vec
| Mikolov et al. (2013)<ref>Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of International Conference of Learning Representations, Scottsdale, Arizona, USA.</ref>
| Hill et al. (2014a)<ref name=simlex>Hill, F., Reichart, R., & Korhonen, A. (2014a). SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation. Computation and Language.</ref>
| Distributional || 0.414 || - || Trained on Wikipedia
|-
| Pe14
| Pennington et al. (2014)<ref name=pennington14>Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. EMNLP 2014, pp. 1532-1543.</ref>
| Dobó (2019)<ref name=dobo19/>
| Distributional || 0.406 || 0.433
|-
| Lesk
|
| Banjade et al. (2015)<ref name=lemontea>Banjade, R., Maharjan, N., Niraula, N., Rus, V., & Gautam, D. (2015). Lemon and Tea Are Not Similar: Measuring Word-to-Word Similarity by Combining Different Methods. Computational Linguistics and Intelligent Text Processing, 9041, 335–346. doi:10.1007/978-3-319-18111-0_25</ref>
| || 0.404 || 0.347
|-
| Do19-corpus
| Dobó (2019)<ref name=dobo19/>
| Dobó (2019)<ref name=dobo19/>
| Distributional || 0.393 || 0.401
|-
| ESA
|
| Banjade et al. (2015)<ref name=lemontea/>
| || 0.271 || 0.145
|-
| Neural language model
| Collobert & Weston (2008)<ref>R. Collobert and J. Weston. 2008. A unified architecture for natural language pro- cessing: Deep neural networks with multitask learning. In International Conference on Machine Learn- ing, ICML.</ref>
| Hill et al. (2014a)<ref name=simlex/>
| Distributional || 0.268 || - || Trained on Wikipedia
|-
| Neural language model with global context
| Huang et al. (2012)<ref>Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 873–882. Association for Computational Linguistics.</ref>
| Hill et al. (2014a)<ref name=simlex/>
| Distributional || 0.098 || - || Trained on Wikipedia
|}

== References ==
<references/>

[[Category:State of the art]]
[[Category:Similarity]]

MC-28 Test Collection (State of the art)

2019-09-16T00:34:01Z

Doboandris:

* state of the art in Miller & Charles 28 (MC-28) dataset [Resnik, 1995]
* 28 word pairs of the original Miller & Charles 30 (MC-30) dataset [Miller and Charles, 1991], which is a subset of the [[RG-65 Test Collection (State of the art)|Rubenstein & Goodenough (RG-65) dataset]]; two word pairs have generally been omitted for semantic similarity evaluation, as words in these word pairs have not been included in previous versions of WordNet
* Similarity of each pair is scored according to a scale from 0 to 4 (the higher the "similarity of meaning," the higher the number);
* The similarity values in the dataset are the means of judgments made by 38 subjects [Miller and Charles, 1991].
* see also: [[Similarity (State of the art)]]

== Table of results ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] [with 95% confidence intervals]
|-
| Human
| Human upper bound
| Resnik (1995)
| Human
| 0.934 [0.861, 0.969]
|-
| PPR
| Agirre et al. (2009)
| Agirre et al. (2009)
| Hybrid
| 0.92 [0.833, 0.962]
|-
| Gloss Vector
| Patwardhan and Pedersen (2006)
| Patwardhan and Pedersen (2006)
| Lexicon-based
| 0.91 [0.813, 0.957]
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.893 [0.780, 0.949]
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019)
| Hybrid
| 0.892 [0.778, 0.949]
|-
| JS
| Jarmasz and Szpakowicz (2003)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 0.87 [0.736, 0.938]
|-
| SR
| Tsatsaronis et al. (2010)
| Tsatsaronis et al. (2010)
| Lexicon-based
| 0.856 [0.710, 0.931]
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based
| 0.853 [0.704, 0.930]
|-
| KC
| Kulkarni and Caragea (2009)
| Kulkarni and Caragea (2009)
| Web-based
| 0.835 [0.671, 0.921]
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019)
| Corpus-based
| 0.832 [0.666, 0.919]
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019)
| Corpus-based
| 0.822 [0.648, 0.914]
|-
| LIN
| Lin (1998)
| Lin (1998)
| Hybrid
| 0.82 [0.644, 0.913]
|-
| RES
| Resnik (1995)
| Resnik (1995)
| Hybrid
| 0.81 [0.627, 0.908]
|-
| DC13
| Dobó and Csirik (2013)
| Dobó and Csirik (2013)
| Corpus-based
| 0.773 [0.562, 0.889]
|-
| GM
| Gabrilovich and Markovitch (2007)
| Tsatsaronis et al. (2010)
| Corpus-based
| 0.72 [0.475, 0.861]
|-
| WLM
| Milne and Witten (2008)
| Milne and Witten (2008)
| Web-based
| 0.70 [0.443, 0.850]
|-
| SH
| Sahami and Heilman (2006)
| Agirre et al. (2009)
| Web-based
| 0.618 [0.319, 0.805]
|}

== References ==

* '''Listed alphabetically.'''

Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A. (2009). [http://www.aclweb.org/anthology/N09-1003 A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches]. In: ''10th Annual Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies''. Association for Computa-tional Linguistics, Stroudsburg. pp. 19–27.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged. [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]

Dobó, A., and Csirik, J. (2013). [http://link.springer.com/chapter/10.1007/978-3-642-35843-2_42 Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) ''SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741''. Springer-Verlag, Berlin Heidelberg, pp. 491-502.

Gabrilovich, E., and Markovitch, S. (2007). [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis], ''Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI)'', Hyderabad, India.

Jarmasz, M., and Szpakowicz, S. (2003). [http://www.csi.uottawa.ca/~szpak/recent_papers/TR-2003-01.pdf Roget’s thesaurus and semantic similarity], ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, September, pp. 212-219.

Kulkarni, S., Caragea, D. (2009). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.4693&rep=rep1&type=pdf Computation of the Semantic Relatedness between Words using Concept Clouds]. In: ''International Conference on Knowledge Discovery and Information Re-trieval''. INSTICC Press, Setubal. pp. 183–188.

Lin, D. (1998). [http://webdocs.cs.ualberta.ca/~lindek/papers/sim.pdf An information-theoretic definition of similarity]. In ''Proceedings of the 15th International Conference on Machine Learning'', Madison,WI, pp. 296–304.

Miller, G., and Charles, W. (1991). [http://www.tandfonline.com/doi/abs/10.1080/01690969108406936#.VjUdmjZRGUk Contextual correlates of semantic similarity]. ''Language and Cognitive Processes'', 6(1), 1–28.

Milne, D., Witten, I.H. (2008). [http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links], In ''Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy'', AAAI Press, Chicago, USA pp. 25-30.

Patwardhan, S., and Pedersen, T. (2006). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.145.6642&rep=rep1&type=pdf#page=7 Using WordNet-based Context Vectors to Estimate the Se-mantic Relatedness of Concepts]. In: ''11th Conference of the European Chapter of the Association for Computational Linguistics''. Association for Computational Linguistics, Stroudsbur. pp. 1–8.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Resnik, P. (1995). [http://arxiv.org/pdf/cmp-lg/9511007 Using information content to evaluate semantic similarity]. In ''Proceedings of the 14th International Joint Conference on Artificial Intelligence'', Montreal, Canada, pages 448–453.

Sahami, M., Heilman, T.D. (2006). [http://robotics.stanford.edu/users/sahami/papers-dir/www2006.pdf A web-based kernel function for measuring the similarity of short text snippets]. In: ''15th international conference on World Wide Web''. ACM Press, New York. pp. 377–386.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. (2010). [http://arxiv.org/abs/1401.5699 Text Relatedness Based on a Word Thesaurus]. ''Journal of Artificial Intelligence Research'' 37, 1–39

[[Category:State of the art]]
[[Category:Similarity]]

RG-65 Test Collection (State of the art)

2019-09-16T00:33:40Z

Doboandris:

* state of the art in Rubenstein & Goodenough (RG-65) dataset
* 65 word pairs;
* Similarity of each pair is scored according to a scale from 0 to 4 (the higher the "similarity of meaning," the higher the number);
* The similarity values in the dataset are the means of judgments made by 51 subjects [Rubenstein and Goodenough, 1965].
* see also: [[Similarity (State of the art)]]

== Table of results ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| ADW
| Pilehvar and Navigli (2015)
| Pilehvar and Navigli (2015)
| Knowledge-based (Wiktionary)
| 0.920
| 0.910
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019)
| Hybrid
| 0.901
| 0.896
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.899
| 0.914
|-
| Y&Q
| Yih and Qazvinian (2012)
| Yih and Qazvinian (2012)
| Hybrid
| 0.890
| -
|-
| NASARI
| Camacho-Collados et al. (2015)
| Camacho-Collados et al. (2015)
| Hybrid
| 0.880
| 0.910
|-
| ADW
| Pilehvar et al. (2013)
| Pilehvar et al. (2013)
| Knowledge-based (WordNet)
| 0.868
| 0.810
|-
| PPR
| Hughes and Ramage (2007)
| Hughes and Ramage (2007)
| Knowledge-based
| 0.838
| -
|-
| SSA
| Hassan and Mihalcea (2011)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.833
| 0.861
|-
| PPR
| Agirre et al. (2009)
| Agirre et al. (2009)
| Knowledge-based
| 0.830
| -
|-
| H&S
| Hirst and St-Onge (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.813
| 0.732
|-
| Roget
| Jarmasz (2003)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.804
| 0.818
|-
| J&C
| Jiang and Conrath (1997)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.804
| 0.731
|-
| WNE
| Jarmasz (2003)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.801
| 0.787
|-
| L&C
| Leacock and Chodorow (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.797
| 0.852
|-
| Lin
| Lin (1998)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.788
| 0.834
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019)
| Corpus-based
| 0.769
| 0.770
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019)
| Corpus-based
| 0.763
| 0.792
|-
| ESA*
| Gabrilovich and Markovitch (2007)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.749
| 0.716
|-
| SOCPMI*
| Islam and Inkpen (2006)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.741
| 0.729
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based
| 0.732
| 0.737
|-
| Resnik
| Resnik (1995)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.731
| 0.800
|-
| WLM
| Milne and Witten (2008)
| Milne and Witten (2008)
| Knowledge-based
| 0.640
| -
|-
| LSA*
| Landauer et al. (1997)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.609
| 0.644
|-
| WikiRelate
| Strube and Ponzetto (2006)
| Strube and Ponzetto (2006)
| Knowledge-based
| -
| 0.530
|}

Note: values reported by (Hassan and Mihalcea, 2011) are "based on the collected raw data from the respective authors", and those highlighted by (*) are re-implementations.

== References ==

* '''Listed alphabetically.'''

Agirre, Eneko, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa: [http://www.aclweb.org/anthology/N09-1003 A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches]. HLT-NAACL 2009: 19-27

Camacho-Collados, José, Pilehvar, Mohammad Taher, and Navigli, Roberto: [http://aclweb.org/anthology/N/N15/N15-1059.pdf NASARI: a Novel Approach to a Semantically-Aware Representation of Items]. NAACL 2015, pp. 567-577, Denver, USA.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged. [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]

Gabrilovich, Evgeniy, and Shaul Markovitch, [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis], Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, 2007.

Hassan, Samer, and Rada Mihalcea: [http://www.cse.unt.edu/~rada/papers/hassan.aaai11.pdf‎ Semantic Relatedness Using Salient Semantic Analysis]. AAAI 2011

Hirst, Graeme and David St-Onge. Lexical chains as representations of context for the detection and correction of malapropisms. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 305–332, 1998.

Hughes, Thad, Daniel Ramage, Lexical Semantic Relatedness with Random Graph Walks. EMNLP-CoNLL 2007: 581-589.

Islam, A., and Inkpen, D. 2006. [http://www.site.uottawa.ca/~mdislam/publications/LREC_06_242.pdf Second order co-occurrence pmi for determining the semantic similarity of words]. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006) 1033–1038.

Jarmasz, M. 2003. [http://www.arxiv.org/pdf/1204.0140 Roget’s thesaurus as a Lexical Resource for Natural Language Processing]. Ph.D. Dissertation, Ottawa Carleton Institute for Computer Science, School of Information Technology and Engineering, University of Ottawa.

Jiang, Jay J. and David W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference on Research in Computational Linguistics (ROCLING X), Taiwan, pages 19–33, 1997.

Landauer, T. K.; L, T. K.; Laham, D.; Rehder, B.; and Schreiner, M. E. 1997. How well can passage meaning be derived without using word order? a comparison of latent semantic analysis and humans.

Leacock, Claudia and Martin Chodorow. Combining local context and WordNet similarity for word sense identification. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 265–283, 1998.

Lin, Dekang. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, Madison,WI, pages 296–304, 1998.

Milne, David, and Ian H. Witten, An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links, In Proceedings of AAAI 2008.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Pilehvar, M.T., Jurgens, D. and Navigli, R. [http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity]. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 4-9, 2013, pp. 1341-1351.

Pilehvar, M.T. and Navigli, R. [http://www.sciencedirect.com/science/article/pii/S000437021500106X From Senses to Texts: An All-in-one Graph-based Approach for Measuring Semantic Similarity]. Artificial Intelligence, Elsevier.

Resnik, Philip. Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448–453, Montreal, Canada, 1995.

Rubenstein, Herbert, and John B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633, 1965.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

Strube, Michael, Simone Paolo Ponzetto: WikiRelate! Computing Semantic Relatedness Using Wikipedia. AAAI 2006: 1419-1424

Yih, W. and Qazvinian, V. (2012). [http://aclweb.org/anthology/N/N12/N12-1077.pdf Measuring Word Relatedness Using Heterogeneous Vector Space Models]. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2012).

[[Category:State of the art]]
[[Category:Similarity]]

TOEFL Synonym Questions (State of the art)

2019-09-16T00:32:59Z

Doboandris:

* '''the TOEFL questions are available on request by contacting [http://lsa.colorado.edu/mail_sub.html LSA Support at CU Boulder]''', the people who manage the [http://lsa.colorado.edu/ LSA web site at Colorado]
* TOEFL = Test of English as a Foreign Language
* 80 multiple-choice synonym questions; 4 choices per question
* introduced in Landauer and Dumais (1997) as a way of evaluating algorithms for measuring degree of similarity between words
* subsequently used by many other researchers
* see also: [[Similarity (State of the art)]]

== Sample question ==

::{| border="0" cellpadding="1" cellspacing="1"
|-
! Stem:
|
| levied
|-
! Choices:
| (a)
| imposed
|-
|
| (b)
| believed
|-
|
| (c)
| requested
|-
|
| (d)
| correlated
|-
! Solution:
| (a)
| imposed
|-
|}

== Table of results ==

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for experiment
! Type
! Correct
! 95% confidence
|-
| RES
| Resnik (1995)
| Jarmasz and Szpakowicz (2003)
| Hybrid
| 20.31%
| 12.89–31.83%
|-
| LC
| Leacock and Chodrow (1998)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 21.88%
| 13.91–33.21%
|-
| LIN
| Lin (1998)
| Jarmasz and Szpakowicz (2003)
| Hybrid
| 24.06%
| 15.99–35.94%
|-
| Random
| Random guessing
| 1 / 4 = 25.00%
| Random
| 25.00%
| 15.99–35.94%
|-
| JC
| Jiang and Conrath (1997)
| Jarmasz and Szpakowicz (2003)
| Hybrid
| 25.00%
| 15.99–35.94%
|-
| LSA
| Landauer and Dumais (1997)
| Landauer and Dumais (1997)
| Corpus-based
| 64.38%
| 52.90–74.80%
|-
| Human
| Average non-English US college applicant
| Landauer and Dumais (1997)
| Human
| 64.50%
| 53.01–74.88%
|-
| RI
| Karlgren and Sahlgren (2001)
| Karlgren and Sahlgren (2001)
| Corpus-based
| 72.50%
| 61.38-81.90%
|-
| DS
| Pado and Lapata (2007)
| Pado and Lapata (2007)
| Corpus-based
| 73.00%
| 62.72-82.96%
|-
| PMI-IR
| Turney (2001)
| Turney (2001)
| Corpus-based
| 73.75%
| 62.72–82.96%
|-
| PairClass
| Turney (2008)
| Turney (2008)
| Corpus-based
| 76.25%
| 65.42-85.06%
|-
| HSO
| Hirst and St.-Onge (1998)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 77.91%
| 68.17–87.11%
|-
| JS
| Jarmasz and Szpakowicz (2003)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 78.75%
| 68.17–87.11%
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019)
| Corpus-based
| 80.00%
| 69.56–88.11%
|-
| PMI-IR
| Terra and Clarke (2003)
| Terra and Clarke (2003)
| Corpus-based
| 81.25%
| 70.97–89.11%
|-
| LC-IR
| Higgins (2005)
| Higgins (2005)
| Web-based
| 81.25%
| 70.97–89.11%
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based
| 81.25%
| 70.97–89.11%
|-
| CWO
| Ruiz-Casado et al. (2005)
| Ruiz-Casado et al. (2005)
| Web-based
| 82.55%
| 72.38–90.09%
|-
| PPMIC
| Bullinaria and Levy (2007)
| Bullinaria and Levy (2007)
| Corpus-based
| 85.00%
| 75.26-92.00%
|-
| GLSA
| Matveeva et al. (2005)
| Matveeva et al. (2005)
| Corpus-based
| 86.25%
| 76.73-92.93%
|-
| SR
| Tsatsaronis et al. (2010)
| Tsatsaronis et al. (2010)
| Lexicon-based
| 87.50%
| 78.21-93.84%
|-
| DC13
| Dobó and Csirik (2013)
| Dobó and Csirik (2013)
| Corpus-based
| 88.75%
| 79.72-94.72%
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019)
| Corpus-based
| 90.00%
| 81.24-95.58%
|-
| LSA
| Rapp (2003)
| Rapp (2003)
| Corpus-based
| 92.50%
| 84.39-97.20%
|-
| LSA
| Han (2014)
| Han (2014)
| Hybrid
| 95.0%
| 87.69-98.62%
|-
| ADW
| Pilehvar et al. (2013)
| Pilehvar et al. (2013)
| WordNet graph-based (unsupervised)
| 96.25%
| 89.43-99.22%
|-
| PR
| Turney et al. (2003)
| Turney et al. (2003)
| Hybrid
| 97.50%
| 91.26–99.70%
|-
| Sp19
| Speer et al. (2017)
| Dobó (2019)
| Hybrid
| 98.75%
| 93.23–99.97%
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 98.75%
| 93.23–99.97%
|-
| PCCP
| Bullinaria and Levy (2012)
| Bullinaria and Levy (2012)
| Corpus-based
| 100.00%
| 96.32-100.00%
|}

== Explanation of table ==

* '''Algorithm''' = name of algorithm
* '''Reference for algorithm''' = where to find out more about given algorithm
* '''Reference for experiment''' = where to find out more about evaluation of given algorithm with TOEFL questions
* '''Type''' = general type of algorithm: corpus-based, lexicon-based, hybrid
* '''Correct''' = percent of 80 questions that given algorithm answered correctly
* '''95% confidence''' = confidence interval calculated using the [[Statistical calculators|Binomial Exact Test]]
* table rows sorted in order of increasing percent correct
* several WordNet-based similarity measures are implemented in [http://www.d.umn.edu/~tpederse/ Ted Pedersen]'s [http://www.d.umn.edu/~tpederse/similarity.html WordNet::Similarity] package
* LSA = Latent Semantic Analysis
* PCCP = Principal Component vectors with Caron P
* PMI-IR = Pointwise Mutual Information - Information Retrieval
* PR = Product Rule
* PPMIC = Positive Pointwise Mutual Information with Cosine
* GLSA = Generalized Latent Semantic Analysis
* CWO = Context Window Overlapping
* DS = Dependency Space
* RI = Random Indexing

== Notes ==

* the performance of a corpus-based algorithm depends on the corpus, so the difference in performance between two corpus-based systems may be due to the different corpora, rather than the different algorithms
* the TOEFL questions include nouns, verbs, and adjectives, but some of the WordNet-based algorithms were only designed to work with nouns; this explains some of the lower scores
* some of the algorithms may have been tuned on the TOEFL questions; read the references for details
* Landauer and Dumais (1997) report scores that were corrected for guessing by subtracting a penalty of 1/3 for each incorrect answer; they report a score of 52.5% when this penalty is applied; when the penalty is removed, their performance is 64.4% correct

== References ==

Bullinaria, J.A., and Levy, J.P. (2007). [http://www.cs.bham.ac.uk/~jxb/PUBS/BRM.pdf Extracting semantic representations from word co-occurrence statistics: A computational study]. ''Behavior Research Methods'', 39(3), 510-526.

Bullinaria, J.A., and Levy, J.P. (2012). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.9582&rep=rep1&type=pdf Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD]. ''Behavior Research Methods'', 44(3):890-907.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged. [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]

Dobó, A., and Csirik, J. (2013). [http://link.springer.com/chapter/10.1007/978-3-642-35843-2_42 Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) ''SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741''. Springer-Verlag, Berlin Heidelberg, pp. 491-502

Lushan Han. (2014). [http://ebiquity.umbc.edu/paper/html/id/658/Schema-Free-Querying-of-Semantic-Data Schema Free Querying of Semantic Data], Ph.D. dissertation, University of Maryland, Baltimore County, Baltimore, MD USA.

Higgins, D. (2005). [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.329.1517 Which Statistics Reflect Semantics? Rethinking Synonymy and Word Similarity.] In: Kepser, S., Reis, M. (eds.) ''Linguistic Evidence: Empirical, Theoretical and Computational Perspectives''. Mouton de Gruyter, Berlin, pp. 265–284.

Hirst, G., and St-Onge, D. (1998). [http://mirror.eacoss.org/documentation/ITLibrary/IRIS/Data/1997/Hirst/Lexical/1997-Hirst-Lexical.pdf Lexical chains as representation of context for the detection and correction of malapropisms]. In C. Fellbaum (ed.), ''WordNet: An Electronic Lexical Database''. Cambridge: MIT Press, 305-332.

Jarmasz, M., and Szpakowicz, S. (2003). [http://www.csi.uottawa.ca/~szpak/recent_papers/TR-2003-01.pdf Roget’s thesaurus and semantic similarity], ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, September, pp. 212-219.

Jiang, J.J., and Conrath, D.W. (1997). [http://wortschatz.uni-leipzig.de/~sbordag/aalw05/Referate/03_Assoziationen_BudanitskyResnik/Jiang_Conrath_97.pdf Semantic similarity based on corpus statistics and lexical taxonomy]. ''Proceedings of the International Conference on Research in Computational Linguistics'', Taiwan.

Karlgren, J. and Sahlgren, M. (2001). [http://www.sics.se/~jussi/Artiklar/2001_RWIbook/KarlgrenSahlgren2001.pdf From Words to Understanding]. In Uesaka, Y., Kanerva, P., & Asoh, H. (Eds.), ''Foundations of Real-World Intelligence'', Stanford: CSLI Publications, pp. 294–308.

Landauer, T.K., and Dumais, S.T. (1997). [http://lsa.colorado.edu/papers/plato/plato.annote.html A solution to Plato's problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge]. ''Psychological Review'', 104(2):211–240.

Leacock, C., and Chodorow, M. (1998). [http://books.google.ca/books?id=Rehu8OOzMIMC&lpg=PA265&ots=IpnaLkZUec&lr&pg=PA265#v=onepage&q&f=false Combining local context and WordNet similarity for word sense identification]. In C. Fellbaum (ed.), ''WordNet: An Electronic Lexical Database''. Cambridge: MIT Press, pp. 265-283.

Lin, D. (1998). [http://www.cs.ualberta.ca/~lindek/papers/sim.pdf An information-theoretic definition of similarity]. ''Proceedings of the 15th International Conference on Machine Learning (ICML-98)'', Madison, WI, pp. 296-304.

Matveeva, I., Levow, G., Farahat, A., and Royer, C. (2005). [http://people.cs.uchicago.edu/~matveeva/SynGLSA_ranlp_final.pdf Generalized latent semantic analysis for term representation]. ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-05)'', Borovets, Bulgaria.

Pado, S., and Lapata, M. (2007). [http://www.nlpado.de/~sebastian/pub/papers/cl07_pado.pdf Dependency-based construction of semantic space models]. ''Computational Linguistics'', 33(2), 161-199.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Pilehvar, M.T., Jurgens D., and Navigli R. (2013). [http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf Align, disambiguate and walk: A unified approach for measuring semantic similarity]. ''Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013),'' Sofia, Bulgaria.

Rapp, R. (2003). [http://www.amtaweb.org/summit/MTSummit/FinalPapers/19-Rapp-final.pdf Word sense discovery based on sense descriptor dissimilarity]. ''Proceedings of the Ninth Machine Translation Summit'', pp. 315-322.

Resnik, P. (1995). [http://citeseer.ist.psu.edu/resnik95using.html Using information content to evaluate semantic similarity]. ''Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95)'', Montreal, pp. 448-453.

Ruiz-Casado, M., Alfonseca, E. and Castells, P. (2005) [http://alfonseca.org/pubs/2005-ranlp1.pdf Using context-window overlapping in Synonym Discovery and Ontology Extension]. ''Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP-2005)'', Borovets, Bulgaria.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

Terra, E., and Clarke, C.L.A. (2003). [http://acl.ldc.upenn.edu/N/N03/N03-1032.pdf Frequency estimates for statistical word similarity measures]. ''Proceedings of the Human Language Technology and North American Chapter of Association of Computational Linguistics Conference 2003 (HLT/NAACL 2003)'', pp. 244–251.

Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. (2010). [http://arxiv.org/abs/1401.5699 Text Relatedness Based on a Word Thesaurus]. ''Journal of Artificial Intelligence Research'' 37, 1–39

Turney, P.D. (2001). [http://arxiv.org/abs/cs.LG/0212033 Mining the Web for synonyms: PMI-IR versus LSA on TOEFL]. ''Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001)'', Freiburg, Germany, pp. 491-502.

Turney, P.D., Littman, M.L., Bigham, J., and Shnayder, V. (2003). [http://arxiv.org/abs/cs.CL/0309035 Combining independent modules to solve multiple-choice synonym and analogy problems]. ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, pp. 482-489.

Turney, P.D. (2008). [http://arxiv.org/abs/0809.0124 A uniform approach to analogies, synonyms, antonyms, and associations]. ''Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)'', Manchester, UK, pp. 905-912.

[[Category:State of the art]]
[[Category:Similarity]]

User:Doboandris

2019-09-16T00:31:02Z

Doboandris:

[http://www.inf.u-szeged.hu/~dobo/ András Dobó]

PhD Student

PhD School in Computer Science, University of Szeged, Hungary

Contact information

Address: University of Szeged, Institute of Informatics
2 Árpád tér, Szeged, 6720, Hungary

Email: dobo@inf.u-szeged.hu

Web: [http://www.inf.u-szeged.hu/~dobo/ www.inf.u-szeged.hu/~dobo]

Professional experience

2012-2014 Research mathematician
nexum Magyarország Kft.

2010-2011 Software developer
Institute of Informatics, University of Szeged, Hungary

2009 Software developer
Biological Research Centre, Hungarian Academy of Sciences, Hungary

2008-2009 Software developer
Szeged és Környéke Vízgazdálkodási Társulat, Szeged, Hungary

Teaching

Artificial Intelligence I. tutorials (3 semesters)

Formal Languages tutorials (3 semesters)

Databases tutorials (1 semester)

Introduction to Informatics tutorials (1 semester)

Education

2012 Guest student (1 semester)
Georg-August Universität Göttingen

2011- [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf PhD in Computer Science] - [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]
PhD School in Computer Science, University of Szeged, Hungary

2009-2010 [https://ora.ox.ac.uk/objects/uuid:1b771160-3f13-4362-a69a-c73401bec321/download_file?file_format=pdf&safe_filename=Dissertation.pdf&type_of_work=Thesis Master of Science in Computer Science]
Computing Laboratory, University of Oxford, UK

2006-2009 [http://diploma.bibl.u-szeged.hu/3370/ Bachelor of Science in Computer Program Designer]
Institute of Informatics, University of Szeged, Hungary

Language exams

English: Level C2, Cambridge ESOL (2010)

German: Level B2, Goethe Institut (2005)

Prizes and further information

2013 Best Young Researcher Prize
MSZNY 2013 - IX. Magyar Számítógépes Nyelvészeti Konferencia, Szeged

2010 III. prize
Regional Scientific Conference of Students (TDK), Szeged, Hungary

2000 I. place
Makkosházi Mathematics Competition (city level)

Erdős number: 3 (András Dobó – János Csirik – Vilmos Totik – Pál Erdős)

Publications

1. Dobó, A.: [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged (2019) - [https://github.com/doboandras/dsm-parameter-analysis GitHub repository]

2. Dobó A., Csirik J.: [http://www.inf.u-szeged.hu/~dobo/Publications/Comparison%20of%20the%20best%20parameter%20settings%20of%20DSMs%20across%20languages.pdf Comparison of the Best Parameter Settings in the Creation and Comparison of Feature Vectors in Distributional Semantic Models Across Multiple Languages]. In: MacIntyre J., Maglogiannis I., Iliadis L., Pimenidis E. (eds) Artificial Intelligence Applications and Innovations. AIAI 2019. IFIP Advances in Information and Communication Technology, vol 559. 487-499. Springer, Cham. (2019)

3. Dobó A., Csirik J.: [https://doi.org/10.1080/09296174.2019.1570897 A Comprehensive Study of the Parameters in the Creation and Comparison of Feature Vectors in Distributional Semantic Models]. Journal of Quantitative Linguistics (2019)

4. Dobó, A.: [http://siba-ese.unisalento.it/index.php/ejasa/article/download/19296/17273 A measure of adjusted difference between values of a variable]. Electronic Journal of Applied Statistical Analysis. 12(1), 153-175. (2019)

5. Dobó, A.: [http://www.rcs.cic.ipn.mx/rcs/2018_147_6/Multi-D%20Kneser-Ney%20Smoothing%20Preserving%20the%20Original%20Marginal%20Distributions.pdf Multi-D Kneser-Ney Smoothing Preserving the Original Marginal Distributions]. Research in Computing Science. 147(6), 11-25 (2018)

6. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Nagy, Á., Vincze, V. and Zsibrita, J.: [http://link.springer.com/chapter/10.1007/978-3-319-13817-6_32 Information Extraction from Hungarian, English and German CVs for a Career Portal]. In: Prasath, R. et al. (eds.) Mining Intelligence and Knowledge Exploration. LNAI, Vol. 8891. 333-341. Springer International Publishing, Switzerland (2014)

7. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Miszori, A., Nagy, Á., Vincze, V. and Zsibrita, J.: [http://rgai.inf.u-szeged.hu/mszny2014/MSZNY2014_press_b5.pdf#page=369 Információkinyerés magyar nyelvű önéletrajzokból a nexum Karrierportálhoz]. In: Tanács, A. et al. (eds.) X. Magyar Számítógépes Nyelvészeti Konferencia. 359-360. University of Szeged, Szeged (2014)

8. Dobó, A., Csirik, J.: [http://www.inf.u-szeged.hu/~dobo/Publications/Computing%20semantic%20similarity%20using%20large%20static%20corpora.pdf Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741. 491-502. Springer-Verlag, Berlin Heidelberg (2013)

9. Dobó, A., Csirik, J.: [http://www.inf.u-szeged.hu/projectdirs/mszny2013/images/stories/kepek/MSZNY2013_press.pdf#page=221 Magyar és angol szavak szemantikai hasonlóságának automatikus kiszámítása]. In: Tanács, A. and Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. 213-224. University of Szeged, Szeged (2012)

10. Dobó, A., Pulman, S.G.: [http://www.inf.u-szeged.hu/projectdirs/mszny2013/images/stories/kepek/MSZNY2013_press.pdf#page=43 Angol nyelvű összetett főnevek értelmezése parafrázisok segítségével]. In: Tanács, A. and Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. 35-46. University of Szeged, Szeged (2012)

11. Dobó, A., Pulman, S.G.: [http://journal.sepln.org/index.php/pln/article/viewFile/842/697 Interpreting noun compounds using paraphrases]. Procesamiento del Lenguaje Natural. 46, 59-66. (2011)

12. Dobó, A.: [http://www.inf.u-szeged.hu/sites/default/files/kutatas/konferenciak/tdk2010osz/Dobo_Andras.pdf Angol szavak szinonimáinak automatikus keresése]. TDK. University of Szeged (2010)

13. Dobó, A.: [https://ora.ox.ac.uk/objects/uuid:1b771160-3f13-4362-a69a-c73401bec321/download_file?file_format=pdf&safe_filename=Dissertation.pdf&type_of_work=Thesis Interpreting Noun Compounds]. University of Oxford (2010)

14. Dobó, A.: [http://diploma.bibl.u-szeged.hu/3370/ A Közelítő és szimbolikus számítások tárgy során MATLAB-ban írt zárthelyi dolgozatok automatizált javítása]. University of Szeged (2009)

User:Doboandris

2019-08-17T00:47:20Z

Doboandris:

[http://www.inf.u-szeged.hu/~dobo/ András Dobó]

PhD Student

PhD School in Computer Science, University of Szeged, Hungary

Contact information

Address: University of Szeged, Institute of Informatics
2 Árpád tér, Szeged, 6720, Hungary

Email: dobo@inf.u-szeged.hu

Web: [http://www.inf.u-szeged.hu/~dobo/ www.inf.u-szeged.hu/~dobo]

Professional experience

2012-2014 Research mathematician
nexum Magyarország Kft.

2010-2011 Software developer
Institute of Informatics, University of Szeged, Hungary

2009 Software developer
Biological Research Centre, Hungarian Academy of Sciences, Hungary

2008-2009 Software developer
Szeged és Környéke Vízgazdálkodási Társulat, Szeged, Hungary

Teaching

Artificial Intelligence I. tutorials (3 semesters)

Formal Languages tutorials (3 semesters)

Databases tutorials (1 semester)

Introduction to Informatics tutorials (1 semester)

Education

2012 Guest student (1 semester)
Georg-August Universität Göttingen

2011- [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf PhD in Computer Science]
PhD School in Computer Science, University of Szeged, Hungary

2009-2010 [http://solo.bodleian.ox.ac.uk/primo_library/libweb/action/dlDisplay.do?vid=OXVU1&docId=oxfaleph017405947&fn=permalink Master of Science in Computer Science]
Computing Laboratory, University of Oxford, UK

2006-2009 [http://diploma.bibl.u-szeged.hu/3370/ Bachelor of Science in Computer Program Designer]
Institute of Informatics, University of Szeged, Hungary

Language exams

English: Level C2, Cambridge ESOL (2010)

German: Level B2, Goethe Institut (2005)

Prizes and further information

2013 Best Young Researcher Prize
MSZNY 2013 - IX. Magyar Számítógépes Nyelvészeti Konferencia, Szeged

2010 III. prize
Regional Scientific Conference of Students (TDK), Szeged, Hungary

2000 I. place
Makkosházi Mathematics Competition (city level)

Erdős number: 3 (András Dobó – János Csirik – Vilmos Totik – Pál Erdős)

Publications

1. Dobó, A.: [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged (2019)

2. Dobó A., Csirik J.: [http://www.inf.u-szeged.hu/~dobo/Publications/Comparison%20of%20the%20best%20parameter%20settings%20of%20DSMs%20across%20languages.pdf Comparison of the Best Parameter Settings in the Creation and Comparison of Feature Vectors in Distributional Semantic Models Across Multiple Languages]. In: MacIntyre J., Maglogiannis I., Iliadis L., Pimenidis E. (eds) Artificial Intelligence Applications and Innovations. AIAI 2019. IFIP Advances in Information and Communication Technology, vol 559. 487-499. Springer, Cham. (2019)

3. Dobó A., Csirik J.: [https://doi.org/10.1080/09296174.2019.1570897 A Comprehensive Study of the Parameters in the Creation and Comparison of Feature Vectors in Distributional Semantic Models]. Journal of Quantitative Linguistics (2019)

4. Dobó, A.: [http://siba-ese.unisalento.it/index.php/ejasa/article/download/19296/17273 A measure of adjusted difference between values of a variable]. Electronic Journal of Applied Statistical Analysis. 12(1), 153-175. (2019)

5. Dobó, A.: [http://www.rcs.cic.ipn.mx/rcs/2018_147_6/Multi-D%20Kneser-Ney%20Smoothing%20Preserving%20the%20Original%20Marginal%20Distributions.pdf Multi-D Kneser-Ney Smoothing Preserving the Original Marginal Distributions]. Research in Computing Science. 147(6), 11-25 (2018)

6. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Nagy, Á., Vincze, V. and Zsibrita, J.: [http://link.springer.com/chapter/10.1007/978-3-319-13817-6_32 Information Extraction from Hungarian, English and German CVs for a Career Portal]. In: Prasath, R. et al. (eds.) Mining Intelligence and Knowledge Exploration. LNAI, Vol. 8891. 333-341. Springer International Publishing, Switzerland (2014)

7. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Miszori, A., Nagy, Á., Vincze, V. and Zsibrita, J.: [http://rgai.inf.u-szeged.hu/mszny2014/MSZNY2014_press_b5.pdf#page=369 Információkinyerés magyar nyelvű önéletrajzokból a nexum Karrierportálhoz]. In: Tanács, A. et al. (eds.) X. Magyar Számítógépes Nyelvészeti Konferencia. 359-360. University of Szeged, Szeged (2014)

8. Dobó, A., Csirik, J.: [http://www.inf.u-szeged.hu/~dobo/Publications/Computing%20semantic%20similarity%20using%20large%20static%20corpora.pdf Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741. 491-502. Springer-Verlag, Berlin Heidelberg (2013)

9. Dobó, A., Csirik, J.: [http://www.inf.u-szeged.hu/projectdirs/mszny2013/images/stories/kepek/MSZNY2013_press.pdf#page=221 Magyar és angol szavak szemantikai hasonlóságának automatikus kiszámítása]. In: Tanács, A. and Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. 213-224. University of Szeged, Szeged (2012)

10. Dobó, A., Pulman, S.G.: [http://www.inf.u-szeged.hu/projectdirs/mszny2013/images/stories/kepek/MSZNY2013_press.pdf#page=43 Angol nyelvű összetett főnevek értelmezése parafrázisok segítségével]. In: Tanács, A. and Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. 35-46. University of Szeged, Szeged (2012)

11. Dobó, A., Pulman, S.G.: [http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/download/842/697 Interpreting noun compounds using paraphrases]. Procesamiento del Lenguaje Natural. 46, 59-66. (2011)

12. Dobó, A.: [http://www.inf.u-szeged.hu/sites/default/files/kutatas/konferenciak/tdk2010osz/Dobo_Andras.pdf Angol szavak szinonimáinak automatikus keresése]. TDK. University of Szeged (2010)

13. Dobó, A.: [http://solo.bodleian.ox.ac.uk/primo_library/libweb/action/dlDisplay.do?vid=OXVU1&docId=oxfaleph017405947&fn=permalink Interpreting Noun Compounds]. University of Oxford (2010)

14. Dobó, A.: [http://diploma.bibl.u-szeged.hu/3370/ A Közelítő és szimbolikus számítások tárgy során MATLAB-ban írt zárthelyi dolgozatok automatizált javítása]. University of Szeged (2009)

MEN Test Collection (State of the art)

2019-08-17T00:37:31Z

Doboandris:

* State of the art on the [https://staff.fnwi.uva.nl/e.bruni/MEN MEN dataset] (Bruni et al., 2013)
* 3000 word pairs: 2000 pairs in the development part of the dataset, 1000 pairs in the test part of the dataset
* The similarity values in the dataset are the means of judgments made by 50 subjects
* see also: [[Similarity (State of the art)]]

== Table of results for the test part of the dataset (1000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.867
| 0.866
|-
| DC19a-hybrid
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Hybrid
| 0.866
| 0.869
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019c)
| Hybrid
| 0.866
| 0.861
|-
| Ch18
| Christopoulou et al. (2018)
| Christopoulou et al. (2018)
| Corpus-based, predictive
| 0.84
| -
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019c)
| Corpus-based, distributional
| 0.813
| 0.808
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019c)
| Corpus-based, distributional
| 0.798
| 0.798
|-
| DC19a-corpus
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Corpus-based, distributional
| 0.781
| 0.749
|-
| Br13
| Bruni et al. (2013)
| Bruni et al. (2013)
| Hybrid
| 0.78
| -
|-
| Do19-corpus
| Dobó (2019), Dobó and Csirik (2019b)
| Dobó (2019), Dobó and Csirik (2019b)
| Corpus-based, distributional
| 0.705
| 0.709
|}

== Table of results for the full dataset (3000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| DC19a-hybrid
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Hybrid
| 0.862
| 0.865
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019c)
| Hybrid
| 0.862
| 0.846
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.861
| 0.859
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019c)
| Corpus-based, distributional
| 0.809
| 0.803
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019c)
| Corpus-based, distributional
| 0.802
| 0.801
|-
| DC19a-corpus
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Corpus-based, distributional
| 0.771
| 0.746
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based, distributional
| 0.702
| 0.707
|}

== References ==

* '''Listed alphabetically.'''

Bruni, E., Tran, N. K., and Baroni, M. (2014). [https://www.jair.org/index.php/jair/article/download/10857/25905/ Multimodal distributional semantics]. ''Journal of Artificial Intelligence Research'', 49, pp. 1-47.

Christopoulou, F., Briakou, E., Iosif, E., and Potamianos, A. (2018). [https://ieeexplore.ieee.org/abstract/document/8334459/ Mixture of topic-based distributional semantic and affective models]. "IEEE 1ICSC 2018. pp. 203-210. IEEE.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged.

Dobó, A., and Csirik, J. (2019a). [https://doi.org/10.1080/09296174.2019.1570897 A comprehensive study of the parameters in the creation and comparison of feature vectors in distributional semantic models]. "Journal of Quantitative Linguistics", pp. 1-28.

Dobó, A., and Csirik, J. (2019b). [http://www.inf.u-szeged.hu/~dobo/Publications/Comparison%20of%20the%20best%20parameter%20settings%20of%20DSMs%20across%20languages.pdf Comparison of the best parameter settings in the creation and comparison of feature vectors in distributional semantic models across multiple languages]. "AIAI 2019: Artificial Intelligence Applications and Innovations", pp. 487-499.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

[[Category:State of the art]]
[[Category:Similarity]]

MEN Test Collection (State of the art)

2019-08-13T00:51:13Z

Doboandris:

* State of the art on the [https://staff.fnwi.uva.nl/e.bruni/MEN MEN dataset] (Bruni et al., 2013)
* 3000 word pairs: 2000 pairs in the development part of the dataset, 1000 pairs in the test part of the dataset
* The similarity values in the dataset are the means of judgments made by 50 subjects
* see also: [[Similarity (State of the art)]]

== Table of results for the test part of the dataset (1000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.867
| 0.866
|-
| DC19a-hybrid
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Hybrid
| 0.866
| 0.869
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019c)
| Hybrid
| 0.866
| 0.861
|-
| Ch18
| Christopoulou et al. (2018)
| Christopoulou et al. (2018)
| Corpus-based, predictive
| 0.84
| -
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019c)
| Corpus-based, distributional
| 0.813
| 0.808
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019c)
| Corpus-based, distributional
| 0.798
| 0.798
|-
| DC19a-corpus
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Corpus-based, distributional
| 0.781
| 0.749
|-
| Br13
| Bruni et al. (2013)
| Bruni et al. (2013)
| Hybrid
| 0.78
| -
|-
| Do19-corpus
| Dobó (2019), Dobó and Csirik (2019b)
| Dobó (2019), Dobó and Csirik (2019b)
| Corpus-based, distributional
| 0.705
| 0.709
|}

== Table of results for the full dataset (3000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| DC19a-hybrid
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Hybrid
| 0.862
| 0.865
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019c)
| Hybrid
| 0.862
| 0.846
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.861
| 0.859
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019c)
| Corpus-based, distributional
| 0.809
| 0.803
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019c)
| Corpus-based, distributional
| 0.802
| 0.801
|-
| DC19a-corpus
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Corpus-based, distributional
| 0.771
| 0.746
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based, distributional
| 0.702
| 0.707
|}

== References ==

* '''Listed alphabetically.'''

Bruni, E., Tran, N. K., and Baroni, M. (2014). [https://www.jair.org/index.php/jair/article/download/10857/25905/ Multimodal distributional semantics]. ''Journal of Artificial Intelligence Research'', 49, pp. 1-47.

Christopoulou, F., Briakou, E., Iosif, E., and Potamianos, A. (2018). [https://ieeexplore.ieee.org/abstract/document/8334459/ Mixture of topic-based distributional semantic and affective models]. "IEEE 1ICSC 2018. pp. 203-210. IEEE.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged.

Dobó, A., and Csirik, J. (2019a). [https://doi.org/10.1080/09296174.2019.1570897 A comprehensive study of the parameters in the creation and comparison of feature vectors in distributional semantic models]. "Journal of Quantitative Linguistics", pp. 1-28.

Dobó, A., and Csirik, J. (2019b). [https://doi.org/10.1007/978-3-030-19823-7_41 Comparison of the best parameter settings in the creation and comparison of feature vectors in distributional semantic models across multiple languages]. "AIAI 2019: Artificial Intelligence Applications and Innovations", pp. 487-499.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

[[Category:State of the art]]
[[Category:Similarity]]

MEN Test Collection (State of the art)

2019-08-13T00:49:52Z

Doboandris: Created page with "* State of the art on the [https://staff.fnwi.uva.nl/e.bruni/MEN MEN dataset] (Bruni et al., 2013) * 3000 word pairs: 2000 pairs in the development part of the dataset, 1000 p..."

* State of the art on the [https://staff.fnwi.uva.nl/e.bruni/MEN MEN dataset] (Bruni et al., 2013)
* 3000 word pairs: 2000 pairs in the development part of the dataset, 1000 pairs in the test part of the dataset
* The similarity values in the dataset are the means of judgments made by 50 subjects
* see also: [[Similarity (State of the art)]]

== Table of results for the test part of the dataset (1000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.867
| 0.866
|-
| DC19a-hybrid
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Hybrid
| 0.866
| 0.869
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019c)
| Hybrid
| 0.866
| 0.861
|-
| Ch18
| Christopoulou et al. (2018)
| Christopoulou et al. (2018)
| Corpus-based, predictive
| 0.84
| -
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019c)
| Corpus-based, distributional
| 0.813
| 0.808
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019c)
| Corpus-based, distributional
| 0.798
| 0.798
|-
| DC19a-corpus
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Corpus-based, distributional
| 0.781
| 0.749
|-
| Br13
| Bruni et al. (2013)
| Bruni et al. (2013)
| Hybrid
| 0.78
| -
|-
| Do19-corpus
| Dobó (2019), Dobó and Csirik (2019b)
| Dobó (2019), Dobó and Csirik (2019b)
| Corpus-based, distributional
| 0.705
| 0.709
|}

== Table of results for the full dataset (3000 word pairs) ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019c)
| Hybrid
| 0.862
| 0.846
|-
| DC19a-hybrid
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Hybrid
| 0.862
| 0.865
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.861
| 0.859
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019c)
| Corpus-based, distributional
| 0.809
| 0.803
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019c)
| Corpus-based, distributional
| 0.802
| 0.801
|-
| DC19a-corpus
| Dobó and Csirik (2019a)
| Dobó and Csirik (2019a)
| Corpus-based, distributional
| 0.771
| 0.746
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based, distributional
| 0.702
| 0.707
|}

== References ==

* '''Listed alphabetically.'''

Bruni, E., Tran, N. K., and Baroni, M. (2014). [https://www.jair.org/index.php/jair/article/download/10857/25905/ Multimodal distributional semantics]. ''Journal of Artificial Intelligence Research'', 49, pp. 1-47.

Christopoulou, F., Briakou, E., Iosif, E., and Potamianos, A. (2018). [https://ieeexplore.ieee.org/abstract/document/8334459/ Mixture of topic-based distributional semantic and affective models]. "IEEE 1ICSC 2018. pp. 203-210. IEEE.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged.

Dobó, A., and Csirik, J. (2019a). [https://doi.org/10.1080/09296174.2019.1570897 A comprehensive study of the parameters in the creation and comparison of feature vectors in distributional semantic models]. "Journal of Quantitative Linguistics", pp. 1-28.

Dobó, A., and Csirik, J. (2019b). [https://doi.org/10.1007/978-3-030-19823-7_41 Comparison of the best parameter settings in the creation and comparison of feature vectors in distributional semantic models across multiple languages]. "AIAI 2019: Artificial Intelligence Applications and Innovations", pp. 487-499.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

[[Category:State of the art]]
[[Category:Similarity]]

State of the art

2019-08-13T00:23:10Z

Doboandris:

The purpose of this section of the ACL wiki is to be a repository of ''k''-best state-of-the-art results (i.e., methods and software) for various core natural language processing tasks.

As a side effect, this should hopefully evolve into a knowledge base of standard evaluation methods and datasets for various tasks, as well as encourage more effort into reproducibility of results. This will help newcomers to a field appreciate what has been done so far and what the main tasks are, and will help keep active researchers informed on fields other than their specific research. The next time you need a system for PP attachment, or wonder what is the current state of word sense disambiguation, this will be the place to visit.

Please contribute! (This is also a good place for you to display your results!)

As a historical point of reference, you may want to refer to the [http://web.archive.org/web/20100325144600/http://cslu.cse.ogi.edu/HLTsurvey/ Survey of the State of the Art in Human Language Technology] ([http://www.lt-world.org/hlt_survey/master.pdf also available as PDF]), edited by R. Cole, J. Mariani, H. Uszkoreit, G. B. Varile, A. Zaenen, A. Zampolli, V. Zue, 1996.


* [[Analogy (State of the art)|Analogy]] -- [[SAT Analogy Questions (State of the art)|SAT]], [[SemEval-2012 Task 2 (State of the art)|SemEval-2012 Task 2]], [[Syntactic Analogies (State of the art)|Syntactic Analogies]], [[Google analogy test set (State of the art)|Google analogy test set]], [[Bigger analogy test set (State of the art)|Bigger analogy test set]]
* [[Anaphora Resolution (State of the art)|Anaphora Resolution]] (stub)
* [[Automatic Text Summarization (State of the art)|Automatic Text Summarization]] (stub)
* [[Chunking (State of the art)|Chunking]] (stub)
* [[Dependency Parsing (State of the art)|Dependency Parsing]] (stub)
* [[Document Classification (State of the art)|Document Classification]] (stub)
* [[Language Identification (State of the art)|Language Identification]] (stub)
* [[Named Entity Recognition (State of the art)|Named Entity Recognition]]
* [[Noun-Modifier Semantic Relations (State of the art)|Noun-Modifier Semantic Relations]]
* [[NP Chunking (State of the art)|NP Chunking]]
* [[Paraphrase Identification (State of the art)|Paraphrase Identification]]
* [[Parsing (State of the art)|Parsing]]
* [[POS Induction (State of the art) |POS Induction]]
* [[POS Tagging (State of the art) |POS Tagging]]
* [[PP Attachment (State of the art)|PP Attachment]] (stub)
* [[Question Answering (State of the art)|Question Answering]]
* [[Semantic Role Labeling (State of the art)|Semantic Role Labeling]] (stub)
* [[Sentiment Analysis (State of the art)|Sentiment Analysis]] (stub)
* [[Similarity (State of the art)|Similarity]] -- [[ESL Synonym Questions (State of the art)|ESL]], [[SAT Analogy Questions (State of the art)|SAT]], [[TOEFL Synonym Questions (State of the art)|TOEFL]], [[RG-65 Test Collection (State of the art)|RG-65 Test Collection]], [[MC-28 Test Collection (State of the art)|MC-28 Test Collection]], [[SimLex-999 (State of the art)|SimLex-999 Similarity Test Collection]], [[WordSimilarity-353 Test Collection (State of the art)|WordSimilarity-353]], [[SemEval-2012 Task 2 (State of the art)|SemEval-2012 Task 2]], [[MEN Test Collection (State of the art)|MEN Test Collection]]
* [[Speech Recognition (State of the art)|Speech Recognition]] (article request)
* [[Temporal Information Extraction (State of the art)|Temporal Information Extraction]]
* [[Cleaneval (State of the art)| Web Corpus Cleaning]] (stub)
* [[Word Segmentation (State of the art)|Word Segmentation]] (stub)
* [[Word Sense Disambiguation (State of the art)|Word Sense Disambiguation]] (stub)


[[Category:State of the art]]

Similarity (State of the art)

2019-08-13T00:21:39Z

Doboandris:

* see also: [[State of the art]]

== Attributional similarity ==

* '''attributional similarity:''' the degree to which two words are synonymous
* state-of-the-art results for:
** [[TOEFL Synonym Questions (State of the art)|TOEFL Synonym Questions]]
** [[ESL Synonym Questions (State of the art)|ESL Synonym Questions]]
** [[RG-65 Test Collection (State of the art)|RG-65 Test Collection]]
** [[MC-28 Test Collection (State of the art)|MC-28 Test Collection]]
** [[SimLex-999 (State of the art)|SimLex-999 Similarity Test Collection]]
** [[WordSimilarity-353 Test Collection (State of the art)|WordSimilarity-353 Test Collection]]
** [[MEN Test Collection (State of the art)|MEN Test Collection]]

== Similarity versus Association ==

* '''similarity versus association''': the contrast between taxonomical similarity (co-hyponymy) and association (co-occurrence)
* state-of-the-art results for:
** [[Similar-Associated-Both Test Collection (State of the art)|Similar-Associated-Both Test Collection]]
** [[SimLex-999 (State of the art)|SimLex-999 Similarity Test Collection]]

== Relational similarity ==

* '''relational similarity:''' the degree to which two relations are analogous
* state-of-the-art results for:
** [[SAT Analogy Questions (State of the art)|SAT Analogy Questions]]
** [[SemEval-2012 Task 2 (State of the art)|SemEval-2012 Task 2: Measuring Degrees of Relational Similarity]]
** [[Syntactic Analogies|Microsoft Research Syntactic Analogies Dataset]]

== Phrase similarity ==

* '''phrase similarity:''' the degree to which two phrases are similar
* state-of-the-art results for:
** [[Noun-Modifier Questions (State of the art)|Noun-Modifier Questions]]

== Sentence similarity ==

* '''sentence similarity:''' sentence paraphrase, paraphrase identification, paraphrase recognition
* state-of-the-art results for:
** [[Paraphrase Identification (State of the art)|Microsoft Research Paraphrase Corpus]]

== External links ==

* SemEval-2012 Task 2: [https://sites.google.com/site/semeval2012task2/ Measuring Degrees of Relational Similarity]
* SemEval-2012 Task 6: [http://www.cs.york.ac.uk/semeval-2012/task6/ Semantic Textual Similarity]
* SEM 2013 Shared Task: [http://ixa2.si.ehu.es/sts/ Semantic Textual Similarity]

[[Category:State of the art]]
[[Category:Similarity]]

WordSimilarity-353 Test Collection (State of the art)

2019-08-12T23:12:17Z

Doboandris:

* [http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ WordSimilarity-353 Test Collection]
* contains two sets of English word pairs along with human-assigned similarity judgements
* first set (set1) contains 153 word pairs along with their similarity scores assigned by 13 subjects
* second set (set2) contains 200 word pairs with similarity assessed by 16 subjects
* WordSimilarity-353 dataset is available [http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ here]
* performance is measured by [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rank correlation coefficient]
* introduced by [http://www.cs.technion.ac.il/~gabr/papers/tois_context.pdf Finkelstein et al. (2002)]
* subsequently used by many other researchers
* [https://www.wikidata.org/wiki/Q31845205 Wikidata] and [https://tools.wmflabs.org/scholia/use/Q31845205 Scholia]
* see also: [[Similarity (State of the art)]]

== Table of results ==

* '''Listed in order of increasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! Spearman's rho
! Pearson's r
|-
| L&C
| Leacock and Chodorow (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.302
| 0.356
|-
| WNE
| Jarmasz (2003)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.305
| 0.271
|-
| J&C
| Jiang and Conrath 1997
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.318
| 0.354
|-
| L&C
| Leacock and Chodorow (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.348
| 0.341
|-
| H&S
| Hirst and St-Onge (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.302
| 0.356
|-
| Lin
| Lin (1998)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.348
| 0.357
|-
| Resnik
| Resnik (1995)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.353
| 0.365
|-
| ROGET
| Jarmasz (2003)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.415
| 0.536
|-
| C&W
| Collobert and Weston (2008)
| Collobert and Weston (2008)
| Corpus-based
| 0.5
| N/A
|-
| WikiRelate
| Strube and Ponzetto (2006)
| Strube and Ponzetto (2006)
| Corpus-based
| N/A
| 0.48
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based
| 0.579
| 0.577
|-
| LSA
| Landauer et al. (1997)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.581
| 0.492
|-
| LSA
| Landauer et al. (1997)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.581
| 0.563
|-
| simVB+simWN
| Finkelstein et al. (2002)
| Finkelstein et al. (2002)
| Hybrid
| N/A
| 0.55
|-
| SSA
| Hassan and Mihalcea (2011)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.622
| 0.629
|-
| HSMN+csmRNN
| Luong et al. (2013)
| Luong et al. (2013)
| Corpus-based
| 0.65
| N/A
|-
| Pe14
| Pennington (2014)
| Dobó (2019)
| Corpus-based
| 0.706
| 0.705
|-
| Multi-prototype
| Huang et al. (2012)
| Huang et al. (2012)
| Corpus-based
| 0.71
| N/A
|-
| Multi-lingual SSA
| Hassan et al. (2011)
| Hassan et al. (2011)
| Corpus-based
| 0.713
| 0.674
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019)
| Corpus-based
| 0.733
| 0.704
|-
| ESA
| Gabrilovich and Markovitch (2007)
| Gabrilovich and Markovitch (2007)
| Corpus-based
| 0.748
| 0.503
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.795
| 0.276
|-
| TSA
| Radinsky et al. (2011)
| Radinsky et al. (2011)
| Hybrid
| 0.80
| N/A
|-
| CLEAR
| Halawi et al. (2012)
| Halawi et al. (2012)
| Corpus-based
| 0.81
| N/A
|-
| Y&Q
| Yih and Qazvinian (2012)
| Yih and Qazvinian (2012)
| Hybrid
| 0.81
| N/A
|-
| ConceptNet Numberbatch
| Speer et al. (2017)
| Speer et al. (2017)
| Hybrid
| 0.828
| N/A
|}

== References ==

* '''Listed in alphabetical order.'''

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged.

Finkelstein, Lev, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. (2002) [http://www.cs.technion.ac.il/~gabr/papers/tois_context.pdf Placing Search in Context: The Concept Revisited]. ACM Transactions on Information Systems, 20(1):116-131.

Gabrilovich, Evgeniy, and Shaul Markovitch, [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis], Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, 2007.

Halawi, Guy, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. (2012). [http://gabrilovich.com/publications/papers/Halawi2012LSL.pdf Large-scale learning of word relatedness with constraints]. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1406-1414. ACM.

Hassan, Samer, and Rada Mihalcea: [http://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3616/3972/ Semantic Relatedness Using Salient Semantic Analysis]. AAAI 2011

Hirst, Graeme and David St-Onge. Lexical chains as representations of context for the detection and correction of malapropisms. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 305–332, 1998.

Huang, Eric H., Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1 (ACL '12), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 873-882.

Islam, A., and Inkpen, D. 2006. [http://www.site.uottawa.ca/~mdislam/publications/LREC_06_242.pdf Second order co-occurrence pmi for determining the semantic similarity of words]. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006) 1033–1038.

Jarmasz, M. 2003. [http://www.arxiv.org/pdf/1204.0140 Roget’s thesaurus as a Lexical Resource for Natural Language Processing]. Ph.D. Dissertation, Ottawa Carleton Institute for Computer Science, School of Information Technology and Engineering, University of Ottawa.

Jiang, Jay J. and David W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference on Research in Computational Linguistics (ROCLING X), Taiwan, pages 19–33, 1997.

Landauer, T. K.; L, T. K.; Laham, D.; Rehder, B.; and Schreiner, M. E. 1997. How well can passage meaning be derived without using word order? a comparison of latent semantic analysis and humans.

Leacock, Claudia and Martin Chodorow. Combining local context and WordNet similarity for word sense identification. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 265–283, 1998.

Lin, Dekang. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, Madison,WI, pages 296–304, 1998.

Luong, Minh-Thang, Richard Socher, and Christopher D. Manning. (2013). [http://nlp.stanford.edu/~lmthang/data/papers/conll13_morpho.pdf Better word representations with recursive neural networks for morphology]. CoNLL-2013: 104.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Pilehvar, M.T., D. Jurgens and R. Navigli. [http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity]. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 4-9, 2013, pp. 1341-1351.

Radinsky, Kira, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. (2011). [http://gabrilovich.com/publications/papers/Radinsky2011WTS.pdf A word at a time: computing word relatedness using temporal semantic analysis]. In Proceedings of the 20th international conference on World wide web, pp. 337-346. ACM.

Resnik, Philip. Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448–453, Montreal, Canada, 1995.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, Rob, Joshua Chin and Catherine Havasi. (2017). [http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972 ConceptNet 5.5: An Open Multilingual Graph of General Knowledge]. Proceedings of The 31st AAAI Conference on Artificial Intelligence, San Francisco, CA.

Strube, Michael and Simone Paolo Ponzetto. (2006). [http://www.aaai.org/Papers/AAAI/2006/AAAI06-223.pdf WikiRelate! Computing Semantic Relatedness Using Wikipedia]. Proceedings of The 21st National Conference on Artificial Intelligence (AAAI), Boston, MA.

Yih, W. and Qazvinian, V. (2012). [http://aclweb.org/anthology/N/N12/N12-1077.pdf Measuring Word Relatedness Using Heterogeneous Vector Space Models]. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2012).

[[Category:State of the art]]
[[Category:Similarity]]

SimLex-999 (State of the art)

2019-08-12T23:00:55Z

Doboandris:

[http://www.cl.cam.ac.uk/~fh295/simlex.html SimLex-999] aims at a cleaner benchmark of similarity (but not relatedness). Pairs of words were chosen to represent different ranges of similarity and with either high or low association. Subjects were instructed to differentiate between similarity and relatedness and rate regarding the former only.

See also: [[Similarity (State of the art)]], [[Similar-Associated-Both Test Collection (State of the art)]].

{|
|-
! Algorithm !! Reference for algorithm !! Reference for reported results !! Type !! Spearman's rho !! Pearson's r !! Notes
|-
| Re16
| Recski et al. (2016)<ref name=recski16>Recski, G., Iklódi, E., Pajkossy, K., & Kornai, A. (2016). [https://www.aclweb.org/anthology/W16-1622 Measuring semantic similarity of words using concept networks]. In: Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 193-200.</ref>
| Recski et al. (2016)<ref name=recski16/>
| Hybrid || 0.76 || -
|-
| SVR4
| Banjade et al. (2015)<ref name=lemontea/>
| Banjade et al. (2015)<ref name=lemontea/>
| Combined || 0.642 || 0.658
|-
| Do19-hybrid
| Dobó (2019)<ref name=dobo19>Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged.</ref>
| Dobó (2019)<ref name=dobo19/>
| Hybrid || 0.621 || 0.481
|-
| Sp17
| Speer et al. (2017)<ref name=speer17>Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. AAAI-17, pp. 4444-4451.</ref>
| Dobó (2019)<ref name=dobo19/>
| Hybrid || 0.616 || 0.634
|-
| joint(SP+,skip-gram)
| Schwartz et al. (2015)<ref name=spplus>Schwartz, R., Reichart, Roi, Rappoport, A. (2015). Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction, CoNLL 2015.</ref>
| Schwartz et al. (2015)<ref name=spplus/>
| Distributional || 0.56 || - || Trained on word2vec corpus, best results for pure distributional model.
|-
| UMBC
| Han et al. (2013)<ref>Han, L., Kashyap, A., Finin, T., Mayfield, J., Weese, J.: UMBC EBIQUITY-CORE: Semantic textual similarity systems. In: Proceedings of the Second Joint Conference on Lexical and Computational Semantics, vol. 1, pp. 44–52 (2013)</ref>
| Banjade et al. (2015)<ref name=lemontea/>
| || 0.558 || 0.557 || without using POS information
|-
| SP+
| Schwartz et al. (2015)<ref name=spplus>Schwartz, R., Reichart, Roi, Rappoport, A. (2015). Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction, CoNLL 2015.</ref>
| Schwartz et al. (2015)<ref name=spplus/>
| Distributional || 0.52 || -
|-
| RNNenc
| Hill et al. (2014b)<ref name=rnnenc>Hill, F., Cho, K., Jean, S., Devin, C., & Bengio, Y. (2014b). Not All Neural Embeddings are Born Equal, 1–5.</ref>
| Hill et al. (2014b)<ref name=rnnenc/>
| Distributional, multilingual || 0.52 || -
|-
| Sa18
| Salle et al. (2018)<ref name=salle18>Salle A., Idiart M., and Villavicencio A. (2018). [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]</ref>
| Dobó (2019)<ref name=dobo19/>
| Distributional || 0.417 || 0.426
|-
| Word2vec
| Mikolov et al. (2013)<ref>Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of International Conference of Learning Representations, Scottsdale, Arizona, USA.</ref>
| Hill et al. (2014a)<ref name=simlex>Hill, F., Reichart, R., & Korhonen, A. (2014a). SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation. Computation and Language.</ref>
| Distributional || 0.414 || - || Trained on Wikipedia
|-
| Pe14
| Pennington et al. (2014)<ref name=pennington14>Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. EMNLP 2014, pp. 1532-1543.</ref>
| Dobó (2019)<ref name=dobo19/>
| Distributional || 0.406 || 0.433
|-
| Lesk
|
| Banjade et al. (2015)<ref name=lemontea>Banjade, R., Maharjan, N., Niraula, N., Rus, V., & Gautam, D. (2015). Lemon and Tea Are Not Similar: Measuring Word-to-Word Similarity by Combining Different Methods. Computational Linguistics and Intelligent Text Processing, 9041, 335–346. doi:10.1007/978-3-319-18111-0_25</ref>
| || 0.404 || 0.347
|-
| Do19-corpus
| Dobó (2019)<ref name=dobo19/>
| Dobó (2019)<ref name=dobo19/>
| Distributional || 0.393 || 0.401
|-
| ESA
|
| Banjade et al. (2015)<ref name=lemontea/>
| || 0.271 || 0.145
|-
| Neural language model
| Collobert & Weston (2008)<ref>R. Collobert and J. Weston. 2008. A unified architecture for natural language pro- cessing: Deep neural networks with multitask learning. In International Conference on Machine Learn- ing, ICML.</ref>
| Hill et al. (2014a)<ref name=simlex/>
| Distributional || 0.268 || - || Trained on Wikipedia
|-
| Neural language model with global context
| Huang et al. (2012)<ref>Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 873–882. Association for Computational Linguistics.</ref>
| Hill et al. (2014a)<ref name=simlex/>
| Distributional || 0.098 || - || Trained on Wikipedia
|}

== References ==
<references/>

[[Category:State of the art]]
[[Category:Similarity]]

MC-28 Test Collection (State of the art)

2019-08-12T22:54:15Z

Doboandris:

* state of the art in Miller & Charles 28 (MC-28) dataset [Resnik, 1995]
* 28 word pairs of the original Miller & Charles 30 (MC-30) dataset [Miller and Charles, 1991], which is a subset of the [[RG-65 Test Collection (State of the art)|Rubenstein & Goodenough (RG-65) dataset]]; two word pairs have generally been omitted for semantic similarity evaluation, as words in these word pairs have not been included in previous versions of WordNet
* Similarity of each pair is scored according to a scale from 0 to 4 (the higher the "similarity of meaning," the higher the number);
* The similarity values in the dataset are the means of judgments made by 38 subjects [Miller and Charles, 1991].
* see also: [[Similarity (State of the art)]]

== Table of results ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] [with 95% confidence intervals]
|-
| Human
| Human upper bound
| Resnik (1995)
| Human
| 0.934 [0.861, 0.969]
|-
| PPR
| Agirre et al. (2009)
| Agirre et al. (2009)
| Hybrid
| 0.92 [0.833, 0.962]
|-
| Gloss Vector
| Patwardhan and Pedersen (2006)
| Patwardhan and Pedersen (2006)
| Lexicon-based
| 0.91 [0.813, 0.957]
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.893 [0.780, 0.949]
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019)
| Hybrid
| 0.892 [0.778, 0.949]
|-
| JS
| Jarmasz and Szpakowicz (2003)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 0.87 [0.736, 0.938]
|-
| SR
| Tsatsaronis et al. (2010)
| Tsatsaronis et al. (2010)
| Lexicon-based
| 0.856 [0.710, 0.931]
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based
| 0.853 [0.704, 0.930]
|-
| KC
| Kulkarni and Caragea (2009)
| Kulkarni and Caragea (2009)
| Web-based
| 0.835 [0.671, 0.921]
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019)
| Corpus-based
| 0.832 [0.666, 0.919]
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019)
| Corpus-based
| 0.822 [0.648, 0.914]
|-
| LIN
| Lin (1998)
| Lin (1998)
| Hybrid
| 0.82 [0.644, 0.913]
|-
| RES
| Resnik (1995)
| Resnik (1995)
| Hybrid
| 0.81 [0.627, 0.908]
|-
| DC13
| Dobó and Csirik (2013)
| Dobó and Csirik (2013)
| Corpus-based
| 0.773 [0.562, 0.889]
|-
| GM
| Gabrilovich and Markovitch (2007)
| Tsatsaronis et al. (2010)
| Corpus-based
| 0.72 [0.475, 0.861]
|-
| WLM
| Milne and Witten (2008)
| Milne and Witten (2008)
| Web-based
| 0.70 [0.443, 0.850]
|-
| SH
| Sahami and Heilman (2006)
| Agirre et al. (2009)
| Web-based
| 0.618 [0.319, 0.805]
|}

== References ==

* '''Listed alphabetically.'''

Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A. (2009). [http://www.aclweb.org/anthology/N09-1003 A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches]. In: ''10th Annual Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies''. Association for Computa-tional Linguistics, Stroudsburg. pp. 19–27.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged.

Dobó, A., and Csirik, J. (2013). [http://link.springer.com/chapter/10.1007/978-3-642-35843-2_42 Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) ''SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741''. Springer-Verlag, Berlin Heidelberg, pp. 491-502.

Gabrilovich, E., and Markovitch, S. (2007). [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis], ''Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI)'', Hyderabad, India.

Jarmasz, M., and Szpakowicz, S. (2003). [http://www.csi.uottawa.ca/~szpak/recent_papers/TR-2003-01.pdf Roget’s thesaurus and semantic similarity], ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, September, pp. 212-219.

Kulkarni, S., Caragea, D. (2009). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.4693&rep=rep1&type=pdf Computation of the Semantic Relatedness between Words using Concept Clouds]. In: ''International Conference on Knowledge Discovery and Information Re-trieval''. INSTICC Press, Setubal. pp. 183–188.

Lin, D. (1998). [http://webdocs.cs.ualberta.ca/~lindek/papers/sim.pdf An information-theoretic definition of similarity]. In ''Proceedings of the 15th International Conference on Machine Learning'', Madison,WI, pp. 296–304.

Miller, G., and Charles, W. (1991). [http://www.tandfonline.com/doi/abs/10.1080/01690969108406936#.VjUdmjZRGUk Contextual correlates of semantic similarity]. ''Language and Cognitive Processes'', 6(1), 1–28.

Milne, D., Witten, I.H. (2008). [http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links], In ''Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy'', AAAI Press, Chicago, USA pp. 25-30.

Patwardhan, S., and Pedersen, T. (2006). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.145.6642&rep=rep1&type=pdf#page=7 Using WordNet-based Context Vectors to Estimate the Se-mantic Relatedness of Concepts]. In: ''11th Conference of the European Chapter of the Association for Computational Linguistics''. Association for Computational Linguistics, Stroudsbur. pp. 1–8.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Resnik, P. (1995). [http://arxiv.org/pdf/cmp-lg/9511007 Using information content to evaluate semantic similarity]. In ''Proceedings of the 14th International Joint Conference on Artificial Intelligence'', Montreal, Canada, pages 448–453.

Sahami, M., Heilman, T.D. (2006). [http://robotics.stanford.edu/users/sahami/papers-dir/www2006.pdf A web-based kernel function for measuring the similarity of short text snippets]. In: ''15th international conference on World Wide Web''. ACM Press, New York. pp. 377–386.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. (2010). [http://arxiv.org/abs/1401.5699 Text Relatedness Based on a Word Thesaurus]. ''Journal of Artificial Intelligence Research'' 37, 1–39

[[Category:State of the art]]
[[Category:Similarity]]

RG-65 Test Collection (State of the art)

2019-08-12T22:53:54Z

Doboandris:

* state of the art in Rubenstein & Goodenough (RG-65) dataset
* 65 word pairs;
* Similarity of each pair is scored according to a scale from 0 to 4 (the higher the "similarity of meaning," the higher the number);
* The similarity values in the dataset are the means of judgments made by 51 subjects [Rubenstein and Goodenough, 1965].
* see also: [[Similarity (State of the art)]]

== Table of results ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| ADW
| Pilehvar and Navigli (2015)
| Pilehvar and Navigli (2015)
| Knowledge-based (Wiktionary)
| 0.920
| 0.910
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019)
| Hybrid
| 0.901
| 0.896
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.899
| 0.914
|-
| Y&Q
| Yih and Qazvinian (2012)
| Yih and Qazvinian (2012)
| Hybrid
| 0.890
| -
|-
| NASARI
| Camacho-Collados et al. (2015)
| Camacho-Collados et al. (2015)
| Hybrid
| 0.880
| 0.910
|-
| ADW
| Pilehvar et al. (2013)
| Pilehvar et al. (2013)
| Knowledge-based (WordNet)
| 0.868
| 0.810
|-
| PPR
| Hughes and Ramage (2007)
| Hughes and Ramage (2007)
| Knowledge-based
| 0.838
| -
|-
| SSA
| Hassan and Mihalcea (2011)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.833
| 0.861
|-
| PPR
| Agirre et al. (2009)
| Agirre et al. (2009)
| Knowledge-based
| 0.830
| -
|-
| H&S
| Hirst and St-Onge (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.813
| 0.732
|-
| Roget
| Jarmasz (2003)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.804
| 0.818
|-
| J&C
| Jiang and Conrath (1997)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.804
| 0.731
|-
| WNE
| Jarmasz (2003)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.801
| 0.787
|-
| L&C
| Leacock and Chodorow (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.797
| 0.852
|-
| Lin
| Lin (1998)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.788
| 0.834
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019)
| Corpus-based
| 0.769
| 0.770
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019)
| Corpus-based
| 0.763
| 0.792
|-
| ESA*
| Gabrilovich and Markovitch (2007)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.749
| 0.716
|-
| SOCPMI*
| Islam and Inkpen (2006)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.741
| 0.729
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based
| 0.732
| 0.737
|-
| Resnik
| Resnik (1995)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.731
| 0.800
|-
| WLM
| Milne and Witten (2008)
| Milne and Witten (2008)
| Knowledge-based
| 0.640
| -
|-
| LSA*
| Landauer et al. (1997)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.609
| 0.644
|-
| WikiRelate
| Strube and Ponzetto (2006)
| Strube and Ponzetto (2006)
| Knowledge-based
| -
| 0.530
|}

Note: values reported by (Hassan and Mihalcea, 2011) are "based on the collected raw data from the respective authors", and those highlighted by (*) are re-implementations.

== References ==

* '''Listed alphabetically.'''

Agirre, Eneko, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa: [http://www.aclweb.org/anthology/N09-1003 A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches]. HLT-NAACL 2009: 19-27

Camacho-Collados, José, Pilehvar, Mohammad Taher, and Navigli, Roberto: [http://aclweb.org/anthology/N/N15/N15-1059.pdf NASARI: a Novel Approach to a Semantically-Aware Representation of Items]. NAACL 2015, pp. 567-577, Denver, USA.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged.

Gabrilovich, Evgeniy, and Shaul Markovitch, [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis], Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, 2007.

Hassan, Samer, and Rada Mihalcea: [http://www.cse.unt.edu/~rada/papers/hassan.aaai11.pdf‎ Semantic Relatedness Using Salient Semantic Analysis]. AAAI 2011

Hirst, Graeme and David St-Onge. Lexical chains as representations of context for the detection and correction of malapropisms. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 305–332, 1998.

Hughes, Thad, Daniel Ramage, Lexical Semantic Relatedness with Random Graph Walks. EMNLP-CoNLL 2007: 581-589.

Islam, A., and Inkpen, D. 2006. [http://www.site.uottawa.ca/~mdislam/publications/LREC_06_242.pdf Second order co-occurrence pmi for determining the semantic similarity of words]. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006) 1033–1038.

Jarmasz, M. 2003. [http://www.arxiv.org/pdf/1204.0140 Roget’s thesaurus as a Lexical Resource for Natural Language Processing]. Ph.D. Dissertation, Ottawa Carleton Institute for Computer Science, School of Information Technology and Engineering, University of Ottawa.

Jiang, Jay J. and David W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference on Research in Computational Linguistics (ROCLING X), Taiwan, pages 19–33, 1997.

Landauer, T. K.; L, T. K.; Laham, D.; Rehder, B.; and Schreiner, M. E. 1997. How well can passage meaning be derived without using word order? a comparison of latent semantic analysis and humans.

Leacock, Claudia and Martin Chodorow. Combining local context and WordNet similarity for word sense identification. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 265–283, 1998.

Lin, Dekang. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, Madison,WI, pages 296–304, 1998.

Milne, David, and Ian H. Witten, An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links, In Proceedings of AAAI 2008.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Pilehvar, M.T., Jurgens, D. and Navigli, R. [http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity]. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 4-9, 2013, pp. 1341-1351.

Pilehvar, M.T. and Navigli, R. [http://www.sciencedirect.com/science/article/pii/S000437021500106X From Senses to Texts: An All-in-one Graph-based Approach for Measuring Semantic Similarity]. Artificial Intelligence, Elsevier.

Resnik, Philip. Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448–453, Montreal, Canada, 1995.

Rubenstein, Herbert, and John B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633, 1965.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

Strube, Michael, Simone Paolo Ponzetto: WikiRelate! Computing Semantic Relatedness Using Wikipedia. AAAI 2006: 1419-1424

Yih, W. and Qazvinian, V. (2012). [http://aclweb.org/anthology/N/N12/N12-1077.pdf Measuring Word Relatedness Using Heterogeneous Vector Space Models]. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2012).

[[Category:State of the art]]
[[Category:Similarity]]

TOEFL Synonym Questions (State of the art)

2019-08-12T22:53:19Z

Doboandris:

* '''the TOEFL questions are available on request by contacting [http://lsa.colorado.edu/mail_sub.html LSA Support at CU Boulder]''', the people who manage the [http://lsa.colorado.edu/ LSA web site at Colorado]
* TOEFL = Test of English as a Foreign Language
* 80 multiple-choice synonym questions; 4 choices per question
* introduced in Landauer and Dumais (1997) as a way of evaluating algorithms for measuring degree of similarity between words
* subsequently used by many other researchers
* see also: [[Similarity (State of the art)]]

== Sample question ==

::{| border="0" cellpadding="1" cellspacing="1"
|-
! Stem:
|
| levied
|-
! Choices:
| (a)
| imposed
|-
|
| (b)
| believed
|-
|
| (c)
| requested
|-
|
| (d)
| correlated
|-
! Solution:
| (a)
| imposed
|-
|}

== Table of results ==

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for experiment
! Type
! Correct
! 95% confidence
|-
| RES
| Resnik (1995)
| Jarmasz and Szpakowicz (2003)
| Hybrid
| 20.31%
| 12.89–31.83%
|-
| LC
| Leacock and Chodrow (1998)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 21.88%
| 13.91–33.21%
|-
| LIN
| Lin (1998)
| Jarmasz and Szpakowicz (2003)
| Hybrid
| 24.06%
| 15.99–35.94%
|-
| Random
| Random guessing
| 1 / 4 = 25.00%
| Random
| 25.00%
| 15.99–35.94%
|-
| JC
| Jiang and Conrath (1997)
| Jarmasz and Szpakowicz (2003)
| Hybrid
| 25.00%
| 15.99–35.94%
|-
| LSA
| Landauer and Dumais (1997)
| Landauer and Dumais (1997)
| Corpus-based
| 64.38%
| 52.90–74.80%
|-
| Human
| Average non-English US college applicant
| Landauer and Dumais (1997)
| Human
| 64.50%
| 53.01–74.88%
|-
| RI
| Karlgren and Sahlgren (2001)
| Karlgren and Sahlgren (2001)
| Corpus-based
| 72.50%
| 61.38-81.90%
|-
| DS
| Pado and Lapata (2007)
| Pado and Lapata (2007)
| Corpus-based
| 73.00%
| 62.72-82.96%
|-
| PMI-IR
| Turney (2001)
| Turney (2001)
| Corpus-based
| 73.75%
| 62.72–82.96%
|-
| PairClass
| Turney (2008)
| Turney (2008)
| Corpus-based
| 76.25%
| 65.42-85.06%
|-
| HSO
| Hirst and St.-Onge (1998)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 77.91%
| 68.17–87.11%
|-
| JS
| Jarmasz and Szpakowicz (2003)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 78.75%
| 68.17–87.11%
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019)
| Corpus-based
| 80.00%
| 69.56–88.11%
|-
| PMI-IR
| Terra and Clarke (2003)
| Terra and Clarke (2003)
| Corpus-based
| 81.25%
| 70.97–89.11%
|-
| LC-IR
| Higgins (2005)
| Higgins (2005)
| Web-based
| 81.25%
| 70.97–89.11%
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based
| 81.25%
| 70.97–89.11%
|-
| CWO
| Ruiz-Casado et al. (2005)
| Ruiz-Casado et al. (2005)
| Web-based
| 82.55%
| 72.38–90.09%
|-
| PPMIC
| Bullinaria and Levy (2007)
| Bullinaria and Levy (2007)
| Corpus-based
| 85.00%
| 75.26-92.00%
|-
| GLSA
| Matveeva et al. (2005)
| Matveeva et al. (2005)
| Corpus-based
| 86.25%
| 76.73-92.93%
|-
| SR
| Tsatsaronis et al. (2010)
| Tsatsaronis et al. (2010)
| Lexicon-based
| 87.50%
| 78.21-93.84%
|-
| DC13
| Dobó and Csirik (2013)
| Dobó and Csirik (2013)
| Corpus-based
| 88.75%
| 79.72-94.72%
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019)
| Corpus-based
| 90.00%
| 81.24-95.58%
|-
| LSA
| Rapp (2003)
| Rapp (2003)
| Corpus-based
| 92.50%
| 84.39-97.20%
|-
| LSA
| Han (2014)
| Han (2014)
| Hybrid
| 95.0%
| 87.69-98.62%
|-
| ADW
| Pilehvar et al. (2013)
| Pilehvar et al. (2013)
| WordNet graph-based (unsupervised)
| 96.25%
| 89.43-99.22%
|-
| PR
| Turney et al. (2003)
| Turney et al. (2003)
| Hybrid
| 97.50%
| 91.26–99.70%
|-
| Sp19
| Speer et al. (2017)
| Dobó (2019)
| Hybrid
| 98.75%
| 93.23–99.97%
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 98.75%
| 93.23–99.97%
|-
| PCCP
| Bullinaria and Levy (2012)
| Bullinaria and Levy (2012)
| Corpus-based
| 100.00%
| 96.32-100.00%
|}

== Explanation of table ==

* '''Algorithm''' = name of algorithm
* '''Reference for algorithm''' = where to find out more about given algorithm
* '''Reference for experiment''' = where to find out more about evaluation of given algorithm with TOEFL questions
* '''Type''' = general type of algorithm: corpus-based, lexicon-based, hybrid
* '''Correct''' = percent of 80 questions that given algorithm answered correctly
* '''95% confidence''' = confidence interval calculated using the [[Statistical calculators|Binomial Exact Test]]
* table rows sorted in order of increasing percent correct
* several WordNet-based similarity measures are implemented in [http://www.d.umn.edu/~tpederse/ Ted Pedersen]'s [http://www.d.umn.edu/~tpederse/similarity.html WordNet::Similarity] package
* LSA = Latent Semantic Analysis
* PCCP = Principal Component vectors with Caron P
* PMI-IR = Pointwise Mutual Information - Information Retrieval
* PR = Product Rule
* PPMIC = Positive Pointwise Mutual Information with Cosine
* GLSA = Generalized Latent Semantic Analysis
* CWO = Context Window Overlapping
* DS = Dependency Space
* RI = Random Indexing

== Notes ==

* the performance of a corpus-based algorithm depends on the corpus, so the difference in performance between two corpus-based systems may be due to the different corpora, rather than the different algorithms
* the TOEFL questions include nouns, verbs, and adjectives, but some of the WordNet-based algorithms were only designed to work with nouns; this explains some of the lower scores
* some of the algorithms may have been tuned on the TOEFL questions; read the references for details
* Landauer and Dumais (1997) report scores that were corrected for guessing by subtracting a penalty of 1/3 for each incorrect answer; they report a score of 52.5% when this penalty is applied; when the penalty is removed, their performance is 64.4% correct

== References ==

Bullinaria, J.A., and Levy, J.P. (2007). [http://www.cs.bham.ac.uk/~jxb/PUBS/BRM.pdf Extracting semantic representations from word co-occurrence statistics: A computational study]. ''Behavior Research Methods'', 39(3), 510-526.

Bullinaria, J.A., and Levy, J.P. (2012). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.9582&rep=rep1&type=pdf Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD]. ''Behavior Research Methods'', 44(3):890-907.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged.

Dobó, A., and Csirik, J. (2013). [http://link.springer.com/chapter/10.1007/978-3-642-35843-2_42 Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) ''SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741''. Springer-Verlag, Berlin Heidelberg, pp. 491-502

Lushan Han. (2014). [http://ebiquity.umbc.edu/paper/html/id/658/Schema-Free-Querying-of-Semantic-Data Schema Free Querying of Semantic Data], Ph.D. dissertation, University of Maryland, Baltimore County, Baltimore, MD USA.

Higgins, D. (2005). [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.329.1517 Which Statistics Reflect Semantics? Rethinking Synonymy and Word Similarity.] In: Kepser, S., Reis, M. (eds.) ''Linguistic Evidence: Empirical, Theoretical and Computational Perspectives''. Mouton de Gruyter, Berlin, pp. 265–284.

Hirst, G., and St-Onge, D. (1998). [http://mirror.eacoss.org/documentation/ITLibrary/IRIS/Data/1997/Hirst/Lexical/1997-Hirst-Lexical.pdf Lexical chains as representation of context for the detection and correction of malapropisms]. In C. Fellbaum (ed.), ''WordNet: An Electronic Lexical Database''. Cambridge: MIT Press, 305-332.

Jarmasz, M., and Szpakowicz, S. (2003). [http://www.csi.uottawa.ca/~szpak/recent_papers/TR-2003-01.pdf Roget’s thesaurus and semantic similarity], ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, September, pp. 212-219.

Jiang, J.J., and Conrath, D.W. (1997). [http://wortschatz.uni-leipzig.de/~sbordag/aalw05/Referate/03_Assoziationen_BudanitskyResnik/Jiang_Conrath_97.pdf Semantic similarity based on corpus statistics and lexical taxonomy]. ''Proceedings of the International Conference on Research in Computational Linguistics'', Taiwan.

Karlgren, J. and Sahlgren, M. (2001). [http://www.sics.se/~jussi/Artiklar/2001_RWIbook/KarlgrenSahlgren2001.pdf From Words to Understanding]. In Uesaka, Y., Kanerva, P., & Asoh, H. (Eds.), ''Foundations of Real-World Intelligence'', Stanford: CSLI Publications, pp. 294–308.

Landauer, T.K., and Dumais, S.T. (1997). [http://lsa.colorado.edu/papers/plato/plato.annote.html A solution to Plato's problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge]. ''Psychological Review'', 104(2):211–240.

Leacock, C., and Chodorow, M. (1998). [http://books.google.ca/books?id=Rehu8OOzMIMC&lpg=PA265&ots=IpnaLkZUec&lr&pg=PA265#v=onepage&q&f=false Combining local context and WordNet similarity for word sense identification]. In C. Fellbaum (ed.), ''WordNet: An Electronic Lexical Database''. Cambridge: MIT Press, pp. 265-283.

Lin, D. (1998). [http://www.cs.ualberta.ca/~lindek/papers/sim.pdf An information-theoretic definition of similarity]. ''Proceedings of the 15th International Conference on Machine Learning (ICML-98)'', Madison, WI, pp. 296-304.

Matveeva, I., Levow, G., Farahat, A., and Royer, C. (2005). [http://people.cs.uchicago.edu/~matveeva/SynGLSA_ranlp_final.pdf Generalized latent semantic analysis for term representation]. ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-05)'', Borovets, Bulgaria.

Pado, S., and Lapata, M. (2007). [http://www.nlpado.de/~sebastian/pub/papers/cl07_pado.pdf Dependency-based construction of semantic space models]. ''Computational Linguistics'', 33(2), 161-199.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation]. ''EMNLP 2014'', pp. 1532-1543.

Pilehvar, M.T., Jurgens D., and Navigli R. (2013). [http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf Align, disambiguate and walk: A unified approach for measuring semantic similarity]. ''Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013),'' Sofia, Bulgaria.

Rapp, R. (2003). [http://www.amtaweb.org/summit/MTSummit/FinalPapers/19-Rapp-final.pdf Word sense discovery based on sense descriptor dissimilarity]. ''Proceedings of the Ninth Machine Translation Summit'', pp. 315-322.

Resnik, P. (1995). [http://citeseer.ist.psu.edu/resnik95using.html Using information content to evaluate semantic similarity]. ''Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95)'', Montreal, pp. 448-453.

Ruiz-Casado, M., Alfonseca, E. and Castells, P. (2005) [http://alfonseca.org/pubs/2005-ranlp1.pdf Using context-window overlapping in Synonym Discovery and Ontology Extension]. ''Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP-2005)'', Borovets, Bulgaria.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

Terra, E., and Clarke, C.L.A. (2003). [http://acl.ldc.upenn.edu/N/N03/N03-1032.pdf Frequency estimates for statistical word similarity measures]. ''Proceedings of the Human Language Technology and North American Chapter of Association of Computational Linguistics Conference 2003 (HLT/NAACL 2003)'', pp. 244–251.

Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. (2010). [http://arxiv.org/abs/1401.5699 Text Relatedness Based on a Word Thesaurus]. ''Journal of Artificial Intelligence Research'' 37, 1–39

Turney, P.D. (2001). [http://arxiv.org/abs/cs.LG/0212033 Mining the Web for synonyms: PMI-IR versus LSA on TOEFL]. ''Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001)'', Freiburg, Germany, pp. 491-502.

Turney, P.D., Littman, M.L., Bigham, J., and Shnayder, V. (2003). [http://arxiv.org/abs/cs.CL/0309035 Combining independent modules to solve multiple-choice synonym and analogy problems]. ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, pp. 482-489.

Turney, P.D. (2008). [http://arxiv.org/abs/0809.0124 A uniform approach to analogies, synonyms, antonyms, and associations]. ''Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)'', Manchester, UK, pp. 905-912.

[[Category:State of the art]]
[[Category:Similarity]]

MC-28 Test Collection (State of the art)

2019-08-12T22:20:54Z

Doboandris:

* state of the art in Miller & Charles 28 (MC-28) dataset [Resnik, 1995]
* 28 word pairs of the original Miller & Charles 30 (MC-30) dataset [Miller and Charles, 1991], which is a subset of the [[RG-65 Test Collection (State of the art)|Rubenstein & Goodenough (RG-65) dataset]]; two word pairs have generally been omitted for semantic similarity evaluation, as words in these word pairs have not been included in previous versions of WordNet
* Similarity of each pair is scored according to a scale from 0 to 4 (the higher the "similarity of meaning," the higher the number);
* The similarity values in the dataset are the means of judgments made by 38 subjects [Miller and Charles, 1991].
* see also: [[Similarity (State of the art)]]

== Table of results ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] [with 95% confidence intervals]
|-
| Human
| Human upper bound
| Resnik (1995)
| Human
| 0.934 [0.861, 0.969]
|-
| PPR
| Agirre et al. (2009)
| Agirre et al. (2009)
| Hybrid
| 0.92 [0.833, 0.962]
|-
| Gloss Vector
| Patwardhan and Pedersen (2006)
| Patwardhan and Pedersen (2006)
| Lexicon-based
| 0.91 [0.813, 0.957]
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.893 [0.780, 0.949]
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019)
| Hybrid
| 0.892 [0.778, 0.949]
|-
| JS
| Jarmasz and Szpakowicz (2003)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 0.87 [0.736, 0.938]
|-
| SR
| Tsatsaronis et al. (2010)
| Tsatsaronis et al. (2010)
| Lexicon-based
| 0.856 [0.710, 0.931]
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based
| 0.853 [0.704, 0.930]
|-
| KC
| Kulkarni and Caragea (2009)
| Kulkarni and Caragea (2009)
| Web-based
| 0.835 [0.671, 0.921]
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019)
| Corpus-based
| 0.832 [0.666, 0.919]
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019)
| Corpus-based
| 0.822 [0.648, 0.914]
|-
| LIN
| Lin (1998)
| Lin (1998)
| Hybrid
| 0.82 [0.644, 0.913]
|-
| RES
| Resnik (1995)
| Resnik (1995)
| Hybrid
| 0.81 [0.627, 0.908]
|-
| DC13
| Dobó and Csirik (2013)
| Dobó and Csirik (2013)
| Corpus-based
| 0.773 [0.562, 0.889]
|-
| GM
| Gabrilovich and Markovitch (2007)
| Tsatsaronis et al. (2010)
| Corpus-based
| 0.72 [0.475, 0.861]
|-
| WLM
| Milne and Witten (2008)
| Milne and Witten (2008)
| Web-based
| 0.70 [0.443, 0.850]
|-
| SH
| Sahami and Heilman (2006)
| Agirre et al. (2009)
| Web-based
| 0.618 [0.319, 0.805]
|}

== References ==

* '''Listed alphabetically.'''

Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A. (2009). [http://www.aclweb.org/anthology/N09-1003 A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches]. In: ''10th Annual Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies''. Association for Computa-tional Linguistics, Stroudsburg. pp. 19–27.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged.

Dobó, A., and Csirik, J. (2013). [http://link.springer.com/chapter/10.1007/978-3-642-35843-2_42 Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) ''SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741''. Springer-Verlag, Berlin Heidelberg, pp. 491-502.

Gabrilovich, E., and Markovitch, S. (2007). [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis], ''Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI)'', Hyderabad, India.

Jarmasz, M., and Szpakowicz, S. (2003). [http://www.csi.uottawa.ca/~szpak/recent_papers/TR-2003-01.pdf Roget’s thesaurus and semantic similarity], ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, September, pp. 212-219.

Kulkarni, S., Caragea, D. (2009). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.4693&rep=rep1&type=pdf Computation of the Semantic Relatedness between Words using Concept Clouds]. In: ''International Conference on Knowledge Discovery and Information Re-trieval''. INSTICC Press, Setubal. pp. 183–188.

Lin, D. (1998). [http://webdocs.cs.ualberta.ca/~lindek/papers/sim.pdf An information-theoretic definition of similarity]. In ''Proceedings of the 15th International Conference on Machine Learning'', Madison,WI, pp. 296–304.

Miller, G., and Charles, W. (1991). [http://www.tandfonline.com/doi/abs/10.1080/01690969108406936#.VjUdmjZRGUk Contextual correlates of semantic similarity]. ''Language and Cognitive Processes'', 6(1), 1–28.

Milne, D., Witten, I.H. (2008). [http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links], In ''Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy'', AAAI Press, Chicago, USA pp. 25-30.

Patwardhan, S., and Pedersen, T. (2006). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.145.6642&rep=rep1&type=pdf#page=7 Using WordNet-based Context Vectors to Estimate the Se-mantic Relatedness of Concepts]. In: ''11th Conference of the European Chapter of the Association for Computational Linguistics''. Association for Computational Linguistics, Stroudsbur. pp. 1–8.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation. ''EMNLP 2014'', pp. 1532-1543.

Resnik, P. (1995). [http://arxiv.org/pdf/cmp-lg/9511007 Using information content to evaluate semantic similarity]. In ''Proceedings of the 14th International Joint Conference on Artificial Intelligence'', Montreal, Canada, pages 448–453.

Sahami, M., Heilman, T.D. (2006). [http://robotics.stanford.edu/users/sahami/papers-dir/www2006.pdf A web-based kernel function for measuring the similarity of short text snippets]. In: ''15th international conference on World Wide Web''. ACM Press, New York. pp. 377–386.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. (2010). [http://arxiv.org/abs/1401.5699 Text Relatedness Based on a Word Thesaurus]. ''Journal of Artificial Intelligence Research'' 37, 1–39

[[Category:State of the art]]
[[Category:Similarity]]

RG-65 Test Collection (State of the art)

2019-08-12T22:09:03Z

Doboandris:

* state of the art in Rubenstein & Goodenough (RG-65) dataset
* 65 word pairs;
* Similarity of each pair is scored according to a scale from 0 to 4 (the higher the "similarity of meaning," the higher the number);
* The similarity values in the dataset are the means of judgments made by 51 subjects [Rubenstein and Goodenough, 1965].
* see also: [[Similarity (State of the art)]]

== Table of results ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] (ρ)
! [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient Pearson correlation] (r)
|-
| ADW
| Pilehvar and Navigli (2015)
| Pilehvar and Navigli (2015)
| Knowledge-based (Wiktionary)
| 0.920
| 0.910
|-
| Sp17
| Speer et al. (2017)
| Dobó (2019)
| Hybrid
| 0.901
| 0.896
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 0.899
| 0.914
|-
| Y&Q
| Yih and Qazvinian (2012)
| Yih and Qazvinian (2012)
| Hybrid
| 0.890
| -
|-
| NASARI
| Camacho-Collados et al. (2015)
| Camacho-Collados et al. (2015)
| Hybrid
| 0.880
| 0.910
|-
| ADW
| Pilehvar et al. (2013)
| Pilehvar et al. (2013)
| Knowledge-based (WordNet)
| 0.868
| 0.810
|-
| PPR
| Hughes and Ramage (2007)
| Hughes and Ramage (2007)
| Knowledge-based
| 0.838
| -
|-
| SSA
| Hassan and Mihalcea (2011)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.833
| 0.861
|-
| PPR
| Agirre et al. (2009)
| Agirre et al. (2009)
| Knowledge-based
| 0.830
| -
|-
| H&S
| Hirst and St-Onge (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.813
| 0.732
|-
| Roget
| Jarmasz (2003)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.804
| 0.818
|-
| J&C
| Jiang and Conrath (1997)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.804
| 0.731
|-
| WNE
| Jarmasz (2003)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.801
| 0.787
|-
| L&C
| Leacock and Chodorow (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.797
| 0.852
|-
| Lin
| Lin (1998)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.788
| 0.834
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019)
| Corpus-based
| 0.769
| 0.770
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019)
| Corpus-based
| 0.763
| 0.792
|-
| ESA*
| Gabrilovich and Markovitch (2007)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.749
| 0.716
|-
| SOCPMI*
| Islam and Inkpen (2006)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.741
| 0.729
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based
| 0.732
| 0.737
|-
| Resnik
| Resnik (1995)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.731
| 0.800
|-
| WLM
| Milne and Witten (2008)
| Milne and Witten (2008)
| Knowledge-based
| 0.640
| -
|-
| LSA*
| Landauer et al. (1997)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.609
| 0.644
|-
| WikiRelate
| Strube and Ponzetto (2006)
| Strube and Ponzetto (2006)
| Knowledge-based
| -
| 0.530
|}

Note: values reported by (Hassan and Mihalcea, 2011) are "based on the collected raw data from the respective authors", and those highlighted by (*) are re-implementations.

== References ==

* '''Listed alphabetically.'''

Agirre, Eneko, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa: [http://www.aclweb.org/anthology/N09-1003 A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches]. HLT-NAACL 2009: 19-27

Camacho-Collados, José, Pilehvar, Mohammad Taher, and Navigli, Roberto: [http://aclweb.org/anthology/N/N15/N15-1059.pdf NASARI: a Novel Approach to a Semantically-Aware Representation of Items]. NAACL 2015, pp. 567-577, Denver, USA.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged.

Gabrilovich, Evgeniy, and Shaul Markovitch, [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis], Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, 2007.

Hassan, Samer, and Rada Mihalcea: [http://www.cse.unt.edu/~rada/papers/hassan.aaai11.pdf‎ Semantic Relatedness Using Salient Semantic Analysis]. AAAI 2011

Hirst, Graeme and David St-Onge. Lexical chains as representations of context for the detection and correction of malapropisms. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 305–332, 1998.

Hughes, Thad, Daniel Ramage, Lexical Semantic Relatedness with Random Graph Walks. EMNLP-CoNLL 2007: 581-589.

Islam, A., and Inkpen, D. 2006. [http://www.site.uottawa.ca/~mdislam/publications/LREC_06_242.pdf Second order co-occurrence pmi for determining the semantic similarity of words]. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006) 1033–1038.

Jarmasz, M. 2003. [http://www.arxiv.org/pdf/1204.0140 Roget’s thesaurus as a Lexical Resource for Natural Language Processing]. Ph.D. Dissertation, Ottawa Carleton Institute for Computer Science, School of Information Technology and Engineering, University of Ottawa.

Jiang, Jay J. and David W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference on Research in Computational Linguistics (ROCLING X), Taiwan, pages 19–33, 1997.

Landauer, T. K.; L, T. K.; Laham, D.; Rehder, B.; and Schreiner, M. E. 1997. How well can passage meaning be derived without using word order? a comparison of latent semantic analysis and humans.

Leacock, Claudia and Martin Chodorow. Combining local context and WordNet similarity for word sense identification. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 265–283, 1998.

Lin, Dekang. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, Madison,WI, pages 296–304, 1998.

Milne, David, and Ian H. Witten, An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links, In Proceedings of AAAI 2008.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation. ''EMNLP 2014'', pp. 1532-1543.

Pilehvar, M.T., Jurgens, D. and Navigli, R. [http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity]. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 4-9, 2013, pp. 1341-1351.

Pilehvar, M.T. and Navigli, R. [http://www.sciencedirect.com/science/article/pii/S000437021500106X From Senses to Texts: An All-in-one Graph-based Approach for Measuring Semantic Similarity]. Artificial Intelligence, Elsevier.

Resnik, Philip. Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448–453, Montreal, Canada, 1995.

Rubenstein, Herbert, and John B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633, 1965.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

Strube, Michael, Simone Paolo Ponzetto: WikiRelate! Computing Semantic Relatedness Using Wikipedia. AAAI 2006: 1419-1424

Yih, W. and Qazvinian, V. (2012). [http://aclweb.org/anthology/N/N12/N12-1077.pdf Measuring Word Relatedness Using Heterogeneous Vector Space Models]. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2012).

[[Category:State of the art]]
[[Category:Similarity]]

TOEFL Synonym Questions (State of the art)

2019-08-12T21:39:57Z

Doboandris:

* '''the TOEFL questions are available on request by contacting [http://lsa.colorado.edu/mail_sub.html LSA Support at CU Boulder]''', the people who manage the [http://lsa.colorado.edu/ LSA web site at Colorado]
* TOEFL = Test of English as a Foreign Language
* 80 multiple-choice synonym questions; 4 choices per question
* introduced in Landauer and Dumais (1997) as a way of evaluating algorithms for measuring degree of similarity between words
* subsequently used by many other researchers
* see also: [[Similarity (State of the art)]]

== Sample question ==

::{| border="0" cellpadding="1" cellspacing="1"
|-
! Stem:
|
| levied
|-
! Choices:
| (a)
| imposed
|-
|
| (b)
| believed
|-
|
| (c)
| requested
|-
|
| (d)
| correlated
|-
! Solution:
| (a)
| imposed
|-
|}

== Table of results ==

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for experiment
! Type
! Correct
! 95% confidence
|-
| RES
| Resnik (1995)
| Jarmasz and Szpakowicz (2003)
| Hybrid
| 20.31%
| 12.89–31.83%
|-
| LC
| Leacock and Chodrow (1998)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 21.88%
| 13.91–33.21%
|-
| LIN
| Lin (1998)
| Jarmasz and Szpakowicz (2003)
| Hybrid
| 24.06%
| 15.99–35.94%
|-
| Random
| Random guessing
| 1 / 4 = 25.00%
| Random
| 25.00%
| 15.99–35.94%
|-
| JC
| Jiang and Conrath (1997)
| Jarmasz and Szpakowicz (2003)
| Hybrid
| 25.00%
| 15.99–35.94%
|-
| LSA
| Landauer and Dumais (1997)
| Landauer and Dumais (1997)
| Corpus-based
| 64.38%
| 52.90–74.80%
|-
| Human
| Average non-English US college applicant
| Landauer and Dumais (1997)
| Human
| 64.50%
| 53.01–74.88%
|-
| RI
| Karlgren and Sahlgren (2001)
| Karlgren and Sahlgren (2001)
| Corpus-based
| 72.50%
| 61.38-81.90%
|-
| DS
| Pado and Lapata (2007)
| Pado and Lapata (2007)
| Corpus-based
| 73.00%
| 62.72-82.96%
|-
| PMI-IR
| Turney (2001)
| Turney (2001)
| Corpus-based
| 73.75%
| 62.72–82.96%
|-
| PairClass
| Turney (2008)
| Turney (2008)
| Corpus-based
| 76.25%
| 65.42-85.06%
|-
| HSO
| Hirst and St.-Onge (1998)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 77.91%
| 68.17–87.11%
|-
| JS
| Jarmasz and Szpakowicz (2003)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 78.75%
| 68.17–87.11%
|-
| Sa18
| Salle et al. (2018)
| Dobó (2019)
| Corpus-based
| 80.00%
| 69.56–88.11%
|-
| PMI-IR
| Terra and Clarke (2003)
| Terra and Clarke (2003)
| Corpus-based
| 81.25%
| 70.97–89.11%
|-
| LC-IR
| Higgins (2005)
| Higgins (2005)
| Web-based
| 81.25%
| 70.97–89.11%
|-
| Do19-corpus
| Dobó (2019)
| Dobó (2019)
| Corpus-based
| 81.25%
| 70.97–89.11%
|-
| CWO
| Ruiz-Casado et al. (2005)
| Ruiz-Casado et al. (2005)
| Web-based
| 82.55%
| 72.38–90.09%
|-
| PPMIC
| Bullinaria and Levy (2007)
| Bullinaria and Levy (2007)
| Corpus-based
| 85.00%
| 75.26-92.00%
|-
| GLSA
| Matveeva et al. (2005)
| Matveeva et al. (2005)
| Corpus-based
| 86.25%
| 76.73-92.93%
|-
| SR
| Tsatsaronis et al. (2010)
| Tsatsaronis et al. (2010)
| Lexicon-based
| 87.50%
| 78.21-93.84%
|-
| DC13
| Dobó and Csirik (2013)
| Dobó and Csirik (2013)
| Corpus-based
| 88.75%
| 79.72-94.72%
|-
| Pe14
| Pennington et al. (2014)
| Dobó (2019)
| Corpus-based
| 90.00%
| 81.24-95.58%
|-
| LSA
| Rapp (2003)
| Rapp (2003)
| Corpus-based
| 92.50%
| 84.39-97.20%
|-
| LSA
| Han (2014)
| Han (2014)
| Hybrid
| 95.0%
| 87.69-98.62%
|-
| ADW
| Pilehvar et al. (2013)
| Pilehvar et al. (2013)
| WordNet graph-based (unsupervised)
| 96.25%
| 89.43-99.22%
|-
| PR
| Turney et al. (2003)
| Turney et al. (2003)
| Hybrid
| 97.50%
| 91.26–99.70%
|-
| Sp19
| Speer et al. (2017)
| Dobó (2019)
| Hybrid
| 98.75%
| 93.23–99.97%
|-
| Do19-hybrid
| Dobó (2019)
| Dobó (2019)
| Hybrid
| 98.75%
| 93.23–99.97%
|-
| PCCP
| Bullinaria and Levy (2012)
| Bullinaria and Levy (2012)
| Corpus-based
| 100.00%
| 96.32-100.00%
|}

== Explanation of table ==

* '''Algorithm''' = name of algorithm
* '''Reference for algorithm''' = where to find out more about given algorithm
* '''Reference for experiment''' = where to find out more about evaluation of given algorithm with TOEFL questions
* '''Type''' = general type of algorithm: corpus-based, lexicon-based, hybrid
* '''Correct''' = percent of 80 questions that given algorithm answered correctly
* '''95% confidence''' = confidence interval calculated using the [[Statistical calculators|Binomial Exact Test]]
* table rows sorted in order of increasing percent correct
* several WordNet-based similarity measures are implemented in [http://www.d.umn.edu/~tpederse/ Ted Pedersen]'s [http://www.d.umn.edu/~tpederse/similarity.html WordNet::Similarity] package
* LSA = Latent Semantic Analysis
* PCCP = Principal Component vectors with Caron P
* PMI-IR = Pointwise Mutual Information - Information Retrieval
* PR = Product Rule
* PPMIC = Positive Pointwise Mutual Information with Cosine
* GLSA = Generalized Latent Semantic Analysis
* CWO = Context Window Overlapping
* DS = Dependency Space
* RI = Random Indexing

== Notes ==

* the performance of a corpus-based algorithm depends on the corpus, so the difference in performance between two corpus-based systems may be due to the different corpora, rather than the different algorithms
* the TOEFL questions include nouns, verbs, and adjectives, but some of the WordNet-based algorithms were only designed to work with nouns; this explains some of the lower scores
* some of the algorithms may have been tuned on the TOEFL questions; read the references for details
* Landauer and Dumais (1997) report scores that were corrected for guessing by subtracting a penalty of 1/3 for each incorrect answer; they report a score of 52.5% when this penalty is applied; when the penalty is removed, their performance is 64.4% correct

== References ==

Bullinaria, J.A., and Levy, J.P. (2007). [http://www.cs.bham.ac.uk/~jxb/PUBS/BRM.pdf Extracting semantic representations from word co-occurrence statistics: A computational study]. ''Behavior Research Methods'', 39(3), 510-526.

Bullinaria, J.A., and Levy, J.P. (2012). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.9582&rep=rep1&type=pdf Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD]. ''Behavior Research Methods'', 44(3):890-907.

Dobó, A. (2019). [http://doktori.bibl.u-szeged.hu/10120/1/AndrasDoboThesis2019.pdf A comprehensive analysis of the parameters in the creation and comparison of feature vectors in distributional semantic models for multiple languages]. University of Szeged.

Dobó, A., and Csirik, J. (2013). [http://link.springer.com/chapter/10.1007/978-3-642-35843-2_42 Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) ''SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741''. Springer-Verlag, Berlin Heidelberg, pp. 491-502

Lushan Han. (2014). [http://ebiquity.umbc.edu/paper/html/id/658/Schema-Free-Querying-of-Semantic-Data Schema Free Querying of Semantic Data], Ph.D. dissertation, University of Maryland, Baltimore County, Baltimore, MD USA.

Higgins, D. (2005). [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.329.1517 Which Statistics Reflect Semantics? Rethinking Synonymy and Word Similarity.] In: Kepser, S., Reis, M. (eds.) ''Linguistic Evidence: Empirical, Theoretical and Computational Perspectives''. Mouton de Gruyter, Berlin, pp. 265–284.

Hirst, G., and St-Onge, D. (1998). [http://mirror.eacoss.org/documentation/ITLibrary/IRIS/Data/1997/Hirst/Lexical/1997-Hirst-Lexical.pdf Lexical chains as representation of context for the detection and correction of malapropisms]. In C. Fellbaum (ed.), ''WordNet: An Electronic Lexical Database''. Cambridge: MIT Press, 305-332.

Jarmasz, M., and Szpakowicz, S. (2003). [http://www.csi.uottawa.ca/~szpak/recent_papers/TR-2003-01.pdf Roget’s thesaurus and semantic similarity], ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, September, pp. 212-219.

Jiang, J.J., and Conrath, D.W. (1997). [http://wortschatz.uni-leipzig.de/~sbordag/aalw05/Referate/03_Assoziationen_BudanitskyResnik/Jiang_Conrath_97.pdf Semantic similarity based on corpus statistics and lexical taxonomy]. ''Proceedings of the International Conference on Research in Computational Linguistics'', Taiwan.

Karlgren, J. and Sahlgren, M. (2001). [http://www.sics.se/~jussi/Artiklar/2001_RWIbook/KarlgrenSahlgren2001.pdf From Words to Understanding]. In Uesaka, Y., Kanerva, P., & Asoh, H. (Eds.), ''Foundations of Real-World Intelligence'', Stanford: CSLI Publications, pp. 294–308.

Landauer, T.K., and Dumais, S.T. (1997). [http://lsa.colorado.edu/papers/plato/plato.annote.html A solution to Plato's problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge]. ''Psychological Review'', 104(2):211–240.

Leacock, C., and Chodorow, M. (1998). [http://books.google.ca/books?id=Rehu8OOzMIMC&lpg=PA265&ots=IpnaLkZUec&lr&pg=PA265#v=onepage&q&f=false Combining local context and WordNet similarity for word sense identification]. In C. Fellbaum (ed.), ''WordNet: An Electronic Lexical Database''. Cambridge: MIT Press, pp. 265-283.

Lin, D. (1998). [http://www.cs.ualberta.ca/~lindek/papers/sim.pdf An information-theoretic definition of similarity]. ''Proceedings of the 15th International Conference on Machine Learning (ICML-98)'', Madison, WI, pp. 296-304.

Matveeva, I., Levow, G., Farahat, A., and Royer, C. (2005). [http://people.cs.uchicago.edu/~matveeva/SynGLSA_ranlp_final.pdf Generalized latent semantic analysis for term representation]. ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-05)'', Borovets, Bulgaria.

Pado, S., and Lapata, M. (2007). [http://www.nlpado.de/~sebastian/pub/papers/cl07_pado.pdf Dependency-based construction of semantic space models]. ''Computational Linguistics'', 33(2), 161-199.

Pennington, J., Socher, R., and Manning, C. (2014). [https://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation. ''EMNLP 2014'', pp. 1532-1543.

Pilehvar, M.T., Jurgens D., and Navigli R. (2013). [http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf Align, disambiguate and walk: A unified approach for measuring semantic similarity]. ''Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013),'' Sofia, Bulgaria.

Rapp, R. (2003). [http://www.amtaweb.org/summit/MTSummit/FinalPapers/19-Rapp-final.pdf Word sense discovery based on sense descriptor dissimilarity]. ''Proceedings of the Ninth Machine Translation Summit'', pp. 315-322.

Resnik, P. (1995). [http://citeseer.ist.psu.edu/resnik95using.html Using information content to evaluate semantic similarity]. ''Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95)'', Montreal, pp. 448-453.

Ruiz-Casado, M., Alfonseca, E. and Castells, P. (2005) [http://alfonseca.org/pubs/2005-ranlp1.pdf Using context-window overlapping in Synonym Discovery and Ontology Extension]. ''Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP-2005)'', Borovets, Bulgaria.

Salle A., Idiart M., and Villavicencio A. (2018) [https://github.com/alexandres/lexvec/blob/master/README.md LexVec]

Speer, R., Chin, J., and Havasi, C. (2017). [https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14972/14051 Conceptnet 5.5: An open multilingual graph of general knowledge]. ''AAAI-17'', pp. 4444-4451.

Terra, E., and Clarke, C.L.A. (2003). [http://acl.ldc.upenn.edu/N/N03/N03-1032.pdf Frequency estimates for statistical word similarity measures]. ''Proceedings of the Human Language Technology and North American Chapter of Association of Computational Linguistics Conference 2003 (HLT/NAACL 2003)'', pp. 244–251.

Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. (2010). [http://arxiv.org/abs/1401.5699 Text Relatedness Based on a Word Thesaurus]. ''Journal of Artificial Intelligence Research'' 37, 1–39

Turney, P.D. (2001). [http://arxiv.org/abs/cs.LG/0212033 Mining the Web for synonyms: PMI-IR versus LSA on TOEFL]. ''Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001)'', Freiburg, Germany, pp. 491-502.

Turney, P.D., Littman, M.L., Bigham, J., and Shnayder, V. (2003). [http://arxiv.org/abs/cs.CL/0309035 Combining independent modules to solve multiple-choice synonym and analogy problems]. ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, pp. 482-489.

Turney, P.D. (2008). [http://arxiv.org/abs/0809.0124 A uniform approach to analogies, synonyms, antonyms, and associations]. ''Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)'', Manchester, UK, pp. 905-912.

[[Category:State of the art]]
[[Category:Similarity]]

MC-28 Test Collection (State of the art)

2015-11-02T16:29:04Z

Doboandris:

* state of the art in Miller & Charles 28 (MC-28) dataset [Resnik, 1995]
* 28 word pairs of the original Miller & Charles 30 (MC-30) dataset [Miller and Charles, 1991], which is a subset of the [[RG-65 Test Collection (State of the art)|Rubenstein & Goodenough (RG-65) dataset]]; two word pairs have generally been omitted for semantic similarity evaluation, as words in these word pairs have not been included in previous versions of WordNet
* Similarity of each pair is scored according to a scale from 0 to 4 (the higher the "similarity of meaning," the higher the number);
* The similarity values in the dataset are the means of judgments made by 38 subjects [Miller and Charles, 1991].
* see also: [[Similarity (State of the art)]]

== Table of results ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] [with 95% confidence intervals]
|-
| Human
| Human upper bound
| Resnik (1995)
| Human
| 0.934 [0.861, 0.969]
|-
| PPR
| Agirre et al. (2009)
| Agirre et al. (2009)
| Hybrid
| 0.92 [0.833, 0.962]
|-
| Gloss Vector
| Patwardhan and Pedersen (2006)
| Patwardhan and Pedersen (2006)
| Lexicon-based
| 0.91 [0.813, 0.957]
|-
| JS
| Jarmasz and Szpakowicz (2003)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 0.87 [0.736, 0.938]
|-
| SR
| Tsatsaronis et al. (2010)
| Tsatsaronis et al. (2010)
| Lexicon-based
| 0.856 [0.710, 0.931]
|-
| KC
| Kulkarni and Caragea (2009)
| Kulkarni and Caragea (2009)
| Web-based
| 0.835 [0.671, 0.921]
|-
| LIN
| Lin (1998)
| Lin (1998)
| Hybrid
| 0.82 [0.644, 0.913]
|-
| RES
| Resnik (1995)
| Resnik (1995)
| Hybrid
| 0.81 [0.627, 0.908]
|-
| DC-best
| Dobó and Csirik (2013)
| Dobó and Csirik (2013)
| Corpus-based
| 0.773 [0.562, 0.889]
|-
| GM
| Gabrilovich and Markovitch (2007)
| Tsatsaronis et al. (2010)
| Corpus-based
| 0.72 [0.475, 0.861]
|-
| WLM
| Milne and Witten (2008)
| Milne and Witten (2008)
| Web-based
| 0.70 [0.443, 0.850]
|-
| SH
| Sahami and Heilman (2006)
| Agirre et al. (2009)
| Web-based
| 0.618 [0.319, 0.805]
|}

== References ==

* '''Listed alphabetically.'''

Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A. (2009). [http://www.aclweb.org/anthology/N09-1003 A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches]. In: ''10th Annual Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies''. Association for Computa-tional Linguistics, Stroudsburg. pp. 19–27.

Dobó, A., and Csirik, J. (2013). [http://link.springer.com/chapter/10.1007/978-3-642-35843-2_42 Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) ''SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741''. Springer-Verlag, Berlin Heidelberg, pp. 491-502.

Gabrilovich, E., and Markovitch, S. (2007). [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis], ''Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI)'', Hyderabad, India.

Jarmasz, M., and Szpakowicz, S. (2003). [http://www.csi.uottawa.ca/~szpak/recent_papers/TR-2003-01.pdf Roget’s thesaurus and semantic similarity], ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, September, pp. 212-219.

Kulkarni, S., Caragea, D. (2009). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.4693&rep=rep1&type=pdf Computation of the Semantic Relatedness between Words using Concept Clouds]. In: ''International Conference on Knowledge Discovery and Information Re-trieval''. INSTICC Press, Setubal. pp. 183–188.

Lin, D. (1998). [http://webdocs.cs.ualberta.ca/~lindek/papers/sim.pdf An information-theoretic definition of similarity]. In ''Proceedings of the 15th International Conference on Machine Learning'', Madison,WI, pp. 296–304.

Miller, G., and Charles, W. (1991). [http://www.tandfonline.com/doi/abs/10.1080/01690969108406936#.VjUdmjZRGUk Contextual correlates of semantic similarity]. ''Language and Cognitive Processes'', 6(1), 1–28.

Milne, D., Witten, I.H. (2008). [http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links], In ''Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy'', AAAI Press, Chicago, USA pp. 25-30.

Patwardhan, S., and Pedersen, T. (2006). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.145.6642&rep=rep1&type=pdf#page=7 Using WordNet-based Context Vectors to Estimate the Se-mantic Relatedness of Concepts]. In: ''11th Conference of the European Chapter of the Association for Computational Linguistics''. Association for Computational Linguistics, Stroudsbur. pp. 1–8.

Resnik, P. (1995). [http://arxiv.org/pdf/cmp-lg/9511007 Using information content to evaluate semantic similarity]. In ''Proceedings of the 14th International Joint Conference on Artificial Intelligence'', Montreal, Canada, pages 448–453.

Sahami, M., Heilman, T.D. (2006). [http://robotics.stanford.edu/users/sahami/papers-dir/www2006.pdf A web-based kernel function for measuring the similarity of short text snippets]. In: ''15th international conference on World Wide Web''. ACM Press, New York. pp. 377–386.

Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. (2010). [http://arxiv.org/abs/1401.5699 Text Relatedness Based on a Word Thesaurus]. ''Journal of Artificial Intelligence Research'' 37, 1–39

[[Category:State of the art]]
[[Category:Similarity]]

WordSimilarity-353 Test Collection (State of the art)

2015-11-01T13:45:51Z

Doboandris:

* [http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ WordSimilarity-353 Test Collection]
* contains two sets of English word pairs along with human-assigned similarity judgements
* first set (set1) contains 153 word pairs along with their similarity scores assigned by 13 subjects
* second set (set2) contains 200 word pairs with similarity assessed by 16 subjects
* WordSimilarity-353 dataset is available [http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ here]
* performance is measured by [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rank correlation coefficient]
* introduced by [http://www.cs.technion.ac.il/~gabr/papers/tois_context.pdf Finkelstein et al. (2002)]
* subsequently used by many other researchers
* see also: [[Similarity (State of the art)]]

== Table of results ==

* '''Listed in order of increasing [http://en.wikipedia.org/wiki/Spearman_rank_correlation Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! Spearman's rho
! Pearson's r
|-
| L&C
| Leacock and Chodorow (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.302
| 0.356
|-
| WNE
| Jarmasz (2003)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.305
| 0.271
|-
| J&C
| Jiang and Conrath 1997
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.318
| 0.354
|-
| L&C
| Leacock and Chodorow (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.348
| 0.341
|-
| H&S
| Hirst and St-Onge (1998)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.302
| 0.356
|-
| Lin
| Lin (1998)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.348
| 0.357
|-
| Resnik
| Resnik (1995)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.353
| 0.365
|-
| ROGET
| Jarmasz (2003)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.415
| 0.536
|-
| C&W
| Collobert and Weston (2008)
| Collobert and Weston (2008)
| Corpus-based
| 0.5
| N/A
|-
| WikiRelate
| Strube and Ponzetto (2006)
| Strube and Ponzetto (2006)
| Corpus-based
| N/A
| 0.48
|-
| LSA
| Landauer et al. (1997)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.581
| 0.492
|-
| LSA
| Landauer et al. (1997)
| Hassan and Mihalcea (2011)
| Corpus-based
| 0.581
| 0.563
|-
| simVB+simWN
| Finkelstein et al. (2002)
| Finkelstein et al. (2002)
| Hybrid
| N/A
| 0.55
|-
| SSA
| Hassan and Mihalcea (2011)
| Hassan and Mihalcea (2011)
| Knowledge-based
| 0.622
| 0.629
|-
| HSMN+csmRNN
| Luong et al. (2013)
| Luong et al. (2013)
| Corpus-based
| 0.65
| N/A
|-
| Multi-prototype
| Huang et al. (2012)
| Huang et al. (2012)
| Corpus-based
| 0.71
| N/A
|-
| Multi-lingual SSA
| Hassan et al. (2011)
| Hassan et al. (2011)
| Corpus-based
| 0.713
| 0.674
|-
| ESA
| Gabrilovich and Markovitch (2007)
| Gabrilovich and Markovitch (2007)
| Corpus-based
| 0.748
| 0.503
|-
| TSA
| Radinsky et al. (2011)
| Radinsky et al. (2011)
| Hybrid
| 0.80
| N/A
|-
| CLEAR
| Halawi et al. (2012)
| Halawi et al. (2012)
| Corpus-based
| 0.81
| N/A
|-
| Y&Q
| Yih and Qazvinian (2012)
| Yih and Qazvinian (2012)
| Hybrid
| 0.81
| N/A
|-
|}

== References ==

* '''Listed in alphabetical order.'''

Finkelstein, Lev, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. (2002) [http://www.cs.technion.ac.il/~gabr/papers/tois_context.pdf Placing Search in Context: The Concept Revisited]. ACM Transactions on Information Systems, 20(1):116-131.

Gabrilovich, Evgeniy, and Shaul Markovitch, [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis], Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, 2007.

Halawi, Guy, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. (2012). [http://gabrilovich.com/publications/papers/Halawi2012LSL.pdf Large-scale learning of word relatedness with constraints]. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1406-1414. ACM.

Hassan, Samer, and Rada Mihalcea: [http://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/download/3616/3972/ Semantic Relatedness Using Salient Semantic Analysis]. AAAI 2011

Hirst, Graeme and David St-Onge. Lexical chains as representations of context for the detection and correction of malapropisms. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 305–332, 1998.

Huang, Eric H., Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1 (ACL '12), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 873-882.

Islam, A., and Inkpen, D. 2006. [http://www.site.uottawa.ca/~mdislam/publications/LREC_06_242.pdf Second order co-occurrence pmi for determining the semantic similarity of words]. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006) 1033–1038.

Jarmasz, M. 2003. [http://www.arxiv.org/pdf/1204.0140 Roget’s thesaurus as a Lexical Resource for Natural Language Processing]. Ph.D. Dissertation, Ottawa Carleton Institute for Computer Science, School of Information Technology and Engineering, University of Ottawa.

Jiang, Jay J. and David W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference on Research in Computational Linguistics (ROCLING X), Taiwan, pages 19–33, 1997.

Landauer, T. K.; L, T. K.; Laham, D.; Rehder, B.; and Schreiner, M. E. 1997. How well can passage meaning be derived without using word order? a comparison of latent semantic analysis and humans.

Leacock, Claudia and Martin Chodorow. Combining local context and WordNet similarity for word sense identification. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, pages 265–283, 1998.

Lin, Dekang. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, Madison,WI, pages 296–304, 1998.

Luong, Minh-Thang, Richard Socher, and Christopher D. Manning. (2013). [http://nlp.stanford.edu/~lmthang/data/papers/conll13_morpho.pdf Better word representations with recursive neural networks for morphology]. CoNLL-2013: 104.

Pilehvar, M.T., D. Jurgens and R. Navigli. [http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity]. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 4-9, 2013, pp. 1341-1351.

Radinsky, Kira, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. (2011). [http://gabrilovich.com/publications/papers/Radinsky2011WTS.pdf A word at a time: computing word relatedness using temporal semantic analysis]. In Proceedings of the 20th international conference on World wide web, pp. 337-346. ACM.

Resnik, Philip. Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448–453, Montreal, Canada, 1995.

Strube, Michael and Simone Paolo Ponzetto. (2006). [http://www.aaai.org/Papers/AAAI/2006/AAAI06-223.pdf WikiRelate! Computing Semantic Relatedness Using Wikipedia]. Proceedings of The 21st National Conference on Artificial Intelligence (AAAI), Boston, MA.

Yih, W. and Qazvinian, V. (2012). [http://aclweb.org/anthology/N/N12/N12-1077.pdf Measuring Word Relatedness Using Heterogeneous Vector Space Models]. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2012).

[[Category:State of the art]]
[[Category:Similarity]]

MC-28 Test Collection (State of the art)

2015-11-01T13:25:59Z

Doboandris:

* state of the art in Miller & Charles 28 (MC-28) dataset [Resnik, 1995]
* 28 word pairs of the original Miller & Charles 30 (MC-30) dataset [Miller and Charles, 1991], which is a subset of the [[RG-65 Test Collection (State of the art)|Rubenstein & Goodenough (RG-65) dataset]]; two word pairs have generally been omitted for semantic similarity evaluation, as words in these word pairs have not been included in previous versions of WordNet
* Similarity of each pair is scored according to a scale from 0 to 4 (the higher the "similarity of meaning," the higher the number);
* The similarity values in the dataset are the means of judgments made by 38 subjects [Miller and Charles, 1991].
* see also: [[Similarity (State of the art)]]

== Table of results ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] [with 95% confidence intervals]
|-
| Human
| Human upper bound
| Resnik (1995)
| Human
| 0.934 [0.861, 0.969]
|-
| PPR
| Agirre et al. (2009)
| Agirre et al. (2009)
| Hybrid
| 0.92 [0.833, 0.962]
|-
| Gloss Vector
| Patwardhan and Pedersen (2006)
| Patwardhan and Pedersen (2006)
| Lexicon-based
| 0.91 [0.813, 0.957]
|-
| JS
| Jarmasz and Szpakowicz (2003)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 0.87 [0.736, 0.938]
|-
| SR
| Tsatsaronis et al. (2010)
| Tsatsaronis et al. (2010)
| Lexicon-based
| 0.856 [0.710, 0.931]
|-
| KC
| Kulkarni and Caragea (2009)
| Kulkarni and Caragea (2009)
| Web-based
| 0.835 [0.671, 0.921]
|-
| LIN
| Lin (1998)
| Lin (1998)
| Hybrid
| 0.82 [0.644, 0.913]
|-
| RES
| Resnik (1995)
| Resnik (1995)
| Hybrid
| 0.81 [0.627, 0.908]
|-
| bnc-bagofwords-num-cos-qw+enwiki-parsed-num-cos-freq
| Dobó and Csirik (2013)
| Dobó and Csirik (2013)
| Corpus-based
| 0.773 [0.562, 0.889]
|-
| GM
| Gabrilovich and Markovitch (2007)
| Tsatsaronis et al. (2010)
| Corpus-based
| 0.72 [0.475, 0.861]
|-
| WLM
| Milne and Witten (2008)
| Milne and Witten (2008)
| Web-based
| 0.70 [0.443, 0.850]
|-
| SH
| Sahami and Heilman (2006)
| Agirre et al. (2009)
| Web-based
| 0.618 [0.319, 0.805]
|}

== References ==

* '''Listed alphabetically.'''

Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A. (2009). [http://www.aclweb.org/anthology/N09-1003 A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches]. In: ''10th Annual Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies''. Association for Computa-tional Linguistics, Stroudsburg. pp. 19–27.

Dobó, A., and Csirik, J. (2013). [http://link.springer.com/chapter/10.1007/978-3-642-35843-2_42 Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) ''SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741''. Springer-Verlag, Berlin Heidelberg, pp. 491-502.

Gabrilovich, E., and Markovitch, S. (2007). [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis], ''Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI)'', Hyderabad, India.

Jarmasz, M., and Szpakowicz, S. (2003). [http://www.csi.uottawa.ca/~szpak/recent_papers/TR-2003-01.pdf Roget’s thesaurus and semantic similarity], ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, September, pp. 212-219.

Kulkarni, S., Caragea, D. (2009). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.4693&rep=rep1&type=pdf Computation of the Semantic Relatedness between Words using Concept Clouds]. In: ''International Conference on Knowledge Discovery and Information Re-trieval''. INSTICC Press, Setubal. pp. 183–188.

Lin, D. (1998). [http://webdocs.cs.ualberta.ca/~lindek/papers/sim.pdf An information-theoretic definition of similarity]. In ''Proceedings of the 15th International Conference on Machine Learning'', Madison,WI, pp. 296–304.

Miller, G., and Charles, W. (1991). [http://www.tandfonline.com/doi/abs/10.1080/01690969108406936#.VjUdmjZRGUk Contextual correlates of semantic similarity]. ''Language and Cognitive Processes'', 6(1), 1–28.

Milne, D., Witten, I.H. (2008). [http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links], In ''Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy'', AAAI Press, Chicago, USA pp. 25-30.

Patwardhan, S., and Pedersen, T. (2006). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.145.6642&rep=rep1&type=pdf#page=7 Using WordNet-based Context Vectors to Estimate the Se-mantic Relatedness of Concepts]. In: ''11th Conference of the European Chapter of the Association for Computational Linguistics''. Association for Computational Linguistics, Stroudsbur. pp. 1–8.

Resnik, P. (1995). [http://arxiv.org/pdf/cmp-lg/9511007 Using information content to evaluate semantic similarity]. In ''Proceedings of the 14th International Joint Conference on Artificial Intelligence'', Montreal, Canada, pages 448–453.

Sahami, M., Heilman, T.D. (2006). [http://robotics.stanford.edu/users/sahami/papers-dir/www2006.pdf A web-based kernel function for measuring the similarity of short text snippets]. In: ''15th international conference on World Wide Web''. ACM Press, New York. pp. 377–386.

Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. (2010). [http://arxiv.org/abs/1401.5699 Text Relatedness Based on a Word Thesaurus]. ''Journal of Artificial Intelligence Research'' 37, 1–39

[[Category:State of the art]]
[[Category:Similarity]]

State of the art

2015-11-01T13:23:17Z

Doboandris:

The purpose of this section of the ACL wiki is to be a repository of ''k''-best state-of-the-art results (i.e., methods and software) for various core natural language processing tasks.

As a side effect, this should hopefully evolve into a knowledge base of standard evaluation methods and datasets for various tasks, as well as encourage more effort into reproducibility of results. This will help newcomers to a field appreciate what has been done so far and what the main tasks are, and will help keep active researchers informed on fields other than their specific research. The next time you need a system for PP attachment, or wonder what is the current state of word sense disambiguation, this will be the place to visit.

Please contribute! (This is also a good place for you to display your results!)

As a historical point of reference, you may want to refer to the [http://web.archive.org/web/20100325144600/http://cslu.cse.ogi.edu/HLTsurvey/ Survey of the State of the Art in Human Language Technology] ([http://www.lt-world.org/hlt_survey/master.pdf also available as PDF]), edited by R. Cole, J. Mariani, H. Uszkoreit, G. B. Varile, A. Zaenen, A. Zampolli, V. Zue, 1996.


* [[Anaphora Resolution (State of the art)|Anaphora Resolution]] (stub)
* [[Automatic Text Summarization (State of the art)|Automatic Text Summarization]] (stub)
* [[Chunking (State of the art)|Chunking]] (stub)
* [[Dependency Parsing (State of the art)|Dependency Parsing]] (stub)
* [[Document Classification (State of the art)|Document Classification]] (stub)
* [[Language Identification (State of the art)|Language Identification]] (stub)
* [[Named Entity Recognition (State of the art)|Named Entity Recognition]]
* [[Noun-Modifier Semantic Relations (State of the art)|Noun-Modifier Semantic Relations]]
* [[NP Chunking (State of the art)|NP Chunking]]
* [[Paraphrase Identification (State of the art)|Paraphrase Identification]]
* [[Parsing (State of the art)|Parsing]]
* [[POS Induction (State of the art) |POS Induction]]
* [[POS Tagging (State of the art) |POS Tagging]]
* [[PP Attachment (State of the art)|PP Attachment]] (stub)
* [[Question Answering (State of the art)|Question Answering]]
* [[Semantic Role Labeling (State of the art)|Semantic Role Labeling]] (stub)
* [[Sentiment Analysis (State of the art)|Sentiment Analysis]] (stub)
* [[Similarity (State of the art)|Similarity]] -- [[ESL Synonym Questions (State of the art)|ESL]], [[SAT Analogy Questions (State of the art)|SAT]], [[TOEFL Synonym Questions (State of the art)|TOEFL]], [[RG-65 Test Collection (State of the art)|RG-65 Test Collection]], [[MC-28 Test Collection (State of the art)|MC-28 Test Collection]], [[WordSimilarity-353 Test Collection (State of the art)|WordSimilarity-353]], [[SemEval-2012 Task 2 (State of the art)|SemEval-2012 Task 2]]
* [[Speech Recognition (State of the art)|Speech Recognition]] (article request)
* [[Temporal Information Extraction (State of the art)|Temporal Information Extraction]]
* [[Cleaneval (State of the art)| Web Corpus Cleaning]] (stub)
* [[Word Segmentation (State of the art)|Word Segmentation]] (stub)
* [[Word Sense Disambiguation (State of the art)|Word Sense Disambiguation]] (stub)


[[Category:State of the art]]

User:Doboandris

2015-10-31T21:44:28Z

Doboandris:

MC-28 Test Collection (State of the art)

2015-10-31T20:02:41Z

Doboandris: Created page with "* state of the art in Miller & Charles 28 (MC-28) dataset * 28 word pairs of the original Miller & Charles 30 (MC-30) dataset, which is a subset of the RG-65 Test Collection..."

* state of the art in Miller & Charles 28 (MC-28) dataset
* 28 word pairs of the original Miller & Charles 30 (MC-30) dataset, which is a subset of the [[RG-65 Test Collection (State of the art)|Rubenstein & Goodenough (RG-65) dataset]]; two word pairs have generally been omitted for semantic similarity evaluation, as words in these word pairs have not been included in previous versions of WordNet
* Similarity of each pair is scored according to a scale from 0 to 4 (the higher the "similarity of meaning," the higher the number);
* The similarity values in the dataset are the means of judgments made by 38 subjects [Miller and Charles, 1991].
* see also: [[Similarity (State of the art)]]

== Table of results ==

* '''Listed in order of decreasing [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman's rho].'''

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for reported results
! Type
! [http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient Spearman correlation] [with 95% confidence intervals]
|-
| Human
| Human upper bound
| Resnik (1995)
| Human
| 0.934 [0.861, 0.969]
|-
| PPR
| Agirre et al. (2009)
| Agirre et al. (2009)
| Hybrid
| 0.92 [0.833, 0.962]
|-
| Gloss Vector
| Patwardhan and Pedersen (2006)
| Patwardhan and Pedersen (2006)
| Lexicon-based
| 0.91 [0.813, 0.957]
|-
| JS
| Jarmasz and Szpakowicz (2003)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 0.87 [0.736, 0.938]
|-
| SR
| Tsatsaronis et al. (2010)
| Tsatsaronis et al. (2010)
| Lexicon-based
| 0.856 [0.710, 0.931]
|-
| KC
| Kulkarni and Caragea (2009)
| Kulkarni and Caragea (2009)
| Web-based
| 0.835 [0.671, 0.921]
|-
| LIN
| Lin (1998)
| Lin (1998)
| Hybrid
| 0.82 [0.644, 0.913]
|-
| RES
| Resnik (1995)
| Resnik (1995)
| Hybrid
| 0.81 [0.627, 0.908]
|-
| bnc-bagofwords-num-cos-qw+enwiki-parsed-num-cos-freq
| Dobó and Csirik (2013)
| Dobó and Csirik (2013)
| Corpus-based
| 0.773 [0.562, 0.889]
|-
| GM
| Gabrilovich and Markovitch (2007)
| Tsatsaronis et al. (2010)
| Corpus-based
| 0.72 [0.475, 0.861]
|-
| WLM
| Milne and Witten (2008)
| Milne and Witten (2008)
| Web-based
| 0.70 [0.443, 0.850]
|-
| SH
| Sahami and Heilman (2006)
| Agirre et al. (2009)
| Web-based
| 0.618 [0.319, 0.805]
|}

== References ==

* '''Listed alphabetically.'''

Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A. (2009). [http://www.aclweb.org/anthology/N09-1003 A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches]. In: ''10th Annual Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies''. Association for Computa-tional Linguistics, Stroudsburg. pp. 19–27.

Dobó, A., and Csirik, J. (2013). [http://link.springer.com/chapter/10.1007/978-3-642-35843-2_42 Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) ''SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741''. Springer-Verlag, Berlin Heidelberg, pp. 491-502.

Gabrilovich, E., and Markovitch, S. (2007). [http://www.cs.technion.ac.il/~gabr/papers/ijcai-2007-sim.pdf Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis], ''Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI)'', Hyderabad, India.

Jarmasz, M., and Szpakowicz, S. (2003). [http://www.csi.uottawa.ca/~szpak/recent_papers/TR-2003-01.pdf Roget’s thesaurus and semantic similarity], ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, September, pp. 212-219.

Kulkarni, S., Caragea, D. (2009). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.4693&rep=rep1&type=pdf Computation of the Semantic Relatedness between Words using Concept Clouds]. In: ''International Conference on Knowledge Discovery and Information Re-trieval''. INSTICC Press, Setubal. pp. 183–188.

Lin, D. (1998). [http://webdocs.cs.ualberta.ca/~lindek/papers/sim.pdf An information-theoretic definition of similarity]. In ''Proceedings of the 15th International Conference on Machine Learning'', Madison,WI, pp. 296–304.

Miller, G., and Charles, W. (1991). [http://www.tandfonline.com/doi/abs/10.1080/01690969108406936#.VjUdmjZRGUk Contextual correlates of semantic similarity]. ''Language and Cognitive Processes'', 6(1), 1–28.

Milne, D., Witten, I.H. (2008). [http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links], In ''Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy'', AAAI Press, Chicago, USA pp. 25-30.

Patwardhan, S., and Pedersen, T. (2006). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.145.6642&rep=rep1&type=pdf#page=7 Using WordNet-based Context Vectors to Estimate the Se-mantic Relatedness of Concepts]. In: ''11th Conference of the European Chapter of the Association for Computational Linguistics''. Association for Computational Linguistics, Stroudsbur. pp. 1–8.

Resnik, P. (1995). [http://arxiv.org/pdf/cmp-lg/9511007 Using information content to evaluate semantic similarity]. In ''Proceedings of the 14th International Joint Conference on Artificial Intelligence'', Montreal, Canada, pages 448–453.

Sahami, M., Heilman, T.D. (2006). [http://robotics.stanford.edu/users/sahami/papers-dir/www2006.pdf A web-based kernel function for measuring the similarity of short text snippets]. In: ''15th international conference on World Wide Web''. ACM Press, New York. pp. 377–386.

Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. (2010). [http://arxiv.org/abs/1401.5699 Text Relatedness Based on a Word Thesaurus]. ''Journal of Artificial Intelligence Research'' 37, 1–39

[[Category:State of the art]]
[[Category:Similarity]]

Similarity (State of the art)

2015-10-31T19:55:42Z

Doboandris: /* Attributional similarity */

* see also: [[State of the art]]

== Attributional similarity ==

* '''attributional similarity:''' the degree to which two words are synonymous
* state-of-the-art results for:
** [[TOEFL Synonym Questions (State of the art)|TOEFL Synonym Questions]]
** [[ESL Synonym Questions (State of the art)|ESL Synonym Questions]]
** [[RG-65 Test Collection (State of the art)|RG-65 Test Collection]]
** [[MC-28 Test Collection (State of the art)|MC-28 Test Collection]]
** [[SimLex-999 (State of the art)|SimLex-999 Similarity Test Collection]]
** [[WordSimilarity-353 Test Collection (State of the art)|WordSimilarity-353 Test Collection]]

== Similarity versus Association ==

* '''similarity versus association''': the contrast between taxonomical similarity (co-hyponymy) and association (co-occurrence)
* state-of-the-art results for:
** [[Similar-Associated-Both Test Collection (State of the art)|Similar-Associated-Both Test Collection]]
** [[SimLex-999 (State of the art)|SimLex-999 Similarity Test Collection]]

== Relational similarity ==

* '''relational similarity:''' the degree to which two relations are analogous
* state-of-the-art results for:
** [[SAT Analogy Questions (State of the art)|SAT Analogy Questions]]
** [[SemEval-2012 Task 2 (State of the art)|SemEval-2012 Task 2: Measuring Degrees of Relational Similarity]]
** [[Syntactic Analogies|Microsoft Research Syntactic Analogies Dataset]]

== Phrase similarity ==

* '''phrase similarity:''' the degree to which two phrases are similar
* state-of-the-art results for:
** [[Noun-Modifier Questions (State of the art)|Noun-Modifier Questions]]

== Sentence similarity ==

* '''sentence similarity:''' sentence paraphrase, paraphrase identification, paraphrase recognition
* state-of-the-art results for:
** [[Paraphrase Identification (State of the art)|Microsoft Research Paraphrase Corpus]]

== External links ==

* SemEval-2012 Task 2: [https://sites.google.com/site/semeval2012task2/ Measuring Degrees of Relational Similarity]
* SemEval-2012 Task 6: [http://www.cs.york.ac.uk/semeval-2012/task6/ Semantic Textual Similarity]
* SEM 2013 Shared Task: [http://ixa2.si.ehu.es/sts/ Semantic Textual Similarity]

[[Category:State of the art]]
[[Category:Similarity]]

User:Doboandris

2015-10-31T09:25:48Z

Doboandris:

User:Doboandris

2015-10-30T22:22:43Z

Doboandris:

András Dobó

PhD Student

PhD School in Computer Science, University of Szeged, Hungary

Contact information

Address: University of Szeged, Institute of Informatics
2 Árpád tér, Szeged, 6720, Hungary

Email: dobo@inf.u-szeged.hu

Web: www.inf.u-szeged.hu/~dobo

Professional experience

2012-2014 Research mathematician
nexum Magyarország Kft.

2010-2011 Software developer
Institute of Informatics, University of Szeged, Hungary

2009 Software developer
Biological Research Centre, Hungarian Academy of Sciences, Hungary

2008-2009 Software developer
Szeged és Környéke Vízgazdálkodási Társulat, Szeged, Hungary

Teaching

2013/2014 II. semester Formal languages tutorials

2013/2014 I. semester Artificial intelligence I. tutorials

2012/2013 II. semester Formal languages tutorials

2012/2013 I. semester Artificial intelligence I. tutorials

2011/2012 I. semester Artificial intelligence I. tutorials

2008/2009 II. semester Formal languages tutorials

2008/2009 I. semester Databases tutorials

2008/2009 I. semester Introduction to informatics tutorials

Education

2012 Guest student (1 semester)
Georg-August Universität Göttingen

2011- PhD in Computer Science
PhD School in Computer Science, University of Szeged, Hungary
Research topic: Automatic interpretation of English and Hungarian noun
compounds

2009-2010 Master of Science in Computer Science
Computing Laboratory, University of Oxford, UK

2006-2009 Bachelor of Science in Computer Program Designer
Institute of Informatics, University of Szeged, Hungary

Language exams

2010 English, level C2, Cambridge ESOL

2005 German, level B2, Goethe Institut

Publications

1. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Nagy, Á., Vincze, V. and Zsibrita, J.: Information Extraction from Hungarian, English and German CVs for a Career Portal. In: Prasath, R. et al. (eds.) Mining Intelligence and Knowledge Exploration. LNAI, Vol. 8891. Springer International Publishing, Switzerland (2014) 333-341

2. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Miszori, A., Nagy, Á., Vincze, V. and Zsibrita, J.: Információkinyerés magyar nyelvű önéletrajzokból a nexum Karrierportálhoz. In: Tanács, A. et al. (eds.) X. Magyar Számítógépes Nyelvészeti Konferencia. SZTE Informatikai Tanszékcsoport, Szeged (2014) 359-360

3. Dobó, A., Csirik, J.: Computing semantic similarity using large static corpora. In: van Emde Boas, P. et al. (eds.) SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741. Springer-Verlag, Berlin Heidelberg (2013) 491-502

4. Dobó, A., Csirik, J.: Magyar és angol szavak szemantikai hasonlóságának automatikus kiszámítása. In: Tanács, A. and Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. SZTE Informatikai Tanszékcsoport, Szeged (2012) 213-224

5. Dobó, A., Pulman, S.G.: Angol nyelvű összetett főnevek értelmezése parafrázisok segítségével. In: Tanács, A. and Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. SZTE Informatikai Tanszékcsoport, Szeged (2012) 35-46

6. Dobó, A., Pulman, S.G.: Interpreting noun compounds using paraphrases. Procesamiento del Lenguaje Natural, Vol. 46 (2011) 59-66

TOEFL Synonym Questions (State of the art)

2015-10-29T20:48:39Z

Doboandris:

* TOEFL = Test of English as a Foreign Language
* 80 multiple-choice synonym questions; 4 choices per question
* the TOEFL questions are available on request by contacting [http://lsa.colorado.edu/mail_sub.html LSA Support at CU Boulder], the people who manage the [http://lsa.colorado.edu/ LSA web site at Colorado]
* introduced in Landauer and Dumais (1997) as a way of evaluating algorithms for measuring degree of similarity between words
* subsequently used by many other researchers
* see also: [[Similarity (State of the art)]]

== Sample question ==

::{| border="0" cellpadding="1" cellspacing="1"
|-
! Stem:
|
| levied
|-
! Choices:
| (a)
| imposed
|-
|
| (b)
| believed
|-
|
| (c)
| requested
|-
|
| (d)
| correlated
|-
! Solution:
| (a)
| imposed
|-
|}

== Table of results ==

{| border="1" cellpadding="5" cellspacing="1" width="100%"
|-
! Algorithm
! Reference for algorithm
! Reference for experiment
! Type
! Correct
! 95% confidence
|-
| RES
| Resnik (1995)
| Jarmasz and Szpakowicz (2003)
| Hybrid
| 20.31%
| 12.89–31.83%
|-
| LC
| Leacock and Chodrow (1998)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 21.88%
| 13.91–33.21%
|-
| LIN
| Lin (1998)
| Jarmasz and Szpakowicz (2003)
| Hybrid
| 24.06%
| 15.99–35.94%
|-
| Random
| Random guessing
| 1 / 4 = 25.00%
| Random
| 25.00%
| 15.99–35.94%
|-
| JC
| Jiang and Conrath (1997)
| Jarmasz and Szpakowicz (2003)
| Hybrid
| 25.00%
| 15.99–35.94%
|-
| LSA
| Landauer and Dumais (1997)
| Landauer and Dumais (1997)
| Corpus-based
| 64.38%
| 52.90–74.80%
|-
| Human
| Average non-English US college applicant
| Landauer and Dumais (1997)
| Human
| 64.50%
| 53.01–74.88%
|-
| RI
| Karlgren and Sahlgren (2001)
| Karlgren and Sahlgren (2001)
| Corpus-based
| 72.50%
| 61.38-81.90%
|-
| DS
| Pado and Lapata (2007)
| Pado and Lapata (2007)
| Corpus-based
| 73.00%
| 62.72-82.96%
|-
| PMI-IR
| Turney (2001)
| Turney (2001)
| Corpus-based
| 73.75%
| 62.72–82.96%
|-
| PairClass
| Turney (2008)
| Turney (2008)
| Corpus-based
| 76.25%
| 65.42-85.06%
|-
| HSO
| Hirst and St.-Onge (1998)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 77.91%
| 68.17–87.11%
|-
| JS
| Jarmasz and Szpakowicz (2003)
| Jarmasz and Szpakowicz (2003)
| Lexicon-based
| 78.75%
| 68.17–87.11%
|-
| PMI-IR
| Terra and Clarke (2003)
| Terra and Clarke (2003)
| Corpus-based
| 81.25%
| 70.97–89.11%
|-
| LC-IR
| Higgins (2005)
| Higgins (2005)
| Web-based
| 81.25%
| 70.97–89.11%
|-
| CWO
| Ruiz-Casado et al. (2005)
| Ruiz-Casado et al. (2005)
| Web-based
| 82.55%
| 72.38–90.09%
|-
| PPMIC
| Bullinaria and Levy (2007)
| Bullinaria and Levy (2007)
| Corpus-based
| 85.00%
| 75.26-92.00%
|-
| GLSA
| Matveeva et al. (2005)
| Matveeva et al. (2005)
| Corpus-based
| 86.25%
| 76.73-92.93%
|-
| SR
| Tsatsaronis et al. (2010)
| Tsatsaronis et al. (2010)
| Lexicon-based
| 87.50%
| 78.21-93.84%
|-
| bnc-parsed-num-cos-loglh+enwiki-parsed-num-cos-pmi
| Dobó and Csirik (2013)
| Dobó and Csirik (2013)
| Corpus-based
| 88.75%
| 79.72-94.72%
|-
| LSA
| Rapp (2003)
| Rapp (2003)
| Corpus-based
| 92.50%
| 84.39-97.20%
|-
| LSA
| Han (2014)
| Han (2014)
| Hybrid
| 95.0%
| 87.69-98.62%
|-
| ADW
| Pilehvar et al. (2013)
| Pilehvar et al. (2013)
| WordNet graph-based (unsupervised)
| 96.25%
| 89.43-99.22%
|-
| PR
| Turney et al. (2003)
| Turney et al. (2003)
| Hybrid
| 97.50%
| 91.26–99.70%
|-
| PCCP
| Bullinaria and Levy (2012)
| Bullinaria and Levy (2012)
| Corpus-based
| 100.00%
| 96.32-100.00%
|}

== Explanation of table ==

* '''Algorithm''' = name of algorithm
* '''Reference for algorithm''' = where to find out more about given algorithm
* '''Reference for experiment''' = where to find out more about evaluation of given algorithm with TOEFL questions
* '''Type''' = general type of algorithm: corpus-based, lexicon-based, hybrid
* '''Correct''' = percent of 80 questions that given algorithm answered correctly
* '''95% confidence''' = confidence interval calculated using the [[Statistical calculators|Binomial Exact Test]]
* table rows sorted in order of increasing percent correct
* several WordNet-based similarity measures are implemented in [http://www.d.umn.edu/~tpederse/ Ted Pedersen]'s [http://www.d.umn.edu/~tpederse/similarity.html WordNet::Similarity] package
* LSA = Latent Semantic Analysis
* PCCP = Principal Component vectors with Caron P
* PMI-IR = Pointwise Mutual Information - Information Retrieval
* PR = Product Rule
* PPMIC = Positive Pointwise Mutual Information with Cosine
* GLSA = Generalized Latent Semantic Analysis
* CWO = Context Window Overlapping
* DS = Dependency Space
* RI = Random Indexing

== Notes ==

* the performance of a corpus-based algorithm depends on the corpus, so the difference in performance between two corpus-based systems may be due to the different corpora, rather than the different algorithms
* the TOEFL questions include nouns, verbs, and adjectives, but some of the WordNet-based algorithms were only designed to work with nouns; this explains some of the lower scores
* some of the algorithms may have been tuned on the TOEFL questions; read the references for details
* Landauer and Dumais (1997) report scores that were corrected for guessing by subtracting a penalty of 1/3 for each incorrect answer; they report a score of 52.5% when this penalty is applied; when the penalty is removed, their performance is 64.4% correct

== References ==

Bullinaria, J.A., and Levy, J.P. (2007). [http://www.cs.bham.ac.uk/~jxb/PUBS/BRM.pdf Extracting semantic representations from word co-occurrence statistics: A computational study]. ''Behavior Research Methods'', 39(3), 510-526.

Bullinaria, J.A., and Levy, J.P. (2012). [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.9582&rep=rep1&type=pdf Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD]. ''Behavior Research Methods'', 44(3):890-907.

Dobó, A., and Csirik, J. (2013). [http://link.springer.com/chapter/10.1007/978-3-642-35843-2_42 Computing semantic similarity using large static corpora]. In: van Emde Boas, P. et al. (eds.) ''SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741''. Springer-Verlag, Berlin Heidelberg, pp. 491-502

Lushan Han. (2014). [http://ebiquity.umbc.edu/paper/html/id/658/Schema-Free-Querying-of-Semantic-Data Schema Free Querying of Semantic Data], Ph.D. dissertation, University of Maryland, Baltimore County, Baltimore, MD USA.

Higgins, D. (2005). [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.329.1517 Which Statistics Reflect Semantics? Rethinking Synonymy and Word Similarity.] In: Kepser, S., Reis, M. (eds.) ''Linguistic Evidence: Empirical, Theoretical and Computational Perspectives''. Mouton de Gruyter, Berlin, pp. 265–284.

Hirst, G., and St-Onge, D. (1998). [http://mirror.eacoss.org/documentation/ITLibrary/IRIS/Data/1997/Hirst/Lexical/1997-Hirst-Lexical.pdf Lexical chains as representation of context for the detection and correction of malapropisms]. In C. Fellbaum (ed.), ''WordNet: An Electronic Lexical Database''. Cambridge: MIT Press, 305-332.

Jarmasz, M., and Szpakowicz, S. (2003). [http://www.csi.uottawa.ca/~szpak/recent_papers/TR-2003-01.pdf Roget’s thesaurus and semantic similarity], ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, September, pp. 212-219.

Jiang, J.J., and Conrath, D.W. (1997). [http://wortschatz.uni-leipzig.de/~sbordag/aalw05/Referate/03_Assoziationen_BudanitskyResnik/Jiang_Conrath_97.pdf Semantic similarity based on corpus statistics and lexical taxonomy]. ''Proceedings of the International Conference on Research in Computational Linguistics'', Taiwan.

Karlgren, J. and Sahlgren, M. (2001). [http://www.sics.se/~jussi/Artiklar/2001_RWIbook/KarlgrenSahlgren2001.pdf From Words to Understanding]. In Uesaka, Y., Kanerva, P., & Asoh, H. (Eds.), ''Foundations of Real-World Intelligence'', Stanford: CSLI Publications, pp. 294–308.

Landauer, T.K., and Dumais, S.T. (1997). [http://lsa.colorado.edu/papers/plato/plato.annote.html A solution to Plato's problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge]. ''Psychological Review'', 104(2):211–240.

Leacock, C., and Chodorow, M. (1998). [http://books.google.ca/books?id=Rehu8OOzMIMC&lpg=PA265&ots=IpnaLkZUec&lr&pg=PA265#v=onepage&q&f=false Combining local context and WordNet similarity for word sense identification]. In C. Fellbaum (ed.), ''WordNet: An Electronic Lexical Database''. Cambridge: MIT Press, pp. 265-283.

Lin, D. (1998). [http://www.cs.ualberta.ca/~lindek/papers/sim.pdf An information-theoretic definition of similarity]. ''Proceedings of the 15th International Conference on Machine Learning (ICML-98)'', Madison, WI, pp. 296-304.

Matveeva, I., Levow, G., Farahat, A., and Royer, C. (2005). [http://people.cs.uchicago.edu/~matveeva/SynGLSA_ranlp_final.pdf Generalized latent semantic analysis for term representation]. ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-05)'', Borovets, Bulgaria.

Pado, S., and Lapata, M. (2007). [http://www.nlpado.de/~sebastian/pub/papers/cl07_pado.pdf Dependency-based construction of semantic space models]. ''Computational Linguistics'', 33(2), 161-199.

Pilehvar, M.T., Jurgens D., and Navigli R. (2013). [http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf Align, disambiguate and walk: A unified approach for measuring semantic similarity]. ''Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013),'' Sofia, Bulgaria.

Rapp, R. (2003). [http://www.amtaweb.org/summit/MTSummit/FinalPapers/19-Rapp-final.pdf Word sense discovery based on sense descriptor dissimilarity]. ''Proceedings of the Ninth Machine Translation Summit'', pp. 315-322.

Resnik, P. (1995). [http://citeseer.ist.psu.edu/resnik95using.html Using information content to evaluate semantic similarity]. ''Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95)'', Montreal, pp. 448-453.

Ruiz-Casado, M., Alfonseca, E. and Castells, P. (2005) [http://alfonseca.org/pubs/2005-ranlp1.pdf Using context-window overlapping in Synonym Discovery and Ontology Extension]. ''Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP-2005)'', Borovets, Bulgaria.

Terra, E., and Clarke, C.L.A. (2003). [http://acl.ldc.upenn.edu/N/N03/N03-1032.pdf Frequency estimates for statistical word similarity measures]. ''Proceedings of the Human Language Technology and North American Chapter of Association of Computational Linguistics Conference 2003 (HLT/NAACL 2003)'', pp. 244–251.

Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. (2010). [http://arxiv.org/abs/1401.5699 Text Relatedness Based on a Word Thesaurus]. ''Journal of Artificial Intelligence Research'' 37, 1–39

Turney, P.D. (2001). [http://arxiv.org/abs/cs.LG/0212033 Mining the Web for synonyms: PMI-IR versus LSA on TOEFL]. ''Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001)'', Freiburg, Germany, pp. 491-502.

Turney, P.D., Littman, M.L., Bigham, J., and Shnayder, V. (2003). [http://arxiv.org/abs/cs.CL/0309035 Combining independent modules to solve multiple-choice synonym and analogy problems]. ''Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-03)'', Borovets, Bulgaria, pp. 482-489.

Turney, P.D. (2008). [http://arxiv.org/abs/0809.0124 A uniform approach to analogies, synonyms, antonyms, and associations]. ''Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)'', Manchester, UK, pp. 905-912.

[[Category:State of the art]]
[[Category:Similarity]]

User:Doboandris

2015-10-29T15:49:24Z

Doboandris:

András Dobó
PhD Student
PhD School in Computer Science, University of Szeged, Hungary

Contact information

Address: University of Szeged, Institute of Informatics
Room 222, 2 Árpád tér, Szeged, 6720, Hungary

Email: dobo@inf.u-szeged.hu

Professional experience

2012-2014 Research mathematician
nexum Magyarország Kft.

2010-2011 Software developer
Institute of Informatics, University of Szeged, Hungary

2009 Software developer
Biological Research Centre, Hungarian Academy of Sciences, Hungary

2008-2009 Software developer
Szeged és Környéke Vízgazdálkodási Társulat, Szeged, Hungary

Teaching

2013/2014 II. semester Formal languages tutorials

2013/2014 I. semester Artificial intelligence I. tutorials

2012/2013 II. semester Formal languages tutorials

2012/2013 I. semester Artificial intelligence I. tutorials

2011/2012 I. semester Artificial intelligence I. tutorials

2008/2009 II. semester Formal languages tutorials

2008/2009 I. semester Databases tutorials

2008/2009 I. semester Introduction to informatics tutorials

Education

2012 Guest student (1 semester)
Georg-August Universität Göttingen

2011- PhD in Computer Science
PhD School in Computer Science, University of Szeged, Hungary
Research topic: Automatic interpretation of English and Hungarian noun
compounds

2009-2010 Master of Science in Computer Science
Computing Laboratory, University of Oxford, UK

2006-2009 Bachelor of Science in Computer Program Designer
Institute of Informatics, University of Szeged, Hungary

Language exams

2010 English, level C2, Cambridge ESOL

2005 German, level B2, Goethe Institut

Publications

1. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Nagy, Á., Vincze, V. and Zsibrita, J.: Information Extraction from Hungarian, English and German CVs for a Career Portal. In: Prasath, R. et al. (eds.) Mining Intelligence and Knowledge Exploration. LNAI, Vol. 8891. Springer International Publishing, Switzerland (2014) 333-341

2. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Miszori, A., Nagy, Á., Vincze, V. and Zsibrita, J.: Információkinyerés magyar nyelvű önéletrajzokból a nexum Karrierportálhoz. In: Tanács, A. et al. (eds.) X. Magyar Számítógépes Nyelvészeti Konferencia. SZTE Informatikai Tanszékcsoport, Szeged (2014) 359-360

3. Dobó, A., Csirik, J.: Computing semantic similarity using large static corpora. In: van Emde Boas, P. et al. (eds.) SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741. Springer-Verlag, Berlin Heidelberg (2013) 491-502

4. Dobó, A., Csirik, J.: Magyar és angol szavak szemantikai hasonlóságának automatikus kiszámítása. In: Tanács, A. and Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. SZTE Informatikai Tanszékcsoport, Szeged (2012) 213-224

5. Dobó, A., Pulman, S.G.: Angol nyelvű összetett főnevek értelmezése parafrázisok segítségével. In: Tanács, A. and Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. SZTE Informatikai Tanszékcsoport, Szeged (2012) 35-46

6. Dobó, A., Pulman, S.G.: Interpreting noun compounds using paraphrases. Procesamiento del Lenguaje Natural, Vol. 46 (2011) 59-66

User:Doboandris

2015-10-29T15:49:00Z

Doboandris:

András Dobó
PhD Student
PhD School in Computer Science, University of Szeged, Hungary

Contact information
Address: University of Szeged, Institute of Informatics
Room 222, 2 Árpád tér, Szeged, 6720, Hungary
Email: dobo@inf.u-szeged.hu

Professional experience
2012-2014 Research mathematician
nexum Magyarország Kft.

2010-2011 Software developer
Institute of Informatics, University of Szeged, Hungary

2009 Software developer
Biological Research Centre, Hungarian Academy of Sciences, Hungary

2008-2009 Software developer
Szeged és Környéke Vízgazdálkodási Társulat, Szeged, Hungary

Teaching
2013/2014 II. semester Formal languages tutorials

2013/2014 I. semester Artificial intelligence I. tutorials

2012/2013 II. semester Formal languages tutorials

2012/2013 I. semester Artificial intelligence I. tutorials

2011/2012 I. semester Artificial intelligence I. tutorials

2008/2009 II. semester Formal languages tutorials

2008/2009 I. semester Databases tutorials

2008/2009 I. semester Introduction to informatics tutorials

Education
2012 Guest student (1 semester)
Georg-August Universität Göttingen

2011- PhD in Computer Science
PhD School in Computer Science, University of Szeged, Hungary
Research topic: Automatic interpretation of English and Hungarian noun
compounds

2009-2010 Master of Science in Computer Science
Computing Laboratory, University of Oxford, UK

2006-2009 Bachelor of Science in Computer Program Designer
Institute of Informatics, University of Szeged, Hungary

Language exams

2010 English, level C2, Cambridge ESOL

2005 German, level B2, Goethe Institut

Publications

1. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Nagy, Á., Vincze, V. and Zsibrita, J.: Information Extraction from Hungarian, English and German CVs for a Career Portal. In: Prasath, R. et al. (eds.) Mining Intelligence and Knowledge Exploration. LNAI, Vol. 8891. Springer International Publishing, Switzerland (2014) 333-341

2. Farkas, R., Dobó, A., Kurai, Z., Miklós, I., Miszori, A., Nagy, Á., Vincze, V. and Zsibrita, J.: Információkinyerés magyar nyelvű önéletrajzokból a nexum Karrierportálhoz. In: Tanács, A. et al. (eds.) X. Magyar Számítógépes Nyelvészeti Konferencia. SZTE Informatikai Tanszékcsoport, Szeged (2014) 359-360

3. Dobó, A., Csirik, J.: Computing semantic similarity using large static corpora. In: van Emde Boas, P. et al. (eds.) SOFSEM 2013: Theory and Practice of Computer Science. LNCS, Vol. 7741. Springer-Verlag, Berlin Heidelberg (2013) 491-502

4. Dobó, A., Csirik, J.: Magyar és angol szavak szemantikai hasonlóságának automatikus kiszámítása. In: Tanács, A. and Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. SZTE Informatikai Tanszékcsoport, Szeged (2012) 213-224

5. Dobó, A., Pulman, S.G.: Angol nyelvű összetett főnevek értelmezése parafrázisok segítségével. In: Tanács, A. and Vincze, V. (eds.) IX. Magyar Számítógépes Nyelvészeti Konferencia. SZTE Informatikai Tanszékcsoport, Szeged (2012) 35-46

6. Dobó, A., Pulman, S.G.: Interpreting noun compounds using paraphrases. Procesamiento del Lenguaje Natural, Vol. 46 (2011) 59-66