Evaluation by Association: A Systematic Study of Quantitative Word Association Evaluation

Recent work on evaluating representation learning architectures in NLP has established a need for evaluation protocols based on subconscious cognitive measures rather than manually tailored intrinsic similarity and relatedness tasks. In this work, we propose a novel evaluation framework that enables large-scale evaluation of such architectures in the free word association (WA) task, which is firmly grounded in cognitive theories of human semantic representation. This evaluation is facilitated by the existence of large manually constructed repositories of word association data. In this paper, we (1) present a detailed analysis of the new quantitative WA evaluation protocol, (2) suggest new evaluation metrics for the WA task inspired by its direct analogy with information retrieval problems, (3) evaluate various state-of-the-art representation models on this task, and (4) discuss the relationship between WA and prior evaluations of semantic representation with well-known similarity and relatedness evaluation sets. We have made the WA evaluation toolkit publicly available.


Introduction
The quality of word representations in semantic models is often measured using intrinsic evaluations that capture particular types of relationships (typically semantic similarity and relatedness) between word pairs (Finkelstein et al., 2002;Schnabel et al., 2015;Tsvetkov et al., 2015, inter alia).
Whereas the notions of semantic similarity and relatedness constitute key concepts in such evaluations, they are in fact vaguely defined (Batchkarov et al., 2016;Ettinger and Linzen, 2016). The construction of ground truth evaluation sets that reflect these relations, such as SimLex-999 , SimVerb-3500 , MEN (Bruni et al., 2014) or Rare Words (Luong et al., 2013), relies on manually constructed guidelines that trigger subjective human interpretation of the task at hand. This in turn introduces inter-annotator variability (Batchkarov et al., 2016) and does not account for the fact that human similarity judgements are asymmetric by nature (Tversky, 1977).
What is more, given that humans perform linguistic comparisons between concepts on a subconscious level (Kutas and Federmeier, 2011), it is at least debatable whether current similarity/relatedness evaluation sets fully capture the implicit relational structure underlying human language representation and understanding.
As evidenced by recent workshops on evaluation of semantic representations 1 , the community appears to recognise that current evaluation methods are inadequate. To fill in this gap, recent work has proposed using subconscious cognitive measures of semantic connection instead, as a proxy for measuring the ability of statistical models to tackle various problems in human language understanding (Ettinger and Linzen, 2016;Søgaard, 2016;Mandera et al., 2017).
Motivated by these insights, this work proposes an evaluation framework based on the word association (WA) task, firmly rooted in and described by the psychology literature, e.g., Nelson et al. (2000) and  2 . Word associations, provided as simple (cue, response) concept pairs, are naturally asymmetric: they tend to be given as a repository of ranked lists of concepts col-lected as responses (i.e., assocations) given a target cue/query concept. The ranking of the response list is based on the WA strength between the cue and each generated response. WAs are directly tied to language use and the memory systems that support online linguistic processing (Till et al., 1988;Nelson et al., 1998).
We build our WA evaluation framework around a large repository of the University of South Florida (USF) association norms (Nelson et al., 2000;. After post-processing, the repository contains~5K queries, and~70,000 (cue, response) pairs, making it one of the largest semantic evaluation databases available (by contrast, the largest word pair scoring data sets in NLP, SimVerb and MEN, contain 3,500 and 3,000 word pairs respectively). This new resource enables comprehensive quantitative studies of WA and may be used to guide the future development of representation learning architectures.
While parts of the USF data set have been used for evaluation in NLP before (Michelbacher et al., 2007;Silberer and Lapata, 2012;, inter alia), we conduct the first full study regarding the evaluation on the quantitative WA task. We compare a wide variety of different semantic representation models, discuss various evaluation metrics and analyse the links between word association and semantic similarity and relatedness. In summary, the main contributions of this paper are as follows: 3 (C1) We present an end-to-end evaluation framework for the WA task, and provide new evaluation metrics and detailed guidelines for evaluating semantic models on the WA task. (C2) We conduct a systematic study and comparison of current state-of-the-art representation learning architectures on the WA task. (C3) We present a systematic quantitative analysis of the connections between the models' performance on the subconscious WA task and their performance on benchmarking similarity and relatedness evaluation sets.
2 Motivation: Association and USF Implicit Cognitive Measures: Means of Semantic Evaluation? Several studies have shown clear correspondence between implicit cognitive measures (most notably semantic priming) and semantic relations encountered in vector space models (VSMs) (McDonald and Brew, 2004;Jones et al., 2006;Padó and Lapata, 2007;Herdagdelen et al., 2009), suggesting that some of the implicit relation structure in the human brain is already reflected in current statistical models of meaning.
These findings encouraged Ettinger and Linzen (2016) to propose a preliminary evaluation framework based on semantic priming experiments (Meyer and Schvaneveldt, 1971). 4 They demonstrate the feasibility of such an evaluation using a subconscious language processing task. They use the online database of the Semantic Priming Project (SPP), which compiles priming data for over 6,000 word pairs.
Here, we go one step further and demonstrate that another subconscious language processing task, with much more available data, can also be used to evaluate representations. We construct an evaluation framework based on the USF free word association (WA) norms quantifying the strength of association between cue and response concepts for more than 70,000 concept pairs.
Word Association WA has been a long-standing research topic in cognitive psychology, as evidenced by the following statement (Deese, 1966): Are there any more fascinating data in psychology than tables of association? (Deese, 1966) Word association still remains one of the fundamental questions in cognitive psychology, as emphasised by e.g. : Association has been part of the theoretical armory of cognitive psychologists since Thomas Hobbes used the notion to account for the structure of our "trayne of thoughts" in 1651.
These insights illustrate how WA can provide a useful benchmark for evaluating models of human semantic representation. WA norms are commonly used in constructing memory experiments (Dennis and Humphreys, 2001;Steyvers and Malmberg, 2003), and statistics derived from them have been shown to be important in predicting cued recall and recognition (Nelson et al., 1998), and false memories (Roediger et al., 2001). 5 WA Evaluation Set: USF The USF norms data set (hereafter USF) is the largest database of free word association collected for English . It was generated by presenting human subjects with one of 5, 000 cue concepts and asking them to write the first word coming to their mind that is associated with that concept. Each cue concept was normed by at least 100 participants, resulting in a set of associates (or responses) for each cue, for a total of ∼72,000 (cue, response) pairs. A sample of the USF data is presented in Tab. 1. The data are accessible online. 6 For each such pair, the proportion of participants that produced the response w r when presented with cue word w c can be used as a proxy for the strength of association between the two words (FSG in Tab. 1). BSG denotes the backward association strength, when the roles of a cue and a response are reversed, shows that the WA relation is inherently asymmetrical. 5 From another viewpoint, the WA evaluation aims to answer a different question than a typical intrinsic evaluation on data sets such as SimLex-999, MEN, WordSim-353, or SimVerb-3500. The goal of the latter is to assess the quality of learned text representations as a proxy towards downstream NLP tasks. The goal of the former is to assess the capability of representation learning and NLP architectures to help in advancing our understanding and modeling of human cognitive processes (occurring on a sub-conscious level), while at the same time it could still be used as a proxy evaluation in NLP. 6 http://w3.usf.edu/FreeAssociation/

Evaluation Protocol
Terminology W c = {w c 1 , . . . , w c i , . . . , w c |W C | } denotes a set of |W c | cue or normed words (more generally, concepts) in the evaluation set. For each cue word w c i , the data set contains a ranked list of concepts or responses R i sorted according to the strength of forward association, from cue to response (i.e., the FSG field in Tab. 1). The list R i contains entries of the format w r,j : fsg i,j , where w r,j is the j th most associated concept in the ranked list, and fsg i,j is the accompanying strength of forward association between cue w c i and response w r,j . Let R g i refer to the ground truth ranked list for w c i , which contains only responses where fsg i,j > 0 in the USF data, and R s i to the ranked list retrieved by an automatic system.
The vocabulary or search space from which responses for all cues are drawn is labeled V r . Note that V r may also contain words from W c and that V r may contain words that do not occur in any of the ground truth lists R g i .
Why Evaluate on Word Association? A standard evaluation protocol with word pair scoring evaluation sets such as SimLex-999 or MEN is to compute Spearman's ρ correlations between the ranking obtained by an automatic system and the ground truth ranking. This protocol, however, is not directly applicable to the USF test data. First, the evaluated relation of WA is asymmetric, and the pairs (X, Y ) and (Y, X) may differ dramatically in their WA scores (see the difference in FSG and BSG values from Tab. 1). Second, instead of one global list of pairs, the data comprises a series of ranked lists conditioned on the cue/normed word w c (see Tab. 1 again). Finally, unlike with SimLex-999 or MEN scores where it is difficult to interpret "what a similarity/relatedness of 7.69 exactly means" (Batchkarov et al., 2016;Avraham and Goldberg, 2016), the USF FSG scores have a direct meaningful interpretation (i.e., F SG = #P/#G).
To fully capture all aspects of the ground truth USF data set, an evaluation protocol should ideally be based not only on response rankings, but also on the actual scores, i.e., the association strength.
In this paper, we propose and investigate two different families of evaluation metrics on the USF data: Sect. 3.1 discusses rank correlation evaluation metrics inspired by recent work on the evaluation of vector space models in distributional semantics (Bruni et al., 2014;, inter alia). Sect. 3.2 draws inspiration from research on evaluation in information retrieval (IR). We show that the problem of evaluating USF association lists may be naturally framed as an ad-hoc IR task (Manning et al., 2008). This enables the application of standard IR evaluation methodology.

Rank Correlation Evaluation
Averaged Standard Spearman's Correlation The first protocol, labeled ρ-std, first computes the standard Spearman's ρ correlation between R g i and R s i . The system list R s i is pruned so that it contains only those items that also occur in R g i . The two lists are then correlated to obtain the score ρ i for cue w c i . Following that, the correlation scores are averaged. First, we apply the Fisher z-transformation (Fisher, 1915) and then average over the transformed scores: The final output score is obtained by applying the inverse z-transformation on z avg :

Averaged Weighted Spearman's Correlation
The previous protocol treats all ranks equally, despite the fact that the system should be rewarded more for getting the strongest responses correct (and penalised when failing to do so). Therefore, we also experiment with weighted rank correlation measures, which weigh the distance between two ranks, and assign more importance to higher ranks (i.e., in our setting, to stronger associates). Several weighted correlation metrics have been proposed (Blest, 2000;Pinto da Costa and Soares, 2005;Dancelli et al., 2013;Pinto da Costa, 2015). We show results with the weighted Spearman's correlation (further labelled ρ-w) from Pinto da Costa (2015). 7 Let us denote Q 1 = [Q 1,1 , Q 1,2 , . . . , Q 1,n ] and Q 1 = [Q 2,1 , Q 2,2 , . . . , Q 2,n ] two vectors of ranks obtained on a sample of size n. The weighted rank correlation ρ between the vectors is computed as: 7 We also experimented with other weighted variants, but detected similar trends in reported model rankings.
We refer the interested reader to the relevant literature (Pinto da Costa, 2015) for further details, theoretical implications and property proofs related to Eq. (4). ρ i scores for all cue words W c are then obtained using Eq. (4), and the averaged score ρ avg is computed as before, see Eq.
(1)-Eq. (3). While the two metrics are intuitive and capture the ability of models to correctly rank (a subset of) associates/responses, note that they have deficiencies. They only evaluate the rankings of words occurring in R g i , which effectively reduces the search space V r to the small subset {w 1 , . . . , w |R g i | } ⊂ V r . This effectively means that the final score simply ignores incorrect responses that are ranked highly by a system but that do not occur in R g i . It also does not take into account the actual strength of association.

IR-Inspired Evaluation
Intuition Another set of evaluation metrics is inspired by the resemblance of the USF data structure to the typical output of ad-hoc IR systems (Manning et al., 2008;Pound et al., 2010). That is, each cue word w c can be thought of as an input query issued against some target concept collection V r , where the goal of our association retrieval system is to rank items from the target collection according to their relevance (i.e., their association strength) to the issued query. The output of the system is the ranked list R s i of length |V r |, with ground truth relevance assessments provided in R g i . MRR and MAP The first two metrics assume non-weighted or binary relevance: the retrieved response is either relevant to the issued cue (labeled 1) or it is non-relevant (0). We assume that all responses found in the ground truth lists R g i where f sg i,j > t are relevant responses, where t is a threshold. 8 We label this reduced set of relevant responses RR g i . The most lenient evaluation metric is Mean Reciprocal Rank (MRR) (Voorhees, 1999;Craswell, 2009). The reciprocal rank of a query response is the multiplicative inverse of the rank of the first relevant answer, and the final score is then averaged over all |W c | queries/cues. More formally: where rank i is the rank position of the first relevant response (i.e., the first response found in the set RR g i ) for the cue word w c i . Since MRR cannot assess multiple correct answers and their ranking in the retrieved list, an alternative metric is Mean Average Precision (MAP): Here, AP (w c i ) denotes Average Precision for query/cue w c i , N ≤ |V r | denotes the number of responses retrieved by the system. P k is the precision at cut-off k in the list, and irel k is an indicator function which 'turns on' only if the response at rank k is the relevant response (i.e., present in RR g i ). The average is computed over all relevant responses, and the non-retrieved relevant responses from V r get a precision score of 0. N << |V r | is typically used (e.g., standard values are N = 100 or N = 1000) to reduce the execution time of the evaluation procedure, since it is expected that a good retrieval system should obtain a majority of relevant responses in the first N responses.
Compared to measures from Sect. 3.1, MRR and MAP are better estimators of the model's ability to capture word association, as they operate over the entire search space V r for each cue word. This effectively means that systems get rewarded if they are able to consistently rank relevant responses higher than non-relevant responses. However, these metrics still rely on binary non-weighted relevance judgements, and are therefore unable to reward models that rank highly relevant responses (i.e., strongly associated responses, see Tab. 1) higher than weakly relevant responses.
NDCG@k In other words, the most expressive evaluation metric should be able to distinguish that cue-response pairs such as (lunch, dinner) and (lunch, food) should be ranked higher than weakly associated pairs such as (lunch, box) or (lunch, sandwich). In addition, the metric should still reward models that rank relevant responses higher than non-relevant ones.
An IR metric which takes all these aspects into account is Discounted Cumulative Gain (DCG) (Järvelin and Kekäläinen, 2002). DCG operates with weighted relevance values: in the USF scenario, these are forward association strengths, i.e., scores fsg i,j . The main idea behind using DCG is that highly relevant responses appearing lower in a ranked list should be penalised. The penalty is implemented by reducing the weighted relevance value logarithmically proportional to the position of the particular response. We opt for a more recent variant of DCG which puts more emphasis on retrieving relevant responses (Burges et al., 2005). DCG@k, the DCG score accumulated at a particular rank position k is computed as follows: wrel i is the graded relevance of the response at rank i given by the ground truth data, i.e., fsg i,j if the cue-response pair occurs in R g i , or 0 otherwise. To make results comparable across different queries, a normalised variant of DCG is typically used. First, all relevant responses are sorted by their graded relevance value, producing the maximum possible DCG at each position k. The score of the ideal ranking at rank k is called Ideal DCG (IDCG@k). NDCG@k for a single query is then: Finally, the mean NDCG@k is produced for the entire collection W c by averaging over all single NDCG@k values. In all experiments we rely on a standard choice for k: NDCG@100, while similar trends are observed with NDCG@10.

Experimental Setup and Models
LDA-Based Approach First, we evaluate an approach based on latent topic modeling, rooted in the psychology literature (Steyvers et al., 2004;. 9 The following quantitative model of word association has been proposed : where w c is a cue word, w r ∈ V r any concept from the search space, and to i is the i th latent topic from the set of M topics induced from the corpus data (using LDA). We label this model LDA-assoc. The probability scores P (w r |to i ) select words that are highly descriptive for each particular topic. P (to i |w c ) scores are computed as in prior work, by assuming topic independence and applying Bayes' rule on the LDA output per-topic word distributions P (·|to i ) Vulić and Moens, 2013). 10 We train LDA with 1,000 topics using suggested parameters .

Count-Based Models
We evaluate the best performing reduced count-based model from . We label this model count-ppmi-500d. 11 For a more detailed description of the model's training data and setup we refer the reader to the original work and supplementary material.
Vector Space Models We also compare the performance of prominent representation models on the WA USF task. We include: (1) unsupervised models that learn from distributional information in text, including Glove (Pennington et al., 2014) with d = 50 and d = 300 dimensions (glove-6B-50d and glove-6B-300d), the skip-gram negative-sampling (SGNS) 300dimensional vectors (Mikolov et al., 2013) with various contexts (bow = bag-of-words; deps = dependency contexts) as in (Levy and Goldberg, 2014) and  (sgns-pw-bow-w2, sgns-pw-bow-w5, sgns-pw-deps, sgns-8b-bow-w2), and the symmetric-pattern based vectors by  (sympat-500d); (2) Models that rely on linguistic hand-crafted resources or curated knowledge bases. Here, we use vectors finetuned to a paraphrase database (paragram-25d, 10 The generative model closely resembles the actual process in the human brain ) -when we generate responses, we first tend to associate that word with a related semantic/cognitive concept, i.e., a latent topic (the factor P (toi|w c )), and then, after establishing the concept, we output a list of words that we consider the most prominent/descriptive for that concept (words with high scores in the factor P (w r |toi)). 11 We have also experimented with simple count-based asymmetric association measures proposed by Michelbacher et al. (2007), estimated using the same corpus as the countppmi-500d model. We do not report the results with these measures, as they show a very poor performance when compared to all other models in our comparison. paragram-300d, (Wieting et al., 2015)) further refined using linguistic constraints (paragram+cf-300d, (Mrkšić et al., 2016)); (3) Multilingual embedding models from  (biskip-256d) and Faruqui and Dyer (2014) (bicca-512d). More detailed descriptions of all VSM models are available in the listed papers and supplementary material attached to this work.

USF Data Processing and Parameters
Only USF pairs where both words are single word expressions were retained, and the rest was discarded. This yields 4,992 single word queries in total. The total number of finally retained USF pairs is ≈ 70,000. Note that this evaluation set is by an order of magnitude larger than current benchmarking word pair scoring datasets such as MEN (3000 word pairs in total), SimVerb (3500), SimLex (999) and Rare Words (2034), and thus allows for a truly comprehensive evaluation of quantitative WA models. Only responses generated by at least 3 human subjects in each list of responses are taken as relevant in all experiments (see Foot. 7 in Sect. 3.2), all other (cue, response) pairs and pairs not present in the USF data are considered non-relevant. 12

Results and Discussion
Exp. I: Making the Evaluation Tractable Computational complexity is not an issue for standard semantic benchmarks such as SimLex-999 or MEN: these data sets require only N gt similarity computations in total, where N gt is the number of word pairs in each benchmark (999 or 3000). However, complexity plays a major role in the USF evaluation: the system has to compute |W c | · |V r | similarity scores, where |W c | ≈ 5, 000, and |V r | is large for large vocabularies (typically covering > 100K words). In addition, each list of |V r | has to be sorted according to the WA strength: this means that the complexity is O(|W c | · (|V r | + |V r | log |V r |)).
Since this is prohibitively expensive, our solution is to restrict the search space V r only to words (both cues and responses) occurring in USF: |V r | = 10, 070. 13 Besides the gains in evaluation efficiency, when using the USF vocabulary all models operate over exactly the same search space: 12 For efficiency reasons with IR metrics, we evaluate results only over the top N = 1000 retrieved responses for each cue. 13 Prior work shows that the USF data represents a good range of distinct semantic phenomena , which suggests that the USF vocabulary represents a balanced sample of the English vocabulary. 0.174 (7) 0.048 (6) 0.121 (7) 0.309 (7) 0.092 (7) 0.198 (7) paragram+cf-300d [4971] 0.221 (5) 0.067 (4) Table 3: Results on the USF WA task using different evaluation metrics proposed in Sect. 3. V r = U SF for all models. The best results per column are in bold, second best in italic.
therefore, their results are directly comparable as the data coverage bias should be largely mitigated.
To fully support this choice, we perform a simple experiment using a subset of models from Sect. 4. In the first evaluation, V r contains the most frequent 100K words for all models, where frequency was computed on their respective training data. In the second evaluation, V r contains only the USF vocabulary words. The results with IR-style metrics are shown in Tab. 2, and similar trends are observed with Spearman's ρ correlations.
The results support several conclusions. (i) Coverage over cue words is very high for all models (the model with the lowest coverage from Tab. 2 has a coverage of 98.2%). This, along with the same search space (the USF vocabulary) indicates a fair comparison of different models. (ii) Different IR metrics produce consistent model rankings, with a slight variation in the middle of the rankings. Interestingly, the best scoring model is Glove, a model which uses document-level co-occurrence, which steers it towards learning topical similarity. On the other hand, the worst performing model relies on dependency-based contexts which better capture functional similarity (Levy and Goldberg, 2014) and outperform other context choices in word similarity tasks on SimLex and SimVerb (Melamud et al., 2016;. (iii) Most importantly, the reduction of V r again yields consistent rankings with all metrics, which are also fairly consistent with the rankings obtained in the ten times larger 100K search space. Therefore, in all further experiments we use the USF vocabulary as our search space.
Exp. II: Results on USF WA Next, we evaluate all models from Sect. 3 on the WA task. The results with different metrics are summarised in Tab. 3. The results suggest that all proposed evaluation metrics indeed reflect the ability of different models to capture WA. We observe strong correlations of the models' rankings with all five metrics (Tab. 4). ρ-w is a slightly more conservative metric than ρstd on average, but it does not affect model rankings at all (see also Tab. 4).
Further, the LDA-based WA model  is largely outperformed by VSM-based approaches. As expected, similar VSMs with more dimensions are more expressive and score higher (e.g., note the scores with glove and paragram models). Additionally, models trained on larger corpora are also able to improve the overall results (e.g., note the scores with sgns trained on the Polyglot Wikipedia (PW, 2B tokens) vs. the 8B word2vec corpus). The paragram models specialised for similarity tasks are unable to match unsupervised VSMs that train on running text (e.g., paragram+cf-300d obtains a SimLex score of 0.74 compared to 0.46 with sgns-8b-bow-w2).
Two models using bilingual training (biskip-256d and bicca-512d) seem unable to match the   best performing monolingual models: however, we plan to further analyse the influence of bilingual information in the WA task in future work. Finally, a comparison of sgns-pw-* models (where the only varied parameter is the context used in training) reveals that (i) larger windows improve WA scores (we test this phenomenon further in Exp. III), (ii) sgns-pw-deps, which captures functional similarity through dependency-based contexts, yields lower WA scores, while it improves on SimLex-999 compared to the other two models. This insight leads us to further investigate this phenomenon in Exp. IV.
Exp. III: Window Size In the next experiment, we analysed the effect of the window size on models' ability to capture similarity, relatedness, and association. We train the sgns-pw-bow model (d = 300) with varying window sizes in the interval [1,30]. The results on similarity (SimLex-999), relatedness (MEN), and WA benchmarks (USF) are presented in Fig. 1(a)-1(b). It is clear that using larger windows deteriorates the performance on SimLex-999 as the focus of the model is shifted from functional to topical similarity. This shift has been detected in prior work on vector space models . However, we also observe a similar trend with MEN scores, although an opposite effect was expected, which questions the ability of MEN to accurately evaluate relatedness. The opposite effect is, however, visible with the WA evaluation, where it is evident that larger win-dows (leading to topical similarity) lead to better WA estimates. This also provides the first hint that WA and semantic similarity capture two completely distinct semantic phenomena.
Exp. IV: WA vs. Similarity vs. Relatedness We delve deeper into this conjecture by computing correlations between model rankings on the WA task and two prominent similarity and relatedness data sets. The results from Tab. 4 indicate the following. First, semantic relatedness and similarity are correlated although they clearly refer to two distinct semantic phenomena as emphasised in prior work . The correlations between different metrics proposed for the WA task are very high (e.g., the lowest correlation score among any of the two is ρ = 0.921). Second, WA and similarity capture very distinct relations (this is evident from low, even negative ρ correlation scores). Third, WA and relatedness are strongly correlated, 14 but the correlation is not as high as expected, given that the two are often considered equivalent, e.g., (Kiela et al., 2015). Future work should investigate whether the difference originates from inadequate evaluation data and protocols (see Fig. 1(a)-1(b) again), or whether the difference is fundamental.

Conclusion and Future Work
We have proposed and released a new end-to-end evaluation framework for the task of free word association (WA). We have also provided new evaluation metrics inspired by research in IR, and guidelines for evaluating semantic representation models on the quantitative WA task.
Besides serving as a gold standard in NLP, the comprehensive WA evaluation resource and accompanying evaluation protocol should enable the development of data-driven automatic systems that can capture the notion of word association, and further analysis on how humans perceive (types of) semantic relatedness and similarity (Spence and Owens, 1990;Maki and Buchanan, 2008;De Deyne et al., 2013). These systems, as discussed in this paper, may additionally facilitate research in cognitive psychology pertaining to human semantic representation and memory.
In future work, we plan to test the portability of the evaluation protocol and apply it to other repositories of word association data in English (De Deyne et al., 2016), as well as in other languages, using existing WA tables in, e.g., German (Schulte im Walde et al., 2008), Dutch (De Deyne and Storms, 2008;Brysbaert et al., 2014), Italian (Guida and Lenci, 2007), Japanese (Joyce, 2005), or Cantonese (Kwong, 2013. 15 In another line of future work, we will experiment with other "cognitively plausible" evaluation data such as N400 (Kutas and Federmeier, 2011;, and will analyse the similarities and differences between WA and other such "cognitive" evaluation protocols, as the one relying on semantic priming (SPP) (Hutchison et al., 2013;Ettinger and Linzen, 2016).
All evaluation scripts and detailed guidelines related to this work are freely available at: github.com/cambridgeltl/wa-eval/