Modeling bilingual word associations as connected monolingual networks

Word associations are a common tool in research on the mental lexicon. Studies report that bilinguals produce different word associations in their non-native language than monolinguals, and propose at least three mechanisms responsible for this difference: bilinguals may rely on their native associations (through translation), on collo-cational patterns, and on the phonological similarity between words. In this paper, we ﬁrst test the differences between monolingual and bilingual responses, showing that these differences are consistent and signiﬁ-cant. Second, we present a computational model of bilingual word associations, implemented as a semantic network paired with a retrieval mechanism. Our model predicts bilingual word associations better than monolingual baselines, and translation is the main mechanism explaining its success, while collocational and phonological associations do not improve the model.


Introduction
In a free association task, participants are given a cue word (e.g., apple) and produce the first word that comes to their mind (e.g., red or fruit). 1 Free associations have been a common tool in the study of the mental lexicon because the observed pattern of associations can reflect the nature and strength of connections between words in semantic memory.
We focus on free associations as a means to better understand the structure and processing of the mental lexicon in bilinguals. Bilingual word associations have been studied for decades (see an overview in Meara, 2009). Despite a number of important findings, which we summarize in the following section, high-level conclusions about the association norms in bilinguals' non-native language are unclear -not only because of high variability in bilingual populations (DeKeyser, 2013), but also due to methodological factors (as explained by Boulton, 2003;Krzemińska-Adamek, 2014). Of specific concern for us is the lack of robust statistical analyses of the results. Many studies provide a selective qualitative analysis of the responses, and their findings can be inconsistent. In particular, it is unclear whether there are significant differences between native and non-native word associations (as compared, for example, to the instability of responses within a group of speakers over time).
We address this issue by providing a statistical analysis of the differences in English word association responses of Dutch[L1]-English[L2] bilinguals (collected by van Hell and de Groot, 1998) compared to English monolingual word association norms. After demonstrating a quantifiable difference between them, we then present the first computational model of bilingual word associations, which we use to investigate how the structure and processing of the bilingual lexicon could lead to the observed differences.
2 Related work 2.1 Non-native word associations In general, non-native speakers' responses tend to differ from those of native speakers (e.g., Wolter, 2001;Zareva, 2007;Antón-Méndez and Gollan, 2010;Hui, 2011). Non-native speakers often produce responses that are translation equivalents of responses they would give in their native language (Meara, 1978) -in other words, L1 mediates their L2 responses (Nam, 2014). Such translations are produced more frequently when the cue word and its translation are cognates 2 (Taylor, 1976;van Hell and de Groot, 1998). Also, collocational responses (called 'syntagmatic'; e.g., duty-free, opportunity-take : Politzer, 1978;Riegel and Zivian, 1972) and phonological responses (favorflavor: Meara, 1978;Namei, 2004) tend to be produced by non-native speakers more frequently than by monolinguals. Multiple examples of all these effects are well-documented, yet open questions remain regarding how systematic these differences are between bilinguals and monolinguals.
Van Hell and de Groot (1998, henceforth vHdG) carry out a free association experiment with Dutch-English bilinguals (i.e., native Dutch speakers who have been learning English). For us, their study is interesting in two respects. First, vHdG work with two similar groups of bilinguals and test one of the groups twice, which allows us to measure the consistency of responses between two groups of bilinguals, as well as within a single group. Second, large-scale monolingual association norms are available for both Dutch and English, which helps us both with our statistical analyses and in building a computational model. We use vHdG's data (1) to carry out a systematic comparison of monolingual and bilingual responses, and (2) to train and test a computational model that helps us predict whether the effects described above are systematic or not.

Existing computational models
Graph-based models (or semantic networks) have been widely used in research on semantic memory (see an overview by Beckage and Colunga, 2016). Despite their 'localist' approach in which a word is simply represented by a node (rather than using distributed representations), such models are a useful tool in the study of lexical access and acquisition. In particular, they have successfully replicated patterns of human verbal behavior in free word association (Enguix et al., 2014;Gruenenfelder et al., 2015), semantic fluency tasks (Abbott et al., 2015;Nematzadeh et al., 2016), lexical growth/acquisition (Stella et al., 2017;Bilson et al., 2015), assessment of semantic similarity (Jackson and Bolger, 2014;De Deyne et al., 2016), etc.
Naturally, a graph is only a static representation of the lexicon, although its structure presumably reflects lexical processing (Beckage and Colunga, 2016). To simulate the actual processing dynam-ics, various mechanisms have been proposed, such as spreading activation, random walk, entanglement, etc. (Galea et al., 2011;Zemla and Austerweil, 2017). In a spreading activation model, the activation starts at a given node and spreads across the graph over adjacent edges proportionally to edge weights (Anderson, 1983;Roelofs, 1992). Recently, De Deyne et al. (2016) used this approach on a free association graph to predict human similarity judgments for weakly-related concepts. We use a similar approach to model bilingual free associations in our computational model.

Data analysis
While vHdG explored various aspects of bilingual word associations, they did not compare the bilingual responses they collected to independent monolingual data. Here, we quantitatively compare vHdG's data against monolingual association norms, to see whether the non-native responses are indeed systematically different from those of native speakers. As vHdG argue, there is a lot of variability among bilinguals. Therefore, we need to compare the between-group differences (monolinguals vs. bilinguals) against within-group differences (two sets of bilinguals), to ensure that any between-group difference we find is due to more than the variation in responses among bilinguals.

Distance measures
Our goal is to compare two sets of responses to a particular cue word against each other. For this, we use two measures. The first is based on average precision, widely used in information retrieval. This measure treats one (unordered) set of responses as a gold standard and compares this set against another (ordered) set, considering only the top n responses. Because we are interested in measuring the distance between the two sets, we employ a complementary measure ρ to assess the distance between an unordered (shorter) set X and an ordered (longer) set Y : where 1 k is an indicator function taking the value of 1 if Y k ∈ X and 0 otherwise, and P k is the precision at k: where Y 1:k is the subset consisting of the first k responses in Y . While average precision is frequently used in information retrieval, a shortcoming of this measure is that the order of responses in X does not matter. In practice, however, some of the responses can be several times more frequent than others. To account for this fact, we use a second measure, total variation distance υ, which considers two probability distributions X and Y , associated with the likelihoods of responses in X and Y , respectively: e.g., X ∼ {L(X i ), 1 ≤ i ≤ |X|}, where the likelihood is proportional to the response frequency in the human data (and later, to the association score in our model). The measure υ is then defined as: (3) Sometimes response r i does not appear in one of the lists; if, e.g., r i is not in Y , we take Y (r i ) = 0.
For both measures, we test two values of n: n = 3 to compare only the top three responses per cue in the data, and n = |X| to compare the maximum possible number of responses per cue. Note that in the latter case, n varies per cue word, depending on the number of responses in X. We denote the respective measures as ρ 3 , υ 3 , and ρ max , υ max .
To focus on systematic differences between word associations and eliminate the noise from occasional responses and various word forms, in all the reported analysis we remove hapax legomena (responses that are only given by one participant) and lemmatize all the responses, using Frog (van den Bosch et al., 2007) for Dutch and NLTK WordNet lemmatizer (Bird et al., 2009) for English.

Same vs. different bilinguals
First, we test if our measures are sensitive enough to find expected differences between sets of free association responses. For this, we compare the difference in responses from two different sets of bilinguals to the difference in responses from a single set of bilinguals at two different timesi.e., we expect more variation in the two response sets in the former case than in the latter, in line with vHdG's results. We use their data, in which one group of bilinguals, B 1 , performed the free association task twice (B 1-1 and B 1-2 ), while another group performed it only once (B 2 ). We then expect that ρ 3 (B 1-1 , B 1-2 ) < ρ 3 (B 1-1 , B 2 ), 3 and the same for υ 3 , ρ max , and υ max . We compute the ρ and υ values for responses given by vHdG's bilinguals to each of the 58 cue words. 4 Figure 1 (left panel) shows the distances in terms of υ 3 only (the differences in distances on the three other measures are more pronounced). We statistically compare the distances using Wilcoxon signed-rank test on pairwise differences per cue word. The results confirm our prediction on all measures: mean ρ 3 (B 1-1 , B 1-2 ) = 0.35 is less than mean ρ 3 (B 1-1 , B 2 ) = 0.47 (p = .002); for υ 3 , the respective means are 0.40 and 0.49 (p = .003); for ρ max , the means are 0.38 and 0.49 (p = .004); for υ max , they are 0.35 and 0.46 (p = .002). The consistency of the observed differences across the four measures suggests that the same set of bilinguals gives more consistent responses across sessions than two different sets of bilinguals, and this effect cannot be explained by random variation. Ideally, we would carry out a similar analysis for monolingual speakers, but individual-level data for monolingual speakers is not available at the moment.

Bilinguals vs. English monolinguals
Given that our measures are sensitive to differences across response populations, we can now turn to our main goal of verifying differences in the responses of non-native speakers (that is, Dutch-English bilinguals tested in English) compared to native English speakers. We expect more consistency in the responses given by the two groups of bilinguals (B 1-1 vs. B 2 ), compared to bilinguals vs. monolinguals (B 1-1 ∪ B 2 vs. M E ); 5 see Figure 1 (left panel). (For English monolingual responses M E , we use the University of South Florida association norms: Nelson et al., 2004.) The results confirm our prediction: mean ρ 3 (B 1-1 , B 2 ) = 0.47 is less than mean ρ 3 (B 1-1 ∪ B 2 , M E ) = 0.63 (p = .003); for υ 3 , the respective means are 0.49 and 0.60 (p = .014); for ρ max , the means are 0.49 and 0.65 (p = .002); for υ max , they are 0.46 and 0.60 (p = .002). In short, despite the high variation in bilinguals' responses, there is still significantly more consistency between groups of bilinguals than between monolinguals vs. bilinguals.

Bilinguals vs. Dutch monolinguals
Finally, we check whether the difference reported in the previous section is only observed in bilinguals' L2 (English), or is also found in their L1 (Dutch). Intuitively, we expect little difference between the responses of Dutch monolinguals and Dutch-English bilinguals tested in Dutch.
In other words, there should be a similar degree of consistency in the responses given by, on the one hand, the two groups of bilinguals (B 1-1 vs. To summarize our human data analyses, we have shown quantitatively that Dutch-English bilinguals give systematically different responses in English (their L2) from English monolinguals. While such a difference has long been observed, to our knowledge we are the first to statistically analyze this difference and show that it is greater than the inconsistency in responses across participants. Besides, this difference is specific to bilinguals' L2, as we did not observe it in bilinguals' L1 Dutch.

Computational model
We develop a computational model intended to investigate the difference found above between bilinguals and monolinguals in free association. Our hypothesis is that bilingual associations in L2 are influenced by their L1 through connections between the lexicons of their two languages. We create a bilingual Dutch-English semantic network as a weighted directed graph G with a set of nodes N , where N consists of cue and response words obtained from (monolingual) word association norms in the two languages: De Deyne et al. (2013) for Dutch and Nelson et al. (2004) for English. 6 We next describe the various types of edges connecting the nodes, and the spreading activation mechanism used as a retrieval mechanism.

Edge types and weights
Dutch and English associative edges, which connect nodes within the same language, effectively create two monolingual sub-networks.
L1 associative edges (DA) start at a Dutch cue word and end at all its Dutch responses, based on the monolingual Dutch association norms. The edge weights are proportional to conditional probabilities p(response|cue) obtained from the norms.
L2 associative edges (EA) are created the same way, using the English association norms. The two resulting sub-networks are then connected to each other with two following types of edges.
Translation equivalent edges (TE) connect nodes that are translations of each other. Translations are obtained from two dictionaries: FreeDict 7 and dict.cc. 8 In many cases a node n has more than one translation (e.g., a and b). To determine which one is more frequent, we use OpenSubtitles, 9 a bilingual corpus of Dutch-English subtitles (Lison and Tiedemann, 2016). Word alignment was performed on a random sample of 50 million sentences using the method of Liang et al. (2006), and conditional probabilities of each Dutch-English and English-Dutch translation were extracted. If a and b are translations of node n, edges E na and E nb are weighted proportionally to the conditional probabilities p(a|n) and p(b|n).
Cognate edges (CG) are placed between translation equivalents that have similar orthographic forms. Cognates are believed to enjoy a special status in bilinguals (van Hell and de Groot, 1998;Voga and Grainger, 2007). These edges are defined using a similarity measure S, which is complementary to the normalized Levenshtein distance (Ciobanu and Dinu, 2013). Given two words w i and w j , S is computed as: where L(w i , w j ) is the Levenshtein distance between the words, and |w| is the number of characters in w. We consider w i and w j to be cognates when they are translation equivalents in our dictionary, and S(w 1 , w 2 ) ≥ 0.5. This rather low threshold was chosen to capture cognates that are spelled differently due to morphological or etymological reasons, yet are similar in their pronunciation: swell-zwellen, photography-fotografie, etc.
Finally, we consider two extra types of edges, which connect English nodes to each other. As we mentioned earlier, there is some evidence that bilinguals tend to produce more orthographic and syntagmatic responses in their non-native language, and the following types of edges are intended to test whether this is a systematic effect.
Orthographic edges (OR) connect English words with similar spelling; they are weighted using the measure S defined above. We chose a higher threshold than for cognates, 0.75, to prevent the English network from becoming too dense.
Here, for simplicity we assume that word spelling captures not only orthographic, but also phonological similarity between words, although in principle, phonological edges could be added as an independent type in the model.
Syntagmatic edges (SY) reflect collocations or pairs of words that frequently co-occur. Sometimes participants produce syntagmatic responses in the free association task, such as duty-free, opportunity-take, or apple-red. While our DA and EA edges capture such responses, there is some evidence that bilinguals produce more of these in their non-native language, so we add these SY edges. Specifically, we consider the most frequent bigrams and trigrams (one million each; from the Corpus of Contemporary American English: Davies, 2008), convert trigrams into skip-bigrams (take opportunity), and exclude stopwords (using the NLTK list: Bird et al., 2009) and words that do not appear in the English free association norms. For each pair of words, we compute their total number of co-occurrences in both bigrams and skip-bigrams, F (w 1 , w 2 ), and their total individual frequency, F (w 1 ) and F (w 2 ). Each weight for SY edge E ij is set proportional to the respective conditional probability: Figure 2 shows a small part of the bilingual network with various types of edges.

Normalization of edge weights
We further weight each type of edges differently, to reflect their relative importance in the spreading activation process. These relative weights are the main parameters of our model. The model has six edge weight coefficients κ: κ DA , κ EA , κ TE , κ CG , κ OR , and κ SY , set as discussed in Section 5.2.
We normalize the edge weights of all outgoing edges of each node n to sum to 1, so that n passes on to its neighbors collectively the same amount of activation that it received. To do so, we first consider all outgoing edges of n a particular type -e.g., DA. We normalize the weights of all DA edges so that they sum to 1, and then multiply each weight by the respective coefficient, κ DA . The same is done for all edge types. After that, we normalize the weights of all outgoing edges of n to sum to 1.

Retrieval algorithm
Given graph G with nodes N and edges E, the activation algorithm starts at a cue node n cue , and activation spreads over edges to neighboring nodes, proportionally to the edge weights. This process is bounded in time by a parameter T , which is the upper limit of number of edges the activation can pass through. At the end, the model returns a ranked set of nodes (responses) M = {n 1 , n 2 , .., n k } and where A t (n i ) is the activation score of n i at time t: where E ji is the edge connecting n j to n i , and w(E ji ) = 0 if the two are not connected. Initially, A 0 (n cue ) = 1; for all other nodes A 0 (n i ) = 0.

Task, models, and baselines
We test our model on the English free association task given to bilinguals in vHdG -i.e., Dutch-English speakers were given English cue words and asked to respond in English. We consider two versions of spreading activation in the model, unconstrained and constrained (see Figure 3). In both versions, we set T -the maximum path length of spreading activation -to be 3, following the intuition that bilinguals may translate the English cue into Dutch (time t = 1), think of Dutch word associations (t = 2), and translate them back into English (t = 3).
In the unconstrained version (UCS) of the model, activation crosses all types of edges at each time step. Note that a T value of 3 enables activation to spread from the English to the Dutch subnetwork and back, but also allows activation to spread beyond the direct English associates. The next version of the model controls for this.
The constrained version (CS) simulates a bilingual who accesses direct English associates of the cue word, as well as English translations of direct Dutch associates of Dutch translations of the cue word. That is, they combine their direct English associations with direct Dutch associations. At t = 1, activation passes from the cue node to its English associates and to its Dutch translations, via EA and TE/CG edges, respectively. At time t = 2, activation passes only from the just-activated Dutch nodes via DA edges to their Dutch associates. Finally, at t = 3, activation passes only from the newly activated Dutch nodes (the associates of cue translations) via TE and CG edges back to English nodes. Conceptually, this version implements a speaker who relies on the word translation mechanism.
Because we have shown that human bilingual responses to the English free association task differ from those of monolinguals, we need to compare our model's performance to a monolingual (English) baseline. The association norms baseline (BASE-AN) corresponds to the English word association data set itself: i.e., we use EA edges only in the English subnetwork and set the maximum path length T = 1. An improvement over BASE-AN ensures that our model is producing a better match to bilingual data than simply outputting English monolingual associations. We also use a second monolingual baseline with the same subnetwork and edges; this spreading activation baseline (BASE-SA) instead uses T = 3, as in our model. This setting enables access to indirect English associations of the cue word (as in our model), but only through English connections (unlike our model). Comparing our model to BASE-SA indicates any improvement we see in our model is due to accessing the Dutch subnetwork (our theoretical claim) and not simply due to making indirect associations in English.

Model evaluation
In the test task, the model receives a set of cue words and generates multiple responses to each cue. Only English nodes can serve as responses, and their probabilities are normalized to sum to 1. The model responses are compared to human data using the measures defined in Section 3.1.
Our main goal is to test which types of edges systematically contribute to predicting bilinguals' (non-native) free word associations, and which do not. We have six parameters of the model related to edge weights (κ weights for the six types of edges) and a relatively low number of test items (58 cue words). To prevent overfitting, we perform cross-validation on our data set, initially fitting only some of the κ parameters. Specifically, we first determine the best weights for the word association edges (κ DA and κ EA , which are essential for the task) and for the cross-language edges (κ TE and κ CG , which ensure that activation can pass from English to Dutch and back). We later test whether adding other edge types (SY and OR) improves the model. For cross-validation, we use the Monte-Carlo method with 10, 000 iterations: in each iteration, the 58 cue words are randomly split into 48 training items and 10 test items. For each training subsample, we consider values {0, 1, 5, 10, 20, 25} for each edge weight (κ DA , κ EA , κ TE , κ CG ), run a grid search to find the best combination, and choose the four combinations (one per evaluation measure) which minimize the distance between the human and the model responses. These combinations are then evaluated on the respective test sub-sample.

Results
6.1 Testing the basic model Table 1 provides average cross-validation scores for the two baselines and the two models. Recall that our scores are distances from human data, so lower values are better. We see that BASE-AN is a stronger baseline than BASE-SA. The UCS model shows little to no improvement over the baselines, and we only consider the CS model henceforth. The CS model shows a noticeable improvement over the stronger BASE-AN baseline, of 0.03-0.04 in terms of absolute distances, an improvement of 5%-6%.
Although the best combinations of edge weights of the CS model differ per iteration, one of them appears much more frequently than the others, over 12, 000 times: (κ DA , κ EA , κ TE , κ CG ) = (10,5,20,25). To determine whether this combination makes significantly better predictions than the baselines, we test it on the full data set with responses to 58 cue words and run a series of Wilcoxon signed-rank tests (one per measure). The results show that the model (average scores ρ 3 = 0.57, υ 3 = 0.56, ρ max = 0.60, υ max = 0.55) is significantly better than both baselines on all measures, apart from υ 3 when compared to BASE-AN. The comparisons to the baselines show that the CS model, but not the UCS model, predicts bilingual responses better than simply using monolingual responses, and it does so by using edges that link translations across English and Dutch.

Testing the model with extra edges
Here we see if adding the further two types of edges -OR and SY -improves the model predictions. We use the CS model with the best parameter combination, (κ DA , κ EA , κ TE , κ CG ) = (10,5,20,25). Again, we cross-validate the model, this time running a grid search to find the best weights of the extra edges only, κ OR and κ SY . We look for the most frequent parameter combinations. The combination of the best CS model without the extra edges -that is, (κ OR , κ SY ) = (0, 0) -is about as frequent as a particular combination with syntagmatic edges -(κ OR , κ SY ) = (0, 1), and both of these perform the same on the full data set. Thus, OR and SY do not improve the model's performance overall. We return to this issue in the discussion.
Note that both for the UCS and CS models, we start by first fitting the κ values for associative edges and cross-language (translation and cognate) edges, because the literature generally agrees that L2 speakers use the translation mechanism at least to some extent (e.g., Meara, 2009). The other two mechanisms -collocations and form similarity -are tested as additions to the model. Effectively, this makes our basic CS model implement the learner relying on word associations (DA and EA edges) and translation equivalence (TE and CG edges), but not on collocation patterns or orthographic similarity between L2 words. One could also design a model without cross-language edgesthat is, relying on L2 word associations (EA edges) together with collocations and/or orthography (OR and/or SY edges), which we do not present in this study for the lack of space.

Best model and error analysis
Here we look in detail at the best CS model and provide an error analysis. (For simplicity, we consider the model without SY edges.) This model weights direct monolingual associations more in Figure 4: An illustration of the spreading activation in the best CS model (CG edges are not shown).
Dutch than in English: κ DA = 10 vs. κ EA = 5. Translation equivalents are also strongly connected to each other (κ TE = 20), and cognates even more so (κ CG = 25, which is in addition to the existing TE edge between them). This pattern of weights ensures that the translation operation is "cheap", and Dutch associates are readily activated; together these effectively make the contributions of English and Dutch associations similar in size. Figure 4 provides a toy example showing why this is the case. At the first step, a small share of the activation passes from the English cue to the English association, while the lion's share goes to the Dutch translation. At the second step, less than half of the activation at the Dutch translation proceeds to its associate; then in the third step, this activation is passed to the Dutch associate's translation.
This figure also shows why we cannot make conclusions about the contribution of a particular factor (e.g., translation equivalence, or the strength of English and Dutch word associations) based on the κ value of the corresponding edge type alone. Even though κ TE = 20, κ DA = 10, and κ EA = 5, the contributions of native and non-native word associations to the final set of responses given by the model are similar, because English associations (the top right rhombus) are connected directly to the cue word, and the activation reaches them immediately upon the presentation of the cue, while Dutch associations (the top left rhombus) are further away from the cue word, and activation gets more dispersed as it passes through the network. Table 2 shows the performance of the best model (vs. BASE-AN) for the best and worst cue words. For the majority of these cues the model is better than the baseline. For eight of these (apple, block, bottle, chance, memory, season, shame, shoulder), the improvement is consistent across the four measures. While the baseline relies on English word associations only, the model benefits from considering Dutch associations. This is because many bilinguals' responses (e.g., chance→possibility, shame→red, farm→farmer) are missing in the monolingual data. In addition, some responses appear in the English monolingual data too, but are uncommon (e.g., apple→pear, green). In both cases, it is the translation edges that are responsible for the model's better performance. Cue words on which the model is consistently worse than the baseline are attempt, daughter, and winter. For hospital, the model is only worse in predicting the top three responses. We find several reasons that may explain the model's errors.
Lack of data for some cues. The cue attempt is translated as poging, which activates a Dutch associate probeersel ['trial']. Because this word is not a cue in the Dutch association norms, all its activation is passed over its translation edges directly to trial, which yields relatively less activation for the more common response suicide.
Lack of word frequency information. For some cues (e.g., hospital, winter), the top human responses are words that are generally more frequent in English than are their Dutch translations (nurse vs. verpleegster, spring vs. lente). 10 In these cases, high frequency of English response words may lead speakers to rely more on English than on Dutch associations, which our model does not take into account.
Language change. The data sets are not from the same time period (Dutch: 2010s;English: 1970s;bilingual: 1990s), so some responses that the model fails to reproduce may be attributed to language change: e.g., the response duty→army appears in the two older data sets, but not in the monolingual Dutch data, perhaps because conscription in the Netherlands was suspended in 1997.

Conclusion
We first showed that Dutch-English bilinguals in their L2 English give responses different from those of English monolinguals, but their L1 Dutch responses are not significantly different from those of Dutch monolinguals. While related observations have been reported in the literature (Wolter, 2001;Zareva, 2007;Antón-Méndez and Gollan, 2010;Hui, 2011, etc.), here we use a set of 58 cue words to demonstrate that this difference is consistent and is significantly larger than the difference between responses given by two groups of bilinguals.
Next, we presented a computational model based on a graph constructed from two monolingual word association data sets that were connected with additional cross-language edges. Our model predicts bilingual responses better than the monolingual baselines. The edge weights in the best model suggest that the contribution of L1 and L2 word associations is approximately equal in a group of Dutch-English bilinguals, and that translation equivalents (and cognates even more so) are strongly connected in the bilingual lexicon (in line with the findings on bilingual lexical access: e.g., Kroll et al., 2006;Dimitropoulou et al., 2011). Bilinguals may often translate L2 cues into L1, generate L1 associations, and translate them back into L1. In contrast, syntagmatic and orthographic responses that have been reported (e.g., Meara, 1978;Namei, 2004;Politzer, 1978) are not useful on the data set we used. Our results also suggest that it is not the case that bilinguals simply activate a broader cluster of L2 words and sample from those. Van Hell and de Groot (1998) showed that bilinguals' responses might depend on the type of the cue word (e.g., noun-verb, abstract-concrete, cognate-non-cognate). As we intended to test how consistently various types of responses are produced across multiple cue words, we did not adjust the weights depending on the word type (except for cognates). Future research will consider en-riching our network with such semantic and syntactic properties, as well as word frequency information. Another fruitful direction is to consider how to learn the association weights themselves, from textual and/or perceptual input (e.g., Griffiths et al., 2007;Gruenenfelder et al., 2015;Nematzadeh et al., 2016), rather than building them in from human norms; this would enable us to more realistically model the emergence of the bilingual lexicon.