SyntaxFest 2019 Invited talk - Quantitative Computational Syntax: dependencies, intervention effects and word embeddings

In the computational study of intelligent behaviour, the domain of language is distinguished by the complexity of the representations and the vast amounts of quantitative text-driven data. In this talk, I will let these two aspects of the study of language inform each other and will discuss current work investigating whether the notion of similarity in the intervention theory of locality is related to current notions of similarity in word embedding space. Despite their practical success and impressive performances, neural-network-based and distributed semantics techniques have often been criticized as they remain fundamentally opaque and difficult to interpret. Several recent pieces of work have investigated the linguistic abilities of these representations, and shown that they can capture long agreement and thus hierarchical notions. In this vein, we study another core, defining and more challenging property of language: the ability to construe long-distance dependencies. We present results that show that word embeddings and the similarity spaces they define do not correlate with experimental results on intervention similarity in long-distance dependencies. These results show that the linguistic encoding in distributed representations does not appear to be human-like, and it also brings evidence to the debate on narrow or broad definitions of similarity in syntax and sentence processing.

Studies of long-distance dependencies equally inconclusive Long-distance dependencies and intervention Not all long-distance dependencies are equally acceptable.
(1a) What do you think John bought <what> ? (1b) * What do you wonder who bought <what>?
(2a) Show me the tiger that the lion is washing <the tiger>.
(2b) Show me the tiger that <the tiger> is washing the lion.
(3) ??/ok Jules sourit aux étudiant(s) que l'orateur <étudiant(s)> endort <étudiant(s)> sérieusement depuis le début. 'Jules smiles to the students who the speaker is putting seriously to sleep from the beginning.' Intervention theory (Rizzi 1990(Rizzi , 2004 Core to the explanation of these facts is the notion of intervener. Intervener: an element that is similar to the two elements that are in a long-distance relation, and structurally intervenes between the two, blocking the relation (shown in bold). Lexical restriction improves acceptability. Acceptability judgements (< = better): c < b < a.
Agreement features: number creates intervention effects (so decreases acceptability) but person doesn't.
Animacy: children don't seem to mind in relative clauses but intervention effects have been found in weak-islands (Franck et al., 2015).

Merlo SyntaxFest 2019
Intervention theory notion of similarity: summary Long-distance dependencies are acceptable if there is no intervener.
Establishing if an element is an intervener requires the calculation of similarity of feature vectors, where some features are morpho-syntactic and some are semantic.
This is very reminescent of current notions of similarity over distributional semantic spaces.

Merlo SyntaxFest 2019
Vector spaces Word embeddings: definition of lexical proximity in feature spaces, vectorial representation of the meaning of a word, defined as the usage of a word in its context.
Tasks that confirm this interpretation are association, analogy, lexical similarity, entailment.
Does the similarity space defined by word embeddings capture the grammatically-relevant notion of similarity at work in long-distance dependencies?
The work is done on French.

SyntaxFest 2019
Weak island intervention and animacy Experiment 1 manipulated the lexical restriction of the wh-elements (both bare vs. both lexically restricted), and the match in animacy between the two wh-elements, as shown. All verbs required animate subjects.
Data: acceptability judgments collected off-line on a seven-point Likert scale. No time constraints.
Results: clear effect of animacy match for lexically restricted phrases and less so for bare wh-phrases.

Merlo SyntaxFest 2019
Weak island intervention and animacy Both the pair (class, student) and the pair (professor, student) are close in a semantic space that measures semantic field and association-based similarity.
Human speakers rate the first sentence as on average a little better as there is a mismatch in animacy, hence the effect of intervention is weaker.
If word embeddings learn grammatically-relevant notions of similarity, then (professor, student) should be more similar, predicting lower acceptability, since they are both animate, compared to (class, student), a pair with a mismatch in animacy. Human speakers read the verb endort in the second sentence on average faster than in the first, as there is a mismatch in number, hence the effect of intervention is weaker.
If word embeddings learn grammatically-relevant notions of similarity, then (student, speaker) should be more similar, predicting slower reading times, since they are both singular, compared to (students, speaker), a pair with a mismatch in number.

Merlo SyntaxFest 2019
Calculating the word and phrase vectors The pairs of words or phrases (indicated in bold in the examples) were used to construct the vector-based similarity space.
For each of these words, French FastText word embeddings (Bojanowski et al., 2016). 5-word window on Wikipedia data using the skip-gram model resulting in 300-dimension vector Every word is represented as an n-gram of characters.
Quality of resulting similarity spaces was inspected.
The cosine is a well-known and efficient measure of vector similarity. It is a symmetric measure. It has been shown to capture analogical semantic similarity in vector space.
Analysis of the results: do we capture a binary distinction?
Animacy in wh-islands: expected inverse correlation between mean similarity and mean acceptability. Also notice that the average similarity score for the number match condition is lower than for the number mismatch condition.

Merlo SyntaxFest 2019
Asymmetric operator Human grammaticality judgments differ depending on whether the feature set of the long-distance element is properly included or properly includes the feature set of the intervener. If the features of the long-distance dependency are a superset of the features of the intervener, sentences are judged more acceptable (Rizzi, 2004).
These fine-grained differences in grammaticality judgments suggest that it might be more appropriate to calculate similarity with an asymmetric operator.
The asymmetric measure we use here has been developed to capture the notion of entailment. This operator has been shown to learn the notion of hyponymy with good results (Henderson and Popa, 2016).

Discussion
These results also confirm a lack of correlation.
The convergence of these results is important as null effects are always hard to confirm and explain.
All experiments, across constructions (weak island and object relatives), across type of noun phrase (bare or composed), across measurement method of the experimental dependent variable (off-line grammaticality judgments and online reaction times), and across operators (symmetric and asymmetric) show a consistent lack of correlation between experimental results, and the notion of similarity encoded in word embeddings.

Merlo SyntaxFest 2019
Extension to sentence embeddings and prediction task Prediction task: can we identify right sentence type?
Translate items also into a new language: English.
Sentence embeddings: additive bag of vectors model (same word embeddings as previously).
Dependent variable: Accuracy, as a measure of how much the information in the input embeddings supports the discrimination of the four sentence types in a categorical classifier.

Weak islands
LexI Which class do you wonder which student liked?
LexA Which professor do you wonder which student liked?
BareI What do you wonder who liked?
BareA Who do you wonder who liked?

Object Relatives
ORCsg Julie smiles to the student that the speaker is putting to sleep seriously from the beginning.
ORCpl Julie smiles to the students that the speakeris putting to sleep seriously from the beginning.
CMPsg Julia points out to the student that the speaker has been yawning frequently from the beginning.
CMPpl Julia points out to the students that the speaker has been yawning frequently from the beginning. For French, the prediction on the effect of animacy in the lexically specified case is confirmed, but the others are not.
For English, the prediction for the effect of animacy is confirmed both in bare wh-phrases and in lexicalised wh-phrases, but the others are not.

Merlo SyntaxFest 2019
Discussion Current word embeddings, i.e. dictionaries in a multi-dimentional vectorial space, clearly encode a notion of similarity, as shown by many experiments on analogical tasks and textual and lexical similarity.
They do not however encode the notion of similarity that has been shown in many human experiments to be at work and to be definitional in long-distance dependencies.
They do not encode therefore a core linguistic notion.

Merlo SyntaxFest 2019
Discussion -Finer-grained distinctions among intervention theories Narrow intervention (grammar-based, explains ungrammaticality, weak islands): only morpho-syntactic features are relevant to define intervention, so the fact that word embeddings -meant to capture semantic notion of similarity -do not correlate with grammar-based notion of similarity is to be expected.
Cue-based memory based models (processing-based, explain difficulty, object relatives): similarity can take any feature type into account (as demonstrated in experiment on weak islands above, which also manipulate semantic reversibility) and intervention is a kind of interference at retrieval in memory. Correlation is expected.
Cross-lingual word embeddings models VECMAP, cross-lingual word embedding, the state-of-the-art for bilingual lexicon induction (Artetxe et al., 2018) M2VEC, a weakly-supervised, concept-based adversarial model (Wang, Henderson and Merlo, 2019). This method is based on the idea that languages use similar words to express similar concepts. It uses concepts, drawn from Wikipedia, rather than words to learn competitive cross-lingual word embeddings.
FastText,subword sequences, is important for the false and true friends experiments. Then trained with VecMap.