Learning Word Representations with Cross-Sentence Dependency for End-to-End Co-reference Resolution

In this work, we present a word embedding model that learns cross-sentence dependency for improving end-to-end co-reference resolution (E2E-CR). While the traditional E2E-CR model generates word representations by running long short-term memory (LSTM) recurrent neural networks on each sentence of an input article or conversation separately, we propose linear sentence linking and attentional sentence linking models to learn cross-sentence dependency. Both sentence linking strategies enable the LSTMs to make use of valuable information from context sentences while calculating the representation of the current input word. With this approach, the LSTMs learn word embeddings considering knowledge not only from the current sentence but also from the entire input document. Experiments show that learning cross-sentence dependency enriches information contained by the word representations, and improves the performance of the co-reference resolution model compared with our baseline.


Introduction
Co-reference resolution requires models to cluster mentions that refer to the same physical entities. The models based on neural networks typically require different levels of semantic representations of input sentences. The models usually need to calculate the representations of word spans, or mentions, given pre-trained character and wordlevel embeddings (Turian et al., 2010;Pennington et al., 2014) before predicting antecedents. The mention-level embeddings are used to make coreference decisions, typically by scoring mention pairs and making links (Lee et al., 2017;Clark and Manning, 2016a;Wiseman et al., 2016). Long short-term memories (LSTMs) are often used to encode the syntactic and semantic information of input sentences.
Articles and conversations include more than one sentences. Considering the accuracy and efficiency of co-reference resolution models, the encoder LSTM usually processes input sentences separately as a batch (Lee et al., 2017). The disadvantage of this method is that the models do not consider the dependency among words from different sentences, which plays a significant role in word representation learning and co-reference predicting. For example, pronouns are often linked to entities mentioned in other sentences, while their initial word vectors lack dependency information. As a result, a word representation model cannot learn an informative embedding of a pronoun without considering cross-sentence dependency in this case.
It is also problematic if we encode the input document considering cross-sentence dependency and treat the entire document as one sentence. An input article or conversation can be too long for a single LSTM cell to memorize. If the LSTM updates itself for too many steps, gradients will vanish or explode (Pascanu et al., 2013), and the coreference resolution model will be very difficult to optimize. Regarding the entire input corpus as one sequence instead of a batch also significantly increases the time complexity of the model.
To solve the problem that traditional LSTM encoders, which treat the input sentences as a batch, lack an ability to capture cross-sentence dependency, and to avoid the time complexity and difficulties of training the model concatenating all input sentences, we propose a cross-sentence encoder for end-to-end co-reference (E2E-CR). Borrowing the idea of an external memory module from Sukhbaatar et al. (2015), an external memory block containing syntactic and semantic information from context sentences is added to the standard LSTM model. With this context memory block, the proposed model is able to encode input sentences as a batch, and also calculate the representations of input words by taking both target sentences and context sentences into consideration. Experiments showed that this approach improved the performance of co-reference resolution models.
2 Related Work 2.1 Co-reference Resolution A popular method of co-reference resolution is mention ranking (Durrett and Klein, 2013). Reading each mention, the model calculates coreference scores for all antecedent mentions, and picks the mention with the highest positive score to be its co-reference. Many recent works are based on this approach. Durrett and Klein (2013) designed a set of feature templates to improve the mention-ranking model. Peng et al. (2015) proposed a mention-ranking model by jointly learning mention heads and co-references. Clark and Manning (2016a) proposed a reinforcement learning framework for the mention ranking approach. Based on similar ideas but without using parsing features, the authors of Lee et al. (2017) proposed the current state-of-the-art model which uses neural networks to embed mentions and calculate mention and antecedent scores.  applied ELMo embeddings (Peters et al., 2018) to improve within-sentence dependency modeling and word representation learning. Wiseman et al. (2016) and Clark and Manning (2016b) proposed models using global entity-level features.

Language Representation Learning
Distributed word embeddings has been used as the basic unit of language representation for over a decade (Bengio et al., 2003). Pre-trained word embeddings, for example GloVe (Pennington et al., 2014) and Skip-Gram  are widely used as the input of natural language processing models.
Long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) are widely used for sentence modeling. A single-layer LSTM network was applied in the previous state-of-theart co-reference model (Lee et al., 2017) to generate word and mention representations. To capture dependency of longer distances, Campos et al. (2017) proposed a recurrent model that outputs hidden states by skipping input tokens.
Recently, memory networks (Sukhbaatar et al., 2015) have been applied in language modeling (Cheng et al., 2016;Tran et al., 2016). Applying an attention mechanism on memory cells, memory networks allow the model to focus on significant words or segments for classification and generation tasks. Previous works have shown that applying memory blocks in LSTMs also improves longdistance dependency extraction (Yogatama et al., 2018).

Learning Cross-Sentence dependency
To improve the word representation learning model for better co-reference resolution performance, we propose two word representation models that learn cross-sentence dependency.

Linear Sentence Linking
Instead of treating the entire input document as separate sentences and encode the sentences as a batch with an LSTM, the most direct way to consider cross-sentence dependency is to initialize LSTM states with the encodings of adjacent sentences. We name this method linear sentence linking (LSL).
In LSL, we encode input sentences with a 2layer bidirectional LSTM. Give input sentences [s 1 , s 2 . . . s n ], the outputs of the first layer are In the second LSTM layer, the initial state of the forward LSTM of s i is initialized as while the backward state is initialized as where c i 0 stands for the initial cell of the ith layer, and x stands for the final output of the LSTMs in first layer. We then concatenate the outputs of the forward and backward LSTMs in the second layer as the word representations for coreference prediction.

Attentional Sentence Linking
It is difficult for LSTMs to embed enough information about a long sentence into a lowdimensional distributed vector. To collect richer knowledge from neighbor sentences, we propose a long short-term recurrent memory module and an attention mechanism to improve sentence linking.
To describe the architecture of the proposed model, we focus on adjacent input sentences s i−1 and s i . We present the input embeddings of the j-th word in the i-th sentence with x i,j .

Long Short-Term Memory RNNs
To solve the traditional recurrent neural networks, Hochreiter and Schmidhuber (1997) proposed the LSTM architecture. The detail of recurrent state updating in LSTMs h t = f lstm (x t , h t−1 , c t−1 ) is shown in following equations.
where x t is the input embedding and h t is the output representation of the t-th word.

LSTMs with Cross-Sentence Attention
We design an LSTM module with cross-sentence attention for capturing cross-sentence dependency. We name this method attentional sentence linking (ASL). Considering input word x i,t in the ith sentence and all words from the previous sentence X i−1 = [x i−1,1 , x i−1,2 , . . . , x i−1,m ], we regard the matrix X i−1 as an external memory module and calculate an attention on its cells, where each cell contains a word embedding.
With the attention distribution α, we can get a vector summarizing related information from The model decides if it needs to pay more attention on the current input or cross-sentence information with a context gate.
σ(·) stands for the Sigmoid function. The word representation of the target word is calculated as where f lstm stands for standard LSTM update described in section 3.2.1.

Co-reference Prediction
In this work, we apply the mention-ranking endto-end co-reference resolution (E2E-CR) model proposed by Lee et al. (2017) for co-reference prediction. The word representations applied in E2E-CR model is formed by concatenating pre-trained word embeddings and the outputs of LSTMs. In our work, we represent words by concatenating pre-trained word embeddings and the outputs of LSL-and ASL-LSTMs.

Experiments
We train and evaluate our model on the English corpus of the CoNLL-2012 shared task (Pradhan et al., 2012). We implement our model based on the published implementation of the baseline E2E-CR model (Lee et al., 2017) 1 . Our implementation is also available online for reproducing the results reported in this paper 2 . In this section, we first describe our hyperparameter setup, and then show the experimental results of previous work and our proposed models.

Model and Hyperparameter Setup
In practice, the LSTM modules applied in our model have 200 output units. In ASL, we calculate cross-sentence dependency using a multilayer perceptron with one hidden layer consisting of 150 hidden units. The initial learning rate is set as 0.001 and decays 0.001% every 100 steps. The model is optimized with the Adam algorithm (Kingma and Ba, 2014). We randomly select up to 40 continuous sentences for training if the input is too long. In co-reference prediction, we select 250 candidate antecedents as our baseline model.

Experiment Results and Discussion
We evaluate our model on the test set of the CoNLL-2012 shared task. The performance of previous work and our model are shown in Table 1. We mainly focus on the average F1 score of MUC, B 3 , and CEAF metrics. Comparing with the baseline model that achieved 67.2% F1 score, the ASL model improved the performance by 0.6% and achieved 67.8% average F1. Experiments  Table 1: Experimental results of previous models and cross-sentence dependency learning models on the CoNLL-2012 shared task.
-I remember receiving an SMS like this one last year before it snowed since snowfall would affect road conditions in Beijing to a large extent.
-Uh-huh . However, it did not give people such a special feeling as it did this time.
-Reporters are tired of the usual stand ups.
-They want to be riding on a train or walking in the rain or something to get attention .
-Planned terrorist bombing that ripped a 20 x 40 -foot hole in the Navy destroyer USS Cole in the Yemeni port of Aden.
-The ship was there for refueling.
-Yemeni authorities claimed they have detained over 70 people for questioning.
-These include some Afghan -Arab volunteers. show that the models that consider cross-sentence dependency significantly outperform the baseline model, which encodes each sentence from the input document separately. Experiments also indicated that the ASL model has better performance than the LSL model, since it summarizes extracts context information with an attention mechanism instead of simply viewing sentence-level embeddings. This gives the model a better ability to model cross-sentence dependency.
Examples for comparing the performance of the ASL model and the baseline are shown in Table  2. Each example contains two continuous sentences with co-references distritubed in different sentences. Underlined spans in bold are target mentions and annotated co-references. Spans in green are ASL predictions, and spans in red are baseline predictions. A prediction on "-" means that no mention is predicted as a co-reference. Table 2 shows that the baseline model, which does not consider cross-sentence dependency, has difficulty in learning the semantics of pronouns whose co-references are not in the same sentence. The pretrained embeddings of pronouns are not informative enough. In the first example, "it" is not semantically similar with "SMS" in GloVe without any context, and in this case, "it" and "SMS" are in different sentences. As a result, if reading this two sentences separately, it is hard for the encoder to represent "it" with the semantics of "SMS". This difficulty makes the co-reference resolution model either prediction a wrong antecedent mention, or cannot find any co-reference.
However, with ASL, the model learns the semantics of pronouns with an attention to words in other sentences. With the proposed context gate, ASL takes knowledge from context sentences if local inputs are not informative enough. Based on word represents enhanced with cross-sentence dependency, the co-reference scoring model can make better predictions.

Conclusion and Future Work
We proposed linear and attentional sentence linking models for learning word representations that captures cross-sentence dependency. Experiments showed that the embeddings learned by proposed models successfully improved the performance of the state-of-the-art co-reference resolution model, indicating that cross-sentence dependency plays an important role in semantic learning in articles and conversations consists of multiple sentences. It worth exploring if our model can improve the performance of other natural language processing applications whose inputs contain multiple sentences, for example, reading comprehension, dialog generation, and sentiment analysis.