DeepNL: a Deep Learning NLP pipeline

We present the architecture of a deep learning pipeline for natural language processing. Based on this architecture we built a set of tools both for creating distributional vector representations and for performing specific NLP tasks. Three methods are available for creating embeddings: feed-forward neural network, sentiment specific embeddings and embeddings based on counts and Hellinger PCA. Two methods are provided for training a network to perform sequence tagging, a window approach and a convolutional approach. The window approach is used for implementing a POS tagger and a NER tagger, the convolutional network is used for Semantic Role Labeling. The library is implemented in Python with core numerical processing written in C++ using parallel linear algebra library for efficiency and scalability.


Introduction
Distributional Semantic Models (DSM) that represent words as vectors of weights over a high dimensional feature space (Hinton et al., 1986), have proved very effective in representing semantic or syntactic aspects of lexicon. Incorporating such representations has allowed improving many natural language tasks. They also reduce the burden of feature selection since these models can be learned through unsupervised techniques from text.
Deep learning algorithms for NLP tasks exploit distributional representation of words. In tagging applications such as POS tagging, NER tagging and Semantic Role Labeling (SRL), this has proved quite effective in reaching state of art accuracy and reducing reliance on manually engineered feature selection .
Word embeddings have been exploited also in constituency parsing  and dependency parsing (Chen and Manning, 2014).
A further benefit of a deep learning approach is to allow performing multiple tasks jointly, and therefore reducing error propagation as well as improving efficiency. This paper presents DeepNL, an NLP pipeline based on a common Deep Learning architecture: it consists of tools for creating embeddings, and tools that exploit word embeddings as features. The current release includes a POS tagger, a NER, an SRL tagger and a dependency parser.
Two methods are supported for creating embeddings: an approach that uses neural network and one using Hellinger PCA (Lebret and Collobert 2014).

NLP Toolkits
A short survey of NLP toolkits is presented by Krithika and Akondi (2014).
NLTK is among the most well-known and comprehensive NLP toolkits: it is written in Python and provides a number of basic processing facilities (tokenization, splitting, statistical analysis of corpora, etc.) as well as machine learning algorithms for classification and clustering. Currently it does not provide any tool based on word embeddings, however it can be interfaced to SENNA 1 or it can be used in conjunction with Gensim 2 which provides several algorithms for performing unsupervised semantic modeling from plain text, including word embeddings, random indexing, LDA (Latent Dirichlet Allocation).
The Stanford NLP Toolkit  is written in Java and provides tools for tokenization, sentence splitting, POS tagging, NER, parsing, sentiment analysis and temporal expression tagging. As a recent inclusion, it pro-vides a dependency parser based on neural network and word embeddings (Chen et al., 2014).
OpenNLP 3 is a machine learning library written in Java that supports the most common NLP tasks, such as tokenization, sentence segmentation, POS tagging, named entity extraction, chunking, parsing, and coreference resolution.
Typically each tool built with these libraries uses a different approach or an most suitable algorithm for the task: for example Sanford NLP uses Conditional Random Fields for NER while the POS tagger uses MaxEntropy and both require a set of rich features that need to be manually engineered.
DeepNL differs from these toolkits since it is based on a common deep learning architecture: all tools exploit the same core neural network and use mostly just word embeddings as features. For example the POS tagger and the NER tagger have an identical structure, and they differ only in the way they read/write documents and in the configuration of the discrete features used: the POS tagger uses word suffixes while the NER uses gazetteer dictionaries. Embeddings are used as features, providing a continuous rather than discrete representation of text.
The ability of creating suitable embeddings for various tasks is critical for the proper working of the tools in DeepNL; hence the toolkit integrates algorithms for creating word embeddings from text, either in unsupervised or supervised fashion.

Building Word Embeddings
Word embeddings provide a low dimensional vector space representation for words, where values in each dimension may represent syntactic or semantic properties.
DeepNL provides two methods for building embeddings, one is based on the use of a neural language model, as proposed by (Turian and Bengio; Mikolov et al., 2010) and one based on spectral method as proposed by Lebret and Collobert (2013).
The neural language method can be hard to train and the process is often quite time consuming, since several iterations are required over the whole training set. Some researchers provide precomputed embeddings for English 4 . The Pol-yglot project (Al-Rafou et al., 2013) makes available embeddings for several languages, built from the plain text of Wikipedia in the respective language, and the Python code for computing them 5 , that supports GPU computations by means of Theano 6 . Mikolov et al. (2013) developed an alternative solution for computing word embeddings, which significantly reduces the computational costs. They propose two log-linear models, called bag of words and skip-gram model. The bag-of-word approach is similar to a feedforward neural network language model and learns to classify the current word in a given context, except that instead of concatenating the vectors of the words in the context window of each token, it just averages them, eliminating a network layer and reducing the data dimensions. The skip-gram model tries instead to estimate context words based on the current word. Further speed up in the computation is obtained by exploiting a mini-batch Asynchronous Stochastic Gradient Descent algorithm, splitting the training corpus into partitions and assigning them to multiple threads. An optimistic approach is also exploited to avoid synchronization costs: updates to the current weight matrix are performed concurrently, without any locking, assuming that updates to the same rows of the matrix will be infrequent and will not harm convergence.
The authors published single-machine multithreaded C++ code for computing the word vectors 7 . A reimplementation of the algorithm in Python is included in the Genism library (Řehůřek and Petr Sojka, 2010). In order to obtain comparable speed to the C++ version, they use Cython for interfacing a coding in C of the core function for training the network on a single sentence, which in turn exploits the BLAS library for algebraic computations.
The DeepNL implementation is written in Cython 8 and uses C++ code which exploits the Eigen 9 library for efficient parallel linear algebra computations. Data is exchanged between Numpy arrays in Python and Eigen matrices by means of Eigen Map types. On the Cython side, a pointer to the location where the data of a Numpy array is stored is obtained with a call like: <FLOAT_t*>np.PyArray_DATA(self.nn.hid den_weights) and passed to a C++ method. On the C++ side this is turned into an Eigen matrix, with no computational costs due to conversion or allocation, with the code: Map<Matrix> hidden_weights ( hidden_weights, numHidden, numInput) which interprets the pointer to a double as a matrix with numHidden rows and numInput columns. Since Eigen by default uses column-major order while Numpy uses row-major order, the class Matrix above is declared as: typedef Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> Matrix; Lebret and Collobert (2013) have shown that embeddings can be efficiently computed from word co-occurence counts, applying Principal Component Analysis (PCA) to reduce dimensionality while optimizing the Hellinger similarity distance. Levy and Goldberg (2014) have shown similarly that the skip-gram model by Mikolov et al.(2013) can be interpreted as implicitly factorizing a word-context matrix, whose values are the pointwise mutual information (PMI) of the respective word and context pairs, shifted by a global constant.

PCA
DeepNL provides an implementation of the Hellinger PCA algorithm using Cython and the LAPACK library SSYEVR from Scipy 10 .
Cooccurrence frequencies are computed by counting the number of times each context word w  D occurs after a sequence of T words: where n(w, T) is the number of times word w occurs after a sequence of T words. The set D of context word is normally chosen as the the subset of the top most frequent words in the vocabulary V.
The word co-occurrence matrix C of size |V||D| is built. The coefficients of C are square rooted and then its transpose is multiplied by it to obtain a symmetric square matrix of size 10 https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.linalg.lapack.ssyevr.html |V||V|, to which PCA is applied to obtain the desired dimensionality reduction.

Sentiment Specific Word Embeddings
For the task of sentiment analysis, semantic similarity is not appropriate, since antonyms end up at close distance in the embeddings space. One needs to learn a vector representation where words of opposite polarity are further apart. Tang et al. (2014) propose an approach for learning sentiment specific word embeddings, by incorporating supervised knowledge of polarity in the loss function of the learning algorithm. The original hinge loss function in the algorithm by  is: where x is an ngram and x c is the same ngram corrupted by changing the target word with a randomly chosen one, f  (·) is the feature function computed by the neural network with parameters θ. The sentiment specific network outputs a vector of 2 dimensions, one for modeling the generic syntactic/semantic aspects of words and the second for modeling polarity.
A second loss function is introduced as objective for minimization: where  s is an indicator function reflecting the sentiment polarity of a sentence, where f g (x) is the gold distribution for ngram x. The overall hinge loss is a linear combination of the two: The gradient for the output layer is given by the formula: DeepNL provides an algorithm for training polarized embeddings, performing gradient descent using an adaptive learning rate according to the AdaGrad method (Duchi et al, 2011). The algorithm requires a training set consisting of sentences annotated with their polarity, for example a corpus of tweets. The algorithm builds embeddings for both unigrams and ngrams at the same time, by performing variations on a training sentence replacing not just a single word, but a sequence of words with either another word or another ngram.

Deep Learning Architecture
DeepNL adopts a multi layer neural network architecture, as proposed in : 1. Lookup layer. It maps word feature indices to a feature vector, as described below.
2. Linear layer. Fully connected network layer, represented by matrix M 1 and input bias b 1 .
3. Activation layer (e.g. hardtanh) 4. Linear layer. Fully connected network layer, represented by matrix M 2 and input bias b 2 5. Softmax layer. Computes the softmax of the output values, producing a probability distribution of the outputs.
Overall, the network computes the following function: are the parameters, with d the dimension of the input, h the number of hidden units, o the number of output classes, a() is the activation function.

Lookup layer
The first layer of the network transforms the input into a feature vector representation. Individual words are represented by a tuple of K discrete features, w  D 1 D k , where D k is the dictionary for the k-th feature. Each feature has its own lookup table (•) , with a matrix of parameters to be learned ∈ ℝ ×| | , where D k is the dictionary for the k-th feature and d k is a user specified vector size. The lookup table layer (•) associates a vector of weights to each discrete feature f  D k : where 〈 〉 1 ∈ ℝ is the f th column of W and d k is the word vector size (a hyper-parameter to be chosen by the user).
The feature vector for word w becomes the concatenation of the vectors for all features: This vector of features for word w, is passed as input to the network. W k , M 1 , b 1, M 2 and b 2 are the parameters to be learned by backpropagation.

Feature Extractors
The library has a modular architecture that allows customizing a network for specific tasks, in particular its first layer, by supplying extractors for various types of features. An extractor is defined as a class that inherits from an abstract class with the following interface: class Extractor(object): def extract(self, tokens) def lookup(self, feature) def save(self, file) def load(self, file) Method extract, applied to a list of tokens, extracts features from each token and returns a list of IDs for those features. The argument is a list of tokens rather than a single token, since features might depend on consecutive tokens. For instance a gazetteer extractor needs to look at a sequence of tokens to determine whether they are mentioned in its dictionary. Method lookup returns the vector of weights for a given feature. Methods save/load allow saving and reloading the Extractor data to/from disk.
Extractors currently include an Embeddings extractor, implementing the word lookup feature, a Caps, Prefix and Postfix extractors for dealing with capitalization and prefix/postfix features, a Gazetteer extractor for dealing with the gazetteers typically used in a NER, and a customizable AttributeFeature extractor that extracts features from the state of a Shift/Reduce dependency parser, i.e. from the tokens in the stack or buffer as described for example in Nivre (2007).

Sequence Taggers
For sequence tagging, two approaches were proposed in , a window approach and a sentence approach. The window approach assumes that the tag of a word depends mainly on the neighboring words, and is suitable for tasks like POS and NE tagging. The sentence approach assumes that the whole sentence must be taken into account by adding a convolution layer after the first lookup layer and is more suitable for tasks like SRL.
We can train a neural network to maximize the log-likelihood over the training data. Denoting by  the trainable parameters, including the network and the transition scores, we want to maximize the following log-likelihood with respect to : where x are all training sentences and t their corresponding tag sequence.
The score s(x, t, ) of a sequence of tags t for a sentence x, with parameters , is given by the sum of the transition scores and the tag scores: where T(i, j) is the score for the transition from tag i to tag j, and f  (t i , x i ) is the output of the network at word x i for tag t i . The probability of a sequence y for sentence x can be expressed as: In order to avoid numeric overflows, the function logadd must be computed carefully, i.e. by subtracting the maximum value to the coefficients before performing exponentiation and then re-adding the maximum.
The computation of the gradients can be performed at once for the whole sequence exploiting matrix operations whose computation can be optimized and parallelized using suitable linear algebra libraries. We implemented two versions of the network trainer, one in Python using NumPy 11 and one in C++ using Eigen 12 .
Here for example is the Python code for computing the  in the above equation: The array scores [i, j] contains the output of the neural network for the i-th element of the sequence and for tag j, delta [i, j] represents the sum of all scores ending at the i-th token with tag j; transitions [i, j] contains the current estimate of the probability of a transition from tag i to tag j.

Experiments
We tested the DeepNL sequence tagger on the CoNLL 2003 challenge 13 , a NER benchmark based on Reuters data. The tagger was trained with three types of features: the word embeddings from SENNA, a "caps" feature telling whether a word is in lowercase, uppercase, title case, or had at least one non-initial capital letter, and a gazetteer feature, based on the list provided by the organizers. The window size was set to 5, 300 hidden variables were used and training was iterated for 40 epochs. In the following table we report the scores compared with the system by Ando et al. (2005)  The slight difference with SENNA might be explained by the use of different gazetteers. The same sequence tagger can be used for POS tagging. In this case the discrete features used are the same capitalization feature as for the NER and a suffix feature, which denotes whether a token ends with one of the 455 most frequent suffixes of length one or two characters in the training corpus. Both these experiments confirm that word embeddings can replace the use of complex manually engineered features for typical natural language processing tasks.

Dependency Parsing
We have adapted to the use of embeddings our original transition based dependency parser DeSR (Attardi et al., 2009), that was already based on a neural network. The parser uses the neural network to decide which action to perform at each step in the analysis of a sentence. Looking at a short context of past analyzed tokens and next input tokens, it must decide whether the two current focus tokens can be connected by a dependency relation. In this case it performs a reduction, creating the dependency, otherwise it advances on the input. The original implementation used a large set of discrete features to represent the current context. The deep learning version of the parser exploits word embedding as features and also cre-ates a dense vector representation for the remaining discrete features. A specific extractor (At-tributeExtractor) was built for this purpose.

Conclusions
We have presented the architecture of DeepNL, a library for building NLP applications based on a deep learning architecture. The implementation is written in Python/Cython and uses C++ linear algebra libraries for efficiency and scalability, exploiting multithreading or GPUs where available.
The implementation of DeepNL is available on GitHub 15 .
The availability of a library that allows creating embeddings and training a deep learning architecture using them might contribute to the development of further tools for linguistic analysis.
For example we are planning to build a classifier for performing identification of affirmative, negative or speculative contexts in sentences.
We are also considering additional ways of creating embeddings, for example to generate context sensitive embeddings that could provide word representations that disambiguate among word senses.