SoPa: Bridging CNNs, RNNs, and Weighted Finite-State Machines

Recurrent and convolutional neural networks comprise two distinct families of models that have proven to be useful for encoding natural language utterances. In this paper we present SoPa, a new model that aims to bridge these two approaches. SoPa combines neural representation learning with weighted finite-state automata (WFSAs) to learn a soft version of traditional surface patterns. We show that SoPa is an extension of a one-layer CNN, and that such CNNs are equivalent to a restricted version of SoPa, and accordingly, to a restricted form of WFSA. Empirically, on three text classification tasks, SoPa is comparable or better than both a BiLSTM (RNN) baseline and a CNN baseline, and is particularly useful in small data settings.


Introduction
Recurrent neural networks (RNNs; Elman, 1990) and convolutional neural networks (CNNs; Le-Cun, 1998) are two of the most useful text representation learners in NLP (Goldberg, 2016). These methods are generally considered to be quite different: the former encodes an arbitrarily long sequence of text, and is highly expressive (Siegelmann and Sontag, 1995). The latter is more local, encoding fixed length windows, and accordingly less expressive. In this paper, we seek to bridge the gap between RNNs and CNNs, presenting SoPa (for Soft Patterns), a model that lies in between them.
SoPa is a neural version of a weighted finitestate automaton (WFSA), with a restricted set of transitions. Linguistically, SoPa is appealing as it * The first two authors contributed equally.  Figure 1: A representation of a surface pattern as a six-state automaton. Self-loops allow for repeatedly inserting words (e.g., "funny"). -transitions allow for dropping words (e.g., "a").
is able to capture a soft notion of surface patterns (e.g., "what a great X !"; Hearst, 1992), where some words may be dropped, inserted, or replaced with similar words (see Figure 1). From a modeling perspective, SoPa is interesting because WF-SAs are well-studied and come with efficient and flexible inference algorithms (Mohri, 1997;Eisner, 2002) that SoPa can take advantage of. SoPa defines a set of soft patterns of different lengths, with each pattern represented as a WFSA (Section 3). While the number and lengths of the patterns are hyperparameters, the patterns themselves are learned end-to-end. SoPa then represents a document with a vector that is the aggregate of the scores computed by matching each of the patterns with each span in the document. Because SoPa defines a hidden state that depends on the input token and the previous state, it can be thought of as a simple type of RNN. We show that SoPa is an extension of a onelayer CNN (Section 4). Accordingly, one-layer CNNs can be viewed as a collection of linearchain WFSAs, each of which can only match fixed-length spans, while our extension allows matches of flexible-length. As a simple type of RNN that is more expressive than a CNN, SoPa helps to link CNNs and RNNs.
To test the utility of SoPa, we experiment with three text classification tasks (Section 5). We compare against four baselines, including both a bidirectional LSTM and a CNN. Our model performs on par with or better than all baselines on all tasks (Section 6). Moreover, when training with smaller datasets, SoPa is particularly useful, outperforming all models by substantial margins. Finally, building on the connections discovered in this paper, we offer a new, simple method to interpret SoPa (Section 7). This method applies equally well to CNNs. We release our code at https://github.com/ Noahs-ARK/soft_patterns.

Background
Surface patterns. Patterns (Hearst, 1992) are particularly useful tool in NLP (Lin et al., 2003;Etzioni et al., 2005;. The most basic definition of a pattern is a sequence of words and wildcards (e.g., "X is a Y"), which can either be manually defined or extracted from a corpus using cooccurrence statistics. Patterns can then be matched against a specific text span by replacing wildcards with concrete words.  introduced a flexible notion of patterns, which supports partial matching of the pattern with a given text by skipping some of the words in the pattern, or introducing new words. In their framework, when a sequence of text partially matches a pattern, hard-coded partial scores are assigned to the pattern match. Here, we represent patterns as WFSAs with neural weights, and support these partial matches in a soft manner.
WFSAs. We review weighted finite-state automata with -transitions before we move on to our special case in Section 3. A WFSA-with d states over a vocabulary V is formally defined as a tuple F = π, T, η , where π ∈ R d is an initial weight vector, T : (V ∪ { }) → R d×d is a transition weight function, and η ∈ R d is a final weight vector. Given a sequence of words in the vocabulary x = x 1 , . . . , x n , the Forward algorithm (Baum and Petrie, 1966) scores x with respect to F . Without -transitions, Forward can be written as a series of matrix multiplications: -transitions are followed without consuming a word, so Equation 1 must be updated to reflect the possibility of following any number (zero or more) of -transitions in between consuming each word: where * is matrix asteration: A * := ∞ j=0 A j . In our experiments we use a first-order approximation, A * ≈ I + A, which corresponds to allowing zero or one -transition at a time. When the FSA F is probabilistic, the result of the Forward algorithm can be interpreted as the marginal probability of all paths through F while consuming x (hence the symbol "p").
The Forward algorithm can be generalized to any semiring (Eisner, 2002), a fact that we make use of in our experiments and analysis. 1 The vanilla version of Forward uses the sum-product semiring: ⊕ is addition, ⊗ is multiplication. A special case of Forward is the Viterbi algorithm (Viterbi, 1967), which sets ⊕ to the max operator. Viterbi finds the highest scoring path through F while consuming x. Both Forward and Viterbi have runtime O(d 3 + d 2 n), requiring just a single linear pass through the phrase. Using firstorder approximate asteration, this runtime drops to O(d 2 n). 2 Finally, we note that Forward scores are for exact matches-the entire phrase must be consumed. We show in Section 3.2 how phrase-level scores can be summarized into a document-level score.

SoPa: A Weighted Finite-State Automaton RNN
We introduce SoPa, a WFSA-based RNN, which is designed to represent text as collection of surface pattern occurrences. We start by showing how a single pattern can be represented as a WFSA-(Section 3.1). Then we describe how to score a complete document using a pattern (Section 3.2), and how multiple patterns can be used to encode a document (Section 3.3). Finally, we show that SoPa can be seen as a simple variant of an RNN (Section 3.4).

Patterns as WFSAs
We describe how a pattern can be represented as a WFSA-. We first assume a single pattern. A pattern is a WFSA-, but we impose hard constraints on its shape, and its transition weights are given by differentiable functions that have the power to capture concrete words, wildcards, and everything in between. Our model is designed to behave similarly to flexible hard patterns (see Section 2), but to be learnable directly and "end-to-end" through backpropagation. Importantly, it will still be interpretable as simple, almost linear-chain, WFSA-. Each pattern has a sequence of d states (in our experiments we use patterns of varying lengths between 2 and 7). Each state i has exactly three possible outgoing transitions: a self-loop, which allows the pattern to consume a word without moving states, a main path transition to state i + 1 which allows the pattern to consume one token and move forward one state, and an -transition to state i + 1, which allows the pattern to move forward one state without consuming a token. All other transitions are given score 0. When processing a sequence of text with a pattern p, we start with a special START state, and only move forward (or stay put), until we reach the special END state. 3 A pattern with d states will tend to match token spans of length d − 1 (but possibly shorter spans due to -transitions, or longer spans due to self-loops). See Figure 1 for an illustration.
Our transition function, T, is a parameterized function that returns a d × d matrix. For a word x: where u i and w i are vectors of parameters, a i and b i are scalar parameters, v x is a fixed pre-trained word vector for x, 4 and E is an encoding function, typically the identity function or sigmoid. -transitions are also parameterized, but don't consume a token and depend only on the current state: where c i is a scalar parameter. 5 As we have only 3 To ensure that we start in the START state and end in the END state, we fix π = [1, 0, . . . , 0] and η = [0, . . . , 0, 1]. 4 We use GloVe 300d 840B (Pennington et al., 2014). 5 Adding -transitions to WFSAs does not increase their three non-zero diagonals in total, the matrix multiplications in Equation 2 can be implemented using vector operations, and the overall runtimes of Forward and Viterbi are reduced to O(dn). 6 Words vs. wildcards. Traditional hard patterns distinguish between words and wildcards. Our model does not explicitly capture the notion of either, but the transition weight function can be interpreted in those terms. Each transition is a logistic regression over the next word vector v x . For example, for a main path out of state i, T has two parameters, w i and b i . If w i has large magnitude and is close to the word vector for some word y (e.g., w i ≈ 100v y ), and b i is a large negative bias (e.g., b i ≈ −100), then the transition is essentially matching the specific word y. Whereas if w i has small magnitude (w i ≈ 0) and b i is a large positive bias (e.g., b i ≈ 100), then the transition is ignoring the current token and matching a wildcard. 7 The transition could also be something in between, for instance by focusing on specific dimensions of a word's meaning encoded in the vector, such as POS or semantic features like animacy or concreteness (Rubinstein et al., 2015;Tsvetkov et al., 2015).

Scoring Documents
So far we described how to calculate how well a pattern matches a token span exactly (consuming the whole span). To score a complete document, we prefer a score that aggregates over all matches on subspans of the document (similar to "search" instead of "match" in regular expression parlance). We still assume a single pattern. Either the Forward algorithm can be used to calculate the expected count of the pattern in the document, 1≤i≤j≤n p span (x i:j ), or Viterbi to calculate s doc (x) = max 1≤i≤j≤n s span (x i:j ), the score of the highest-scoring match. In short documents, we expect patterns to typically occur at most once, so in our experiments we choose the Viterbi algorithm, i.e., the max-product semiring.
Implementation details. We give the specific recurrences we use to score documents in a single expressive power, and in fact slightly complicates the Forward equations. We use them as they require fewer parameters, and make the modeling connection between (hard) flexible patterns and our (soft) patterns more direct and intuitive. pass with this model. We define: We also define the following for taking zero or one -transitions: where max is element-wise max. We maintain a row vector h t at each token: 8 and then extract and aggregate END state values: [h t ] i represents the score of the best path through the pattern that ends in state i after consuming t tokens. By including h 0 in Equation 7b, we are accounting for spans that start at time t + 1. s t is the maximum of the exact match scores for all spans ending at token t. And s doc is the maximum score of any subspan in the document.

Aggregating Multiple Patterns
We describe how k patterns are aggregated to score a document. These k patterns give k different s doc scores for the document, which are stacked into a vector z ∈ R k and constitute the final document representation of SoPa. This vector representation can be viewed as a feature vector. In this paper, we feed it into a multilayer perceptron (MLP), culminating in a softmax to give a probability distribution over document labels. We minimize cross-entropy, allowing the SoPa and MLP parameters to be learned end-to-end. SoPa uses a total of (2e + 3)dk parameters, where e is the word embedding dimension, d is the number of states and k is the number of patterns. For comparison, an LSTM with a hidden dimension of h has 4((e + 1)h + h 2 ). In Section 6 we show that SoPa consistently uses fewer parameters than a BiLSTM baseline to achieve its best result. 8 Here a row vector h of size n can also be viewed as a 1 × n matrix.

SoPa as an RNN
SoPa can be considered an RNN. As shown in Section 3.2, a single pattern with d states has a hidden state vector of size d. Stacking the k hidden state vectors of k patterns into one vector of size k × d can be thought of as the hidden state of our model. This hidden state is, like in any other RNN, dependent of the input and the previous state. Using selfloops, the hidden state at time point i can in theory depend on the entire history of tokens up to x i (see Figure 2 for illustration). We do want to discourage the model from following too many self-loops, only doing so if it results in a better fit with the remainder of the pattern. To do this we use the sigmoid function as our encoding function E (see Equation 3), which means that all transitions have scores strictly less than 1. This works to keep pattern matches close to their intended length. Using other encoders, such as the identity function, can result in different dynamics, potentially encouraging rather than discouraging self-loops.
Although even single-layer RNNs are Turing complete (Siegelmann and Sontag, 1995), SoPa's expressive power depends on the semiring. When a WFSA is thought of as a function from finite sequences of tokens to semiring values, it is restricted to the class of functions known as rational series (Schützenberger, 1961;Droste and Gastin, 1999;Sakarovitch, 2009). 9 It is unclear how limiting this theoretical restriction is in practice, especially when SoPa is used as a component in a larger network. We defer the investigation of the exact computational properties of SoPa to future work. In the next section, we show that SoPa is an extension of a one-layer CNN, and hence more expressive.

SoPa as a CNN Extension
A convolutional neural network (CNN; LeCun, 1998) moves a fixed-size sliding window over the document, producing a vector representation for each window. These representations are then often summed, averaged, or max-pooled to produce a document-level representation (Kim, 2014;Yin and Schütze, 2015). In this section, we show that SoPa is an extension of one-layer, max-pooled CNNs.
To recover a CNN from a soft pattern with d + 1 states, we first remove self-loops and -transitions,  Figure 2: State activations of two patterns as they score a document. pattern1 (length three) matches on "in years". pattern2 (length five) matches on "funniest and most likeable book", using a self-loop to consume the token "most". Active states in the best match are marked with arrow cursors.
retaining only the main path transitions. We also use the identity function as our encoder E (Equation 3), and use the max-sum semiring. With only main path transitions, the network will not match any span that is not exactly d tokens long. Using max-sum, spans of length d will be assigned the score: . Rearranged this way, we recognize the span score as an affine transformation of the concatenated word vectors v x i:i+d . If we use k patterns, then together their span scores correspond to a linear filter with window size d and output dimension k. 10 A single pattern's score for a document is: The max in Equation 10 is calculated for each pattern independently, corresponding exactly to element-wise max-pooling of the CNN's output layer.
Based on the equivalence between this impoverished version of SoPa and CNNs, we conclude that one-layer CNNs are learning an even more 10 This variant of SoPa has d bias parameters, which correspond to only a single bias parameter in a CNN. The redundant biases may affect optimization but are an otherwise unimportant difference. restricted class of WFSAs (linear-chain WFSAs) that capture only fixed-length patterns.
One notable difference between SoPa and arbitrary CNNs is that in general CNNs can use any filter (like an MLP over v x i:i+d , for example). In contrast, in order to efficiently pool over flexiblelength spans, SoPa is restricted to operations that follow the semiring laws. 11 As a model that is more flexible than a one-layer CNN, but (arguably) less expressive than many RNNs, SoPa lies somewhere on the continuum between these two approaches. Continuing to study the bridge between CNNs and RNNs is an exciting direction for future research.

Experiments
To evaluate SoPa, we apply it to text classification tasks. Below we describe our datasets and baselines. More details can be found in Appendix A.
Datasets. We experiment with three binary classification datasets.
• SST. The Stanford Sentiment Treebank (Socher et al., 2013) 12 contains roughly 10K movie reviews from Rotten Tomatoes, 13 labeled on a scale of 1-5. We consider the binary task, which considers 1 and 2 as negative, and 4 and 5 as positive (ignoring 3s). It is worth noting that this dataset also contains syntactic phrase level annotations, providing a sentiment label to parts of sentences. In order to experiment in a realistic setup, we only consider the complete sentences, and ignore syntactic annotations at train or test time. The number of training/development/test sentences in the dataset is 6,920/872/1,821.
• Amazon. The Amazon Review Corpus (McAuley and Leskovec, 2013) 14 contains electronics product reviews, a subset of a larger review dataset. Each document in the dataset contains a review and a summary. Following Yogatama et al. (2015), we only use the reviews part, focusing on positive and negative reviews. The number of training/development/test samples is 20K/5K/25K.
• ROC. The ROC story cloze task (Mostafazadeh et al., 2016) is a story understanding task. 15 The task is composed of four-sentence story prefixes, followed by two competing endings: one that makes the joint five-sentence story coherent, and another that makes it incoherent. Following Schwartz et al. (2017), we treat it as a style detection task: we treat all "right" endings as positive samples and all "wrong" ones as negative, and we ignore the story prefix. We split the development set into train and development (of sizes 3,366 and 374 sentences, respectively), and take the test set as-is (3,742 sentences).
Reduced training data. In order to test our model's ability to learn from small datasets, we also randomly sample 100, 500, 1,000 and 2,500 SST training instances and 100, 500, 1,000, 2,500, 5,000, and 10,000 Amazon training instances. Development and test sets remain the same.
Baselines. We compare to four baselines: a BiL-STM, a one-layer CNN, DAN (a simple alternative to RNNs) and a feature-based classifier trained with hard-pattern features.
• BiLSTM. Bidirectional LSTMs have been successfully used in the past for text classification tasks (Zhou et al., 2016). We learn a one-layer BiLSTM representation of the document, and feed the average of all hidden states to an MLP.
• CNN. CNNs are particularly useful for text classification (Kim, 2014). We train a one-layer CNN with max-pooling, and feed the resulting representation to an MLP.
• DAN. We learn a deep averaging network with word dropout (Iyyer et al., 2015), a simple but strong text-classification baseline.
• Hard. We train a logistic regression classifier with hard-pattern features. Following , we replace low frequency words with a special wildcard symbol. We learn sequences of 1-6 concrete words, where any number of wildcards can come between two adjacent words. We consider words occurring with frequency of at least 0.01% of our training set as concrete words, and words occurring in frequency 1% or less as wildcards. 16 Number of patterns. SoPa requires specifying the number of patterns to be learned, and their lengths. Preliminary experiments showed that the model doesn't benefit from more than a few dozen patterns. We experiment with several configurations of patterns of different lengths, generally considering 0, 10 or 20 patterns of each pattern length between 2-7. The total number of patterns learned ranges between 30-70. 17 6 Results Table 1 shows our main experimental results. In two of the cases (SST and ROC), SoPa outperforms all models. On Amazon, SoPa performs within 0.3 points of CNN and BiLSTM, and outperforms the other two baselines. The table also shows the number of parameters used by each model for each task. Given enough data, models with more parameters should be expected to perform better. However, SoPa performs better or roughly the same as a BiLSTM, which has 3-6 times as many parameters. Figure 3 shows a comparison of all models on the SST and Amazon datasets with varying training set sizes. SoPa is substantially outperforming all baselines, in particular BiLSTM, on small datasets (100 samples). This suggests that SoPa is better fit to learn from small datasets.
Ablation analysis. Table 1 also shows an ablation of the differences between SoPa and CNN: max-product semiring with sigmoid vs. max-sum semiring with identity, self-loops, and -transitions. The last line is equivalent to a CNN with 16 Some words may serve as both words and wildcards. See Davidov and Rappoport (2008) for discussion. 17 The number of patterns and their length are hyperparameters tuned on the development data (see Appendix A).

Interpretability
We turn to another key aspect of SoPa-its interpretability. We start by demonstrating how we interpret a single pattern, and then describe how to interpret the decisions made by downstream classifiers that rely on SoPa-in this case, a sentence classifier. Importantly, these visualization techniques are equally applicable to CNNs.
Interpreting a single pattern. In order to visualize a pattern, we compute the pattern matching scores with each phrase in our training dataset, and select the k phrases with the highest scores. Table 2 shows examples of six patterns learned using the best SoPa model on the SST dataset, as 18 Although SoPa does make use of them-see Section 7.  Second, it seems our patterns are relatively soft, and allow lexical flexibility. While some patterns do seem to fix specific words, e.g., "of" in the first example or "minutes" in the last one, even in those cases some of the top matching spans replace these words with other, similar words ("with" and "halfhour", respectively). Encouraging SoPa to have more concrete words, e.g., by jointly learning the word vectors, might make SoPa useful in other contexts, particularly as a decoder. We defer this direction to future work.
Finally, SoPa makes limited but non-negligible use of self-loops and epsilon steps. Interestingly, the second example shows that one of the pat-Analyzed Documents it 's dumb , but more importantly , it 's just not scary though moonlight mile is replete with acclaimed actors and actresses and tackles a subject that 's potentially moving , the movie is too predictable and too self-conscious to reach a level of high drama While its careful pace and seemingly opaque story may not satisfy every moviegoer 's appetite, the film 's final scene is soaringly , transparently moving unlike the speedy wham-bam effect of most hollywood offerings , character development -and more importantly, character empathyis at the heart of italian for beginners .
the band 's courage in the face of official repression is inspiring , especially for aging hippies ( this one included ) . Table 3: Documents from the SST training data. Phrases with the largest contribution toward a positive sentiment classification are in bold green, and the most negative phrases are in italic orange.
terns had an -transition at the same place in every phrase. This demonstrates a different function of -transitions than originally designed-they allow a pattern to effectively shorten itself, by learning a high -transition parameter for a certain state.
Interpreting a document. SoPa provides an interpretable representation of a document-a vector of the maximal matching score of each pattern with any span in the document. To visualize the decisions of our model for a given document, we can observe the patterns and corresponding phrases that score highly within it.
To understand which of the k patterns contributes most to the classification decision, we apply a leave-one-out method. We run the forward method of the MLP layer in SoPa k times, each time zeroing-out the score of a different pattern p. The difference between the resulting score and the original model score is considered p's contribution. We then consider the highest contributing patterns, and attach each one with its highest scoring phrase in that document. Table 3 shows example texts along with their most positive and negative contributing phrases.

Related Work
Weighted finite-state automata. WFSAs and hidden Markov models 19 were once popular in automatic speech recognition (Hetherington, 2004;Moore et al., 2006;Hoffmeister et al., 2012) 19 HMMs are a special case of WFSAs (Mohri et al., 2002). and remain popular in morphology (Dreyer, 2011;Cotterell et al., 2015). Most closely related to this work, neural networks have been combined with weighted finite-state transducers to do morphological reinflection (Rastogi et al., 2016). These prior works learn a single FSA or FST, whereas our model learns a collection of simple but complementary FSAs, together encoding a sequence. We are the first to incorporate neural networks both before WFSAs (in their transition scoring functions), and after (in the function that turns their vector of scores into a final prediction), to produce an expressive model that remains interpretable.
Recurrent neural networks. The ability of RNNs to represent arbitrarily long sequences of embedded tokens has made them attractive to NLP researchers. The most notable variants, the long short-term memory (LSTM; Hochreiter and Schmidhuber, 1997) and gated recurrent units (GRU; Cho et al., 2014), have become ubiquitous in NLP algorithms (Goldberg, 2016). Recently, several works introduced simpler versions of RNNs, such as recurrent additive networks (Lee et al., 2017) and Quasi-RNNs (Bradbury et al., 2017). Like SoPa, these models can be seen as points along the bridge between RNNs and CNNs.
Other works have studied the expressive power of RNNs, in particular in the context of WFSAs or HMMs (Cleeremans et al., 1989;Giles et al., 1992;Visser et al., 2001;Chen et al., 2018). In this work we relate CNNs to WFSAs, showing that a one-layer CNN with max-pooling can be simulated by a collection of linear-chain WFSAs.
Convolutional neural networks. CNNs are prominent feature extractors in NLP, both for generating character-based embeddings (Kim et al., 2016), and as sentence encoders for tasks like text classification (Yin and Schütze, 2015) and machine translation (Gehring et al., 2017). Similarly to SoPa, several recently introduced variants of CNNs support varying window sizes by either allowing several fixed window sizes (Yin and Schütze, 2015) or by supporting non-consecutive n-gram matching (Lei et al., 2015;Nguyen and Grishman, 2016).
Neural networks and patterns. Some works used patterns as part of a neural network. Schwartz et al. (2016) used pattern contexts for estimating word embeddings, showing improved word similarity results compared to bag-of-word contexts. Shwartz et al. (2016) designed an LSTM representation for dependency patterns, using them to detect hypernymy relations. Here, we learn patterns as a neural version of WFSAs.
Interpretability. There have been several efforts to interpret neural models. The weights of the attention mechanism (Bahdanau et al., 2015) are often used to display the words that are most significant for making a prediction. LIME (Ribeiro et al., 2016) is another approach for visualizing neural models (not necessarily textual). Yogatama and Smith (2014) introduced structured sparsity, which encodes linguistic information into the regularization of a model, thus allowing to visualize the contribution of different bag-of-word features.
Other works jointly learned to encode text and extract the span which best explains the model's prediction (Yessenalina et al., 2010;Lei et al., 2016). Li et al. (2016) and Kádár et al. (2017) suggested a method that erases pieces of the text in order to analyze their effect on a neural model's decisions. Finally, several works presented methods to visualize deep CNNs (Zeiler and Fergus, 2014;Simonyan et al., 2014;Yosinski et al., 2015), focusing on visualizing the different layers of the network, mainly in the context of image and video understanding. We believe these two types of research approaches are complementary: inventing general purpose visualization tools for existing black-box models on the one hand, and on the other, designing models like SoPa that are interpretable by construction.

Conclusion
We introduced SoPa, a novel model that combines neural representation learning with WFSAs. We showed that SoPa is an extension of a one-layer CNN. It naturally models flexible-length spans with insertion and deletion, and it can be easily customized by swapping in different semirings. SoPa performs on par with or strictly better than four baselines on three text classification tasks, while requiring fewer parameters than the stronger baselines. On smaller training sets, SoPa outperforms all four baselines. As a simple version of an RNN, which is more expressive than one-layer CNNs, we hope that SoPa will encourage future research on the bridge between these two mechanisms. To facilitate such research, we release our implementation at https://github.com/ Noahs-ARK/soft_patterns.