Bag of What? Simple Noun Phrase Extraction for Text Analysis

Social scientists who do not have specialized natural language processing training often use a unigram bag-of-words (BOW) representation when analyzing text corpora. We offer a new phrase-based method, NPFST , for enriching a unigram BOW. NPFST uses a part-of-speech tagger and a ﬁnite state transducer to extract multiword phrases to be added to a unigram BOW. We compare NPFST to both n - gram and parsing methods in terms of yield, recall, and efﬁciency. We then demonstrate how to use NPFST for exploratory analyses; it performs well, without conﬁguration, on many different kinds of English text. Finally, we present a case study using NPFST to analyze a new corpus of U.S. congressional bills. For our open-source implementation, see http://slanglab . cs . umass . edu/phrases/.


Introduction
Social scientists typically use a unigram representation when analyzing text corpora; each document is represented as a unigram bag-of-words (BOW), while the corpus itself is represented as a documentterm matrix of counts. For example, Quinn et al. (2010) and Grimmer (2010) used a unigram BOW as input to a topic model, while Monroe et al. (2008) used a unigram BOW to report the most partisan terms from political speeches. Although the simplicity of a unigram BOW is appealing, unigram analyses do not preserve meaningful multiword phrases, such as "health care" or "social security," and cannot distinguish between politically significant phrases that share a word, such as "illegal immigrant" and "undocumented immigrant." To address these limitations, we introduce NPFST, which extracts multiword phrases to enrich a unigram BOW as additional columns in the document-term matrix. NPFST is suitable for many different kinds of English text; it uses modest computational resources and does not require any specialized configuration or annotations.

Background
We compare NPFST to several other methods in terms of yield, recall, efficiency, and interpretability. Yield refers to the number of extracted phrases-a lower yield requires fewer computational and human resources to process the phrases. Recall refers to a method's ability to recover the most relevant or important phrases, as determined by a human. A good method should have a low yield, but high recall.

n-grams
Our simplest baseline is AllNGrams(K). This method extracts all n-grams, up to length K, from tokenized, sentence-segmented text, excluding ngrams that cross sentence boundaries. This method is commonly used to extract features for text classification (e.g., Yogatama et al. (2015)), but has several disadvantages in a social scientific context. First, social scientists often want to substantively interpret individual phrases, but fragmentary phrases that cross sentence constituents may not be meaningful. For example, the Affordable Care Act includes the hard-to-interpret 4-gram, "the Internet website of." Second, although AllNGrams(K) has high recall (provided that K is sufficiently large), it suffers from a higher yield and can therefore require substantial resources to process the extracted phrases.

Parsing
An alternative approach 1 is to use syntax to restrict the extracted phrases to constituents, such as noun phrases (NPs). Unlike verb, prepositional, or adjectival phrases, NPs often make sense even when stripped from their surrounding contexte.g., [Barack Obama] N P vs. [was inaugurated in 2008] V P . There are many methods for extracting NPs. Given the long history of constituent parsing research in NLP, one obvious approach is to run an off-the-shelf constituent parser and then retrieve all NP non-terminals from the trees. 2 We refer to this method as ConstitParse. Unfortunately, the major sources of English training data, such as the Penn Treebank (Marcus et al., 1993), include determiners within the NP and non-nested flat NP annotations, 3 leading to low recall in our context (see §4). Since modern parsers rely on these sources of training data, it is very difficult to change this behavior.

Part-of-Speech Grammars
Another approach, proposed by Justeson and Katz (1995), is to use part-of-speech (POS) patterns to find and extract NPs-a form of shallow partial parsing (Abney, 1997). Researchers have used this approach in a variety of different contexts (Benoit and Nulty, 2015;Frantzi et al., 2000;Kim et al., 2010;Chuang et al., 2012;Bamman and Smith, 2014). A pattern-based method can be specified in terms of a triple of parameters: (G, K, M ), where G is a grammar, K is a maximum length, and M is a matching strategy. The grammar G is a non-recursive regular expression that defines an infinite set of POS tag sequences (i.e., a regular language); the maximum length K limits the length of the extracted n-grams to n ≤ K; while the matching strategy M specifies how to extract text spans that match the grammar.
The simplest grammar that we consider is defined over a coarse tag set of adjectives, nouns (both common and proper), prepositions, and determiners. We refer to this grammar as SimpleNP. The constituents that match this grammar are bare NPs (with optional PP attachments), N-bars, and names. We do not include any determiners at the root NP.
We also consider three baseline matching strategies, each of which can (in theory) be used with any G and K. The first, FilterEnum, enumerates all possible strings in the regular language, up to length K, as a preprocessing step. Then, at runtime, it checks whether each n-gram in the corpus is present in this enumeration. This matching strategy is simple to implement and extracts all matches up to length K, but it is computationally infeasible if K is large. The second, FilterFSA, compiles G into a finite-state automaton (FSA) as a preprocessing step. Then, at runtime, it checks whether each n-gram matches this FSA. Like FilterEnum, this matching strategy extracts all matches up to length K; however, it can be inefficient if K is large. The third, GreedyFSA, also compiles G into an FSA, but uses a standard greedy matching approach at runtime to extract ngrams that match G. Unlike the other two matching strategies, it cannot extract overlapping or nested matches, but it can extract very long matches. 4 In their original presentation, Justeson and Katz (1995) defined a grammar that is very similar to SimpleNP and suggested using 2-and 3-grams (i.e., K = 3). With this restriction, their grammar comprises seven unique patterns. They also proposed using FilterEnum to extract text spans that match these patterns. We refer to this method as JK = (Sim-pleNP, K = 3, FilterEnum). Many researchers have used this method, perhaps because it is described in the NLP textbook by Manning and Schütze (1999).

FullNP Grammar
FullNP extends SimpleNP by adding coordination of pairs of words with the same tag (e.g., (VB CC VB) in (cease and desist) order); coordination of noun phrases; parenthetical post-modifiers (e.g., 401(k), which is a 4-gram because of common NLP tokenization conventions); numeric modifiers and nominals; and support for the Penn Treebank tag set, We provide the complete definition in the appendix.

RewriteFST Matching Strategy
RewriteFST uses a finite-state transducer (FST) to rapidly extract text spans that match G-including overlapping and nested spans. This matching strategy is a form of finite-state NLP (Roche and Schabes, 1997), and therefore builds on an extensive body of previous work on FST algorithms and tools.
The input to RewriteFST is a POS-tagged 5 sequence of tokens I, represented as an FSA. For a simple tag sequence, this FSA is a linear chain, but, if there is uncertainty in the output of the tagger, it can be a lattice with multiple tags for each position.
The grammar G is first compiled into a phrase transducer P , 6 which takes an input sequence I and outputs the same sequence, but with pairs of start and end symbols-[S] and [E], respectivelyinserted to indicate possible NPs (see figure 1). At runtime, RewriteFST computes an output lattice L = I • P using FST composition; 7 since it is nondeterministic, L includes all overlapping and nested spans, rather than just the longest match. Finally, FilterFST traverses L to find all edges with a [S] symbol. From each one, it performs a depth-first search to find all paths to an edge with an [E] symbol, accumulating all [S]-and [E]-delimited spans. 8 In table 1, we provide a comparison of FilterFST and the three matching strategies described in §2.3. 5 We used the ARK POS tagger for tweets Owoputi et al., 2013) and used Stanford CoreNLP for all other corpora (Toutanova et al., 2003;Manning et al., 2014). 6 We used foma (Hulden, 2009;Beesley and Karttunen, 2003) to compile G into P . foma was designed for building morphological analyzers; it allows a developer to write a grammar in terms of readable production rules with intermediate categories. The rules are then compiled into a single, compact FST. 7 We implemented the FST composition using OpenNLP (Allauzen et al., 2007) and pyfst (http://pyfst.github.io/). 8 There are alternatives to this FST approach, such as a backtracking algorithm applied directly to the original grammar's FSA to retrieve all spans starting at each position in the input.

Experimental Results
In this section, we provide experimental results comparing NPFST to the baselines described in §2 in terms of yield, recall, efficiency, and interpretability. As desired, NPFST has a low yield and high recall, and efficiently extracts highly interpretable phrases.

Yield and Recall
Yield refers to the number of phrases extracted by a method, while recall refers to a method's ability to recover the most relevant or important phrases, as determined by a human. Because relevance and importance are domain-specific concepts that are not easy to define, we compared the methods using three named-entity recognition (NER) data sets: mentions of ten types of entities on Twitter from the WNUT 2015 shared task (Baldwin et al., 2015); mentions of proteins in biomedical articles from the BioNLP shared task 2011 (Kim et al., 2011); and a synthetic data set of named entities in New York Times articles (Sandhaus, 2008), identified using Stanford NER (Manning et al., 2014). Named entities are undoubtedly relevant and important phrases in all three of these different domains. 9 For each data set, we defined a method's yield to be the total number of spans that it extracted and a method's recall to be the percentage of the (labeled) named entity spans that were present in its list of extracted spans. 10  A good method should have a low yield, but high recall-i.e., the best methods are in the topleft corner of each plot. The pattern-based methods all achieved high recall, with a considerably lower yield than AllNGrams(K). ConstitParse achieved a lower yield than NPFST, but also achieved lower recall. JK performed worse than NPFST, in part because it can only extract 2-and 3-grams, and, for example, the BioNLP data set contains mentions of proteins that are as long as eleven tokens (e.g., "Ca2+/calmodulin-dependent protein kinase (CaMK) type IV/Gr"). Finally, (SimpleNP, K = ∞, GreedyFSA) performed much worse than JK because it cannot extract overlapping or nested spans. 11 The WNUT data set is already tokenized; however, we accidentally re-tokenized it in our experiments. Figure 2 therefore only depicts yield and recall for the 1,278 (out of 1,795) tweets for which our re-tokenization matched the original tokenization. 12 We used the Stanford CoreNLP shift-reduce parser.  For the WNUT data set, NPFST's recall was relatively low (91.8%). To test whether some of its false negatives were due to POS-tagging errors, we used NPFST's ability to operate on an input lattice with multiple tags for each position. Specifically, we constructed an input lattice I using the tags for each position whose posterior probability was at least t. We experimented with t = 0.01 and t = 0.001. These values increased recall to 96.2% and 98.3%, respectively, in exchange for only a slightly higher yield (lower than that of AllNGrams(2)). We suspect that we did not see a greater increase in yield, even for t = 0.001, because of posterior calibration (Nguyen and O'Connor, 2015;Kuleshov and Liang, 2015).
POS tagging is about twenty times faster than parsing, which is helpful for social scientists who may not have fast servers. NPFST is slightly slower than the simpler pattern-based methods; however, 80% of its time is spent constructing the input I and traversing the output lattice L, both of which are implemented in Python and could be made faster.

Interpretability
When analyzing text corpora, social scientists often examine ranked lists of terms, where each term is ranked according to some score. We argue that multiword phrases are more interpretable than unigrams when stripped from their surrounding context and presented as a list. In §4.3.1 we explain how to merge related terms, and in §4.3.2, we provide ranked lists that demonstrate that NPFST extracts more interpretable phrases than other methods.

Merging Related Terms
As described in §3.2, NPFST extracts overlapping and nested spans. For example, when run on a data set of congressional bills about crime, NPFST extracted "omnibus crime control and safe streets act," as well as the nested phrases "crime control" and "safe streets act." Although this behavior is generally desirable, it can also lead to repetition in ranked lists. We therefore outline an high-level algorithm for merging the highest-ranked terms in a ranked list.
The input to our algorithm is a list of terms L. The algorithm iterates through the list, starting with the highest-ranked term, aggregating similar terms according to some user-defined criterion (e.g., whether the terms share a substring) until it has generated C distinct term clusters. The algorithm then selects a single term to represent each cluster. Finally, the al- 13 We used Python's timeit module. gorithm orders the clusters' representative terms to form a ranked list of length C. By starting with the highest-ranked term and terminating after C clusters have been formed, this algorithm avoids the inefficiency of examining all possible pairs of terms.

Ranked Lists
To assess the interpretability of the phrases extracted by NPFST, we used three data sets: tweets about climate change, written by (manually identified) climate deniers; 14 transcripts from criminal trials at the Old Bailey in London during the 18 th century; 15 and New York Times articles from September, 1993. For each data set, we extracted phrases using ConstitParse, JK, and NPFST and produced a list of terms for each method, ranked by count. We excluded domain-specific stopwords and any phrases that contained them. 16 Finally, we merged related terms using our term-merging algorithm, aggregating terms only if one term was a substring of another, to produce ranked lists of five representative terms. Table 4.3 contains these lists, demonstrating that NPFST produces highly interpretable phrases.

Case Study: Finding Partisan Terms in U.S. Congressional Legislation
Many political scientists have studied the relationship between language usage and party affiliation (Laver et al., 2003;Monroe et al., 2008;Slapin and Proksch, 2008;Quinn et al., 2010;Grimmer and Stewart, 2013). We present a case study, in which we use NPFST to explore partisan differences in U.S. congressional legislation about law and crime. In §5.1, we describe our data set, and in §5.2, we explain our methodology and present our results.

The Congressional Bills Corpus
We used a new data set of 97,221 U.S. congressional bills, introduced in the House and Senate between  1993 and 2014. We created this data set by scraping the Library of Congress website. 17 We used Stanford CoreNLP to tokenize and POS tag the bills. We removed numbers and punctuation, and discarded all terms that occurred in fewer than five bills. We also augmented each bill with its author, its final outcome (e.g., did it survive committee deliberations, did it pass a floor vote in the Senate) from the Congressional Bills Project (Adler and Wilkerson, 2014), and its major topic area (Purpura and Hillard, 2006). For our case study, we focused on a subset of 488 bills, introduced between 2013 and 2014, that are primarily about law and crime. We chose this subset because we anticipated that it would clearly highlight partisan policy differences. For example, the bills include legislation about immigration enforcement and about incarceration of low-level offenders-two areas where Democrats and Republicans tend to have very different policy preferences.

Partisan Terms
We used NPFST to extract phrases from the bills, and then created ranked lists of terms for each party using the informative Dirichlet 18 feature selection 17 http://congress.gov/ 18 In order to lower the z-scores of uninformative, highfrequency terms, we set the Dirichlet hyperparameters to be proportional to the term counts from our full data set of bills. method of Monroe et al. (2008). This method computes a z-score for each term that reflects how strongly that term is associated with Democrats over Republicans-a positive z-score indicates that Democrats are more likely to use the term, while a negative z-score indicates that Republications are more likely to use the term. We merged the highestranked terms for each party, aggregating terms only if one term was a substring of another and if the terms were very likely to co-occur in a single bill, 19 to form ranked lists of representative terms. Finally, for comparison, we also used the same approach to create ranked lists of unigrams, one for each party. Figure 3 depicts z-score versus term count, while table 4 lists the twenty highest-ranked terms. The unigram lists suggest that Democratic lawmakers focus more on legislation related to mental health, juvenile offenders, and possibly domestic violence, while Republican lawmakers focus more on illegal immigration. However, many of the highest-ranked unigrams are highly ambiguous when stripped from their surrounding context. For example, we do not know whether "domestic" refers to "domestic violence," "domestic terrorism," or "domestic programs" without manually reviewing the origi- : z-score versus term count. Each dot represents a single term and is sized according to that term's z-score. Terms that are more likely to be used by Democrats are shown in blue; terms that are more likely to be used by Republicans are shown in dark red. nal bills (e.g., using a keyword-in-context interface (O'Connor, 2014)). Moreover, many of the highest-ranked Republican unigrams, such as "communication," are not unique to law and crime.
In contrast, the phrase-based lists are less ambiguous and much more interpretable. They include names of bills (which are often long) and important concepts, such as "mental health," "victims of domestic violence," "interstate or foreign commerce," and "explosive materials." These lists suggest that Democratic lawmakers have a very strong focus on programs to prevent child abuse and domestic violence, as well as issues related to mental health and gang violence. Republican lawmakers appear to focus on immigration and incarceration. This focus on immigration is not surprising given the media coverage between 2013 and 2014; however, there was much less media coverage of a Democratic focus on crime-related legislation during that time period.
These results suggest that social scientists will be less likely to draw incorrect conclusions from ranked lists of terms if they include multiword phrases. Because phrases are less ambiguous than unigrams, social scientists can more quickly discover meaningful term-based associations for further exploration, without undertaking a lengthy process to validate their interpretation of the terms.

Conclusions and Future Work
Social scientists typically use a unigram BOW representation when analyzing text corpora, even though unigram analyses do not preserve meaningful multiword phrases. To address this limitation, we presented a new phrase-based method, NPFST, for enriching a unigram BOW. NPFST is suitable for many different kinds of English text; it does not require any specialized configuration or annotations.
We compared NPFST to several other methods for extracting phrases, focusing on yield, recall, efficiency, and interpretability. As desired, NPFST has a low yield and high recall, and efficiently extracts highly interpretable phrases. Finally, to demonstrate the usefulness of NPFST for social scientists, we used NPFST to explore partisan differences in U.S. congressional legislation about law and crime. We found that the phrases extracted by NPFST were less ambiguous and more interpretable than unigrams.
In the future, we plan to use NPFST in combination with other text analysis methods, such as topic modeling; we have already obtained encouraging preliminary results. We have also experimented with modifying the FullNP grammar to select broader classes of phrases, such as subject-verb and verbobject constructions (though we anticipate that more structured syntactic parsing approaches will eventually be useful for these kinds of constructions).

Party
Ranked List unigrams Democrat and, deleted, health, mental, domestic, inserting, grant, programs, prevention, violence, program, striking, education, forensic, standards, juvenile, grants, partner, science, research Republican any, offense, property, imprisoned, whoever, person, more, alien, knowingly, officer, not, united, intent, commerce, communication, forfeiture, immigration, official, interstate, subchapter NPFST Democrat mental health, juvenile justice and delinquency prevention act, victims of domestic violence, child support enforcement act of u.s.c., fiscal year, child abuse prevention and treatment act, omnibus crime control and safe streets act of u.s.c., date of enactment of this act, violence prevention, director of the national institute, former spouse, section of the foreign intelligence surveillance act of u.s.c., justice system, substance abuse criminal street gang, such youth, forensic science, authorization of appropriations, grant program Republican special maritime and territorial jurisdiction of the united states, interstate or foreign commerce, federal prison, section of the immigration and nationality act, electronic communication service provider, motor vehicles, such persons, serious bodily injury, controlled substances act, department or agency, one year, political subdivision of a state, civil action, section of the immigration and nationality act u.s.c., offense under this section, five years, bureau of prisons, foreign government, explosive materials, other person Our open-source implementation of NPFST is available at http://slanglab.cs.umass.edu/phrases/.