Lithium NLP: A System for Rich Information Extraction from Noisy User Generated Text on Social Media

In this paper, we describe the Lithium Natural Language Processing (NLP) system - a resource-constrained, high-throughput and language-agnostic system for information extraction from noisy user generated text on social media. Lithium NLP extracts a rich set of information including entities, topics, hashtags and sentiment from text. We discuss several real world applications of the system currently incorporated in Lithium products. We also compare our system with existing commercial and academic NLP systems in terms of performance, information extracted and languages supported. We show that Lithium NLP is at par with and in some cases, outperforms state-of-the-art commercial NLP systems.


Introduction
Social media has become one of the major means for communication and content production. As a result, industrial systems that possess the capability to process rich user generated content from social media platform have several real-world applications. Furthermore, due to the content style, size and heterogeneity of information (e.g. text, emoticons, hashtags etc.) available on social media, novel NLP techniques and systems that are designed specifically for such content and can potentially integrate or learn information from different sources are highly useful and applicable.
However, NLP on social media data can be significantly complex and challenging due to several reasons: • Noisy unnormalized data -Social media data is much more informal than traditional text and less consistent in language in terms of style, tone etc. It involves heavy usage of slang, jargons, emoticons, or abbreviations which usually do not follow formal grammatical rules. Hence, novel NLP techniques need to be developed for such content.
• Multi-lingual content -Social media data poses an additional challenge to NLP practitioners because the user generated content on them is often multi-lingual. Hence, any NLP system processing real world data from the web should be able to support multiple languages in order to be practical and applicable.
• Large scale datasets -State-of-the-art NLP systems should be able to work on large scale datasets such as social media data, often involving millions of documents. Moreover, these systems need to have low resource consumption in order to scale to such datasets in a finite amount of time. In addition, in order to be applicable and practical, they should be able to run on off-the-shelf commodity machines.
• Rich set of information -In order to be cost-efficient, state-of-the-art NLP systems need to be exhaustive in terms of information extracted 1 from social media text. This includes extracting entities of different types (such as professional titles, sports, activities etc.) in addition to just named entities (such as persons, organizations, locations etc.), inferring fine-grained and coarse-grained subject matter topics (sports, politics, healthcare, basketball), text sentiment, hashtags, emoticons etc.
In this paper, we present the Lithium NLP 2 system which addresses these challenges. It is a resource-constrained, high-throughput and language-agnostic system for information extraction from noisy user generated text such as that available on social media. It is capable of extracting a rich set of information including entities, topics, hashtags and sentiment. Lithium NLP currently supports multiple languages including Arabic, English, French, German, Italian and Spanish. It supports large scale data from several social media platforms such as Twitter, Facebook, Linkedin, etc. by processing about 500M new social media messages, and 0.5M socially relevant URLs shared daily. Since it employs statistical NLP techniques, it uses the large scale of the data to help overcome the noisiness.
Lithium NLP is currently incorporated in several Lithium products. It enables consumer products like Klout 3 -a platform which integrates users' data from multiple social networks such as Twitter, Facebook, Instagram, Linkedin, GooglePlus, Youtube, and Foursquare, in order to measure their online social influence via the Klout Score 4 . On Klout, it is used to model users' topics of interest (Spasojevic et al., 2014) and expertise (Spasojevic et al., 2016) by building their topical profiles. Figure 1 shows an example of a user's topics of expertise, as inferred on Klout. Currently, we build topical profiles for more than 600M users. These profiles are further used to recommend personalized content to these users by matching their topics of interest or expertise with content topics as this leads to better user engagement. An example of content personalization is shown in Figure 2. The user scores and topics are also available via the GNIP PowerTrack API 5 .
Lithium NLP also enables enterprise products such as Lithium's social media management tools 6 -Lithium Reach and Lithium Response. It is used to analyze 20 + M new daily engagements across Lithium's 400+ communities 7 . In the past, a version of Lithium NLP had been used to enable user targeting applications such as Klout Perks 8 (influencer reward platform), Cinch 9 (Q&A app), and Who-To-Follow recommendations. These involved selecting a group of users for targeting based on given topics and other filtering criteria.

Knowledge Base
Our Knowledge Base (KB) consists of about 1 million Freebase machine ids for entities that were chosen from a subset of all Freebase entities that map to Wikipedia entities. We prefer to use Freebase rather than Wikipedia as our KB since in Freebase, the same id represents a unique entity across multiple languages. Due to limited resources and usefulness of the enti-ties, our KB contains approximately 1 million most important entities from among all the Freebase entities. This gives us a good balance between coverage and relevance of entities for processing common social media text. Section 3.1 explains how entity importance is calculated, which enables us to rank the top 1 million Freebase entities.
In addition to the KB entities, we also employ two special entities: NIL and MISC. NIL entity indicates that there is no entity associated with the mention, eg. mention 'the' within the sentence may link to entity NIL. This entity is useful especially when it comes to dealing with stop words and false positives. MISC indicates that the mention links to an entity which is outside the selected entity set in our KB. Figure 3 shows a high level overview of the Lithium NLP system. It has two phases:

Offline Resource Generation
In this phase, we generate several dictionaries that capture language models, probabilities and relations across entities and topics, by leveraging various multi-lingual data sources. Some of these dictionaries are derived using our DAWT 10 data set  that consists of densely annotated wikipedia pages across multiple languages. It is 4.8 times denser than Wikipedia and is designed to be exhaustive across several domains.
The dictionaries generated from the DAWT dataset are: • Mention-Entity Co-occurrence -This dictionary captures the prior probability that a mention M i refers to an entity E j (including NIL and MISC) within the DAWT dataset and is equivalent to the cooccurrence probability of the mention and the entity: For instance, mention Michael Jordan can link to Michael Jordan (Professor) or Michael Jordan (Basketball player) 10 https://github.com/klout/opendata/tree/ master/wiki_annotation with different prior probabilities. Moreover, we generate a separate dictionary for each language.
• Entity-Entity Co-occurrence -This dictionary captures co-occurrence frequencies among entities by counting all the entities that simultaneously appear within a sliding window of 50 tokens. Moreover, this data is accumulated across all languages and is language independent in order to capture better relations and create a smaller memory footprint when supporting additional languages. Also, for each entity, we consider only the top 30 co-occurring entities which have at least 10 or more co-occurrences across all supported languages. For instance, entity Michael Jordan (Basketball player) co-occurs with entities Basketball, NBA etc. while entity Michael Jordan (Professor) co-occurs with entities Machine Learning, Artificial Intelligence, UC Berkeley etc.
We also generate additional dictionaries: • Entity Importance -The entity importance score (Bhattacharyya and Spasojevic, 2017) is derived as a global score identifying how important an extracted entity is for a casual observer. This score is calculated using linear regression with features capturing popularity within Wikipedia links, and importance of the entity within Freebase. We used signals such as Wiki page rank, Wiki and Freebase incoming and outgoing links, and type descriptors within our KB etc.
• Topic Parents -This dictionary contains the parent topics for each topic in the Klout Topic Ontology 11 (KTO) -a manually curated ontology built to capture social media users' interests and expertise scores, in different topics, across multiple social networks. As of April 2017, it consists of roughly 8,030 topic nodes and 13,441 edges encoding hierarchical relationships among them.    Figure 4: An example demonstrating the information extracted and added as semantic annotation at each stage of the Lithium NLP pipeline (best viewed in color) • Topic Hashtags -This dictionary contains hashtags recommended for topics in KTO. We determine the hashtags via co-occurrence counts of topics and hashtags, importance, recency and popularity of hashtags as well popularity of topics.

Text Processing
In the Lithium NLP system, an input text document is stored as a Protocol Buffers 12 message. The Text Processing phase of the system processes the input text document through several stages and the information (entities, topics etc.) extracted at every stage is added as a semantic annotation to the text. Not all annotations are added to a document, the Lithium NLP API (explained in Section 3.3) allows a client application to select specific annotations. However, certain annotations such as language and tokens are prerequisites for later stages. The Text Processing pipeline stages are: • Language Detection -This stage detects the language of the input document using an open source language detector 13 . This detector employs a naive Bayesian filter which uses character, spellings and script as features to classify language and estimate its probability. It has a precision of 99% for 49 languages.
• Text Normalization -This stage normalizes the text by escaping unescaped characters and replacing special characters (e.g. diacritical marks) based on the detected language. It replaces non-ASCII punctuations and hyphens with spaces, multiple spaces with single space, converts accents to regular characters etc.
• Sentence Breaking -This stage breaks the normalized text into sentences using Java Text API 14 . It can distinguish sentence breakers from other marks, such as periods within numbers and abbreviations, according to the detected language.
• Tokenization -This stage converts each sentence into a sequence of tokens via the Lucene Standard Tokenizer 15 for all languages and the Lucene Smart Chinese Analyzer 16 for Chinese.
• Entity Extraction -This stage extracts mentions in each sentence using the Mention Entity Co-occurrence dictionary generated offline (Section 3.1). A mention may contain a single token or several consecutive tokens, but a token can belong to at most one mention.
To make this task computationally efficient, we apply a simple greedy strategy that analyzes windows of n-grams (n ∈ [1,6]) and extracts the longest mention found in each window. For each extracted mention, we generate multiple candidate entities. For instance, mention Android can link to candidate entities Android (OS) or Android (Robot).
• Entity Disambiguation and Linking (EDL) -This stage disambiguates and links an entity mention to the correct candidate entity in our KB . It uses several features obtained from the dictionaries generated offline (Section 3.1). These include contextindependent features, such as mentionentity co-occurrence, mention-entity Jaccard similarity and entity importance, and context-dependent features such as entity entity co-occurrence and entity topic semantic similarity. It employs machine learning models, such as decision trees and logistic regression, generated using these features to correctly disambiguate a mention and link to the corresponding entity. This stage has a precision of 63%, recall of 87% and an Fscore of 73% when tested on an in-house dataset.
• Topic Projection -In this stage, we associate each entity in our KB to upto 10 most relevant topics in KTO. For instance, entity Android (OS) will be associated with the topics such as Smartphones, Software etc.
We use a weighted ensemble of several semi-supervised models that employ entity co-occurrences, GloVe (Pennington et al., 2014) word vectors, Freebase hierarchical relationships and Wikipedia in order to propagate topic labels. A complete description of this algorithm is beyond the scope of this paper.
• Hashtag Recommendation -In this stage, we annotate the text with hashtags recommended based on the topics associated with the text in Topic Projection. This uses the Topic Hashtags dictionary generated offline (Section 3.1) • Sentiment Analysis -In this stage, we determine the sentiment of the text (positive, negative or neutral) via lexicons and term counting with negation handling . For this, we used several lexicons of positive and negative words (including SentiWordNet (Baccianella et al., 2010;Esuli and Sebastiani, 2007) and AFINN (Nielsen, 2011)) as well as emoticons. We compute the sentiment score as WP os − WNeg Log(Total # of words in text) + where W P os is the weighted strength of positive words and emoticons, W N eg is the weighted strength of negative words and emoticons in the text and is a smoothing constant. If the score is positive and above a certain threshold, the text is classified as 'Positive'. If it is below a certain threshold, the text is classified as 'Negative'. If it lies within the boundary between 'Positive' and 'Negative' classes, the text is classified as 'Neutral'.
To handle negations, we use a lookback window. Every time, we encounter a word from our sentiment lexicons, we look back at a window of size 3 to see if any negation words precede it and negate the weight of the sentiment word. Overall, this stage has a precision of 47%, recall of 48% and an F-score of 46% when tested on an inhouse dataset.
• Entity Metadata Decoration -In this stage, we add the entity metadata such as its type (Person, Organization, Location, Film, Event, Book) and Location (Population, Time Zone, Latitude/Longitude). Figure 4 demonstrates how the Lithium NLP pipeline processes a sample text "Google CEO Eric Schmidt said that the competition between Apple and Google and iOS vs. Android is 'the defining fight of the tech industry'." and adds the annotations at every stage.

REST API
The Lithium NLP system provides a REST API via which client applications can send a text document as request and receive the annotated text as JSON response. A snippet of an annotated response (which is in our text proto format 17 ) received through the API is shown in Listing 1. Note that the disambiguated entities are also linked to their Freebase ids and Wikipedia links.
17 https://github.com/klout/opendata/blob/ master/wiki_annotation/Text.proto " t e x t " : " Vlade Divac S e r b i a n NBA p l a y e r u s e d t o p l a y f o r LA L a k e r s . " , " l a n g u a g e " : " en " , " a n n o t a t i o n s u m m a r y " : [ { " t y p e " : "ENTITY" , " a n  Figure 5 shows the computational performance per processing stage of the Lithium NLP system. The overall processing speed is about 22ms per 1kb of text. As shown, the time taken by the system is a linear function of text size. The EDL stage takes about 80% of the processing time.

Comparison with existing NLP systems
Currently, due to limited resources at our end and also due to inherent differences in the  (Ritter et al., 2011(Ritter et al., , 2012 is not possible. Hence, we compare our system with some of them on a different set of metrics.

Comparison on runtime and entity density
We compare the runtime of Lithium NLP and AIDA across various text sizes. As shown in Figure 6, Lithium NLP is on an average 40,000 times faster than AIDA whose slow runtime can be attributed mainly to Stanford NER. In addition to speed, we also compare the number of entities extracted per kb of text. As shown, Lithium NLP extracts about 2.8 times more entities than AIDA. Table 1 compares the types of information extracted by Lithium NLP system with existing systems. In this comparison, we explicitly differentiate between named entities (Person, Location etc.) and other entity types (Sports, Activities) as well as fine-grained topics (Basketball) and coarse-grained topics (Sports) to demonstrate the rich set of information extracted by Lithium NLP. As evident, most other systems do not provide the rich set of semantic annotations that Lithium NLP provides. A majority of the systems focus on recognizing named entities and types with only a few focusing on sentiment and coarse-grained topics as well. In contrast, Lithium NLP extracts, disambiguates and links named and other entities, extracts subject matter topics, recommends hashtags and also infers the sentiment of the text. Table 2 compares the languages supported by the Lithium NLP system with existing systems. As evident, Lithium supports 6 different languages which is at par and in some cases, more than existing systems. In future, we plan to extend the capabilities of Lithium NLP to include entity level sentiment as well. We also hope to collaborate actively with academia and open up the Lithium NLP API to academic institutions.