NLP Service APIs and Models for Efficient Registration of New Clients

State-of-the-art NLP inference uses enormous neural architectures and models trained for GPU-months, well beyond the reach of most consumers of NLP. This has led to one-size-fits-all public API-based NLP service models by major AI companies, serving millions of clients. They cannot afford traditional fine tuning for individual clients. Many clients cannot even afford significant fine tuning, and own little or no labeled data. Recognizing that word usage and salience diversity across clients leads to reduced accuracy, we initiate a study of practical and lightweight adaptation of centralized NLP services to clients. Each client uses an unsupervised, corpus-based sketch to register to the service. The server modifies its network mildly to accommodate client sketches, and occasionally trains the augmented network over existing clients. When a new client registers with its sketch, it gets immediate accuracy benefits. We demonstrate the proposed architecture using sentiment labeling, NER, and predictive language modeling.


Introduction
State-of-the-art NLP uses large neural networks with billions of parameters, enormous training data, and intensive optimization over weeks of GPUtime, causing more carbon emission than a car over its lifetime (Strubell et al., 2019). Such training prowess is (mercifully) out of reach for most users of NLP methods. Recognizing this, large AI companies have launched NLP cloud services 1 and also provided trained models for download and fine tuning. But many clients have too little data or hardware for fine tuning massive networks. Neither can the service be expected to fine-tune for each client.
Distributional mismatch between the giant general-purpose corpus used to train the central service and the corpus from which a client's instances 1 Google NLP, Microsoft Azure, IBM Watson arise leads to lower accuracy. A common source of trouble is mismatch of word salience (Paik, 2013) between client and server corpora (Ruder, 2019). In this respect, our setting also presents a new opportunity. Clients are numerous and form natural clusters, e.g., healthcare, sports, politics. We want the service to exploit commonalities in existing client clusters, without explicitly supervising this space, and provide some level of generalization to new clients without re-training or fine-tuning.
In response to the above challenges and constraints, we initiate an investigation of practical protocols for lightweight client adaptation of NLP services. We propose a system, KYC ("Know Your Client"), in which each client registers with the service using a simple sketch derived from its (unlabeled) corpus. The service network takes the sketch as additional input with each instance later submitted by the client. The service provides accuracy benefits to new clients immediately.
What form can a client sketch take? How should the service network incorporate it? While this will depend on the task, we initiate a study of these twin problems focused on predictive language modeling, sentiment labeling, and named entity recognition (NER). We show that a simple late-stage intervention in the server network gives visible accuracy benefits, and provide diagnostic analyses and insights. Our code and data can be found here 2 .
Contributions In summary, we • introduce the on-the-fly client adaptation problem motivated by networked NLP API services; • present KYC, that learns to compute clientspecific biases from unlabeled client sketches; • show improved accuracy for predictive language modeling, NER and sentiment labeling; • diagnose why KYC's simple client-specific label biases succeed, in terms of relations between word salience, instance length and label distributions at diverse clients.
Related work Our method addresses the mismatch between a client's data distribution and the server model. The extensive domain adaptation literature (DauméIII, 2007;Ben-David et al., 2006) is driven by the same goal but most of these update model parameters using labeled or unlabeled data from the target domain (client). Unsupervised Domain Adaptation summarized in (Ramponi and Plank, 2020) relaxes the requirement of labelled client data, but still demands target-specific fine-tuning which inhibits scalability. Some recent approaches attempt to make the adaptation light-weight (Lin and Lu, 2018;Li et al., 2020;Jia et al., 2019;Cai and Wan, 2019;Liu et al., 2020) while others propose to use entity description (Bapna et al., 2017;Shah et al., 2019) for zero-shot adaptation. Domain generalization is another relevant technique (Chen and Cardie, 2018;Guo et al., 2018;Li et al., 2018a;Wang et al., 2019;Shankar et al., 2018;Carlucci et al., 2019;Dou et al., 2019;Piratla et al., 2020) where multiple domains during training are used to train a model that can generalize to new domains. Of these, the method that seems most relevant to our setting is the mixture of experts network of (Guo et al., 2018), with which we present empirical comparison. Another option is to transform the client data style so as to match the data distribution used to train the server model. Existing style transfer techniques (Yang et al., 2018;Shen et al., 2017;Prabhumoye et al., 2018;Fu et al., 2018;Lample et al., 2019;Li et al., 2018b;Gong et al., 2019) require access to server data distribution.

Proposed service protocol
We formalize the constraints on the server and client in the API setting.
(1) The server is expected to scale to a large number of clients making it impractical to adapt to individual clients.
(2) After registration, the server is expected to provide labeling immediately and response latency per instance must be kept low implying that the server's inference network cannot be too compute-intensive.
(3) Finally, the client cannot perform complex pre-processing of every instance before sending to the server, and does not have any labelled data. Server network and model These constraints lead us to design a server model that learns to compute client-specific model parameters from the client sketch, and requires no client-specific finetuning or parameter learning. The original server network is written asŷ = Y θ (M θ (x)) where x is the input instance, and Y θ is a softmax layer to get the predicted labelŷ. M θ is a representation learning layer that may take diverse forms depending on the task; of late, BERT (Devlin et al., 2018) is used to design M θ for many tasks. We augment the server network to accept, with loss y each input x, a client-specific sketch S c as shown in Figure 1. We discuss possible forms of S c in the next subsection. (The dotted arrow represents a generative influence of S c on x.) The server implements an auxiliary network g = G φ (S c ). Here g can be regarded as a neural digest of the client sketch. Module combines M θ (x) and g; concatenation was found adequate on the tasks we evaluated but we also discuss other options in Section 3. When the module is concatenation we are computing a client-specific per-label bias, and even that provides significant gains, as we show in Section 3.

Client sketch
The design space of client sketch S c is infinite. We initiate a study of designing S c from the perspective of term weighting and salience in Information Retrieval (Paik, 2013). S c needs to be computed once by each client, and thereafter reused with every input instance x. Ideally, S c and G φ should be locality preserving, in the sense that clients with similar corpora and tasks should lead to similar gs. Suppose the set of clients already registered is C.
A simple client sketch is just a vector of counts of all words in the client corpus. Suppose word w occurs n c,w times in a client c, with w n c,w = N c . Before input to G φ , the server normalizes these counts using counts of other clients as follows: From all of C, the server will estimate a background unigram rate of word. Let the estimated rate for word w be p w , which is calculated as: (1) The input into G φ will encode, for each word w, how far the occurrence rate of w for client c deviates from the global estimate. Assuming the multinomial word event distribution, the marginal probability of having w occur n c,w times at client c is proportional to p nc,w w (1 − p w ) (Nc−nc,w) . We finally pass a vector containing the normalized negative log probabilities as input to the model: (2) We call this the term-saliency sketch. We discuss other sketches like TF-IDF and corpus-level statistics like average instance length in Sec. 3.

Experiments
We evaluate KYC on three NLP tasks as services: NER, sentiment classification, and auto-completion based on predictive language modeling. We compare KYC against the baseline model (without the G φ network in Figure 1) and the mixture of experts (MoE) model (Guo et al., 2018) (see Appendix B). For all three models, the M θ network is identical in structure. In KYC, G φ has two linear layers with ReLU giving a 128-dim vector g, with slight exceptions (see Appendix A). We choose datasets that are partitioned naturally across domains, used to simulate clients. We evaluate in two settings: indistribution (ID) on test instances from clients seen during training, and out-of-distribution (OOD) on instances from unseen clients. For this, we perform a leave-k-client-out evaluation where given a set D of clients, we remove k clients as OOD test and use remaining D − k as the training client set C. Named Entity Recognition (NER) We use Ontonotes (Pradhan et al., 2007) which has 18 entity classes from 31 sources which forms our set D of clients. We perform leave-2-out test five times with 29 training clients as C. We train a cased BERT-based NER model (Devlin et al., 2018) and report F-scores. Table 1 shows that KYC provides substantial gains for OOD clients. For the first two OOD clients (BC/CCTV,Phoenix), the baseline F1 score jumps from 63.8 to 71.8. MoE performs worse than baseline. We conjecture this is because  separate softmax parameters over the large NER label space is not efficiently learnable. Sentiment Classification We use the popular Amazon dataset (Ni et al., 2019) with each product genre simulating a client. We retain genres with more than 1000 positive and negative reviews each and randomly sample 1000 positive and negative reviews from these 22 genres. We perform leave-2-out evaluation five times and Table 2 shows the five OOD genre pairs. We use an uncased BERT model for classifcation (Sun et al., 2019). Table 2 shows that average OOD client accuracy increases from 86.1 to 86.8 with KYC. Auto-complete Task We model this task as a forward language model and measure perplexity. We used the 20 NewsGroup dataset and treat each of the twenty topics as a client. Thus D is of size 20. We use the state-of-art Mogrifier LSTM (Melis et al., 2020). We perform leave-1-topic-out evaluation six times and OOD topics are shown in Table 3. For MoE, the client-specific parameter is only the bias and not the full softmax parameters which would blow up the number of trainable parameters. Also it did not perform well.  KYC performs consistently better than the baseline with average perplexity drop from 28.2 to 27.9. This drop is particularly significant because the Mogrifier LSTM is a strong baseline to start with. MoE is worse than baseline.  Statistical Significance We verify the statistical significance of the gains obtained for the Sentiment Analysis and Auto-complete tasks; the gains in the case of NER are much larger than statistical variation. Shown in Tables 4 and 5 are the sample estimate and standard deviation for three runs along with the p value corresponding to the null hypothesis of significance testing. In both cases, we see that the gains of KYC over the baseline are statistically significant.
Diagnostics We provide insights on why KYC's simple method of learning per-client label biases from client sketches is so effective. One explanation is that the baseline had large discrepancy between the true and predicted class proportions for several OOD clients. KYC corrects this dis-    Figure 2 shows true, baseline, and KYC predicted class proportions for one OOD client on NER. Observe how labels like date, GPE, money and org are underpredicted by baseline and corrected by KYC. Since KYC only corrects label biases, instances most impacted are those close to the shared decision boundary, and exhibiting properties correlated with labels but diverging across clients. We uncovered two such properties: Ambiguous Tokens In NER the label of several tokens changes across clients, E.g. tokens like million, billion in finance clients like NW/Xinhua are money 92% of the times whereas in general only 50% of the times. Based on client sketches, it is easy to spot finance-related topics and increase the bias of money label. This helps KYC correct labels of borderline tokens.
Instance Length For sentiment labeling, review length is another such property. Figure 3 is a scatter plot of the average review length of a client versus the fraction predicted as positive by the baseline. For most clients, review length is clustered around the mean of 61, but four clients have length > 90. Length of review is correlated with label: on average, negative reviews contain 20 words more than positive ones. This causes baseline to underpredict positives on the few clients with longer reviews. The topics of the four outlying clients (video games, CDs, Toys&Games) are related so that the client sketch is able to shift the decision boundary to correct for this bias. Using only normalized average sentence length as the client sketch bridges part of the improvement of KYC over the baseline (details in Appendix C) implying that average instance length should be part of client sketch for sentiment classification tasks.  Ablation Studies We explored a number of alternative client sketches and models for harnessing them. We present a summary here; details are in the Appendix C and D. Table 6 shows average F1 on NER for three other sketches: TF-IDF, Binary bag of words, and a 768-dim pooled BERT embedding of ten summary sentences extracted from client corpus (Barrios et al., 2016). KYC's default term saliency features provides best accuracy with TF-IDF a close second, and embedding-based sketches the worst. Next, we compare three other architectures for harnessing g in Table 6: Deep, where module after concatenating g and M adds an additional non-linear layer so that now the whole decision boundary, and not just bias, is clientspecific. KYC's OOD performance increases a bit over plain concat. Decompose, which mixes two softmax matrices with a client-specific weight α learned from g. MoE-g, which is like MoE but uses the client sketch for expert gating. We observe that the last two options are worse than KYC.

Conclusion
We introduced the problem of lightweight client adaption in NLP service settings. This is a promising area, ripe for further research on more complex tasks like translation. We proposed client sketches and KYC: an early prototype server network for on-the-fly adaptation. Three NLP tasks showed considerable benefits from simple, per-label bias correction. Alternative architectures and ablations provide additional insights.