Li Du


2023

pdf bib
A Formal Perspective on Byte-Pair Encoding
Vilém Zouhar | Clara Meister | Juan Gastaldi | Li Du | Tim Vieira | Mrinmaya Sachan | Ryan Cotterell
Findings of the Association for Computational Linguistics: ACL 2023

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method.BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a 1/sigma*(1-e(-sigma))-approximation of an optimal merge sequence, where sigma is the total backward curvature with respect to the optimal merge sequence. Empirically the lower bound of the approximation is approx0.37.We provide a faster implementation of BPE which improves the runtime complexity from O(NM) to O(N log M), where N is the sequence length and M is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.

pdf bib
On the Representational Capacity of Recurrent Neural Language Models
Franz Nowak | Anej Svete | Li Du | Ryan Cotterell
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

This work investigates the computational expressivity of language models (LMs) based on recurrent neural networks (RNNs). Siegelmann and Sontag (1992) famously showed that RNNs with rational weights and hidden states and unbounded computation time are Turing complete. However, LMs define weightings over strings in addition to just (unweighted) language membership and the analysis of the computational power of RNN LMs (RLMs) should reflect this. We extend the Turing completeness result to the probabilistic case, showing how a rationally weighted RLM with unbounded computation time can simulate any deterministic probabilistic Turing machine (PTM) with rationally weighted transitions. Since, in practice, RLMs work in real-time, processing a symbol at every time step, we treat the above result as an upper bound on the expressivity of RLMs. We also provide a lower bound by showing that under the restriction to real-time computation, such models can simulate deterministic real-time rational PTMs.

pdf bib
Towards Stable Natural Language Understanding via Information Entropy Guided Debiasing
Li Du | Xiao Ding | Zhouhao Sun | Ting Liu | Bing Qin | Jingshuo Liu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Although achieving promising performance, current Natural Language Understanding models tend to utilize dataset biases instead of learning the intended task, which always leads to performance degradation on out-of-distribution (OOD) samples. Toincrease the performance stability, previous debiasing methods empirically capture bias features from data to prevent the model from corresponding biases. However, our analyses show that the empirical debiasing methods may fail to capture part of the potential dataset biases and mistake semantic information of input text as biases, which limits the effectiveness of debiasing. To address these issues, we propose a debiasing framework IEGDB that comprehensively detects the dataset biases to induce a set of biased features, and then purifies the biased features with the guidance of information entropy. Experimental results show that IEGDB can consistently improve the stability of performance on OOD datasets for a set of widely adopted NLU models.

pdf bib
Tokenization and the Noiseless Channel
Vilém Zouhar | Clara Meister | Juan Gastaldi | Li Du | Mrinmaya Sachan | Ryan Cotterell
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Subword tokenization is a key part of most NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to improved downstream model performance over others. We propose that good tokenizers lead to efficient channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum entropy of the subword distribution. Nevertheless, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency subwords and very short codes to high-frequency subwords.Defining efficiency in terms of Rényi entropy, on the other hand, penalizes distributions with either very high or very low-frequency subwords.We posit that (1) extremely high-frequency subwords are problematic because their meaning is not distinct and (2) that low-frequency subwords may not appear frequently enough for their meaning to be learned properly; encodings that induce unigram distributions with either can harm model performance. In machine translation, we find that across multiple tokenizers, the Rényi entropy has a very strong correlation with BLEU: 0.82 in comparison to just -0.30 for compressed length.

pdf bib
A Measure-Theoretic Characterization of Tight Language Models
Li Du | Lucas Torroba Hennigen | Tiago Pimentel | Clara Meister | Jason Eisner | Ryan Cotterell
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Language modeling, a central task in natural language processing, involves estimating a probability distribution over strings. In most cases, the estimated distribution sums to 1 over all finite strings. However, in some pathological cases, probability mass can “leak” onto the set of infinite sequences. In order to characterize the notion of leakage more precisely, this paper offers a measure-theoretic treatment of language modeling. We prove that many popular language model families are in fact tight, meaning that they will not leak in this sense. We also generalize characterizations of tightness proposed in previous works.

2022

pdf bib
e-CARE: a New Dataset for Exploring Explainable Causal Reasoning
Li Du | Xiao Ding | Kai Xiong | Ting Liu | Bing Qin
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Understanding causality has vital importance for various Natural Language Processing (NLP) applications. Beyond the labeled instances, conceptual explanations of the causality can provide deep understanding of the causal fact to facilitate the causal reasoning process. However, such explanation information still remains absent in existing causal reasoning resources. In this paper, we fill this gap by presenting a human-annotated explainable CAusal REasoning dataset (e-CARE), which contains over 20K causal reasoning questions, together with natural language formed explanations of the causal questions. Experimental results show that generating valid explanations for causal facts still remains especially challenging for the state-of-the-art models, and the explanation information can be helpful for promoting the accuracy and stability of causal reasoning models.

pdf bib
CogBERT: Cognition-Guided Pre-trained Language Models
Xiao Ding | Bowen Chen | Li Du | Bing Qin | Ting Liu
Proceedings of the 29th International Conference on Computational Linguistics

We study the problem of integrating cognitive language processing signals (e.g., eye-tracking or EEG data) into pre-trained language models like BERT. Existing methods typically fine-tune pre-trained models on cognitive data, ignoring the semantic gap between the texts and cognitive signals. To fill the gap, we propose CogBERT, a framework that can induce fine-grained cognitive features from cognitive data and incorporate cognitive features into BERT by adaptively adjusting the weight of cognitive features for different NLP tasks. Extensive experiments show that: (1) Cognition-guided pre-trained models can consistently perform better than basic pre-trained models on ten NLP tasks. (2) Different cognitive features contribute differently to different NLP tasks. Based on this observation, we give a fine-grained explanation of why cognitive data is helpful for NLP. (3) Different transformer layers of pre-trained models should encode different cognitive features, with word-level cognitive features at the bottom and semantic-level cognitive features at the top. (4) Attention visualization demonstrates that CogBERT aligns with human gaze patterns and improves its natural language comprehension ability.

pdf bib
A Graph Enhanced BERT Model for Event Prediction
Li Du | Xiao Ding | Yue Zhang | Ting Liu | Bing Qin
Findings of the Association for Computational Linguistics: ACL 2022

Predicting the subsequent event for an existing event context is an important but challenging task, as it requires understanding the underlying relationship between events. Previous methods propose to retrieve relational features from event graph to enhance the modeling of event correlation. However, the sparsity of event graph may restrict the acquisition of relevant graph information, and hence influence the model performance. To address this issue, we consider automatically building of event graph using a BERT model. To this end, we incorporate an additional structured variable into BERT to learn to predict the event connections in the training process. Hence, in the test process, the connection relationship for unseen events can be predicted by the structured variable. Results on two event prediction tasks: script event prediction and story ending prediction, show that our approach can outperform state-of-the-art baseline methods.

pdf bib
Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning
Shuo Xie | Jiahao Qiu | Ankita Pasad | Li Du | Qing Qu | Hongyuan Mei
Findings of the Association for Computational Linguistics: EMNLP 2022

While transferring a pretrained language model, common approaches conventionally attach their task-specific classifiers to the top layer and adapt all the pretrained layers. We investigate whether one could make a task-specific selection on which subset of the layers to adapt and where to place the classifier. The goal is to reduce the computation cost of transfer learning methods (e.g. fine-tuning or adapter-tuning) without sacrificing its performance. We propose to select layers based on the variability of their hidden states given a task-specific corpus. We say a layer is already “well-specialized” in a task if the within-class variability of its hidden states is low relative to the between-class variability. Our variability metric is cheap to compute and doesn’t need any training or hyperparameter tuning. It is robust to data imbalance and data scarcity. Extensive experiments on the GLUE benchmark demonstrate that selecting layers based on our metric can yield significantly stronger performance than using the same number of top layers and often match the performance of fine-tuning or adapter-tuning the entire language model.

pdf bib
ReCo: Reliable Causal Chain Reasoning via Structural Causal Recurrent Neural Networks
Kai Xiong | Xiao Ding | Zhongyang Li | Li Du | Ting Liu | Bing Qin | Yi Zheng | Baoxing Huai
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Causal chain reasoning (CCR) is an essential ability for many decision-making AI systems, which requires the model to build reliable causal chains by connecting causal pairs. However, CCR suffers from two main transitive problems: threshold effect and scene drift. In other words, the causal pairs to be spliced may have a conflicting threshold boundary or scenario. To address these issues, we propose a novel Reliable Causal chain reasoning framework (ReCo), which introduces exogenous variables to represent the threshold and scene factors of each causal pair within the causal chain, and estimates the threshold and scene contradictions across exogenous variables via structural causal recurrent neural networks (SRNN). Experiments show that ReCo outperforms a series of strong baselines on both Chinese and English CCR datasets. Moreover, by injecting reliable causal chain knowledge distilled by ReCo, BERT can achieve better performances on four downstream causal-related tasks than BERT models enhanced by other kinds of knowledge.

2021

pdf bib
ExCAR: Event Graph Knowledge Enhanced Explainable Causal Reasoning
Li Du | Xiao Ding | Kai Xiong | Ting Liu | Bing Qin
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Prior work infers the causation between events mainly based on the knowledge induced from the annotated causal event pairs. However, additional evidence information intermediate to the cause and effect remains unexploited. By incorporating such information, the logical law behind the causality can be unveiled, and the interpretability and stability of the causal reasoning system can be improved. To facilitate this, we present an Event graph knowledge enhanced explainable CAusal Reasoning framework (ExCAR). ExCAR first acquires additional evidence information from a large-scale causal event graph as logical rules for causal reasoning. To learn the conditional probabilistic of logical rules, we propose the Conditional Markov Neural Logic Network (CMNLN) that combines the representation learning and structure learning of logical rules in an end-to-end differentiable manner. Experimental results demonstrate that ExCAR outperforms previous state-of-the-art methods. Adversarial evaluation shows the improved stability of ExCAR over baseline systems. Human evaluation shows that ExCAR can achieve a promising explainable performance.

pdf bib
Learning Event Graph Knowledge for Abductive Reasoning
Li Du | Xiao Ding | Ting Liu | Bing Qin
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Abductive reasoning aims at inferring the most plausible explanation for observed events, which would play critical roles in various NLP applications, such as reading comprehension and question answering. To facilitate this task, a narrative text based abductive reasoning task 𝛼NLI is proposed, together with explorations about building reasoning framework using pretrained language models. However, abundant event commonsense knowledge is not well exploited for this task. To fill this gap, we propose a variational autoencoder based model ege-RoBERTa, which employs a latent variable to capture the necessary commonsense knowledge from event graph for guiding the abductive reasoning task. Experimental results show that through learning the external event graph knowledge, our approach outperforms the baseline methods on the 𝛼NLI task.

pdf bib
Neural Natural Logic Inference for Interpretable Question Answering
Jihao Shi | Xiao Ding | Li Du | Ting Liu | Bing Qin
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Many open-domain question answering problems can be cast as a textual entailment task, where a question and candidate answers are concatenated to form hypotheses. A QA system then determines if the supporting knowledge bases, regarded as potential premises, entail the hypotheses. In this paper, we investigate a neural-symbolic QA approach that integrates natural logic reasoning within deep learning architectures, towards developing effective and yet explainable question answering models. The proposed model gradually bridges a hypothesis and candidate premises following natural logic inference steps to build proof paths. Entailment scores between the acquired intermediate hypotheses and candidate premises are measured to determine if a premise entails the hypothesis. As the natural logic reasoning process forms a tree-like, hierarchical structure, we embed hypotheses and premises in a Hyperbolic space rather than Euclidean space to acquire more precise representations. Empirically, our method outperforms prior work on answering multiple-choice science questions, achieving the best results on two publicly available datasets. The natural logic inference process inherently provides evidence to help explain the prediction process.

2019

pdf bib
Modeling Event Background for If-Then Commonsense Reasoning Using Context-aware Variational Autoencoder
Li Du | Xiao Ding | Ting Liu | Zhongyang Li
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Understanding event and event-centered commonsense reasoning are crucial for natural language processing (NLP). Given an observed event, it is trivial for human to infer its intents and effects, while this type of If-Then reasoning still remains challenging for NLP systems. To facilitate this, a If-Then commonsense reasoning dataset Atomic is proposed, together with an RNN-based Seq2Seq model to conduct such reasoning. However, two fundamental problems still need to be addressed: first, the intents of an event may be multiple, while the generations of RNN-based Seq2Seq models are always semantically close; second, external knowledge of the event background may be necessary for understanding events and conducting the If-Then reasoning. To address these issues, we propose a novel context-aware variational autoencoder effectively learning event background information to guide the If-Then reasoning. Experimental results show that our approach improves the accuracy and diversity of inferences compared with state-of-the-art baseline methods.