On the Branching Bias of Syntax Extracted from Pre-trained Language Models

Many efforts have been devoted to extracting constituency trees from pre-trained language models, often proceeding in two stages: feature definition and parsing. However, this kind of methods may suffer from the branching bias issue, which will inflate the performances on languages with the same branch it biases to. In this work, we propose quantitatively measuring the branching bias by comparing the performance gap on a language and its reversed language, which is agnostic to both language models and extracting methods. Furthermore, we analyze the impacts of three factors on the branching bias, namely feature definitions, parsing algorithms, and language models. Experiments show that several existing works exhibit branching biases, and some implementations of these three factors can introduce the branching bias.


Introduction
Neural language models such as LSTM (Merity et al., 2018;Peters et al., 2018), GPT2 (Radford et al., 2019), and BERT (Devlin et al., 2019;Liu et al., 2019) have achieved state-of-the-art performance in various downstream NLP tasks. Many recent works try to interpret their success by revealing the linguistic properties captured by these language models (Hewitt and Manning, 2019;Clark et al., 2019;Jawahar et al., 2019;Tenney et al., 2019). One interesting line of these works tries to extract discrete constituency trees from pre-trained language models Rosa, 2018, 2019;Kim et al., 2020;Wu et al., 2020). The core of these works is to extract syntax in two stages. Firstly, it defines the feature scores based on a language model, namely, the feature definition stage. Secondly, it leverages the feature scores to build a constituency tree, namely, the parsing stage.
However, the degree to which the extracted constituency trees match gold constituency annotations  may imprecisely reflect the model's competence of capturing syntax, since their final performance may benefit from the branching bias. For example, as pointed out by Dyer et al. (2019), the syntax extracted from the ordered neuron based language model (Shen et al., 2019) is biased to rightbranching languages 1 (e.g., English). Nevertheless, the approach to measuring the bias in Dyer et al. (2019) is highly dependent on the architecture of ordered neuron and its parsing algorithm. Therefore, it is far from trivial to be applied to general pre-trained language models. This paper proposes a new approach to reveal the branching bias of syntax from pre-trained language models, which is agnostic to model architectures and parsing algorithms. The key idea of our approach is based on the following observation: We can construct a left-branching language by reversing a right-branching language and vice versa. An illustration is given in Figure 1. If a syntax extracting method has no branching bias, the parsing performances on the original language and the re-versed language should have little or no difference. Therefore, the performance gap can be used as an indicator of branching bias. Using our approach, we find that some recent works on pre-trained language models suffer from the branching bias (Kim et al., 2020;Wu et al., 2020;Mareček and Rosa, 2018). We further investigate on an in-depth question: Does the bias come from language models? Or the extraction methods (feature definition and parsing algorithm)? We propose a simple approach to quantitatively analyze the bias in them, which tries to control the impacts of other factors while studying a specific part in the pipeline.

Measuring branching bias
Intuitively, branching bias means that the induced syntax tends a specific branching structure, such as the right-branching in Shen et al. (2019), as pointed out by Dyer et al. (2019). For example, a right-branching bias will exaggerate the method's performance on a right-branching language and undermine its performance on a left-branching language. Therefore, a natural way to quantify the branching bias in syntax is to compare the performance gap between two natural languages with different branches (e.g., English and Japanese).
Unfortunately, due to the intrinsic differences between the two natural languages, it may be unfair to compare their performances directly. Therefore, for a language L, we build a synthetic language L by reversing the word order in the way of right-toleft, rather than the left-to-right order in language L. If language L is right-branching, then language L will be left-branching, as shown in Figure 1. Based on this observation, we use a natural language L and a synthetic language L to measure the performance gap.
More concretely, the performance gap between language L and L , namely the branching gap, is defined as follows: where m is a metric function to measure the quality of the parsing tree (e.g., f1-score); t is a tree extracted by a syntax extracting method on language L, and g is its golden truth; t and g are defined similarly but on the reversed language L . To make the comparison in Eq.
(1) fairer, we guarantee that training and testing datasets for both languages are the same except for the word order.
If a syntax extracting method is unbiased, the branching gap would be nearly 0. 2 The sign of indicates the direction of the branching bias. It is worth noting that the proposed approach to measure branching bias is independent of the model architecture and the syntax extracting method, unlike the approach used in Dyer et al. (2019). Therefore, our approach can be naturally applied to any pre-trained language models and syntax extracting methods. Besides, Dyer et al. (2019) mainly focus on the branching bias in a specific parsing algorithm (Shen et al., 2019). In our work, we further analyze the branching bias in feature definitions and language models, besides parsing algorithms.

Factors affecting branching bias
Since constituency trees are extracted from a pretrained language model using a syntax extracting method, the branching bias may owe to both the syntax extracting method and the language model. More precisely, the branching bias may be affected by parsing algorithms, definitions of feature scores, and language models. In the rest of this section, we will investigate the branching bias in each of the three factors one-by-one.
Bias in parsing algorithm Since the parsing algorithm is on the top of the language model and feature definition, To analyze the bias in a parsing algorithm alone, we need to exclude the influences of these two factors. To this end, we propose to assign a sequence of random scores as the feature scores and then run the parsing algorithm using these random scores to obtain the constituency tree. The random feature scores are generated according to a uniform distribution 3 . Since the feature scores are independent of both the language model and the feature definition, the branching bias can be introduced solely by the parsing algorithm if a noticeable branching gap is observed.
Bias in feature definition Feature definition is the type of information (e.g., hidden vectors or attention matrix) from a language model, converted into feature scores, and then fed into a parsing algorithm. Some feature definitions may also intrinsically contain branching bias. To reveal the bias solely dependent on a specific feature definition, instead of using the original weights (e.g., hidden # Syntax Extracting Method Model Parsing Alg. L is the original language and L is its reversed version. representations and attention weights) outputted by a pre-trained language model, we propose randomly initializing them and using them to compute feature scores. Then we run an unbiased parsing algorithm on the feature scores generated in this way. Therefore, if there is a noticeable branching gap, the branching bias will be attributed to the feature definition. The pipeline to extract syntax is independent of the language model, and the fixed parsing algorithm is unbiased.

Bias in language model
The pre-trained language model is the input of a syntax extracting method. We further analyze the branching bias in a language model. To analyze the branching bias in it, we firstly choose an unbiased syntax extracting pipeline (i.e., both the feature definition and parsing algorithm are fair) and then calculate the branching gap using the well-trained language models on languages L and L . Since there is no branching bias within our selected extracting method, the branching bias can be attributed to the input itself, if a branching gap is observed.

Settings
Data We choose English as the main language in our experiments. The English data used for training language models is the concatenation of 1M lines of Wikipedia data (Devlin et al., 2019) and the Penn TreeBank (PTB) (Marcus et al., 1993) training data. We use PTB-22 and PTB-23 for validation and test, respectively. Besides, to rule out the impact of other linguistic properties, we also conduct part of our experiments on German and Chinese. We use the German Treebank from the SPMRL (Seddah et al., 2014) and Penn Chinese TreeBank (CTB) (Xue et al., 2005) with their provided test sets to evaluate previous methods on those two languages, respectively.
Language Models In our experiments, we train three different language models (i.e., BERT, GPT2, LSTM) for English and its reversed language 4 . The BERT and GPT2 models are trained using Huggingface's Transformers (Wolf et al., 2019) and we use the default parameters of their base settings (Devlin et al., 2019;Radford et al., 2019;Wolf et al., 2019). The LSTM model is trained using awd-lstm-lm 5 , and we use the parameters similar to Merity et al. (2018). Models used for extracting syntax are selected according to the PPL on validation set. The tokenizers for BERT and GPT2 are trained using the toolkit huggingface/tokenizers 6 , and their vocabulary sizes are 22000 and 35000 respectively. The tokenizer of GPT2 is shared with LSTM.

Syntax Extracting Methods
To evaluate the branching bias, we use the codes 7 of Kim et al. (2020) and Wu et al. (2020), and re-implement the algorithm in Mareček and Rosa (2018). The parsing algorithms proposed by them are referred to as DIST, MART, and ATTNSPAN respectively. Note that Kim et al. (2020) propose a trick to explicitly inject right-branching bias to their method, and we set the weight of this injected external bias to zero in our experiments. For feature definitions, we mainly focus on three types of feature definitions, which are hidden representation (Kim et al., 2020), full attention (Mareček and Rosa, 2018), and prefix attention (Kim et al., 2020;Wu et al., 2020). 8 The 4 We train a language model on the reversed language by reversing the entire training corpus 5 https://github.com/salesforce/ awd-lstm-lm 6 https://github.com/huggingface/ tokenizers 7 https://github.com/galsang/trees_ from_transformers and https://github.com/ LividWo/Perturbed-Masking 8 Prefix-attention means the attention is performed over the prefix words as in GPT2 whereas full-attention is over all words in a sentence as in BERT.   hyper-parameters (e.g., choices of attention head and hidden layer) of syntax extracting methods are tuned on the validation set.

Main Results
As shown in Table 1, the behaviors of different approaches are widely divergent. We find that the branching bias in BERT+ATTNSPAN and BERT+DIST are relatively lower than other approaches. However, the results of GPT2+ATTNSPAN and BERT/GPT2+MART demonstrate significant right-branching biases. GPT2+DIST shows a tendency towards leftbranching. Since these approaches are pipelined, which part of their methods has an impact on the branching bias is still unclear.
The results reported in Table 1 is a little worse than those reported in Kim et al. (2020); Wu et al. (2020). One reason is that we evaluate the results on the corpus-level F1 score following the standard, rather than sentence-level (Kim et al., 2020). The other reason is that our training data is small, since it is too expensive to train reversed language models on a huge dataset. However, these results are obtained by running the released codes of Kim et al. (2020); Wu et al. (2020), and thus, we think it will not affect our findings.

Factors affecting branching bias
Branching Bias in Parsing Algorithm The branching gaps of different parsing algorithms are shown in Table 2. Observing from the experiment results in English, The branching gaps of MART is significantly larger than 0, which means it has a tendency to right-branching. In contrast, the branching gaps of parsing algorithm ATTNSPAN and DIST are nearly 0, which means they do not bias to leftbranching or right-branching. Although DIST is inspired by the parsing algorithm in Shen et al. (2019), it is an unbiased, which is consistent with the claim in Kim et al. (2020). We also evaluate the parsing algorithm of Shen et al. (2019), and its branching gap is +3.22 on English, which is consistent with the finding in Dyer et al. (2019).
To examine whether some other language properties might play a role in this process, we also conduct experiments on different languages, which can help rule out the impact of specific language properties. The results in Table 2 show that MART has the same trend as RIGHT-B baseline (row 5) on both Chinese and German datasets, which is consistent with the finding on the English dataset. It is also worth noting that the branching gap for MART is positive on Chinese and English datasets, whereas it is negative on German. The reason is that both Chinese and English are right-branching languages, while German is inclined to be leftbranching. However, both head-initial and headfinal structures occur in the German language from the viewpoint of linguistics. In addition, one interesting observation is that, the performances of ATTNSPAN are always higher than the RANDOM  baseline. We hypothesize that ATTNSPAN may have a bias towards the balance tree, due to its way to compute weights of splitting points.
Branching Bias in Feature Definition As shown in Table 3, we choose the unbiased parsing algorithm DIST to further analyze the branching bias in feature definitions. It is worth noting that, after normalization, the attention matrix of PREFIX-ATTN is lower triangular, and that of FULL-ATTN is fully filled. We find that the feature definitions based on HIDDEN and FULL-ATTN are unbiased. However, PREFIX-ATTN tends to generate rightbranching trees, where the branching gap is +7.27 on English. This finding is consistent with that on Chinese and German. One possible explanation about PREFIX-ATTN is that the attention scores will become distracted with the prefix grows, such that the feature scores in the front of the sequence, which has a larger value, would be picked at first.

Branching Bias in Language Models
After the analyses in previous steps, we will use the unbiased parsing algorithm and feature definition, DIST and HIDDEN, to evaluate the branching bias in language models. Note that the results in this section is different from those in Table 1, since other feature definitions are prohibited except for HIDDEN. Our experiments conducted on language models are shown in Table 4. The performances of BERT on both branching are nearly the same, where the branching gap is just −0.30. In contrast, slight branching gap is observed on both GPT2 and LSTM. The branching gap of GPT2 is −2.24. With the same left-to-right paradigm, LSTM behaviors a positive branching gap +2.50. The opposite branching gap may be caused by the difference between model architectures, where GPT2 is based on self-attention (Vaswani et al., 2017) and LSTM is based on gating mechanism (Hochreiter and Schmidhuber, 1997). However, the random noises may also play a role in this observa-tion, since the performance range of GPT2 models trained on the original English dataset with different random seed can also reach around 1.50. We will investigate it in future works.

Conclusion
In this paper, we propose an approach to quantitatively analyze the branching bias in extracting syntax from pre-trained language models. Unlike previous work, our approach is more general to be applied to any pre-trained language models and syntax extracting methods. Furthermore, we systematically analyze three factors in depth that may affect the branching bias: the language model, feature definition, and parsing algorithm. Our experiments show that branching biases are in many recent works. In addition, these biases can be brought by each of the three factors. We appeal that researchers should carefully design their syntax extracting method to reveal the real competence of syntax from a pre-trained language model.