PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing

We present the first multi-task learning model – named PhoNLP – for joint Vietnamese part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT (Nguyen and Nguyen, 2020) for each task independently. We publicly release PhoNLP as an open-source toolkit under the Apache License 2.0. Although we specify PhoNLP for Vietnamese, our PhoNLP training and evaluation command scripts in fact can directly work for other languages that have a pre-trained BERT-based language model and gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing. We hope that PhoNLP can serve as a strong baseline and useful toolkit for future NLP research and applications to not only Vietnamese but also the other languages. Our PhoNLP is available at https://github.com/VinAIResearch/PhoNLP


Introduction
Vietnamese NLP research has been significantly explored recently. It has been boosted by the success of the national project on Vietnamese language and speech processing (VLSP) KC01.01/2006-2010 and VLSP workshops that have run shared tasks since 2013. 1 Fundamental tasks of POS tagging, NER and dependency parsing thus play important roles, providing useful features for many downstream application tasks such as machine translation (Tran et al., 2016), sentiment analysis (Bang and Sornlertlamvanich, 2018), relation extraction (To and Do, 2020), semantic parsing , open information extraction (Truong et al., 2017) and question answering (Nguyen et al., 2017;Le-Hong and Bui, 2018). Thus, there is a need to develop NLP toolkits for linguistic annotations w.r.t. Vietnamese POS tagging, NER and dependency parsing.
VnCoreNLP (Vu et al., 2018) is the previous public toolkit employing traditional featurebased machine learning models to handle those Vietnamese NLP tasks. However, VnCoreNLP is now no longer considered state-of-the-art because its performance results are significantly outperformed by ones obtained when fine-tuning PhoBERT-the current state-of-the-art monolingual pre-trained language model for Vietnamese . Note that there are no publicly available fine-tuned BERT-based models for the three Vietnamese tasks. Assuming that there would be, a potential drawback might be that an NLP package wrapping such fine-tuned BERTbased models would take a large storage space, i.e. three times larger than the storage space used by a BERT model (Devlin et al., 2019), thus it would not be suitable for practical applications that require a smaller storage space. Jointly multi-task learning is a promising solution as it might help reduce the storage space. In addition, POS tagging, NER and dependency parsing are related tasks: POS tags are essential input features used for dependency parsing and POS tags are also used as additional features for NER. Jointly multi-task learning thus might also help improve the performance results against the single-task learning (Ruder, 2019).
In this paper, we present a new multi-task learning model-named PhoNLP-for joint POS tagging, NER and dependency parsing. In particular, given an input sentence of words to PhoNLP, an encoding layer generates contextualized word embeddings that represent the input words. These contextualized word embeddings are fed into a POS tagging layer that is in fact a linear prediction layer (Devlin et al., 2019)   tag is then represented by two "soft" embeddings that are later fed into NER and dependency parsing layers separately. More specifically, based on both the contextualized word embeddings and the "soft" POS tag embeddings, the NER layer uses a linear-chain CRF predictor (Lafferty et al., 2001) to predict NER labels for the input words, while the dependency parsing layer uses a Biaffine classifier (Dozat and Manning, 2017) to predict dependency arcs between the words and another Biaffine classifier to label the predicted arcs. Our contributions are summarized as follows: • To the best of our knowledge, PhoNLP is the first proposed model to jointly learn POS tagging, NER and dependency parsing for Vietnamese.
• We discuss a data leakage issue in the Vietnamese benchmark datasets, that has not yet been pointed out before. Experiments show that PhoNLP obtains state-of-the-art performance results, outperforming the PhoBERT-based single task learning.
• We publicly release PhoNLP as an open-source toolkit that is simple to setup and efficiently run from both the command-line and Python API. We hope that PhoNLP can serve as a strong baseline and useful toolkit for future NLP research and downstream applications. Figure 1 illustrates our PhoNLP architecture that can be viewed as a mixture of a BERT-based encoding layer and three decoding layers of POS tagging, NER and dependency parsing.

Encoder & Contextualized embeddings
Given an input sentence consisting of n word tokens w 1 , w 2 , ..., w n , the encoding layer employs PhoBERT to generate contextualized latent feature embeddings e i each representing the i th word w i : In particular, the encoding layer employs the PhoBERT base version. Because PhoBERT uses BPE (Sennrich et al., 2016) to segment the input sentence with subword units, the encoding layer in fact represents the i th word w i by using the contextualized embedding of its first subword.

POS tagging
Following a common manner when fine-tuning a pre-trained language model for a sequence labeling task (Devlin et al., 2019), the POS tagging layer is a linear prediction layer that is appended on top of the encoder. In particular, the POS tagging layer feeds the contextualized word embeddings e i into a feed-forward network (FFNN POS ) followed by a softmax predictor for POS tag prediction: where the output layer size of FFNN POS is the number of POS tags. Based on probability vectors p i , a cross-entropy objective loss L POS is calculated for POS tagging during training.

NER
The NER layer creates a sequence of vectors v 1:n in which each v i is resulted in by concatenating the contextualized word embedding e i and a "soft" POS tag embedding t (1) where following Hashimoto et al. (2017), the "soft" POS tag embedding t (1) i is computed by multiplying a label weight matrix W (1) with the corresponding probability vector p i : The NER layer then passes each vector v i into a FFNN (FFNN NER ): where the output layer size of FFNN NER is the number of BIO-based NER labels. The NER layer feeds the output vectors h i into a linear-chain CRF predictor for NER label prediction (Lafferty et al., 2001). A cross-entropy loss L NER is calculated for NER during training while the Viterbi algorithm is used for inference.

Dependency parsing
The dependency parsing layer creates vectors z 1:n in which each z i is resulted in by concatenating e i and another "soft" POS tag embedding t (2) i : Dozat and Manning (2017), the dependency parsing layer uses FFNNs to split z i into head and dependent representations: To predict potential dependency arcs, based on input vectors h , the parsing layer uses a Biaffine classifier's variant (Qi et al., 2018) that additionally takes into account the distance and relative ordering between two words to produce a probability distribution of arc heads for each word. For inference, the Chu-Liu/Edmonds' algorithm is used to find a maximum spanning tree (Chu and Liu, 1965;Edmonds, 1967). The parsing layer also uses another Biaffine classifier to label the predicted arcs, based on input vectors h . An objective loss L DEP is computed by summing a cross entropy loss for unlabeled dependency parsing and another cross entropy loss for dependency label prediction during training based on gold arcs and arc labels.

Joint multi-task learning
The final training objective loss L of our model PhoNLP is the weighted sum of the POS tagging loss L POS , the NER loss L NER and the dependency parsing loss L DEP : (10) Discussion: Our PhoNLP can be viewed as an extension of previous joint POS tagging and dependency parsing models (Hashimoto et al., 2017;Li et al., 2018;Nguyen and Verspoor, 2018;Nguyen, 2019;Kondratyuk and Straka, 2019), where we additionally incorporate a CRF-based prediction layer for NER. Unlike Hashimoto et al. (2017), Nguyen and Verspoor (2018), Li et al. (2018) and Nguyen (2019) that use BiLSTM-based encoders to extract contextualized feature embeddings, we use a BERT-based encoder. Kondratyuk and Straka (2019) also employ a BERT-based encoder. However, different from PhoNLP where we construct a hierarchical architecture over the POS tagging and dependency parsing layers, Kondratyuk

Implementation
PhoNLP is implemented based on PyTorch (Paszke et al., 2019), employing the PhoBERT encoder implementation available from the transformers library (Wolf et al., 2020) and the Biaffine classifier implementation from Qi et al. (2020). We set both the label weight matrices W (1) and W (2) to have 100 rows, resulting in 100-dimensional soft POS tag embeddings. In addition, following Qi et al. (2018Qi et al. ( , 2020, FFNNs in equations 6-9 use 400dimensional output layers. We use the AdamW optimizer (Loshchilov and Hutter, 2019) and a fixed batch size at 32, and train for 40 epochs. The sizes of training sets are different, in which the POS tagging training set is the largest, consisting of 23906 sentences. Thus for each training epoch, we repeatedly sample from the NER and dependency parsing training sets to fill the gaps between the training set sizes. We perform a grid search to select the initial AdamW learning rate, λ 1 and λ 2 . We find the optimal initial AdamW learning rate, λ 1 and λ 2 at 1e-5, 0.4 and 0.2, respectively. Here, we compute the average of the POS tagging accuracy, NER F 1 -score and dependency parsing score LAS after each training epoch on the validation sets. We select the model checkpoint that produces the highest average score over the validation sets to apply to the test sets. Each of our reported scores is an average over 5 runs with different random seeds. Table 2 presents results obtained for our PhoNLP and compares them with those of a baseline approach of single-task training. For the single-task training approach: (i) We follow a common approach to fine-tune a pre-trained language model for POS tagging, appending a linear prediction layer on top of PhoBERT, as briefly described in Section 2.2. (ii) For NER, instead of a linear prediction layer, we append a CRF prediction layer on top of PhoBERT. (iii) For dependency parsing, predicted POS tags are produced by the learned single-task POS tagging model; then POS tags are represented by embeddings that are concatenated with the corresponding PhoBERT-based contextualized word embeddings, resulting in a sequence of input vectors for the Biaffine-based classifiers for dependency parsing (Qi et al., 2018). Here, the single-task training approach is based on the PhoBERT base version, employing the same hyper-  Table 2: Performance results (in %) on the test sets for POS tagging (i.e. accuracy), NER (i.e. F 1 -score) and dependency parsing (i.e. LAS and UAS scores). "Leak." abbreviates "leakage", denoting the results obtained w.r.t. the data leakage issue. "Re-spl" denotes the results obtained w.r.t. the data re-split and duplication removal for POS tagging to avoid the data leakage issue. "Single-task" refers to as the single-task training approach. † denotes scores taken from the PhoBERT paper . Note that "Singletask" NER is not affected by the data leakage issue. parameter tuning and model selection strategy that we use for PhoNLP. Note that PhoBERT helps produce state-of-theart results for multiple Vietnamese NLP tasks (including but not limited to POS tagging, NER and dependency parsing in a single-task training strategy), and obtains higher performance results than VnCoreNLP. However, in both the PhoBERT and VnCoreNLP papers Vu et al., 2018), results for POS tagging and dependency parsing are reported w.r.t. the data leakage issue. Our "Single-task" results in Table 2 regarding "Re-spl" (i.e. the data re-split and duplication removal for POS tagging to avoid the data leakage issue) can be viewed as new PhoBERT results for a proper experimental setup. Table 2 shows that in both setups "Leak." and "Re-spl", our joint multi-task training approach PhoNLP performs better than the PhoBERT-based single-task training approach, thus resulting in state-of-theart performances for the three tasks of Vietnamese POS tagging, NER and dependency parsing.

PhoNLP toolkit
We present in this section a basic usage of our PhoNLP toolkit. We make PhoNLP simple to setup, i.e. users can install PhoNLP from either source or pip (e.g. pip3 install phonlp). We also aim to make PhoNLP simple to run from both the command-line and the Python API. For example, annotating a corpus with POS tagging, NER and dependency parsing can be performed by using a simple command as in Figure 2.
Assume that the input file "input.txt" in Figure 2 contains a sentence "Tôi đang làm_việc tại python3 run_phonlp.py --save_dir ./pretrained_phonlp --mode annotate --input_file input.txt --output_file output.txt Figure 2: Minimal command to run PhoNLP. Here "save_dir" denote the path to the local machine folder that stores the pre-trained PhoNLP model. CH O 3 punct Table 3: The output in the output file "output.txt" for the sentence "Tôi đang làm_việc tại VinAI ." from the input file "input.txt" in Figure 2. The output is formatted with 6 columns representing word index, word form, POS tag, NER label, head index of the current word and its dependency relation type.
VinAI ." (I Tôi am đang working làm_việc at tại VinAI). Table 3 shows the annotated output in plain text form for this sentence. Similarly, we also get the same output by using the Python API as simple as in Figure 3. Furthermore, commands to (re-)train and evaluate PhoNLP using gold annotated corpora are detailed in the PhoNLP GitHub repository. Note that it is absolutely possible to directly employ our PhoNLP (re-)training and evaluation command scripts for other languages that have gold annotated corpora available for the three tasks and a pre-trained BERT-based language model available from the transformers library.

Speed test:
We perform a sole CPU-based speed test using a personal computer with Intel Core i5 8265U 1.6GHz & 8GB of memory. For a GPUbased speed test, we employ a machine with a single NVIDIA RTX 2080Ti GPU. For performing the three NLP tasks jointly, PhoNLP obtains a speed at 15 sentences per second for the CPU-based test and 129 sentences per second for the GPU-based test, respectively, with an average of 23 word tokens per sentence and a batch size of 8.

Conclusion and future work
We have presented the first multi-task learning model PhoNLP for joint POS tagging, NER and dependency parsing in Vietnamese. Experiments on Vietnamese benchmark datasets show that PhoNLP outperforms its strong fine-tuned PhoBERT-based  single-task training baseline, producing state-ofthe-art performance results. We publicly release PhoNLP as an easy-to-use open-source toolkit and hope that PhoNLP can facilitate future NLP research and applications. In future work, we will also apply PhoNLP to other languages.