Sentence Boundary Detection on Line Breaks in Japanese

For NLP, sentence boundary detection (SBD) is an essential task to decompose a text into sentences. Most of the previous studies have used a simple rule that uses only typical characters as sentence boundaries. However, some characters may or may not be sentence boundaries depending on the context. We focused on line breaks in them. We newly constructed annotated corpora, implemented sentence boundary detectors, and analyzed performance of SBD in several settings.


Introduction
Many NLP tasks treat a sentence as a unit of processing. The task for decomposing a text into sentences is called sentence boundary detection (SBD). In Japanese, periods (e.g. " ", "."), exclamation marks, and question marks are delimiters to segment sentences in most cases. For this reason, the SBD in most studies takes only the positions of these typical delimiters as sentence boundaries. For example, in the construction of the "Web Japanese N-gram database 1 " provided by Google, Inc., they extracted sentences by segmenting on their positions.
However, line breaks can also indicate sentence boundaries without periods as the following text 2 .
Many line breaks do not follow the typical delimiters. For example, 33.4% of line breaks in the balanced corpus of contemporary written Japanese (BCCWJ) (Maekawa, 2008) were not followed by them. On the other hand, line breaks may be placed in the middle of a sentence. Therefore, we can not simply treat the positions of line breaks as sentence boundaries.
(Among recent movies are there any with Gary Oldman?) (2) This type of line break is used to make long sentences easy to read. Shinmori et al. (2003) performed a structural analysis of Japanese patent documents. They reported that 48.5% of the first claim in the 59,968 patent documents contain line breaks in the sentence. They explain it is common that claims written in Japanese are described in one sentence and the use of line breaks is intended to improve readability.
There can be more sentence boundaries than these. Nishimura (2003) showed that there are more than six variations of Japanese sentence boundaries in an online forum: Description of actions (e.g. "( )": embarrassment, "( )": tears), "Smiley" Icons (e.g. "( *ˆ ˆ*)", " ˆ ˆ "), and so on. Sakai (2013) conducted a linguistic analysis of Japanese emails written by young people on their mobile phones and found that about 63% of the emails used emoticons instead of punctuation marks for sentence boundaries.
In this paper, we focus on SBD on line breaks in Japanese. We newly construct annotated corpora to answer the following three research questions:  (2011) insist that the concept of "sentences" is fuzzier and less-defined in Chinese, and Native Chinese writers seldom follow the usage guidelines of punctuation marks. They listed the symbols used as sentence boundaries, such as whitespaces, commas, periods, line breaks. They reported F1 of manual SBD is 81.18 and one of CRF is 77.48.
Stanza 3 (Qi et al., 2020) is a language-agnostic fully neural pipeline for text analysis, including tokenization, multiword token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition. Unlike most existing toolkits, it does tokenization and SBD at the same time by using a bidirectional long short-term memory network (Graves and Schmidhuber, 2005) (Bi-LSTM) for characters in texts. It provides models for 66 languages including Japanese. The Japanese model is trained with with UD Japanese GSD 4 . Its architecture enables SBD on any characters, including line breaks. However the training corpus does not contain line breaks. Therefore the model can not perform SBD on line breaks.

BCCWJ
The balanced corpus of contemporary written Japanese (BCCWJ) is a corpus annotated with sen-

Jalan Corpora
We create two kinds of Japanese corpora with sentence boundary annotation: Jalan-F and Jalan-A, in order to perform experiments in another domain and another writing style. Both of them are composed of a part of hotel reviews posted on Jalan 7 , which is a popular travel information web site. Table 1 shows the statistics of the corpora. All annotations are performed by one worker and confirmed by another worker. Jalan-F 8 comprises 500 reviews. We fully anno-5 In this paper, we removed all line breaks at the end of documents because they are obvious sentence boundaries. Additionally, if there is a series of line breaks or a space before or after a line break, we replaced it with a single line break. tated sentence boundaries for all texts. As a result, we found 3,290 sentences. It contains 1,484 line breaks. Out of them, 170 line breaks do not segment sentences.
Jalan-A 9 comprises 298 reviews in an atypical writing style. They do not contain typical Japanese periods (" "). This is an example. </s> </s> (We stayed in a private room </s> The room was clean and spacious, so we'll be back again </s> The staffs were great) (3) Some line breaks segment sentences and some do not. We only annotated sentence boundaries on line breaks. While the number of boundaries is 1,374, there may be more sentences. It contains 1,983 line breaks. Out of them, 153 line breaks do not segment sentences.

Pseudo Annotation Corpora
To answer the third research question, we created two pseudo annotation corpora: P-BCCWJ and P-Jalan. First, we removed all line breaks from BCCWJ and 10,000 reviews additionally extracted from Jalan. Then, we replaced typical Japanese sentence boundaries " " into line breaks and regard all of them as sentence boundaries. Finally, we replaced ideographic commas " " into line breaks with 50% probability. This is an example. Original: (It is to look into the distance from a good view.) Pseudo annotation: 4 Experiments

Token
Gold Prediction Evaluation University. Texts are first tokenized with MeCab 11 morphological parser and then spitted into subwords by WordPiece. Its vocabulary size is 32,000. We exploit implementations of sequence labeling in "transformers" 12 by Hugging Face with three labels 13 : "Sentence boundary" (SB) and "Not sentence boundary" (NSB) for line breaks, and "Others" (O) for tokens that are not line breaks. We only use predictions for line breaks. Table 2 shows an example of input, output, and evaluation for detectors. In training, all tokens are labeled "O" except for line breaks. Whatever predictions are output for them, we do not consider them in the evaluation. Line breaks are labeled "SB" or "NSB" for training. We recognize sentence boundaries only on the tokens whose predictions are "SB." We set the maximum sequence length 320, the training batch size 32, and the number of epochs five. If the maximum number of input tokens is exceeded, we divide them into multiple inputs. We perform the Unicode NFKC normalization for all inputs.
For training and evaluation, we exclude 663 documents from BCCWJ and 164 documents from Jalan-F that do not contain line breaks. Each corpus of BCCWJ, Jalan-F, and Jalan-A is divided into 8:2 for learning and training. We built four models by using the three training sources and the data from the combination of Jalan-F and Jalan-A.

Experiments 1: Impact of Domains
First, we investigate the impact of domains. As shown in Table 3, In BCCWJ test data, the F 1 score of the model Jalan-F+A (96.8) is not very 11 https://github.com/taku910/mecab 12 https://github.com/huggingface/ transformers 13 We did a preliminary experiment with binary labels "Sentence boundary" (SB) and "Not sentence boundary" (NSB), but it was low performance.    This shows that we can make reasonably accurate models using training data even from different domains. On the other hand, F 1 scores for Jalan-F test data are close to 100 for all models. Therefore, we consider Jalan-F only contains simple cases.

Experiments 2: Impact of Writing Styles
Second, we investigate the impact of writing styles. As shown in Table 4, the F 1 score of the model BCCWJ is the best (97.2) among the four models. This shows that models trained on a large amount of data are more accurate, even if the writing styles are different.

Experiments 3: Effect of Pseudo Corpora
Third, we investigate the effect of pseudo corpora. Table 5 shows the result. The F 1 scores of the model P-BCCWJ for BCCWJ is 78.8. It is much worse than one of the model BCCWJ (98.2). This is an example of a false negative (FN) by the model P-BCCWJ.
</s> . . . (It is necessary to build disaster prevention measures. </s>In the fire and disaster management agency, . . .) They were often wrong even in the almost obvious cases where periods " " were just before line breaks.
The F 1 scores of the models P-BCCWJ and P-Jalan are respectively 94.8 and 92.8. Though they are better than one of the model Jalan-F (90.2), worse than one of the model Jalan-F+A (95.1).
These results suggest that although a sentence boundary detector with pseudo-corpus could achieve moderate performance, we can obtain better detectors by training with annotated corpora.

Conclusion
We implemented sentence boundary detectors by using BERT and revealed the following facts: • It is possible to train a sentence boundary detector on line breaks with annotated corpora.
• Training with much annotation data is effective even for texts in another writing style.
• Although it is possible to train a sentence boundary detector even with pseudo-corpus to some extent, more performance will be gained by training with annotated corpora.
There are two main issues that we need to address in the future. The first issue is to do is to use active learning to increase the number of learning examples and improve accuracy. The second issue is to perform SBD for other atypical sentence boundary expressions other than line breaks.