A Joint Model for Document Segmentation and Segment Labeling

Text segmentation aims to uncover latent structure by dividing text from a document into coherent sections. Where previous work on text segmentation considers the tasks of document segmentation and segment labeling separately, we show that the tasks contain complementary information and are best addressed jointly. We introduce Segment Pooling LSTM (S-LSTM), which is capable of jointly segmenting a document and labeling segments. In support of joint training, we develop a method for teaching the model to recover from errors by aligning the predicted and ground truth segments. We show that S-LSTM reduces segmentation error by 30% on average, while also improving segment labeling.


Introduction
A well-written document is rich not only in content but also in structure. One type of structure is the grouping of content into topically coherent segments. These segmented documents have many uses across various domains and downstream tasks. Segmentation can, for example, be used to convert unstructured medical dictations into clinical reports (Sadoughi et al., 2018), which in turn can help with medical coding (since a diagnosis mentioned in a "Medical History" might be different from a diagnosis mentioned in an "Intake" section (Ganesan and Subotin, 2014)). Segmentation can also be used downstream for retrieval (Hearst and Plaunt, 2002;Edinger et al., 2017;Allan et al., 1998), where it can be particularly useful when applied to informal text or speech that lacks explicit segment markup. Topically segmented documents are also useful for pre-reading (the process of skimming or surveying a text prior to careful reading), thus serving as an aid for reading comprehension (Swaffar et al., 1991;Ajideh, 2003). * Work done while interning at Adobe. Uncovering latent, topically coherent segments of text is a difficult problem because it requires solving a chicken-and-egg problem: determining the segment topics is easier if segment boundaries are given, and identifying the boundaries of segments is easier if the topic(s) addressed in parts of the document are known. Prior approaches to text segmentation can largely be split into two categories that break the cycle by sequentially solving the two problems: those that attempt to directly predict segment bounds (Koshorek et al., 2018), and those that attempt to predict topics per passage (e.g., per sentence) and use measures of coherence for post hoc segmentation (Hearst, 1997;Arnold et al.;Eisenstein and Barzilay, 2008;Riedl and Biemann, 2012;Glavaš et al., 2016). The benefit of the topic modeling approach is that it can work in unsupervised settings where collecting ground truth segmentations is difficult and labeled data is scarce (Eisenstein and Barzilay, 2008;Choi, 2000). Recent work uses Wikipedia as a source of segmentation labels by eliding the segment bounds of a Wikipedia article to train supervised models (Koshorek et al., 2018;Arnold et al.). This enables models to directly learn to predict segment bounds or to learn sentence-level topics and perform post hoc segmentation.
Our work is motivated by the observation that the segment bounds and topicality are tightly interwoven, and should ideally be considered jointly rather than sequentially. We start by examining three properties about text segmentation: (1) segment bounds and segment labels contain complementary supervisory signals, (2) segment labels are a product of lower level (e.g. sentence) labels which must be composed, and (3) the model should not only learn to label from ground-truth segmentations at training time, but instead the labeler should learn to be robust to segmentation errors. These properties build on previous work discussed in Section 2. We experimentally evaluate and verify each of these properties in Section 5 with respect to a document segmentation and segment labeling task.
Taking advantage of these properties, we propose a neural model that jointly segments and labels without committing to a priori segmentations, Segment Pooling LSTM (S-LSTM). It consists of three components: a segment proposal LSTM (discussed in Section 3.2), a segment pooling layer (Section 3.3), and a segment aligner for training and evaluation (Section 3.4).
Our main contribution is a model that performs segmentation and labeling jointly rather than separately. By virtue of joint inference, our model takes advantage of the complementary supervisory signals for segmentation and topic inference, considers the contribution of all sentences to the segment label, and avoids committing to early errors in low-level inference.
Our approach improves over neural and nonneural baselines of a document segmentation task. We use a dataset of Wikipedia articles described in Section 5 for training and evaluation. We show that S-LSTM is capable of reducing segmentation error by, on average, 30% while also improving segment classification. We also show that these improvements hold on out-of-domain datasets.

Related Work
Coherence-based Segmentation. Much work on text segmentation uses measures of coherence to find topic shifts in documents. Hearst (1997) introduced the TextTiling algorithm, which uses term co-occurrences to find coherent segments in a document. Eisenstein and Barzilay (2008) introduced BayesSeg, a Bayesian method that can incorporate other features such as cue phrases. Riedl and Biemann (2012) later introduced TopicTiling, which uses coherence shifts in topic vectors to find segment bounds. Glavaš et al. (2016) proposed GraphSeg, which constructs a semantic relatedness graph over the document using lexical features and word embeddings, and segments using cliques. Nguyen et al. (2012) proposed SITS, a model for topic segmentation in dialogues that incorporates a per-speaker likelihood to change topics.
While the above models are unsupservised, Arnold et al. introduced a supervised method to compute sentence-level topic vectors using Wikipedia articles. The authors created the Wiki-Section dataset and proposed the SECTOR neural model. The SECTOR model predicts a label for each sentence, and then performs post hoc segmentation looking at the coherence of the latent sentence representations, addressing segmentation and labeling separately. We propose a model capable of jointly learning segmentation boundaries and segment-level labels at training time. Our segmentation does not rely on measures of coherence, and can instead learn from signals in the data, such as cue phrases, to predict segment bounds, while still performing well at the segment labeling task.
Supervised Segmentation. An alternative to using measures of topical coherence to segment text is to learn to directly predict segment bounds from labeled data. This was the approach taken in Koshorek et al. (2018), where the authors used Wikipedia as a source of training data to learn text segmentation as a supervised task. However, learning only to predict segment bounds does not necessarily capture the topicality of a segment that is useful for informative labeling.
The task of document segmentation and labeling is well-studied in the clinical domain, where both segmenting and learning segment labels are important tasks. Pomares-Quimbaya et al. (2019) provide a current overview of work on clinical segmentation. Ganesan and Subotin (2014) trained a logistic regression model on a clinical segmentation task, though they did not consider the task of segment labeling. Tepper et al. (2012) considered both tasks of segmentation and segment labeling, and proposed a two-step pipelined method that first segments and then classifies the segments. Our proposed model is trained jointly on both the segmentation and segment labeling tasks. Concurrent work considers the task of document outline generation (Zhang et al., 2019). The goal of outline generation is to segment and generate (potentially hierarchical) headings for each segment. The authors propose the HiStGen model, a hierarchical LSTM model with a sequence decoder. The work offers an alternative view of the joint segmentation and labeling problem, and is evaluated using exact match for segmentation and ROUGE (Lin, 2004) for heading generation if the segment is predicted correctly. In contrast, we evaluate our models using a commonly-used probabilistic segmentation measure, Pk, which assigns partial credit to incorrect segmentations (Beeferman et al., 1999). We also use an alignment technique to assign partial credit to labels of incorrect segmentations, both for training and evaluation. In addition, we explicitly consider the problem of model transferability, evaluating the pretrained models on additional datasets. IOB Tagging. The problem of jointly learning to segment and classify is well-studied in NLP, though largely at a lower level, with Inside-Outside-Beginning (IOB) tagging (Ramshaw and Marcus, 1999). Conditional random field (CRF) decoding has long been used with IOB tagging to simultaneously segment and label text, e.g. for named entity recognition (NER, McCallum and Li, 2003). The models that perform best at joint segmentation/classification tasks like NER or phrase chunking were IOB tagging models, typically LSTMs with a CRF decoder (Lample et al., 2016) until BERT (Devlin et al., 2019) and ELMo (Peters et al., 2018). Tepper et al. (2012) proposed the use of IOB tagging to segment and label clinical documents, but argued for a pipelined approach.
CRF-decoded IOB tagging models are more difficult to apply to the multilabel case. Segment bounds need to be consistent across all labels, so modeling the full transition from |L| −→ |L| (where |L| is the size of the label space) at every time step is computationally expensive. In contrast, our joint model performs well at multilabel prediction, while also outperforming a neural CRFdecoded model on a single-label labeling task.

Modeling
In order to jointly model document segmentation and segment classification, we introduce the Segment Pooling LSTM (S-LSTM) model. S-LSTM is a supervised model trained to both predict segment bounds and pool over and classify the segments. The model consists of three components: a sentence encoder (Section 3.1), a segment predictor LSTM (Section 3.2), and a segment pooling network which pools over predicted segments to classify them (Section 3.3). The segment predictor is allowed to make mistakes that the labeler must learn to be robust to, a process which we refer to as exploration, and accomplish by aligning predicted and ground truth segments (Section 3.4). The full architecture is presented in Figure 1, and the loss is discussed in Section 3.5.

Encoding Sentences
The first stage is encoding sentences. S-LSTM is agnostic to the choice of sentence encoder, though in this work we use a concat pooled bi-directional LSTM (Howard and Ruder, 2018). First, the embedded words are passed through the LSTM encoder. Then, the maximum and mean of all hidden states are concatenated with the final hidden states, and this is used as the sentence encoding.

Predicting Segment Bounds
The second step of our model is a Segment Predictor LSTM, which predicts segment boundaries within the document. For this step we use a bidirectional LSTM that consumes each sentence vector and predicts an indicator variable, (B)eginning or (I)nside a segment. It is trained from presegmented documents using a binary cross entropy loss. This indicator variable determines if the sentence is the start of a new segment or not. This is similar to the approach taken by TextSeg in Koshorek et al. (2018), though we do not estimate a threshold, τ , and instead learn to to predict two classes: (B)eginning and (I)nside.

Segment Pooling
After segmenting the document, the third stage of the model pools within the predicted segments to predict a label for each segment. The sentence vectors for the predicted segments are all grouped, and a pooling function is run over them. There are several possible sequence-to-vector pooling functions that could be used, such as averaging, and more complex learned pooling functions, such as LSTMs. The full S-LSTM model uses a concat pooling LSTM, and our experimental results show that this yields a better segment label than just averaging. We then use a classifier following the output of the segment pooler, which can provide a distribution over labels for each segment.
The combination of segment prediction and pooling is one way that S-LSTM is different from previous hierarchical LSTM models. The model can predict and label segments dynamically, generating a single vector for predicted segments.

Segment Alignment and Exploration
Because segments can be considered dynamically at training time, we propose a method of assigning labels to potentially incorrect segments by aligning the predicted segments with ground truth segments. This label assignment allows segment-labeling loss to be propagated through the end-to-end model.
Teacher Forcing. Teacher forcing, or feeding ground truth inputs into a recurrent network as  opposed to model predictions, was first developed in Williams and Zipser (1989). The idea is to use ground truth predictions for inputs that would normally come from model predictions for the first stages of training, to help with convergence. For S-LSTM, it is the simplest approach to segment pooling and alignment: at training time feed the ground truth segments (as opposed to the predicted segments) the segment pooler (step 3 in Figure 1). This gives us a one-to-one alignment of "predicted" (forced) segments and ground truth segments. This is opposed to only using the predicted segments as the bounds for segment pooler.
Exploration. Employing only teacher forcing does not allow the segment labeler to learn how to recover from errors in segmentation. The mechanism for allowing the model to explore incorrect segmentations is to align the predicted segments with overlapping ground truth segments at training time, and treat the all aligned ground truth labels as correct. While many alignments are possible, we use the one presented in Figure 2. This manyto-many alignment ensures that every ground-truth segment is mapped to at least one predicted segment and every predicted segment is mapped to at least one ground truth segment. We can additionally schedule teacher forcing. At the beginning, when the segmentation prediction network performs poorly, the model pools over only ground truth segment bounds, allowing it to learn the cleanest topic representations. However, as training progresses and the segmentation accuracy begins to converge, we switch from pooling over ground truth segments to aligning predicted and ground truth segment. In this way, the segment pooler learns to be robust to segmentation errors.

Joint Training
To jointly train the model, we use a multi-task loss, where y seg are the labels for the segment prediction LSTM and y cls are segment labels. In addition, we pass in an aligner, which determines how to align the predicted segments with the ground truth segments to compute the loss, and either teacher forces the model or allows it to explore.

Experimental Setup
We follow the experimental procedure of Arnold et al. to evaluate S-LSTM for the tasks of document segmentation and segment labeling.

Datasets
WikiSection. Arnold et al. introduced the Wiki-Section dataset, which contains Wikipedia articles across two languages (English and German) and domains (Cities and Diseases). Articles are segmented using the Wikipedia section structure. The heading of each segment is retained, as well as a normalized label for each heading type (e.g. History, Demography), drawn from a restricted label vocabulary. There are two tasks: (1) jointly segment the document and assign a single restrictedvocabulary label to the segment, and (2) predict the bag-of-words in the title of the Wikipedia section as a label. For instance, the bag-of-words label for the title of this section would be the words:  Figure 2: Greedy many-to-many alignment. This alignment is used to assign ground-truth labels to predicted segments for training. Each ground truth segment first aligns to the maximally overlapping predicted segment; each leftover predicted segment then aligns to the maximally overlapping ground truth segment.
1. Slide a probe of length k over the items.
2. Increase a counter by 1 whenever: a. the items are in the same segment in the ground truth, but not the predictions; or b. the items are in different segments in the ground truth, but not the predictions.
3. Divide the counter by the number of measures taken.
Figure 3: Computing P k . A sliding window of length k is run over the text, and a counter increments whenever the same/different status for the two ends of the window doesn't match in the ground truth and predicted segmentation.
[Dataset, Experimental, Setup]. 1 For the second task, we post-process headers to remove stopwords, numbers and punctuation. We then remove words that occur fewer than 20 times in the training data to get the final label vocabulary sizes. Of note, we encountered a smaller label vocabulary for the bag-of-words generation task than that reported by Arnold et al.. For the four datasets, the original reported sizes of the header vocabularies were: [1.5k 1.0k, 2.8k, 1.1k]. When reproducing earlier results, we verified with the dataset authors that the actual sizes were: [179,115,603,318].
The first task aligns closely with the clinical domain, in which headers are typically drawn from a fixed label set (Tepper et al., 2012). The second aligns more closely with learning to segment and label from naturally labeled data, such as contracts or Wikipedia articles, which can potentially then be transferred (Koshorek et al., 2018). Wiki-50. The Wiki-50 dataset was introduced as a test set in Koshorek et al. (2018), which also introduced the full Wiki-727k dataset. The dataset contains 50 randomly sampled Wikipedia articles, segmented and with their headers, and was used to evaluate computationally expensive methods such as BAYESSEG (Eisenstein and Barzilay, 2008 (2008), which has segment boundaries but no headings.

Experimental Design
We evaluate S-LSTM with previous document segmentation and segment labeling approaches on all four WikiSection datasets-English-language Diseases (en_disease), German-language Diseases (de_disease), English-language Cities (en_city), and German-language Cities (de_city)-for both the single label and multi-label tasks.
Model Ablation. In order to understand the effect of our proposed segment pooling and segment exploration strategies, we also include results for simpler baselines for each of these modules. For the segment labeling we report not only the full S-LSTM model with LSTM pooling, but also additionally a mean pooling model, which we denote with "-pool". For the segment exploration we report not only the model with exploration, but also a model only trained using teacher forcing, which we denote with "-expl".
Model Transferability. To evaluate model transferability, we test models trained on the English Żelechów is located near border of Masovian and Lublin Voivodeships . During the period between 1975 -1998 Żelechów was in Siedlce Voivodship . Before 1795 , Żelechów had strong connections with Lesser Poland . So it is located between three geographical regions : Podlaskie , Lubelszczyzna and Masovia .
The surrounding landscape was formed during the ice age when the whole area was covered with ice . The landscape now is gently waved , and the town itself is located on a hill , making its altitude vary from up to . The area around Żelechów is surrounded by fields and few forests .
The area of the town is 1214 hectares ( 12,14 km² ) . This is much more than the actual built -up area : 77,8 % ( 945 ha ) of the whole area is agriculture usage , 3,6 % ( 43 ha ) of the area are forests , and 18,6 % ( 226 ha ) is unused or built up . The first record of Żelechów dates back to 1282 , and the city rights were gained in 1447 . Żelechów was a private town , first owned by the family of Ciołek ( who later changed their surname to Żelechowski ) . It was a local center of trade and an important city until the Deluge ( the war with Sweden ) . At that time the town was greatly devastated , and dozens of people died ( also due to diseases ) . In the first half of the 17th century Jews first settled in Żelechów . The owners of the town changed frequently , one of them was Ignacy Wyssogota Zakrzewski -the first President of Warsaw .

Żelechów is 65th town in Masovian
After the Partitions of Poland Żelechów belonged to Austria . Then in the time of the Napoleonic Wars it was within the borders of the Duchy of Warsaw , and after the Congress of Vienna it was finally placed in Congress Poland , which was in fact controlled by Russia . Joachim Lelewel was a deputy to the Sejm from Żelechów county in years 1828 -1831 . Romuald Traugutt lived here in 1845 , he served as officer of a ruff of sappers . During the January Uprising near Żelechów , few skirmishes took place .
After the uprising the Russian government took the decision to punish those who fought against them , who were generally nobility . Nearby peasants received land ( which later belonged to nobility ) , and the city from that time onward was not owned by a single person . To keep the peace in the area , two cavalry companies and an artillery unit were placed in Żelechów . They brought prosperity , because their needs had to be supported by the townspeople . In that time , Żelechów started to be especially well known as a shoe production center .
In 1880 a great fire burned a large part of the town , but it was rebuilt quickly with brick houses replacing wooden ones . In 1919 about 7,800 inhabitants lived in the city . During the interwar period about 800 firms resided in Żelechów ( mainly shops and handicrafts ) . In 1939 in Żelechów lived about 8,500 inhabitants , who were mostly Jews ( 5,800 people ) . Before the Great Wars , many Jews fled to America , mainly to Costa Rica , where they founded a new Jewish community .
When the Nazi Germany occupied Poland , a ghetto was created in a small area in the city , placing about 10,000 Jews there , mainly from Żelechów but also from other cities of Poland . In September 1942 , the liquidation of the ghetto began , where people were transported to Treblinka extermination camp , but due to the chaos many tried to escape . About 1,000 died in Żelechów this time shot by German soldiers .
On July 17 of 1944 the Red Army entered Żelechów , ending the war there . Only 50 Jews remained alive in the city . At this time about 4,000 people lived in Żelechów , and this number has not changed much to this day .
Żelechów is a centre supporting nearby farmers . There are over 500 firms in the town , mainly small family shops , handicrafts or service . Bigger firms work in the fields of machinery , footwear and the floor industry .
Żelechów is a local centre of education , up to secondary school . There are many schools offering education in different areas of knowledge .
The city is from European route E372 , which runs from Warsaw to Lviv . The voivodship road 807 passes through the town .

Żelechów is located near border of Masovian and Lublin Voivodeships .
During the period between 1975 -1998 Żelechów was in Siedlce Voivodship . Before 1795 , Żelechów had strong connections with Lesser Poland . So it is located between three geographical regions : Podlaskie , Lubelszczyzna and Masovia .
The surrounding landscape was formed during the ice age when the whole area was covered with ice . The landscape now is gently waved , and the town itself is located on a hill , making its altitude vary from up to . The area around Żelechów is surrounded by fields and few forests .
The area of the town is 1214 hectares ( 12,14 km² ) . This is much more than the actual built -up area : 77,8 % ( 945 ha ) of the whole area is agriculture usage , 3,6 % ( 43 ha ) of the area are forests , and 18,6 % ( 226 ha ) is unused or built up . The first record of Żelechów dates back to 1282 , and the city rights were gained in 1447 . Żelechów was a private town , first owned by the family of Ciołek ( who later changed their surname to Żelechowski ) . It was a local center of trade and an important city until the Deluge ( the war with Sweden ) . At that time the town was greatly devastated , and dozens of people died ( also due to diseases ) . In the first half of the 17th century Jews first settled in Żelechów . The owners of the town changed frequently , one of them was Ignacy Wyssogota Zakrzewski -the first President of Warsaw .

Żelechów is 65th town in Masovian Voivodship in respect of number of inhabitants ( with total number of towns in Masovian Voivodship of 85 ) . It is the smallest town in Garwolin
After the Partitions of Poland Żelechów belonged to Austria . Then in the time of the Napoleonic Wars it was within the borders of the Duchy of Warsaw , and after the Congress of Vienna it was finally placed in Congress Poland , which was in fact controlled by Russia . Joachim Lelewel was a deputy to the Sejm from Żelechów county in years 1828 -1831 . Romuald Traugutt lived here in 1845 , he served as officer of a ruff of sappers . During the January Uprising near Żelechów , few skirmishes took place .
After the uprising the Russian government took the decision to punish those who fought against them , who were generally nobility . Nearby peasants received land ( which later belonged to nobility ) , and the city from that time onward was not owned by a single person . To keep the peace in the area , two cavalry companies and an artillery unit were placed in Żelechów . They brought prosperity , because their needs had to be supported by the townspeople . In that time , Żelechów started to be especially well known as a shoe production center .
In 1880 a great fire burned a large part of the town , but it was rebuilt quickly with brick houses replacing wooden ones . In 1919 about 7,800 inhabitants lived in the city . During the interwar period about 800 firms resided in Żelechów ( mainly shops and handicrafts ) . In 1939 in Żelechów lived about 8,500 inhabitants , who were mostly Jews ( 5,800 people ) . Before the Great Wars , many Jews fled to America , mainly to Costa Rica , where they founded a new Jewish community .
When the Nazi Germany occupied Poland , a ghetto was created in a small area in the city , placing about 10,000 Jews there , mainly from Żelechów but also from other cities of Poland . In September 1942 , the liquidation of the ghetto began , where people were transported to Treblinka extermination camp , but due to the chaos many tried to escape . About 1,000 died in Żelechów this time shot by German soldiers .
On July 17 of 1944 the Red Army entered Żelechów , ending the war there . Only 50 Jews remained alive in the city . At this time about 4,000 people lived in Żelechów , and this number has not changed much to this day .
Żelechów is a centre supporting nearby farmers . There are over 500 firms in the town , mainly small family shops , handicrafts or service . Bigger firms work in the fields of machinery , footwear and the floor industry .
Żelechów is a local centre of education , up to secondary school . There are many schools offering education in different areas of knowledge .
The city is from European route E372 , which runs from Warsaw to Lviv . The voivodship road 807 passes through the town .  WikiSection tasks (en_disease and en_city) on the Cities, Elements, Wiki-50, and Clinical datasets.

Evaluation Measures
Segmentation: Pk. P k is a probabilistic measure (Beeferman et al., 1999) that works by running a sliding window of width k over the predicted and ground truth segments, and counting the number of times there is disagreement about the ends of the probe being in the same or different sections (see Figure 3). The number of disagreements is then divided by the total number of window positions, resulting in a score normalized between 0 and 1. Our segmentation results are reported setting k to half the average size of ground truth segments.
Classification: F1, MAP, and Prec@1. For classification, we report three different measures, depending on the task. For the single label tasks, we report F 1 and Mean Average Precision (MAP). For evaluating the bag-of-words (multilabel) tasks, we report Precision at the first rank position (Prec@1) and MAP. In both cases, these are computed by first aligning the predicted segments with the ground truth segments as shown in Figure 2 and described in Section 3.4. In all cases, the metrics are micro-averaged.

Baselines
We report C99 (Choi, 2000), TopicTiling (Riedl and Biemann, 2012), and TextSeg (Koshorek et al., 2018) as baselines on WikiSection segmentation. For a neural baseline, we report the SECTOR model (Arnold et al.) with pre-trained embeddings, denoted in the paper as SEC>T,H+emb. For the additional datasets, we report GraphSeg (Glavaš et al., 2016), BayesSeg (Eisenstein and Barzilay, 2008) and pretrained TextSeg and SECTOR models. In addition, we implemented an LSTM-LSTM-CRF IOB tagging model following Lample et al. (2016). This is only used for the single-label experiments, as CRF-decoded IOB tagging models are more difficult to apply to the multilabel case.

Model Setup
For each task and dataset, we use the same set of hyperparameters: Adam optimizer (Kingma and Ba, 2015) with learning rate 0.001 and weight decay 0.9. Dropout (Srivastava et al., 2014) is applied after each layer except the final classification layers; we use a single dropout probability of 0.1 for every instance. For models with exploration, we employ teacher forcing for 10 epochs. Model weights are initialized using Xavier normal initialization (Glorot and Bengio, 2010). All LSTM hidden-layer sizes are set to 200. We use fixed 300-dimensional FastText embeddings (Bojanowski et al., 2017) for both English and German, and project them down to 200 dimensions using a trainable linear layer.

Results and Analysis
There are five major takeaways from the experimental results and analysis. First, the jointly trained S-LSTM model shows major improvement over prior work that modeled document segmentation and segment labeling tasks separately. Second, segment alignment and exploration during training reduces error rates. Third, the segment pooling layer leads to improvements for both segmentation and segment labeling. Fourth, S-LSTM outperforms an IOB-tagging CRF-decoded model for single label segment labeling, and also generalizes easily 37.4 n/a n/a 42.7 n/a n/a 36.8 n/a n/a 38.3 n/a n/a TopicTiling 43.4 n/a n/a 45.4 n/a n/a 30.5 n/a n/a 41.3 n/a n/a TextSeg 24.3 n/a n/a 35.7 n/a n/a 19.3 n/a n/a 27.5 n/a n/  (Riedl and Biemann, 2012), TextSeg (Koshorek et al., 2018), andC99 (Choi, 2000), and the best neural SECTOR models from Arnold et al..
37.4 n/a n/a 42.7 n/a n/a 36.8 n/a n/a 38.3 n/a n/a TopicTiling 43.4 n/a n/a 45.4 n/a n/a 30.5 n/a n/a 41.3 n/a n/a TextSeg 24.3 n/a n/a 35.7 n/a n/a 19.3 n/a n/a 27.5 n/a n/a  and tractably to multi-labeling. Fifth, a deeper analysis of the joint modeling demonstrates that segment labeling and segment bound prediction contain complementary information.

Structure Predicts Better Structure
Tables 1 and 2 show that by explicitly predicting segment bounds we can improve segmentation by a large margin. On the header prediction task (Table 2), we reduced P k by an average of over 30% across the WikiSection datasets. P k was consistent across both WikiSection tasks, and did not degrade when going from single-label to multi-label prediction, as Arnold et al. had found. This shows that we can achieve a more robust segmentation through jointly modeling segmentation and labeling. This is also clear from Figure 4, where S-LSTM predicts a much more accurate segmentation.

Exploration Allows Error Recovery
The results of an ablation experiment (Table 2, bottom) show that there is an additional classification gain by allowing the model to explore recovering from segmentation errors. Exploration has the important property of allowing the model to optimize more closely to how it is being evaluated. This follows from a long line of work in NLP that shows that for tasks such as dependency parsing , constituency parsing (Goodman, 1996), and machine translation (Och, 2003), all show improvements by optimizing on a loss that aligns with evaluation. The teacher forcing was important at the beginning of model training. When training variants of S-LSTM that did not use teacher forcing at the beginning, which instead could explore the bad segmentation, the segmentation failed to converge and the model performed universally poorly.

S-LSTM Can Take Advantage of Both of These, Plus Segment Pooling
S-LSTM is capable of taking advantage of the complementary information by jointly learning to segment and label. It is capable of learning to recover from segmentation errors by exploring towards the end of training. But the ablation study shows that there is one more important component of S-LSTM that allows it to improve over previous baselines: LSTM pooling over segments. The addition of the segment pooling layer improves MAP and Prec@1 across all four datasets in the heading prediction task (Table 2), comparing the model without exploration (S-LSTM,-expl) with the model without exploration (which uses average pooling: S-LSTM,-    expl,-pool). It is the combination of these three improvements that comprise the full S-LSTM.

S-LSTM Outperforms a CRF Baseline
In Table 1, the results demonstrate that S-LSTM outperforms LSTM-LSTM-CRF baseline in almost every case for single-labeling, and in every case for segmentation. This makes S-LSTM a useful model choice for cases like clinical segmentation and labeling, where segments are drawn from a small fixed vocabulary. S-LSTM also generalizes easily to multi-label problems, in contrast to an IOB-tagging LSTM-LSTM-CRF, since it only requires changing the segment-pooling loss from cross-entropy to binary cross-entropy.

Predicting Structure Predicts Better Labels (and vice versa)
Though we compare with TextSeg (a neural model that predicts segment bounds) and SECTOR (a neural model that predicts sentence labels and post hoc segments them) and show improvements compared to both models, we also directly test the hypothesis that the segmentation and segment labeling tasks contain complementary information. To do so, we conduct two experiments: (1) we fix the segment bounds at training and evaluation time, only training the model to label known segments (results in Table 5); and (2) we only have the model predict segment bounds (results in Table 4). In both cases, the addition of the loss from the companion task improves performance on the main task. This shows that the two tasks contain complementary information, and directly validates our core hypothesis that the two tasks are tightly interwoven. Thus, considering them jointly improves performance on both tasks.

Conclusion and Future Work
In this paper we introduce the Segment Pooling LSTM (S-LSTM) model for joint segmentation and segment labeling tasks. We find that the model dramatically reduces segmentation error (by 30% on average across four datasets) while improving segment labeling accuracy compared to previous neural and non-neural baselines for both singlelabel and multi-label tasks. Experiments demonstrate that jointly modeling the segmentation and segment labeling, segmentation alignment and exploration, and segment pooling each contribute to S-LSTM's improved performance.
S-LSTM is agnostic as to the sentence encoder used, so we would like to investigate the potential usefulness of transformer-based language models as sentence encoders. There are additional engineering challenges associated with using models such as BERT as sentence encoders, since encoding entire documents can be too expensive to fit on a GPU without model parallelism. We would also like to investigate the usefulness of an unconsidered source of document structure: the hierarchical nature of sections and subsections. Like segment bounds and headers, this structure is naturally available in Wikipedia. Having shown that segment bounds contain useful supervisory signal, it would be interesting to examine if segment hierarchies might also contain useful signal.