Eliciting Knowledge from Experts: Automatic Transcript Parsing for Cognitive Task Analysis

Cognitive task analysis (CTA) is a type of analysis in applied psychology aimed at eliciting and representing the knowledge and thought processes of domain experts. In CTA, often heavy human labor is involved to parse the interview transcript into structured knowledge (e.g., flowchart for different actions). To reduce human efforts and scale the process, automated CTA transcript parsing is desirable. However, this task has unique challenges as (1) it requires the understanding of long-range context information in conversational text; and (2) the amount of labeled data is limited and indirect—i.e., context-aware, noisy, and low-resource. In this paper, we propose a weakly-supervised information extraction framework for automated CTA transcript parsing. We partition the parsing process into a sequence labeling task and a text span-pair relation extraction task, with distant supervision from human-curated protocol files. To model long-range context information for extracting sentence relations, neighbor sentences are involved as a part of input. Different types of models for capturing context dependency are then applied. We manually annotate real-world CTA transcripts to facilitate the evaluation of the parsing tasks.


Introduction
Cognitive task analysis (CTA) is a powerful tool for training, instructional design, and development of expert systems (Woods et al., 1989;Clark and Estes, 1996) focusing on yielding the knowledge and thought processes from domain experts (Schraagen et al., 2000).Traditional CTA methods require interviews with domain experts and parsing the interview transcript (transcript) into structured text describing processes (protocol, shown in Fig. 1).However, parsing transcripts requires Figure 1: An example of CTA interview transcript and the human parsed structured text (protocol).In the protocol, splitting by the highlighted line numbers indicating the sources in transcript, phrases in protocol (called protocol phrases) are abstractive description of actions in the transcript.In the transcript, the highlighted numbers are line numbers, and the bolded are text spans matched by protocol phrases.The highlighted line numbers are provided by human parsing which provide constraint on mapping protocol phrases back to the transcript, but they are noisy and pointing back to a large scope of sentences, instead of the text span we want to extract.heavy human labor, which becomes the major hurdle of scaling up CTA.Therefore, automated approaches to extract structured knowledge from CTA interview transcripts are important for expert systems using massive procedural data.
A natural realization of automated CTA is to apply relation extraction (RE) models to parse interview text.However, the key challenge here is the lack of direct sentence-level supervision data for training RE models because the only available supervision, protocols, are document-level transcripts summaries.Furthermore, the information towards relations between procedural actions spreads all over the transcripts, which bur- dens the RE model to process global information of the text.One previous work (Park and Motahari Nezhad, 2018) studies extracting procedure information on well-structured text using OpenIE and sentence pair RE models.In this work, however, we focus on unstructured conversational text (i.e., CTA interview transcripts) for which OpenIE is inapplicable.
To address the above challenges, we develop a novel method to effectively extract and leverage weak(in-direct) supervision signals from protocols.The key observation is that these protocols are structured in the phrase level (c.f.Fig. 1).We split each protocol into a set of protocol phrases.Each protocol phrase is associated with a line number that points back to one sentence in the original transcript.Then, we can map these protocol phrases back to text spans in transcript sentences and obtain useful supervision signals from three aspects.First, these matched text spans provide direct supervision labels for training text span extraction model.Second, the procedural relations between protocols phrases are transformed into relations between text spans within sentences, which enables us to train RE models.Finally, the local contexts around text spans provide strong signals and can enhance the mention representation in all RE models.
Our approach consists of following steps: (1) parse original protocol into a collection of protocol phrases together with their procedural relations, using a deterministic finite automation (DFA); (2) Match the protocol phrases back to the text spans in transcripts using fuzzy matching (Pennington et al., 2014;Devlin et al., 2018); (3) Generate text span extraction dataset and train a sequence labeling model (Finkel et al., 2005;Liu et al., 2017) for text span extraction; (4) Generate text spanpair relation extraction (span-pair RE) dataset and fine-tune pre-trained context-aware span-pair RE model (Devlin et al., 2018).With the trained models, we can automatically extract text spans summarizing actions from transcripts along with the procedural relations among them.Finally, we assemble the results into protocol knowledge, which lays the foundation for CTA.
We explore our approaches from manifold aspects: (i) We experimented different fuzzy matching methods, relation extraction models and sequence labeling models; (ii) We present models for solving context-aware span-pair RE; (iii) We evaluate the approach on real-world data with human annotations, which demonstrates the best fuzzy matching method achieves 47.1% mention level accuracy, best sequence labeling model achieves 38.18% token level accuracy, and best text span-pair relation extraction model achieves 74.4% micro F 1 .

Related Work
Our work is closely related to procedural extraction, however we focus on conversational text from CTA interviews which is in a low-resource setting and no sentence-by-sentence label is available.Cognitive task analysis.Cognitive task analysis is a powerful tool for extracting knowledge and thought processes of experts widely used in different domains (Schraagen et al., 2000;Seamster and Redding, 2017).Yet, it is time-consuming and not scalable.Recent years, with the development of natural language processing, techniques are introduced to aid human expertise (Zhong et al., 2015;Roose et al., 2018).Li et al.(2013) used learning agent to discover cognitive model in specific domains.Chaplot et al.(2018) explored modeling cognitive knowledge in well-defined tasks with neural models.However, for the most general setting that extract cognitive processes from interviews, we still need substantial expertise to interpret the interview transcript.Procedural extraction.
Recent advances in machine reading comprehension, textual entailment (Devlin et al., 2018) and relation extraction (Zhang et al., 2017) shows the contemporary NLP models have the capability of capturing causal relations in some degree.However, it is still an open problem to extract procedural information from text.There were some attempts to extract similar procedural information on well-structured instructional text from how-to community.Park and Motahari Nezhad (2018) treated procedural extraction as a relation extraction problem on sentence pair extracted by pattern matching.They used OpenIE for pattern extraction and hierarchical LSTM to classify relation labels of sentence pairs.Pre-trained language representations.Recent researches showed that language models generically trained on massive corpus is beneficial to various specific NLP tasks (Pennington et al., 2014;Devlin et al., 2018).Language representation has been an active area of research for years.Tons of effective approaches have been developed from feature-based approaches (Ando and Zhang, 2005;Mikolov et al., 2013;Peters et al., 2018) to finetuning approaches (Dai and Le, 2015;Alec Radford and Sutskever, 2018;Devlin et al., 2018).

Framework
Our automated CTA transcript parsing framework takes interview transcripts as input and outputs structured knowledge consisting of summary phrases.The framework, visualized in Fig. 2, includes two parts: (1) summary text spans extraction and (2) text span-pair relation extraction.The extracted knowledge will then be structured using a flowchart and supports automated CTA.

Text Spans Extraction
Since CTA interview transcripts are conversational text while structured knowledge are formed of summary phrases describing actions in transcripts (c.f.Fig. 1), we need to first summarize transcript sentences.An intuitive idea is to first leverage off-the-shelf text summarization methods (Shen et al., 2007;Nallapati et al., 2016;Liu et al., 2018).However, CTA is a low-resource task and thus we do not have enough training data for learning seq2seq-based text summarization models.Therefore, in this work, we formulate the summariza-tion of transcript sentences as a sequence labeling task (Liu et al., 2017) and treat the best summarized text span in a transcript sentence as its corresponding summary phrase.
Given a sentence in transcripts, we denote the sentence as x = {x i } where x i is the token at position i.The text spans extraction task aims to obtain the prediction p t representing the summary text span t of the transcript sentence x using a sequence labeling model p t = M s (x), where t is a continuous subset of x labeled by p t = {p ti } with IOBES schema.To train the model, we utilize weakly-supervised sequence labels created in Sec.4.3.

Text Span-Pair Relation Extraction
Structural relations between text spans are required to assemble summary text spans into structured knowledge.To extract structural information, following the previous study (Park and Motahari Nezhad, 2018), we formalize text span-pair relation extraction as a sentence pair classification problem.A directed graph G t = (T , R t ) is used to represent the structured knowledge parsed from a CTA transcript, consisting of nodes for summary text spans in the transcript (T = {t i }) and edges for procedural information (R t = {(u ti , v ti , r ti )} where u ti , v ti ∈ T are summary text spans and r ti is the procedural relation from text span u ti to text span v ti ).A span-pair RE model r ti = M r (u ti , v ti ), ∀u ti , v ti ∈ T is then applied to extract relations between all summary text spans T in the transcript.We train the model using the span-pair RE dataset generated in Sec.4.4.
To capture the long-range context dependency, we enrich the text span representation t based on its surrounding contexts and feed the enhance span representation t c into the relation extraction model M r .Examples are shown in Fig. 3.

Context-aware Models for Text Span-Pair Relation Extraction
We apply state-of-the-art models for natural language entailment (Talman et al., 2018;Devlin et al., 2018) to solve the text span-pair relation extraction task as a sentence pair classification problem.While these models show promising results on the span-pair RE dataset we generated, they do not fully exploit all the information of our dataset.For example, in our dataset, a text span with context information is the combination of matched text span and its surrounding context sentences  Hidden states masking.In this model variant we inject the context segmentation into models by masking out the final layer hidden states for the context sentences and aggregate the remaining hidden states using a pooling function.This structure enables us to incorporate context segmentation information without introducing any new parameters.
where {h} are the final layer hidden states, u t , v t are the two tokenized text spans, t We use p i and p t to denote the position of context and text span in transcript in sentence level.For i = p t , c i = 1 or −1, depends on whether the context is on the left or right of the text span.
The two context position sequences are truncated by a fix length for computational complexity, then injected into BERT model using position-aware attention (Zhang et al., 2017).
Import context position as input embedding.
Segment embedding is a part of input embedding designed to import sentence-pair segmentation information in BERT model.In this model variant we expand the segment embedding to encode context position sequence above.

Dataset Generation
To take advantage of the weak supervision from protocols, we build a pipeline to generate datasets for the CTA parsing framework, showed in Fig. 5.

Protocol Parsing
We use a deterministic finite automation to parse the protocol into a graph G p = (P, R p ) describing protocol phrases represented by nodes (P = {p i } which denotes all protocol phrases parsed from the protocol) and procedural relations represented by edges (R p = {(u p i , v p i , r p i )}, where u p i , v p i ∈ P are protocol phrases and r p i is the procedural relation from phrase u p i to phrase v p i ).We consider three types of procedural relations during the parsing: none for no procedural relation between protocol phrases, next for sequence, and if for decision branching.

Text Spans Matching
To enable the abundant information in protocols, we want to map each phrase in the protocol back to the nearest textual representation in the transcript.We can achieve this by using sentence matching techniques.Following the sequence labeling setting in transcript summarization of our framework, given a protocol phrase p, we want to find the best matching text span t in the transcript.The scope of search is limited to the source lines L p mentioned in the protocol (Fig. 1).Then we extract all possible text spans {t i } from these sentences by enumerating all available n-grams and find the best matching text span t best for p that maximizes sentence similarity measure M sim (p, t best ).Fol-lowing is the overall workflow: For the similarity measure M sim , we adopt sentence embedding from different methods (Pennington et al., 2014;Devlin et al., 2018).The similarity is calculated by the cosine distance between two normalized sentence embedding.An empirical threshold 0.5 is adopted for dropping the protocol phrases without good matched text span.We then match the protocol phrases back to the nearest text span in the transcript.

Sequence Labeling Dataset
With the matched text spans in the transcript, we are able to assign labels to every token in the transcript, denoting whether the token belongs to a matched text span.We adopt IOBES format (Ramshaw and Marcus, 1999) as the labeling schema for constructing the sequence labeling dataset.The labeled text spans are semantically close to the protocol phrases which are abstractive description of actions, and we can use the labels to train text spans extraction models (Sec.3.1) in a weakly-supervised manner.

Text Span-Pair Relation Extraction Dataset
By parsing the protocol we learn the procedural relations between protocol phrases.Thus we can apply them to the matched text spans in transcript to construct the span-pair RE dataset.These relations serve as weak supervision for the span-pair RE model (Sec.3.2).Corresponding to the relation types parsed from the protocols, the dataset include three types of label: <none>, <next> and <if>.

Human-Annotated Matching Test Set
Since the datasets for CTA transcript parsing framework are created via matching, we need to evaluate the performance of our matching methods.Thus, for testing purpose, we manually annotated the matched text spans in transcript for 138 protocol phrases as the manual matching test set.

Experiments
In this section we evaluate the effectiveness of our proposed automated CTA transcript parsing framework and the models.Especially, we run three sets of experiments: (1) we evaluate our text spans matching methods with the manual matching test set; (2) we evaluate model performance on the CTA text spans extraction task with the sequence labeling dataset; (3) we evaluate model performance on the CTA span-pair RE task with the RE dataset.

Text Spans Matching
Implementation.We enumerate all text spans with length [2, K t ] within the sentences in transcripts, where K t = 30 for truncating text spans.
For text spans matching, we try two sentence encoding methods to extract sentence embeddings: (1) average pooling on Glove word embeddings of words in sentences and text spans (Pennington et al., 2014); (2) extracting features using pretrained BERT BASE model and sum up the features in the last four layers then average over words in sentences and text spans (Devlin et al., 2018).Then, we normalize the embeddings and find the best matching text spans for each protocol phrase based on cosine similarity.We also provide the exact matching as a baseline, which finds the longest transcript text span matched by a text span in protocol phrase.Evaluation.We evaluate the performance of our text spans matching methods with the manual matching test set by token level metrics and mention level accuracy, where token level metrics are normalized by sentence lengths.Results in Table 1 show the two methods get acceptable results while the exact matching baseline has a poor performance in comparison.Glove-300d shows better token level accuracy and F 1 score while BERT features have a better mention level accuracy.For cheaper computation, we use Glove-300d as the sentence encoding method of matching for the following sections.Please refer to the appendix for the case study of matching.

Text Spans Extraction
Models.We conduct the experiments of text spans extraction using off-the-shelf sequence labeling models, including CRF (Finkel et al., 2005), LSTM-CRF (Huang et al., 2015) and LM-LSTM-CRF (Liu et al., 2017).The models are trained on the sequence labeling dataset generated by text spans matching.For comparison, we also implement a hand-crafted rule extraction baseline with TokensRegex.
LSTM-CRF and LM-LSTM-CRF.We use LM-LSTM-CRF2 to conduct our experiments for both models, with the same setting of 2 layers word level LSTM, word level hidden size H w = 300, SGD with 0.045 learning rate and 0.05 learning rate decay, and 0.3 dropout ratio.The major difference between two models is that LM-LSTM-CRF contains an additional char-level structure optimized via language model loss.Evaluation.Results for the text spans extraction models on manual matching test set are presented in Table 2, which shows that CRF achieves the best performance and outperforms the neural models (LSTM-CRF, LM-LSTM-CRF).The LM-LSTM-CRF which contains character level language model is even worse (shown as w/ LM in table, with different character level hidden size).One reason could be the neural models require a large scale dataset to train, while our dataset does not meet this requirement.the input of models.The level of context is controlled by a hyperparameter K (Fig. 3).We experiment our models with different levels of context, while fixing the label sampling portion (Sec.5.3) to <none> : <next> : <if>= 4 : 2 : 1. Evaluation.The results are available in Table 4, which shows the model we proposed can outperform the baselines (BERT, HBMP, PCNN), and the model variant Mask MAX reach best performance among all variants when using context level K = 2 and sampling portion = 4 : 2 : 1. Evaluation on context level.The short version table of evaluation results for different context levels are shown in Table 5 and please refer to the appendix for the full version.The results are visualized in Fig. 7 and Fig. 6.The model Mask MAX reached the best micro F 1 score on the manual matching test set with context level K = 2 over all models and K, which shows the effectiveness of the span-pair RE and the hidden state masking Evaluation on label sampling.We try 3 sampling settings and find <none>:<next>:<if>= 4 : 2 : 1 shows the best performance on manual matching test set for most cases (Table 6).Please refer to the appendix for the full results on label sampling.Discussion.We have some observations when looking through the results on manual matching test set: (1) The model variants injected with context information awareness are more sensitive to the change of context level K, comparing to the vanilla BERT model.These variants are outper-forming the vanilla model when provided with more context, but would fall behind if provided with short even no context.(2) Vanilla models without specific context awareness structures (BERT, HBMP, PCNN) also gain improvements from the context on the manual matching test set.
(3) A big gap of <next> F 1 score between K = 1 and K = 2 are observed in most of the models.This is because when K = 1 context only provide the sentence enclosing the text span, the K = 2 context is providing the last and the next sentence, which is useful for predicting the <next> relation.
The results on generated test set (Fig. 6) is also interesting, in which the performance is not stably increased as the K increasing.This may be caused by the propagation of error from fuzzy matching.Since there are some error (noisy) samples in the generated dataset, the models are more likely to capture the noisy patterns from the noisy samples.The larger the context is, the more noisy patterns are contained.Still, changing K from K = 1 to K = 2 gives noticeable improvement to all models, especially for the <next> F 1 score.
Also, the experiments on label sampling (Table 6, see appendix for the full result) show the performance of models are sensitive to sampling portion.Resampling and reweighting techniques for allevi-ating label imbalance could be helpful to address such problem in future study.

Conclusion
In this paper, we explored automated CTA transcript parsing, which is a challenging task due to the lack of direct supervision data and the requirement of document level understanding.We proposed a weakly supervised framework to utilize the full information in data.We noticed the importance of context in the CTA parsing task and exploited model variants to make use of context information.Our evaluation on manually labeled test set shows the effectiveness of our framework.

Acknowledgment
This work has been supported in part by National Science Foundation SMA 18-29268, Schmidt Family Foundation, Amazon Faculty Award, Google Research Award, and JP Morgan AI Research Award.We would like to thank all the collaborators in INK research lab for their constructive feedback on the work.We thank the anonymous reviewers for their valuable feedback.

Figure 2 :
Figure 2: The framework of Automated CTA Transcripts Parsing.Text spans are extracted via the sequence labeling model, then the relations between text spans are extracted by the text span-pair relation extraction model (span-pair RE model).In the end we assemble the results into structured knowledge (flowchart) for CTA.

Figure 3 :
Figure 3: The construction of text span with context t c .The example shows two text spans with context using K = 2. Neighbours of text span t are denoted by t +i and t −i , 0 < i <= K

Figure 4 :
Figure 4: Visualization of the hidden state masking.Hidden states for the context sentences are masked before pooling.
[cls] , t [sep] are the [cls] token and [sep] token (for BERT model), h [cls] , h [sep] are the corresponding hidden states, respectively.Import context position as attention.Inspired by position embedding and position-aware attention (Zeng et al., 2014; Zhang et al., 2017), we define two context position sequences [c 1 , • • • , c n ] and [c 1 , • • • , c n ], which correspond to the position of the two text spans, respectively, that is:

Figure 5 :
Figure 5: The dataset generation pipeline.The protocol is first parsed into a graph with relations between protocol phrases (shown as phrase), then match the protocol phrases with the text spans in transcripts (shown as span).Finally, sequence labeling dataset and span-pair RE dataset are created according to the matches and the relations.

5. 3
Text Span-Pair Relation Extraction Models.For text span-pair relation extraction, we use the pre-trained BERT BASE model 3 (Devlin et al., 2018) as our backbone model to address the low-resource issue of our RE dataset generated from the limited CTA data.On this basis, we implement model variants of injecting context information awareness (Sec.3.3) to utilize the full information in our dataset, which includes: hidden states Masking (Mask AVG and Mask MAX ), Context position as Attention (C.Attn.) and Context postion as input Embedding (C.Emb.).For hidden states Masking, the different subscriptions represent different hidden state pooling methods (avg pooling and max pooling) For the two models using context position, we empirically use E = 30 as the embedding size and truncate the context position sequence (Sec.3.3) by ±10.In addition, we experiment on the hierarchical BiLSTM model (Talman et al., 2018) and Piecewise Convolution Neural Network (Zeng et al., 2015) as the nonpretrained baseline models in comparison.Results are aggregated from 5 runs with different initialization seeds for all experiments.

Figure 6 :
Figure 6: The micro F 1 score of models on different context level K, evaluated on generated test set.

Figure 7 :
Figure 7: The micro F 1 score of models on different context level K, evaluated on manual matching test set.

Table 1 :
Matching performance on the manual matching testset with different sentence encoding, in token level accuracy and mention level accuracy.
BERT fea.means using features extracted by BERT model.and Exact is the exact matching baseline

Table 2 :
Performance of sequence labeling models,

Table 4 :
Performance of span-pair RE models, with sampling portion 4 : 2 : 1 and K = 2. Evaluated on generated test set and manual matching test set.

Table 5 :
Performance of span-pair RE models on different context level K, with sampling portion 4 : 2 : 1. Evaluated on generated test set and manual matching test set.

Table 6 :
Performance on text spans relation extraction models on different label sampling settings, with K = 3. Generated represent the sampled generated test set follows the sampling portion the model trained on, while Manual represents the manual matching test set which is fixed to 6 : 3 : 1.