Automated Scoring of Picture-based Story Narration

This work investigates linguistically motivated features for automatically scoring a spoken picture-based narration task. Speciﬁcally, we build scoring models with features for story development, language use and task relevance of the response. Results show that combinations of these features outperform a base-line system that uses state of the art speech-based features, and that best results are obtained by combining the linguistic and speech features.


Introduction
Story-telling has been used in evaluating the development of language skills (Sun and Nippold, 2012;McKeough and Malcolm, 2011;Botvin and Sutton-Smith, 1977). It has also been incorporated into assessment of English language proficiency in tests such as ETS's TOEFL Junior Comprehensive Test 1 , where English language skills of non-native middleschool students are tested on a task designed to elicit stories based on pictures. The Six-Picture Narration task presents a series of six pictures (similar to a comic strip) to the test taker, who must orally produce a story which incorporates the events depicted in the pictures. As the scoring guide 2 for this task indicates, in addition to fluidity of speech and few pronunciation errors, high scoring responses must also show good command of language conventions, including grammar and word usage, and must also be relevant to the task.
Previous work (Evanini and Wang, 2013) explored automated assessment of the speech component of the spoken responses to the picture narration task, but the linguistic and narrative aspects of the response have not received much attention. In this work, we investigate linguistic and constructrelevant aspects of the test such as (1) relevance and completeness of the content of the responses with respect to the prompt pictures, (2) proper word usage (3) use of narrative techniques such as detailing to enhance the story, and (4) sequencing strategies to build a coherent story.
The contribution of this work is three-fold. First, we improve the construct coverage of the automated scoring models by incorporating evaluation of elements prescribed in the scoring rubric. Second, our linguistically motivated features allow for clear interpretation and explanation of scores, which is especially important if the automated scoring is to be employed for educational purposes. Finally, our results are promising -we show that the combination of linguistic and construct-relevant features which we explore in this work outperforms the state of the art baseline system, and that the best performance is obtained when the linguistic and construct-relevant features are combined with the speech features. Evanini et al. (2013; use features extracted mainly from speech for scoring the picture narration task. They employ measures capturing fluency, prosody and pronunciation. Our work explores the other (complementary) dimensions of the test such as language use, content relevance and story development. Somasundaran and Chodorow (2014) construct features for awkward word usage and content relevance for a written vocabulary test which we adapt for our task. Discourse organization features have been employed for essay scoring of written essays in the expository and argumentative genre (Attali and Burstein, 2006). Our discourse features are focused on the structure of spoken narratives. Our relevance measure is intended to capture topicality while providing leeway for creative story telling, which is different from scoring summaries (Loukina et al., 2014). King and Dickinson (2013) use dependency parses of written picture descriptions. Given that our data is automatically recognized speech, parse features are not likely to be reliable. We use measures of n-gram association, such as pointwise mutual information (PMI), that have a long history of use for detecting collocations and measuring their quality (see Manning and Schütze (1999) and Leacock et al. (2014) for reviews). Our application of a large n-gram database and PMI is to encode language proficiency in sentence construction without using a parser.

Related Work
Picture description tasks have been employed in a number of areas of study ranging from second language acquisition to Alzheimer's disease (Ellis, 2000;Forbes-McKay and Venneri, 2005). Picturebased story narration has also been used to study referring expressions (Lee et al., 2012) and to analyze child narratives (Hassanali et al., 2013).

Data
The TOEFL Junior Comprehensive assessment is a computer-based test intended for middle school students around the ages of 11 -15, and is designed to assess a student's English communication skills. As mentioned above, we focus on the Six-Picture Narration task. Human expert raters listen to the recorded responses, which are about 60 seconds in duration, and assign a score to each on a scale of 1 -4, with score point 4 indicating an excellent response. In this work, we use the automatic speech recognition (ASR) output transcription of the re- sponses (see (Evanini and Wang, 2013) for details).
The data consists of 3440 responses to 6 prompts, all of which were scored by human raters. Table 1 shows the data size and partitions for the experiments as well as the score distributions. An ASR partition (with 1538 responses) was created and used for training the speech recognition models and was used also for our linguistic feature development. Train was used for cross validation experiments as well as for training a final model that was evaluated on Eval evaluation dataset. Quadratic Weighted Kappa (QWK) between human raters for Train is 0.69 and for Eval is 0.70. Responses containing anomalous test taker behavior (such as non-English responses or non-responses) and responses with severe technical difficulties (such as static or background noise) receive separate ratings and are excluded from this study. This filtering resulted in a total of 874 responses in Train and 672 responses in Eval data sets.

Features
We explore five different feature sets to help us answer the following questions about the response: Did the test taker construct a story about the pictures in the prompt (or did he/she produce an irrelevant response instead?) (Relevance); Did the test taker use words appropriately in the response? Proper usage of words and phrases is characterized by the probabilities of the contexts in which they are used (Collocation); Did the test taker adequately organize the narrative? (Discourse); Did the test taker enhance the narrative by including details (Detailing); and Did the test taker develop the story through expression of emotion and character development? (Sentiment)

Relevance
In order to test if a given response tells a story that is relevant to the pictures in the prompt, we calculate the overlap of the content of the response and the content of the pictures similar to (Somasundaran and Chodorow, 2014). To facilitate this, each prompt is associated with a reference corpus containing a detailed description of each picture, and also an overall narrative that ties together the events in the pictures. Each reference corpus was created by merging the picture descriptions and narratives that were generated independently by 10 annotators. 3 To calculate overlap, stop words were first removed from lemmatized versions of the response and the reference corpus.
Because test-takers often use synonyms and other words related to the prompt, we expanded the content words in the reference corpus by adding their synonyms, as provided in Lin's thesaurus (Lin, 1998) and in WordNet, and also included their WordNet hypernyms and hyponyms. This gave us the following 6 features which measure the overlap, or coverage, between the lemmatized response and the lemmatized (i) reference corpus (lemmas), (ii) reference corpus expanded using Lin's thesaurus (cov-lin), (iii) reference corpus expanded using WordNet Synonyms (cov-wn-syns), (iv) reference corpus expanded using WordNet Hypernyms (cov-wn-hyper), (v) reference corpus expanded using WordNet Hyponyms (cov-wn-hypo), and (vi) reference corpus expanded using all of the above methods (cov-all).

Collocation
Inexperienced use of language is often characterized by inappropriate combinations of words, indicating the writer's lack of knowledge of collocations. In order to detect this, we calculate the Pointwise Mutual Information (PMI) of all adjacent word pairs (bigrams), as well as all adjacent word triples (trigrams) in the Google 1T web corpus (Brants and Franz, 2006). The higher the value of the PMI, the more common is the collocation for the word pair/triple in well formed texts. On the other hand, negative values of PMI indicate that the given word pair or triple is less likely than chance to occur together. We hypothesized that this would be a good indicator of awkward usage, as suggested in Chodorow and Leacock (2000).
We generate two sets of features based on the proportions of bigrams/trigrams falling into each bin, resulting in a total of 16 features. In addition to binning, we also encode as features the maximum, minimum and median PMI value obtained over all bigrams and trigrams. These encode the best and the worst word collocations in a response as well as the overall general quality of the response.

Discourse
Stories are characterized by events that are related (and ordered) temporally or causally. In order to form a coherent narrative, it is often necessary to use proper transition cues to organize the story. Intuitively, coherent responses are more likely to have these cues than less coherent responses.
In order to detect discourse organization cues, we use two lexicons. The first was obtained from the Penn Discourse Treebank (PDTB) annotation manual (Prasad et al., 2008). The second was developed by manually mining websites giving advice on good narrative writing. The two lexicons gave us a total of over 550 cues. From the PDTB and our lexicon, we extracted the number of times each connective was encountered in a particular sense (sense information such as "Temporal" or "Cause" is directly provided in the PDTB manual, and we added similar information to our manually collected lexicon) and used the frequencies to construct a probability distribution over the senses for that cue. Then, for each response, we produced the following features: the number of cues found in the response (totalCuesCount), the number of cues found in the response divided by the number of words in the response (normalized-CuesCount), the number of cues belonging to the temporal category (temporalCuesCount), the number of cues belonging to the causal category (causal-CuesCount), the sum of the probabilities of belonging to the temporal category for each cue found in the response (temporalCuesScore), the sum of the probabilities of belonging to the causal category for each cue found in the response (causalCuesScore).

Detailing
We hypothesized that better responses would show evidence of effective narrative techniques, such as providing vivid descriptions of the events and providing depth to the story. For example, one could say "In the afternoon a boy and a man went to the library.", or make the story more interesting by assigning names to the characters and places as "One day John went to the Central Public Library because he wanted to do some research for his science project. An old man was walking behind him; his name was Peter. " We observed that certain syntactic categories, such as adjectives and adverbs, come into play in the process of detailing. Also, detailing by providing names to the characters and places results in a higher number of proper nouns (NNPs). Thus our detailing feature set consists of the following features: a binary value indicating whether the response contains any proper nouns (presenceNames), the number of proper nouns in the response (countNames), a binary value indicating whether the response contains any adjectives (presenceAdj), the number of adjectives in the response (countAdj), a binary value indicating whether the response contains any adverbs (pres-enceAdv), the number of adverbs in the response (countAdv). We use separate features for counts and presence of the syntactic category in order to balance the trade-off between sparsity and informativeness. The count features are more informative, but they can be sparse (especially for higher counts).

Sentiment
One common technique used in developing a story is to reveal the character's private states, emotions and feelings. This requires the use of subjectivity and sentiment terms.
We use lexicons for annotating sentiment and subjective words in the response. Specifically, we use a sentiment lexicon (ASSESS) developed in previous work in assessments (Beigman Klebanov et al., 2013) and the MPQA subjectivity lexicon (Wilson et al., 2005). ASSESS lexicon assigns a positive/negative/neutral polarity probability profile to its entries, and MPQA lexicon associates a positive, negative or neutral polarity category to its entries. We consider a word from the ASSESS lexicon to be polar if the sum of positive and negative probabilities is greater than 0.65 (we arrived at this number after manual inspection of the lexicon). This gives us the subjectivity feature set comprised of the following features: A binary value indicating whether the response contains any polar words from the AS-SESS lexicon (presencePolarProfile), the number of polar words from the ASSESS lexicon found in the response (cntPolarProfile), a binary value indicating whether the response contains any polar words from the MPQA lexicon (presenceMpqaPolar), the number of polar words from the MPQA lexicon found in the response (cntMpqaPolar), a binary value indicating whether the response contains any neutral words from the MPQA lexicon (presenceMpqaNeut), the number of neutral words from the MPQA lexicon found in the response (cntMpqaNeut).
We construct separate features from the ASSESS lexicon and the MPQA lexicon because we found that the neutral category had different meanings in the two lexicons -even the neutral entries in the MPQA lexicon are valuable as they may indicate speech events and private states (e.g. view, assess, believe, cogitate, contemplate, feel, glean, think etc.). On the other hand, words with a high probability of being neutral in the ASSESS lexicon are non-subjective words (e.g. woman, undergo, entire, technologies).

Experiments
For our experiments, we used a supervised learning framework, with the data described above, to build scoring models based on our feature sets. We evaluated several different learning algorithms and found that a Random Forest Classifier consistently produced the best results in cross-validation experiments on the training data when we used our features as well as when we used the baseline set of features. Hence, all of our results in this section are reported using this Random Forest learner. Performance was calculated using Quadratic Weighted Kappa (QWK) (Cohen, 1968), which is the standard evaluation metric used in automated scoring. QWK measures the agreement between the system score and the human-annotated score, correcting for chance agreement and penalizing large disagreements more than small ones.

Baseline
We use the previous state-of-the-art features from Evanini and Wang (2013) as our baseline (EW13). They are comprised of the following subsets: fluency (rate of speech, number of words per chunk, average number of pauses, average number of long pauses), pronunciation (normalized Acoustic Model score, average word confidence, average difference in phone duration from native speaker norms), prosody (mean duration between stressed syllables), and lexical choice (normalized Language Model score).

Results and Analysis
We performed cross validation on our training data (Train) and also performed training on the full training dataset with evaluation on the Eval data. Table  2 reports our results on 10-fold cross validation experiments on the training data (CV), as well results when training on the full training dataset and testing on the evaluation dataset (Eval). The first 5 rows report the performance of the individual feature sets described in Section 4. Not surprisingly, each individual feature set is not able to perform as well as the EW13 baseline, which is comprised of an array of many features that measures various speech characteristics. One exception to this is the collocation feature set that performs as well as the EW13 baseline in the cross validation experiments. Notably, the combination of all five feature sets proposed in this work (All Feats), performs better than the EW13 baseline, indicating that our relevance and  linguistic features are important for scoring for this spoken response item type. Finally the best performance is obtained when we combine our features with the speech-based features. This improvement of All Feats + EW13 over the baseline is statistically significant at p < 0.01, based on 10K bootstrap samples (Zhang et al., 2004). Somewhat surprisingly, the testing on the evaluation dataset showed slightly better performance for most types of features than the cross validation testing. We believe that this might be due to the fact that, for the Eval results, all the training data were available to train the scoring models.
We also performed analysis on the Train set to see if the baseline's performance is impacted when each of our individual feature sets is added to it. As shown in Table 3, each of the feature sets is able to improve the baseline's performance (of 0.48 QWK). Specifically, Discourse and Subjectivity produce a slight improvement while Relevance produces modest improvement. However, only the improvement produced by the Collocation features was statistically significant (p < 0.01)

Conclusions
In this work, we explored five different types of linguistic features for scoring spoken responses in a picture narration task. The features were designed to capture language proficiency, story development and task relevance. Our results are promising: we found that each feature is able to combine well with a state of the art speech feature system to improve results. The combination of the linguistic features achieved better overall performance than the speech features alone. Finally the best performance was achieved when linguistic and speech features were combined.