Automated Topical Component Extraction Using Neural Network Attention Scores from Source-based Essay Scoring

While automated essay scoring (AES) can reliably grade essays at scale, automated writing evaluation (AWE) additionally provides formative feedback to guide essay revision. However, a neural AES typically does not provide useful feature representations for supporting AWE. This paper presents a method for linking AWE and neural AES, by extracting Topical Components (TCs) representing evidence from a source text using the intermediate output of attention layers. We evaluate performance using a feature-based AES requiring TCs. Results show that performance is comparable whether using automatically or manually constructed TCs for 1) representing essays as rubric-based features, 2) grading essays.

Our work focuses on a particular source-based essay writing task called the response-to-text assessment (RTA) (Correnti et al., 2013).Recently, an RTA AWE system (Zhang et al., 2019) was built by extracting rubric-based features related to the use of Topical Components (TCs) in an essay.However, manual expert effort was first required to create the TCs.For each source, the TCs consist of a comprehensive list of topics related to evidence which include: 1) important words indicating the set of evidence topics in the source, and 2) phrases representing specific examples for each topic that students need to find and use in their essays.
To eliminate this expert effort, we propose a method for using the interpretable output of the attention layers of a neural AES for source-based essay writing, with the goal of extracting TCs.We evaluate this method by using the extracted TCs to support feature-based AES for two RTA source texts.Our results show that 1) the feature-based AES with TCs manually created by humans is matched by our neural method for generating TCs , and 2) the values of the rubric-based essay features based on automatic TCs are highly correlated with human Evidence scores.

Related Work
Three recent AWE systems have used non-neural AES to provide rubric-specific feedback.Woods et al. (2017) developed an influence estimation process that used a logistic regresion AES to identify sentences needing feedback.Shibani et al. (2019) presented a web-based tool that provides formative feedback on rhetorical moves in writing.Zhang et al. (2019) used features created for a random forest AES to select feedback messages, although human effort was first needed to create TCs from a source text.We automatically extract TCs using neural AES, thereby eliminating this expert effort.
Others have also proposed methods for preprocessing source information external to an essay.Content importance models for AES predict the parts of a source text that students should include when writing a summary (Klebanov et al., Source Excerpt: Today, Yala Sub-District Hospital has medicine, free of charge, for all of the most common diseases.Water is connected to the hospital, which also has a generator for electricity.Bed nets are used in every sleeping site in Sauri... Essay Prompt: The author provided one specific example of how the quality of life can be improved by the Millennium Villages Project in Sauri, Kenya.Based on the article, did the author provide a convincing argument that winning the fight against poverty is achievable in our lifetime?Explain why or why not with 3-4 examples from the text to support your answer.Essay: In my opinion I think that they will achieve it in lifetime.During the years threw 2004 and 2008 they made progress.People didnt have the money to buy the stuff in 2004.The hospital was packed with patients and they didnt have alot of treatment in 2004.In 2008 it changed the hospital had medicine, free of charge, and for all the common dieases.Water was connected to the hospital and has a generator for electricity.Everybody has net in their site.The hunger crisis has been addressed with fertilizer and seeds, as well as the tools needed to maintain the food.The school has no fees and they serve lunch.To me thats sounds like it is going achieve it in the lifetime.

2014
). Methods for extracting important keywords or keyphrases also exist, both supervised (unlike our approach) (Meng et al., 2017;Mahata et al., 2018;Florescu and Jin, 2018) and unsupervised (Florescu and Caragea, 2017).Rahimi and Litman (2016) developed a TC extraction LDA model (Blei et al., 2003).While the LDA model considers all words equally, our model takes essay scores into account by using attention to represent word importance.Both the unsupervised keyword and LDA models will serve as baselines in our experiments.
In the computer vision area, attention cropped images have been used for further image classification or object detection (Cao et al., 2015;Yuxin et al., 2018;Ebrahimpour et al., 2019).In the NLP area, Lei et al. (2016) proposed to use a generator to find candidate rationale and these are passed through the encoder for prediction.Our work is similar in spirit to this type of work.

RTA Corpus and Prior AES Systems
The essays in our corpus were written by students in grades 4 to 8 in response to two RTA source texts (Correnti et al., 2013): RT A M V P (2970 essays) and RT A Space (2076 essays).Table 1 shows an excerpt from RT A M V P , the associated essay writing prompt, and a student essay.The bolding in the source indicates evidence examples that ex-perts manually labeled as important for students to discuss (i.e., TC phrases).Evidence usage in each essay was manually scored on a scale of 1 to 4 (low to high).The distribution of Evidence scores is shown in Table 2.The essay in Table 1 received a score of 3, with the bolding indicating phrases semantically related to the TCs from the source text.
To date, two approaches to AES have been proposed for the RTA: AES rubric and AES neural .To support the needs of AWE, AES rubric (Zhang and Litman, 2017) used a traditional supervised learning framework where rubric-motivated features were extracted from every essay before model training -Number of Pieces of Evidence (NPE)1 , Concentration (CON), Specificity (SPC)2 , Word Count (WOC).The two aspects of TCs introduced in Section 1 (topic words, specific example phrases) were used during feature extraction.
Motivated by improving stand-alone AES performance (i.e., when an interpretable model was not needed for subsequent AWE), Zhang and Litman (2018)  phrase has important source-related information.
To provide intuition, Table 3 shows examples sentences from the student essay in Table 1.Bolded are phrases with the highest self-attention score within the sentence.Italics are specific example phrases that refer to the manually constructed TCs for the source.Attn sent is the text to essay attention score that measures which essay sentences have the closest meaning to a source sentence.Attn phrase is the self-attention score of the bolded phrase that measures phrase importance.A sentence with a high attention score tends to include at least one specific example phrase, and vice versa.The phrase with the highest attention score tends to include at least one specific example phrase if the sentence has a high attention score.
Based on these observations, we first extract the output of two layers from the neural network: 1) the attn sent of each sentence, and 2) the output of the convolutional layer as the representation of the phrase with the highest attn phrase in each sentence (denoted by cnn phrase ).We also extract the plain text of the phrase with the highest attn phrase in each sentence (denoted by text phrase ).Then, our T C attn method uses the extracted information in 3 main steps: 1) filtering out text phrase from sentences with low attn sent , 2) clustering all remaining text phrase based on cnn phrase , and 3) generating TCs from clusters.
The first filtering step keeps all text phrase where the original sentences have attn sent higher than a threshold.The intuition is that lower attn sent indicates less source-related information.
The second step clusters these text phrase based on their corresponding representations cnn phrase .We use k-medoids to cluster text phrase into M clusters, where M is the number of topics in the source text.Then, for text phrase in each topic cluster, we use k-medoids to cluster them into N clusters, where N is the number of the specific example phrases we want to extract from each topic.The outputs of this step are M * N clusters.
The third step uses the topic and example clus-  tering to extract TCs.As noted earlier, TCs include two parts: topic words, and specific example phrases.Since our method is data-driven and students introduce their vocabulary into the corpus, essay text is noisy.To make the TC output cleaner, we filter out words that are not in the source text.
To obtain topic words, we combine all text phrase from each topic cluster to calculate the word frequency per topic.To make topics unique, we assign each word to the topic cluster in which it has the highest normalized word frequency.We then include the top K topic words based on their frequency in each topic cluster.To obtain example phrases, we combine all text phrase from each example cluster to calculate the word frequency per example, then include the top K example words based on their frequency in each example cluster.

Experimental Setup and Results
Figure 1 shows an overview of four TC extraction methods to be evaluated.T C manual (upper bound) uses a human expert to extract TCs from a source text.T C attn is our proposed method and automatically extracts TCs using both a source text and student essays.T C lda (Rahimi and Litman, 2016) (baseline) builds on LDA to extract TCs from student essays only, while T C pr (baseline) builds on PositionRank (Florescu and Caragea, 2017)  traction, we needed to further process its output to create T C pr .To extract topic words, we extract all keywords from the output.Next, we map each word to a higher dimension with word embedding.Lastly, we cluster all keywords using k-medoids into P R topic topics.To extract example phrases, we put them into only one topic and remove all redundant example phrases if they are subsets of other example phrases.
We configure experiments to test two hypotheses: H1) the AES rubric model for scoring Evidence (Zhang and Litman, 2017) will perform comparably when extracting features using either T C attn or T C manual , and will perform worse when using T C lda or T C pr ; H2) the correlation between the human Evidence score and the feature values (NPE and sum of SPC features) 3 will be comparable when extracted using T C attn and T C manual , and will be stronger than when using T C lda and T C pr .The experiment for H1 tests the impact of using our proposed TC extraction method on the downstream AES rubric task, while the H2 experiment examines the impact on the essay representation itself.
Following Zhang and Litman (2017), we stratify essay corpora: 40% for training word embeddings and extracting TCs, 20% for selecting the best embedding and parameters, and 40% for testing.We use the hyper-parameters from Zhang and Litman (2018) for neural training as shown in Table 4. Table 5 shows all other parameters selected using the development set.
Results for H1.H1 is supported by the results in Table 6, which compares the Quadratic Weighted Kappa (QWK) between human and AES rubric Evidence scores (values 1-4) when AES rubric uses T C manual versus each of the automatic methods.T C attn always yields better performance, and even significantly better than T C manual .
Results for H2.The results in  lines, and for NPE even yields stronger correlations than the manual TC method.
Qualitative Analysis.The manually-created topic words for RT A M V P represent 4 topics, which are "hospital", "malaria", "farming" and "school"4 .Although Table 5 shows that the automated list has more topics for topic words and might have broken one topic into separate topics, a good automated list should have more topics related to the 4 topics above.We manually assign a topic for each of the topic words from the different automated methods.T C lda has 4 related topics out of 9 (44.44%),T C pr has 6 related topics out of 19 (31.58%), and T C attn has 10 related topics out of 16 (62.50%).Obviously, T C attn preserves more related topics than our baselines.
Moving to the second aspect of TCs (specific example phrases), Table 8 shows the first 10 specific example phrases for a manually-created category that introduces the changes made by the MVP project5 .This category is a mixture of different topics because it talks about the "hospital", "malaria", "school", and "farming" at the same time.
T C attn has overlap with T C manual on different topics.However, T C lda mainly talks about "hospital", because the nature of the LDA model doesn't allow mixing specific example phrases about different topics in one category.Unfortunately, T C pr does not include any overlapped specific phrase in the first 10 items; they all refer to some general example phrases from the beginning of the source article.Although there are some related specific example phrases in the full list, they are mainly about school.This is because the PositionRank algorithm tends to assign higher scores to words that appear early in the text.

Conclusion and Future Work
This paper proposes T C attn , a method for using the attention scores in a neural AES model to automatically extract the Topical Components of a source text.Evaluations show the potential of T C attn for eliminating expert effort without degrading AES rubric performance or the feature representations themselves.T C attn outperforms baselines and generates comparable or even better results than a manual approach.
Although T C attn outperforms all baselines and requires no human effort on TC extraction, annotation of essay evidence scores is still needed.This leads to an interesting future investigation direction, which is training the AES neural using the gold standard that can be extracted automatically.
One of our next steps is to investigate the impact of TC extraction methods on a corresponding AWE system (Zhang et al., 2019), which uses the feature values produced by AES rubric to generate formative feedback to guide essay revision.
Currently, the T C lda are trained on student essays, while the T C pr only works on the source article.However, T C attn uses both student essays and the source article for TC generation.It might be hard to say that the superior performance of T C attn is due to the neural architecture and attention scores rather than the richer training resources.Therefore, a comparison between T C attn and a model that uses both student essays and the source article is needed.

A.1 Topic Words Results
Table 9 shows all topic words for the RT A M V P from T C manual .Table 10 shows all topic words for the RT A M V P from T C lda .Table 11 shows all topic words for the RT A M V P from T C pr .Table 12 shows all topic words for the RT A M V P from T C attn .

A.2 Specific Example Phrases Results
Table 13 shows all specific example phrases for the RT A M V P from T C manual .Table 14 shows all specific example phrases for the RT A M V P from T C lda .Table 15 shows all specific example phrases for the RT A M V P from T C pr .

Figure 1 :
Figure 1: An overview of four TC extraction systems. fight fight poverty

Table 1 :
A source excerpt for the RT A M V P prompt and an essay with score of 3.

Table 2 :
The Evidence score distribution of RTA.

Table 3 :
Example attention scores of essay sentences.

Table 4 :
Hyper-parameters for neural training.

Table 5 :
Parameters for different models.
to instead extract TCs from only the source text.Since PositionRank is not designed for TC ex- Table 7 support H2. T C attn outperforms the two automated base- 3 These features are extracted based on TCs.

Table 6 :
The performance (QWK) of AES rubric using different TC extraction methods for feature creation.The numbers in the parentheses show the model numbers over which the current model performs significantly better (p < 0.05).The best results between automated methods in each row are in bold.

Table 7 :
Pearson's r comparing feature values computed using each TC extraction method with human (gold-standard) Evidence essay scores.All correlation values are significant (p ≤ 0.05).The best results between automated methods in each row are in bold.

Table 8 :
Specific example phrases for the RT A M V P progress topic.

Table 9 :
Table 16 shows all specific example phrases for the RT A M V P from T C attn .Topic words of T C manual .

Table 14 :
Specific example phrases of T C

Table 15 :
Specific example phrases of T C pr .

Table 16 :
Specific example phrases of T C attn .