Multilingual Short Text Responses Clustering for Mobile Educational Activities: a Preliminary Exploration

Text clustering is a powerful technique to detect topics from document corpora, so as to provide information browsing, analysis, and organization. On the other hand, the Instant Response System (IRS) has been widely used in recent years to enhance student engagement in class and thus improve their learning effectiveness. However, the lack of functions to process short text responses from the IRS prevents the further application of IRS in classes. Therefore, this study aims to propose a proper short text clustering module for the IRS, and demonstrate our implemented techniques through real-world examples, so as to provide experiences and insights for further study. In particular, we have compared three clustering methods and the result shows that theoretically better methods need not lead to better results, as there are various factors that may affect the final performance.


Introduction
The development of Natural Language Processing (NLP) has advanced to a level that affects the research landscape of academic domains and has practical applications in various industrial sectors. On the other hand, educational environment has also been improved to impact the world society, such as the emergence of MOOCs (Massive Open Online Courses), and new learning tools or teaching paradigms have also change the way of class interactions, such as the use of Classroom Response Systems (CRS) (Siau et al., 2006). The advance of these two fields has converged to support some of the online or on-site course activities that are previously infeasible, such as real-time understanding of student's responses (Beatty and Gerace, 2009) and mobile language learning (Cardoso, 2010).
Research issues in this direction have gained more and more attention (Hearst, 2015). Examples include the workshops on Innovative Use of NLP for Building Educational Applications (BEA) since 2003 1 and the workshops on Natural Language Processing Techniques for Educational Applications (NLPTEA) since 2014 2 , where the former was held in North America mainly for English or western languages, while the latter was held in Asia mainly for Chinese or oriental languages.
NLP for educational applications not only concerns the academic community, but also has great potential in the educational market. Systems for online writing evaluation service (or automated essay scoring) like ETS's Criterion 3 and for plagiarism identification like Turnitin 4 have established their market share. However, these successful services are built upon mature educational activities and deal with relatively long articles or complete sentences for reliable performance. In contrast, processing of short texts (or sub-sentences, non-sentences, or even a few terms) is under-developed for novel educational applications.
Electronic classroom response systems (CRS), also called instant response systems (IRS) or clickers, have been tested and used in higher education classrooms since the 1960's (Deal, 2007). According to a CNET report (Gilbert, 2005), schools and universities, most in the United States, bought nearly a million clickers in year 2004 alone, using infrared or radio frequency technology for students' transmitters. This number accumulated to nearly nine million units in under a decade by just two of many companies that make clickers (Hoffman, 2012). Recently, IRS has gained even greater popularity in class interaction (Bartsch and Murphy, 2011;Han, 2014;Morais et al., 2015) due to the ubiquitous availability of mobile devices for each individual and cloud-based technology for ease of data collection and integration. IRS services in Taiwan like Zuvio (http://www.zuvio.com.tw/) have attracted local university users in a short term because of its easier use than traditional transmitterrequired IRS and LMS (Learning Management System) such as Moodle App. For example, over the course of year 2014, Zuvio usage in National Taiwan University (NTU) increased from 61 to 263 instructors, 68 to 384 courses, and 2,037 to 11,172 students (Lee and Shih, 2015).
By broadcasting a question to all students' mobile devices and getting responses instantly, such systems help teachers know the learning status of each student better and also help students maintain their attention during the class due to the instant feedback from the teachers and/or their classmates (Bartsch and Murphy, 2011;Beatty and Gerace, 2009). However, the potential of such IRS may still be under-explored (Chien and Chang, 2015a). In the above NTU case, although the majority (54%) of questions deployed in Zuvio were multiple choices, many instructors also used open-ended questions (20%) and composite questions (21%) to promote deeper engagement and reflection (J. W.-S. Lee and Shih, 2015). Previous studies even indicated that multiple-choice examinations pose an obstacle for higher-level thinking in science classes (Stanger-Hall, 2012) and constructed response (e.g. free text writing) assessments are widely viewed as providing greater insight into student thinking than closed form (e.g. multiple-choice) assessments (Birenbaum and Tatsuoka, 1987).
However, no IRS system has yet provided analysis of these open-ended text responses in real time, to our best knowledge. By applying NLP techniques to the IRS or similar mobile interaction systems where only short text interaction is feasible, more information for the students could be provided and therefore more meaningful engagement and efficient learning could be achieved (Chien and Chang, 2015b).
Based on the above trends and observations, this study aims at developing related NLP techniques applicable to the current and future educational environment. More specifically, this paper focuses on the short text response processing in the situation where some forms of instant response systems (IRS) are used in and after the class.

Short Text Response Clustering
As our purpose is to support IRS-related educational activities, an existing IRS would be used for integrating the techniques to be developed so that we can focus on the required new functions without re-inventing the wheel. We choose CloudClassRoom (CCR, http://ccr.tw/) because it is developed by the team of our collaborators (Chien and Chang, 2015a) and because it supports at least 12 languages for international use. This choice would facilitate our testing and evaluation of the developed techniques. However, we keep in mind that the techniques to be developed should be independent of the CCR system, such that they can be ported to another IRS instantly. In fact, CCR is developed in JQuery and PHP language, while the NLP techniques to be developed mainly use Python as our programming language.
Once we have an IRS platform, we can package the required techniques into one of the IRS's module to meet the research purposes. Figure 1 shows a series of processing step packaged into a Semantic Processing Module (SPM), where each rectangular box denotes a processing sub-module and each cylinder denotes a set of language knowledge, corpora, resources, or technical options.
The first-row in the figure mainly deals with refining the terms from the response texts, which heavily depends on the language knowledge and resources. The second-row deals with the sematic processing of the texts, which is basically language independent, except the term expansion step. This pipeline structure is inevitable as there are many options in processing texts for a certain task in a certain language. At our early stage of development, each step would have options for selection by teachers or by NLP experts to best suit the educational activities in a certain course. At the later stage, we expect that the SPM should finally learn the options without human selection. For example, the tokenization need to transform all different digital numbers into a single numeric symbol for semantic clustering in general cases, but should leave the numbers intact in courses such as mathematics, where exact numbers from students are expected for accurate processing. The case also applies to the morphological step where lowercasing and stemming are applied for English semantic processing in general cases, but the morphological analysis should be turned off when, e.g., English is taught, or the expected answers are exact terms used by the students. This consideration would optimize the SPM for each educational activity, but may require years of fine-tuning when more and more activities are encountered in real-world applications. In fact, the CCR has at least 4780 teachers registered, 11,784 classrooms established, and 23,376 questions asked and 248,633 responses received. It really contains many valuable resources for NLP experiments and applications.

Demonstration
To have a concrete idea about the texts submitted by students via CCR, Table 1 shows a set of realworld texts in response to the question asked by a Taiwan university teacher of General Education of Science: "As a marine researcher, if someone presents the photos shown in Figure  As can be seen from Table 1, there are several characteristics in the students' responses: 1) meaningfulness punctuations, e.g., ID 3; 2) multi-lingual: English responses even in a Chinese   class, e.g., ID 4; 3) nonsense responses, e.g. ID 15, 24, 29, etc.; 4) very short texts, e.g., ID 5, 11, 27, etc.; and 5) non-topical texts, e.g. the last part of ID 25, where the student asks for a prize promised by the teacher who encourages the students to aggressively respond to the question for a USB storage device as a prize.
Characteristic 1 can be removed at the tokenization stage. Characteristic 2 could be translated using simple word-by-word translation (by way of multi-lingual lexicons or multilingual Word-Nets 5 , such as BabelNet 6 ), with translation tools such as Goslate 7 , or customized machine translation techniques (Chuang and Tseng, 2008;Tseng et al., 2011). Characteristic 4 can be extended by synonym lexicons or multilingual WordNets to enrich the textual information. However, despite we have eHownet 8 resources from the ACLCLP (Association of Computational Linguistics and Chinese Language Processing), there is no guarantee that the synonyms or hypernyms in eHownet is able to cover the terms used in a class like the above. After these preprocessing, Characteristics 2, 3, 4, and 5 require an effective text clustering technique to distinguish them from the normal meaningful responses, such that the teacher could decide what to do for the improper responses. Once they can be isolated in real time, the teacher can, for example, ask the corresponding students to re-submitted their responses, or preset the system to prevent these texts from been submitted by the students.
To have an idea of how well existing clustering techniques can do for these texts, we have tried three approaches: (1) Hierarchical Agglomerative Clustering (HAC) based on a hybrid way of term indexing, namely lexicon-based segmentation followed by a keyword extraction using the method of (Tseng, 1998(Tseng, , 2002Tseng et al., 2010b), implemented in a well-debugged tool called CATAR (Tseng, 2010a;Tseng and Tsay, 2013), as shown in Figure 3.
(2) Latent Semantic Analysis (LSA) based on the word segmentation by jieba and a topic modeling tool genism without removal of any stopwords and punctuations, as shown in Figure 4.
(3) Latent Dirichlet Analysis (LDA) by jieba and gensim with stopwords and punctuations being removed, as shown in Figure 5.
From Figure 3 based on HAC, there are 3 multi-documents clusters and 16 singleton clusters. The result is generally reasonable, only a few texts, like ID 23 and 31, could not be grouped together with other similar texts. This is because a rigorous criterion is imposed on the HAC, i.e., complete linkage clustering such that ID 31 did not cluster into Cluster 3, despite it contains the salient term "⽛⿒" in Cluster 3. Also, the lexicon-based segmentation regards "深海", "海洋", and " 海 洋 深 層 " as different terms, such that they are totally different features for text clustering. The above two reasons may also apply to the terms and texts, such as "光源" (ID 23), "感光" (ID 33), "暗海" (ID 35), and "夜行性" (ID 25), or "食物" (ID 6) and "肉食性" (ID 3, 33, and 35).   To improve the performance such that the texts containing these semantically related terms being clustered together, it seems that LSA or LDA are better solutions as past studies have shown the possibility (Blei et al., 2003;Deerwester et al. 1990). Based on the HAC result, there are 3-5 clusters in this case. So we cluster the responses using 5 topics with LSA and LDA. Actually, this number: about 5 clusters for each set of responses, is a proper choice for science education based on the feedback of our coinvestigator. However, Figure 4 and 5 shows that LSA and LDA alone cannot solve this short-text clustering problem better. They can sometimes lead to worse results. In addition to the shortage of textual information (short texts), there are other factors that influence the performance, such as feature extraction (whether to use 1-grams as features in Chinese short texts or not, such as "海", "光"), term expansion (whether to incorporate the term-level similarity, such as those between "感光" and "夜行性", or "食物" and "肉 食性", into text clustering). Furthermore, these decisions may depend on the characteristics of the questions asked or classes taught. Therefore, we propose the pipeline SPM in Figure 1 to deal with this problem, so that in each step we could choose proper options for better performance.
To incorporate more semantic information into the SPM, we plan to use language resources such as eHownet, WordNet, and BabelNet for Chinese, English, and multilingual synonym expansion, respectively. Our future study would also use tools like word2vec (Mikolov et al., 2013) and concept map miner (Tseng et al., 2010;Tseng et al., 2012) to extract paradigmatically and/or topically similar terms for term expansion (Tseng et al., 2010). In addition to term expansion, utilization of contextual information of the short texts can be enhanced by machine translation (Tang et al., 2012). Direct clustering based on the continuous distributed representations of words, sentences, or paragraphs (Chinea-Rios et al., 2015;Mikolov et al., 2013) may also be worth of exploring. As a tradition in NLP research, further study will try all the promising combinations of the mentioned techniques to see which combinations perform best in which conditions.
As to the clustering performance evaluation, there are intrinsic and extrinsic measures, where the former measures the clustering quality directly and the latter measures the quality indirectly by applying the clustering result to other task and see if a good result can be obtained from the task. For intrinsic evaluation, measures like perplexity, Rand index, and Silhouette index have been used and we have implemented the latter two measures (Rand and Silhouette) in CATAR to help determine the number of clusters (Tseng, Lin, & Lin, 2007;Tseng & Tsay, 2013). For extrinsic evaluation, which is more suitable for the IRS applications, it depends on how the teacher would like the clustering results. Therefore, our strategies would implement different clustering techniques and intrinsic evaluation measures to suggest various cluster results for the teachers to choose a proper one. Before that, we had assisted the teachers to quickly understand a clustering result by providing some intrinsic evaluation result, i.e., the cluster descriptors as shown in Fig-ure 3. In this way, we help the teachers to explore the students' responses in a period of time short enough during their lecturing activities using the IRS.

Conclusions
This paper describes our preliminary study of short text response clustering for mobile educational activities. We illustrate the characteristics of short text responses from the IRS, propose the SPM module for processing short texts, and demonstrate our implemented techniques via the CCR system. We also compare three clustering methods, and the results showed that theoretically better methods need not lead to better results, as there are various factors that may affect the final performance.
In real-case applications, the SPM module based on the LSA technique has been used online for two years, serving thousands of teachers. Informal evaluation from the responses of teachers, including those in Taiwan and Thailand, has shown that the proposed short-text clustering is applicable to their educational activities.