International Conference on Natural Language Processing (2020)

Volumes

Proceedings of the 17th International Conference on Natural Language Processing (ICON) 67 papers
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task 10 papers
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TermTraction 2020 Shared Task 4 papers
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task 6 papers
Proceedings of the 17th International Conference on Natural Language Processing (ICON): System Demonstrations 19 papers
Proceedings of the Workshop on Joint NLP Modelling for Conversational AI @ ICON 2020 7 papers

pdf (full)
bib (full) Proceedings of the 17th International Conference on Natural Language Processing (ICON)

pdf bib
Proceedings of the 17th International Conference on Natural Language Processing (ICON)
Pushpak Bhattacharyya | Dipti Misra Sharma | Rajeev Sangal

pdf bib abs
The WEAVE Corpus: Annotating Synthetic Chemical Procedures in Patents with Chemical Named Entities
Ravindra Nittala | Manish Shrivastava

The Modern pharmaceutical industry depends on the iterative design of novel synthetic routes for drugs while not infringing on existing intellectual property rights. Such a design process calls for analyzing many existing synthetic chemical reactions and planning the synthesis of novel chemicals. These procedures have been historically available in unstructured raw text form in publications and patents. To facilitate automated synthetic chemical reactions analysis and design the novel synthetic reactions using Natural Language Processing (NLP) methods, we introduce a Named Entity Recognition (NER) dataset of the Examples section in 180 full-text patent documents with 5188 synthetic procedures annotated by domain experts. All the chemical entities which are part of the synthetic discourse were annotated with suitable class labels. We present the second-largest chemical NER corpus with 100,129 annotations and the highest IAA value of 98.73% (F-measure) on a 45 document subset. We discuss this new resource in detail and highlight some specific challenges in annotating synthetic chemical procedures with chemical named entities. We make the corpus available to the community to promote further research and development of downstream NLP systems applications. We also provide baseline results for the NER model to the community to improve on.

pdf bib abs
Increasing accuracy of a semantic word labelling tool based on a small lexicon
Hugo Sanjurjo-González

Semantic annotation has become an important piece of information within corpus linguistics. This information is usually included for every lexical unit of the corpus providing a more exhaustive analysis of language. There are some resources such as lexicons or ontologies that allow this type of annotation. However, expanding these resources is a time-consuming task. This paper describes a simple NLP baseline for increasing accuracy of the existing semantic resources of the UCREL Semantic Analysis System (USAS). In our experiments, Spanish token accuracy is improved by up to 30% using this method.

pdf bib abs
Treatment of optional forms in Mathematical modelling of Pāṇini
Anupriya Aggarwal | Malhar Kulkarni

Pāṇini in his Aṣṭādhyāyī has written the grammar of Sanskrit in an extremely concise manner in the form of about 4000 sūtras. We have attempted to mathematically remodel the data produced by these sūtras. The mathematical modelling is a way to show that the Pāṇinian approach is a minimal method of capturing the grammatical data for Sanskrit which is a natural language. The sūtras written by Pāṇini can be written as functions, that is for a single input the function produces a single output of the form y=f(x), where x and y is the input and output respectively. However, we observe that for some input dhātus, we get multiple outputs. For such cases, we have written multivalued functions that is the functions which give two or more outputs for a single input. In other words, multivalued function is a way to represent optional output forms which are expressed in Pāṇinian grammar with the help of 3 terms i.e. vā, vibhaṣā, and anyatarasyam. Comparison between the techniques employed by Pāṇini and our notation of functions helps us understand how Pāṇinian techniques ensure brevity and terseness, hence illustrating that Pāṇinian grammar is minimal.

pdf bib abs
Automatic Hadith Segmentation using PPM Compression
Taghreed Tarmom | Eric Atwell | Mohammad Alsalka

In this paper we explore the use of Prediction by partial matching (PPM) compression based to segment Hadith into its two main components (Isnad and Matan). The experiments utilized the PPMD variant of the PPM, showing that PPMD is effective in Hadith segmentation. It was also tested on Hadith corpora of different structures. In the first experiment we used the non- authentic Hadith (NAH) corpus for train- ing models and testing, and in the second experiment we used the NAH corpus for training models and the Leeds University and King Saud University (LK) Hadith cor- pus for testing PPMD segmenter. PPMD of order 7 achieved an accuracy of 92.76% and 90.10% in the first and second experiments, respectively.

Current voice assistants typically use the best hypothesis yielded by their Automatic Speech Recognition (ASR) module as input to their Natural Language Understanding (NLU) module, thereby losing helpful information that might be stored in lower-ranked ASR hypotheses. We explore the change in performance of NLU associated tasks when utilizing five-best ASR hypotheses when compared to status quo for two language datasets, German and Portuguese. To harvest information from the ASR five-best, we leverage extractive summarization and joint extractive-abstractive summarization models for Domain Classification (DC) experiments while using a sequence-to-sequence model with a pointer generator network for Intent Classification (IC) and Named Entity Recognition (NER) multi-task experiments. For the DC full test set, we observe significant improvements of up to 7.2% and 15.5% in micro-averaged F1 scores, for German and Portuguese, respectively. In cases where the best ASR hypothesis was not an exact match to the transcribed utterance (mismatched test set), we see improvements of up to 6.7% and 8.8% micro-averaged F1 scores, for German and Portuguese, respectively. For IC and NER multi-task experiments, when evaluating on the mismatched test set, we see improvements across all domains in German and in 17 out of 19 domains in Portuguese (improvements based on change in SeMER scores). Our results suggest that the use of multiple ASR hypotheses, as opposed to one, can lead to significant performance improvements in the DC task for these non-English datasets. In addition, it could lead to significant improvement in the performance of IC and NER tasks in cases where the ASR model makes mistakes.

pdf bib abs
A Grammatical Sketch of Asur: A North Munda language
Zoya Khalid

Asur belongs to North Munda sub-branch of Austro-Asiatic languages which now has less than 10,000 speakers. This is a very first attempt at describing and documenting Asur language, therefore the approach of this paper is descriptive rather than that of answering research questions. The paper attempts to describe the grammatical features such as gender, number, case, pronouns, tense-aspect-mood, negation, question formation, etc. of Asur language. It briefly touches upon the morphosyntactic and typological features of Asur, with the intent to present a concise overview of the language, which has so far remained almost untouched by documentary linguistics.

pdf bib abs
English to Manipuri and Mizo Post-Editing Effort and its Impact on Low Resource Machine Translation
Loitongbam Sanayai Meetei | Thoudam Doren Singh | Sivaji Bandyopadhyay | Mihaela Vela | Josef van Genabith

We present the first study on the post-editing (PE) effort required to build a parallel dataset for English-Manipuri and English-Mizo, in the context of a project on creating data for machine translation (MT). English source text from a local daily newspaper are machine translated into Manipuri and Mizo using PBSMT systems built in-house. A Computer Assisted Translation (CAT) tool is used to record the time, keystroke and other indicators to measure PE effort in terms of temporal and technical effort. A positive correlation between the technical effort and the number of function words is seen for English-Manipuri and English-Mizo but a negative correlation between the technical effort and the number of noun words for English-Mizo. However, average time spent per token in PE English-Mizo text is negatively correlated with the temporal effort. The main reason for these results are due to (i) English and Mizo using the same script, while Manipuri uses a different script and (ii) the agglutinative nature of Manipuri. Further, we check the impact of training a MT system in an incremental approach, by including the post-edited dataset as additional training data. The result shows an increase in HBLEU of up to 4.6 for English-Manipuri.

pdf bib abs
Learning to Interact: An Adaptive Interaction Framework for Knowledge Graph Embeddings
. Chandrahas | Nilesh Agrawal | Partha Talukdar

Knowledge Graph (KG) Embedding methods have been widely studied in the past few years and many methods have been proposed. These methods represent entities and relations in the KG as vectors in a vector space, trained to distinguish correct edges from the incorrect ones. For this distinction, simple functions of vectors’ dimensions, called interactions, are used. These interactions are used to calculate the candidate tail entity vector which is matched against all entities in the KG. However, for most of the existing methods, these interactions are fixed and manually specified. In this work, we propose an automated framework for discovering the interactions while training the KG Embeddings. The proposed method learns relevant interactions along with other parameters during training, allowing it to adapt to different datasets. Many of the existing methods can be seen as special cases of the proposed framework. We demonstrate the effectiveness of the proposed method on link prediction task by extensive experiments on multiple benchmark datasets.

pdf bib abs
Inducing Interpretability in Knowledge Graph Embeddings
. Chandrahas | Tathagata Sengupta | Cibi Pragadeesh | Partha Talukdar

We study the problem of inducing interpretability in Knowledge Graph (KG) embeddings. Learning KG embeddings has been an active area of research in the past few years, resulting in many different models. However, most of these methods do not address the interpretability (semantics) of individual dimensions of the learned embeddings. In this work, we study this problem and propose a method for inducing interpretability in KG embeddings using entity co-occurrence statistics. The proposed method significantly improves the interpretability, while maintaining comparable performance in other KG tasks.

pdf bib abs
Solving Arithmetic Word Problems Using Transformer and Pre-processing of Problem Texts
Kaden Griffith | Jugal Kalita

This paper outlines the use of Transformer networks trained to translate math word problems to equivalent arithmetic expressions in infix, prefix, and postfix notations. We compare results produced by a large number of neural configurations and find that most configurations outperform previously reported approaches on three of four datasets with significant increases in accuracy of over 20 percentage points. The best neural approaches boost accuracy by 30% on average when compared to the previous state-of-the-art.

pdf bib abs
Clickbait in Hindi News Media : A Preliminary Study
Vivek Kaushal | Kavita Vemuri

A corpus of Hindi news headlines shared on Twitter was created by collecting tweets of 5 mainstream Hindi news sources for a period of 4 months. 7 independent annotators were recruited to mark the 20 most retweeted news posts by each of the 5 news sources on its clickbait nature. The clickbait score hence generated was assessed for its correlation with interactions on the platform (retweets, favorites, reader replies), tweet word count, and normalized POS (part-of-speech) tag counts in tweets. A positive correlation was observed between readers’ interactions with tweets and tweets’ clickbait score. Significant correlations were also observed for POS tag counts and clickbait score. The prevalence of clickbait in mainstream Hindi news media was found to be similar to its prevalence in English news media. We hope that our observations would provide a platform for discussions on clickbait in mainstream Hindi news media.

pdf bib abs
Self Attended Stack-Pointer Networks for Learning Long Term Dependencies
Salih Tuc | Burcu Can

We propose a novel deep neural architecture for dependency parsing, which is built upon a Transformer Encoder (Vaswani et al. 2017) and a Stack Pointer Network (Ma et al. 2018). We first encode each sentence using a Transformer Network and then the dependency graph is generated by a Stack Pointer Network by selecting the head of each word in the sentence through a head selection process. We evaluate our model on Turkish and English treebanks. The results show that our trasformer-based model learns long term dependencies efficiently compared to sequential models such as recurrent neural networks. Our self attended stack pointer network improves UAS score around 6% upon the LSTM based stack pointer (Ma et al. 2018) for Turkish sentences with a length of more than 20 words.

pdf bib abs
Creation of Corpus and Analysis in Code-Mixed Kannada-English Social Media Data for POS Tagging
Abhinav Reddy Appidi | Vamshi Krishna Srirangam | Darsi Suhas | Manish Shrivastava

Part-of-Speech (POS) is one of the essential tasks for many Natural Language Processing (NLP) applications. There has been a significant amount of work done in POS tagging for resource-rich languages. POS tagging is an essential phase of text analysis in understanding the semantics and context of language. These tags are useful for higher-level tasks such as building parse trees, which can be used for Named Entity Recognition, Coreference resolution, Sentiment Analysis, and Question Answering. There has been work done on code-mixed social media corpus but not on POS tagging of Kannada-English code-mixed data. Here, we present Kannada-English code- mixed social media corpus annotated with corresponding POS tags. We also experimented with machine learning classification models CRF, Bi-LSTM, and Bi-LSTM-CRF models on our corpus.

pdf bib abs
Identifying Complaints from Product Reviews: A Case Study on Hindi
Raghvendra Pratap Singh | Rejwanul Haque | Mohammed Hasanuzzaman | Andy Way

Automatic recognition of customer complaints on products or services that they purchase can be crucial for the organisations, multinationals and online retailers since they can exploit this information to fulfil their customers’ expectations including managing and resolving the complaints. Recently, researchers have applied supervised learning strategies to automatically identify users’ complaints expressed in English on Twitter. The downside of these approaches is that they require labeled training data for learning, which is expensive to create. This poses a barrier for them being applied to low-resource languages and domains for which task-specific data is not available. Machine translation (MT) can be used as an alternative to the tools that require such task-specific data. In this work, we use state-of-the-art neural MT (NMT) models for translating Hindi reviews into English and investigate performance of the downstream classification task (complaints identification) on their English translations.

pdf bib abs
Generative Adversarial Networks for Annotated Data Augmentation in Data Sparse NLU
Olga Golovneva | Charith Peris

Data sparsity is one of the key challenges associated with model development in Natural Language Understanding (NLU) for conversational agents. The challenge is made more complex by the demand for high quality annotated utterances commonly required for supervised learning, usually resulting in weeks of manual labor and high cost. In this paper, we present our results on boosting NLU model performance through training data augmentation using a sequential generative adversarial network (GAN). We explore data generation in the context of two tasks, the bootstrapping of a new language and the handling of low resource features. For both tasks we explore three sequential GAN architectures, one with a token-level reward function, another with our own implementation of a token-level Monte Carlo rollout reward, and a third with sentence-level reward. We evaluate the performance of these feedback models across several sampling methodologies and compare our results to upsampling the original data to the same scale. We further improve the GAN model performance through the transfer learning of the pre-trained embeddings. Our experiments reveal synthetic data generated using the sequential generative adversarial network provides significant performance boosts across multiple metrics and can be a major benefit to the NLU tasks.

pdf bib abs
BertAA : BERT fine-tuning for Authorship Attribution
Maël Fabien | Esau Villatoro-Tello | Petr Motlicek | Shantipriya Parida

Identifying the author of a given text can be useful in historical literature, plagiarism detection, or police investigations. Authorship Attribution (AA) has been well studied and mostly relies on a large feature engineering work. More recently, deep learning-based approaches have been explored for Authorship Attribution (AA). In this paper, we introduce BertAA, a fine-tuning of a pre-trained BERT language model with an additional dense layer and a softmax activation to perform authorship classification. This approach reaches competitive performances on Enron Email, Blog Authorship, and IMDb (and IMDb62) datasets, up to 5.3% (relative) above current state-of-the-art approaches. We performed an exhaustive analysis allowing to identify the strengths and weaknesses of the proposed method. In addition, we evaluate the impact of including additional features (e.g. stylometric and hybrid features) in an ensemble approach, improving the macro-averaged F1-Score by 2.7% (relative) on average.

This paper proposes language independent natural language generator for Tree Adjoining Grammar (TAG)[8] based Machine Translation System. In this model, the TAG based parsing and generation approach considered for the syntactic and semantic analysis of a source language. This model provides an efficient and a systematic way of encapsulating language resources with engineering solution to develop the machine translation System. A TAG based Generator is developed with existing resources using TAG formalism to generate the target language from TAG based parser derivation. The process allows syntactic feature-marking, the Subject-Predicate Agreement marking and multiple synthesized generated outputs in complex and morphological rich language. The challenge in applying such approach is to handle the linguistically diversified features. It is achieved using rule-based translation grammar model to align the source language to corresponding target languages. The computational experiments demonstrate that substantial performance in terms of time and memory could also be obtained by using this approach. Nevertheless, this paper also describes the process of lexicalization and explain the state charts, TAG based adjunction and substitution function and the complexity and challenges beneath parsing-generation process.

pdf bib abs
Exploration of Cross-lingual Summarization for Kannada-EnglishLanguage Pair
Vinayaka R Kamath | Rachana Aithal K R | Vennela K | Mamatha Hr

Cross-lingual summarization(CLS) is the process of generating a summary in one particular language for a source document in a different language. Low resource languages like Kannada greatly benefit from such systems because they help in delivering a concise representation of the same information in a different popular language. We propose a novel dataset generation pipeline and a first of its kind dataset that will aid in CLS for Kannada-English language pair. This work is also an attempt to inspect the existing systems and extend them to the Kannada-English language pair using our dataset.

pdf bib abs
Hater-O-Genius Aggression Classification using Capsule Networks
Parth Patwa | Srinivas Pykl | Amitava Das | Prerana Mukherjee | Viswanath Pulabaigari

Contending hate speech in social media is one of the most challenging social problems of our time. There are various types of anti-social behavior in social media. Foremost of them is aggressive behavior, which is causing many social issues such as affecting the social lives and mental health of social media users. In this paper, we propose an end-to-end ensemble-based architecture to automatically identify and classify aggressive tweets. Tweets are classified into three categories - Covertly Aggressive, Overtly Aggressive, and Non-Aggressive. The proposed architecture is an ensemble of smaller subnetworks that are able to characterize the feature embeddings effectively. We demonstrate qualitatively that each of the smaller subnetworks is able to learn unique features. Our best model is an ensemble of Capsule Networks and results in a 65.2% F1 score on the Facebook test set, which results in a performance gain of 0.95% over the TRAC-2018 winners. The code and the model weights are publicly available at https://github.com/parthpatwa/Hater-O-Genius-Aggression-Classification-using-Capsule-Networks.

pdf bib abs
A New Approach to Claim Check-Worthiness Prediction and Claim Verification
Shukrity Si | Anisha Datta | Sudip Naskar

The more we are advancing towards a modern world, the more it opens the path to falsification in every aspect of life. Even in case of knowing the surrounding, common people can not judge the actual scenario as the promises, comments and opinions of the influential people at power keep changing every day. Therefore computationally determining the truthfulness of such claims and comments has a very important societal impact. This paper describes a unique method to extract check-worthy claims from the 2016 US presidential debates and verify the truthfulness of the check-worthy claims. We classify the claims for check-worthiness with our modified Tf-Idf model which is used in background training on fact-checking news articles (NBC News and Washington Post). We check the truthfulness of the claims by using POS, sentiment score and cosine similarity features.

pdf bib abs
Improving Passage Re-Ranking with Word N-Gram Aware Coattention Encoder
Chaitanya Alaparthi | Manish Shrivastava

In text matching applications, coattentions have proved to be highly effective attention mechanisms. Coattention enables the learning to attend based on computing word level affinity scores between two texts. In this paper, we propose two improvements to coattention mechanism in the context of passage ranking (re-ranking). First, we extend the coattention mechanism by applying it across all word n-grams of query and passage. We show that these word n-gram coattentions can capture local context in query and passage to better judge the relevance between them. Second, we further improve the model performance by proposing a query based attention pooling on passage encodings. We evaluate these two methods on MSMARCO passage re-ranking task. The experiment results shows that these two methods resulted in a relative increase of 8.04% in Mean Reciprocal Rank @10 (MRR@10) compared to the naive coattention mechanism. At the time of writing this paper, our methods are the best non transformer model on MS MARCO passage re-ranking task and are competitive to BERT base while only having less than 10% of the parameters.

pdf bib abs
Language Model Metrics and Procrustes Analysis for Improved Vector Transformation of NLP Embeddings
Thomas Conley | Jugal Kalita

Artificial Neural networks are mathematical models at their core. This truism presents some fundamental difficulty when networks are tasked with Natural Language Processing. A key problem lies in measuring the similarity or distance among vectors in NLP embedding space, since the mathematical concept of distance does not always agree with the linguistic concept. We suggest that the best way to measure linguistic distance among vectors is by employing the Language Model (LM) that created them. We introduce Language Model Distance (LMD) for measuring accuracy of vector transformations based on the Distributional Hypothesis ( LMD Accuracy ). We show the efficacy of this metric by applying it to a simple neural network learning the Procrustes algorithm for bilingual word mapping.

pdf bib abs
Cognitively Aided Zero-Shot Automatic Essay Grading
Sandeep Mathias | Rudra Murthy | Diptesh Kanojia | Pushpak Bhattacharyya

Automatic essay grading (AEG) is a process in which machines assign a grade to an essay written in response to a topic, called the prompt. Zero-shot AEG is when we train a system to grade essays written to a new prompt which was not present in our training data. In this paper, we describe a solution to the problem of zero-shot automatic essay grading, using cognitive information, in the form of gaze behaviour. Our experiments show that using gaze behaviour helps in improving the performance of AEG systems, especially when we provide a new essay written in response to a new prompt for scoring, by an average of almost 5 percentage points of QWK.

pdf bib abs
Automated Arabic Essay Evaluation
Abeer Alqahtani | Amal Alsaif

Although the manual evaluation of essays is a time-consuming process, writing essays has a significant role in assessing learning outcomes. Therefore, automated essay evaluation represents a solution, especially for schools, universities, and testing companies. Moreover, the existence of such systems overcomes some factors that influence manual evaluation such as the evaluator’s mental state, the disparity between evaluators, and others. In this paper, we propose an Arabic essay evaluation system based on a support vector regression (SVR) model along with a wide range of features including morphological, syntactic, semantic, and discourse features. The system evaluates essays according to five criteria: spelling, essay structure, coherence level, style, and punctuation marks, without the need for domain-representative essays (a model essay). A specific model is developed for each criterion; thus, the overall evaluation of the essay is a combination of the previous criteria results. We develop our dataset based on essays written by university students and journalists whose native language is Arabic. The dataset is then evaluated by experts. The experimental results show that 96% of our dataset is correctly evaluated in the overall score and the correlation between the system and the experts’ evaluation is 0.87. Additionally, the system shows variant results in evaluating criteria separately.

pdf bib abs
Semantic Extractor-Paraphraser based Abstractive Summarization
Anubhav Jangra | Raghav Jain | Vaibhav Mavi | Sriparna Saha | Pushpak Bhattacharyya

The anthology of spoken languages today is inundated with textual information, necessitating the development of automatic summarization models. In this manuscript, we propose an extractor-paraphraser based abstractive summarization system that exploits semantic overlap as opposed to its predecessors that focus more on syntactic information overlap. Our model outperforms the state-of-the-art baselines in terms of ROUGE, METEOR and word mover similarity (WMS), establishing the superiority of the proposed system via extensive ablation experiments. We have also challenged the summarization capabilities of the state of the art Pointer Generator Network (PGN), and through thorough experimentation, shown that PGN is more of a paraphraser, contrary to the prevailing notion of a summarizer; illustrating it’s incapability to accumulate information across multiple sentences.

pdf bib abs
ThamizhiUDp: A Dependency Parser for Tamil
Kengatharaiyer Sarveswaran | Gihan Dias

This paper describes how we developed a neural-based dependency parser, namely ThamizhiUDp, which provides a complete pipeline for the dependency parsing of the Tamil language text using Universal Dependency formalism. We have considered the phases of the dependency parsing pipeline and identified tools and resources in each of these phases to improve the accuracy and to tackle data scarcity. ThamizhiUDp uses Stanza for tokenisation and lemmatisation, ThamizhiPOSt and ThamizhiMorph for generating Part of Speech (POS) and Morphological annotations, and uuparser with multilingual training for dependency parsing. ThamizhiPOSt is our POS tagger, which is based on the Stanza, trained with Amrita POS-tagged corpus. It is the current state-of-the-art in Tamil POS tagging with an F1 score of 93.27. Our morphological analyzer, ThamizhiMorph is a rule-based system with a very good coverage of Tamil. Our dependency parser ThamizhiUDp was trained using multilingual data. It shows a Labelled Assigned Score (LAS) of 62.39, 4 points higher than the current best achieved for Tamil dependency parsing. Therefore, we show that breaking up the dependency parsing pipeline to accommodate existing tools and resources is a viable approach for low-resource languages.

pdf bib abs
Constructing a Korean Named Entity Recognition Dataset for the Financial Domain using Active Learning
Dong-Ho Jeong | Min-Kang Heo | Hyung-Chul Kim | Sang-Won Park

The performance of deep learning models depends on the quality and quantity of data. Data construction, however, is time- consuming and costly. In addition, when expert domain data are constructed, the availability of experts is limited. In such cases, active learning can efficiently increase the performance of the learning models with minimal data construction. Although various datasets have been constructed using active learning techniques, vigorous studies on the construction of Korean data on expert domains are yet to be conducted. In this study, a corpus for named entity recognition was constructed for the financial domain using the active learning technique. The contributions of the study are as follows. (1) It was verified that the active learning technique could effectively construct the named entity recognition corpus for the financial domain, and (2) a named entity recognizer for the financial domain was developed. Data of 8,043 sentences were constructed using the proposed method, and the performance of the named entity recognizer reached 80.84%. Moreover, the proposed method reduced data construction costs by 12–25%

pdf bib abs
Self-Supervised Claim Identification for Automated Fact Checking
Archita Pathak | Mohammad Abuzar Shaikh | Rohini Srihari

We propose a novel, attention-based self-supervised approach to identify “claim-worthy” sentences in a fake news article, an important first step in automated fact-checking. We leverage aboutness of headline and content using attention mechanism for this task. The identified claims can be used for downstream task of claim verification for which we are releasing a benchmark dataset of manually selected compelling articles with veracity labels and associated evidence. This work goes beyond stylistic analysis to identifying content that influences reader belief. Experiments with three datasets show the strength of our model.

pdf bib abs
SUKHAN: Corpus of Hindi Shayaris annotated with Sentiment Polarity Information
Salil Aggarwal | Abhigyan Ghosh | Radhika Mamidi

Shayari is a form of poetry mainly popular in the Indian subcontinent, in which the poet expresses his emotions and feelings in a very poetic manner. It is one of the best ways to express our thoughts and opinions. Therefore, it is of prime importance to have an annotated corpus of Hindi shayaris for the task of sentiment analysis. In this paper, we introduce SUKHAN, a dataset consisting of Hindi shayaris along with sentiment polarity labels. To the best of our knowledge, this is the first corpus of Hindi shayaris annotated with sentiment polarity information. This corpus contains a total of 733 Hindi shayaris of various genres. Also, this dataset is of utmost value as all the annotation is done manually by five annotators and this makes it a very rich dataset for training purposes. This annotated corpus is also used to build baseline sentiment classification models using machine learning techniques.

pdf bib abs
Improving Neural Machine Translation for Sanskrit-English
Ravneet Punia | Aditya Sharma | Sarthak Pruthi | Minni Jain

Sanskrit is one of the oldest languages of the Asian Subcontinent that fell out of common usage around 600 B.C. In this paper, we attempt to translate Sanskrit to English using Neural Machine Translation approaches based on Reinforcement Learning and Transfer learning that were never tried and tested on Sanskrit. Along with the paper, we also release monolingual Sanskrit and parallel aligned Sanskrit-English corpora for the research community. Our methodologies outperform the previous approaches applied to Sanskrit by various re- searchers and will further help the linguistic community to accelerate the costly and time consuming manual translation process.

pdf bib abs
Parsing Indian English News Headlines
Samapika Roy | Sukhada Sukhada | Anil Kumar Singh

Parsing news Headlines is one of the difficult tasks of Natural Language Processing. It is mostly because news Headlines (NHs) are not complete grammatical sentences. News editors use all sorts of tricks to grab readers’ attention, for instance, unusual capitalization as in the headline’ Ear SHOT ashok rajagopalan’; some are world knowledge demanding like ‘Church reformation celebrated’ where the ‘Church reformation’ refers to a historical event and not a piece of news about an ordinary church. The lack of transparency in NHs can be linguistic, cultural, social, or contextual. The lack of space provided for a news headline has led to creative liberty. Though many works like news value extraction, summary generation, emotion classification of NHs have been going on, parsing them had been a tough challenge. Linguists have also been interested in NHs for creativity in the language used by bending traditional grammar rules. Researchers have conducted studies on news reportage, discourse analysis of NHs, and many more. While the creativity seen in NHs is fascinating for language researchers, it poses a computational challenge for Natural Language Processing researchers. This paper presents an outline of the ongoing doctoral research on the parsing of Indian English NHs. The ultimate aim of this research is to provide a module that will generate correctly parsed NHs. The intention is to enhance the broad applicability of newspaper corpus for future Natural Language Processing applications.

pdf bib abs
WORD SENSE DISAMBIUATION FOR KASHMIRI LANGUAGE USING SUPERVISED MACHINE LEARNING
Tawseef Ahmad Mir | Aadil Ahmad Lawaye

Every language used in this word has ambiguous words. The process of analyzing the word tokens and assigning the correct meanings to the ambiguous words according the context in which they are used is called word sense disambiguation(WSD). WSD is a very hot research topic in Natural Language Processing. The main purpose of my research work is to tackle the WSD problem for Kashmiri language using Supervised Machine Learning Approaches

pdf bib abs
Sentimental Poetry Generation
Kasper Aalberg Røstvold | Björn Gambäck

The paper investigates how well poetry can be generated to contain a specific sentiment, and whether readers of the poetry experience the intended sentiment. The poetry generator consists of a bi-directional Long Short-Term Memory (LSTM) model, combined with rhyme pair generation, rule-based word prediction methods, and tree search for extending generation possibilities. The LSTM network was trained on a set of English poetry written and published by users on a public website. Human judges evaluated poems generated by the system, both with a positive and negative sentiment. The results indicate that while there are some weaknesses in the system compared to other state-of-the-art solutions, it is fully capable of generating poetry with an inherent sentiment that is perceived by readers.

pdf bib abs
WEKA in Forensic Authorship Analysis: A corpus-based approach of Saudi Authors
Mashael AlAmr | Eric Atwell

This is a pilot study that aims to explore the potential of using WEKA in forensic authorship analysis. It is a corpus-based research using data from Twitter collected from thirteen authors from Riyadh, Saudi Arabia. It examines the performance of unbalanced and balanced data sets using different classifiers and parameters of word grams. The attributes are dialect-specific linguistic features categorized as word grams. The findings further support previous studies in computational authorship identification.

pdf bib abs
Native-Language Identification with Attention
Stian Steinbakken | Björn Gambäck

The paper explores how an attention-based approach can increase performance on the task of native-language identification (NLI), i.e., to identify an author’s first language given information expressed in a second language. Previously, Support Vector Machines have consistently outperformed deep learning-based methods on the TOEFL11 data set, the de facto standard for evaluating NLI systems. The attention-based system BERT (Bidirectional Encoder Representations from Transformers) was first tested in isolation on the TOEFL11 data set, then used in a meta-classifier stack in combination with traditional techniques to produce an accuracy of 0.853. However, more labelled NLI data is now available, so BERT was also trained on the much larger Reddit-L2 data set, containing 50 times as many examples as previously used for English NLI, giving an accuracy of 0.902 on the Reddit-L2 in-domain test scenario, improving the state-of-the-art by 21.2 percentage points.

pdf bib abs
Does a Hybrid Neural Network based Feature Selection Model Improve Text Classification?
Suman Dowlagar | Radhika Mamidi

Text classification is a fundamental problem in the field of natural language processing. Text classification mainly focuses on giving more importance to all the relevant features that help classify the textual data. Apart from these, the text can have redundant or highly correlated features. These features increase the complexity of the classification algorithm. Thus, many dimensionality reduction methods were proposed with the traditional machine learning classifiers. The use of dimensionality reduction methods with machine learning classifiers has achieved good results. In this paper, we propose a hybrid feature selection method for obtaining relevant features by combining various filter-based feature selection methods and fastText classifier. We then present three ways of implementing a feature selection and neural network pipeline. We observed a reduction in training time when feature selection methods are used along with neural networks. We also observed a slight increase in accuracy on some datasets.

pdf bib abs
Efforts Towards Developing a Tamang Nepali Machine Translation System
Binaya Kumar Chaudhary | Bal Krishna Bal | Rasil Baidar

The Tamang language is spoken mainly in Nepal, Sikkim, West Bengal, some parts of Assam, and the North East region of India. As per the 2011 census conducted by the Nepal Government, there are about 1.35 million Tamang speakers in Nepal itself. In this regard, a Machine Translation System for Tamang-Nepali language pair is significant both from research and practical outcomes in terms of enabling communication between the Tamang and the Nepali communities. In this work, we train the Transformer Neural Machine Translation (NMT) architecture with attention using a small hand-labeled or aligned Tamang-Nepali corpus (15K sentence pairs). Our preliminary results show BLEU scores of 27.74 for the Nepali→Tamang direction and 23.74 in the Tamang→Nepali direction. We are currently working on increasing the datasets as well as improving the model to obtain better BLEU scores. We also plan to extend the work to add the English language to the model, thus making it a trilingual Machine Translation System for Tamang-Nepali-English languages.

pdf bib abs
Event Argument Extraction using Causal Knowledge Structures
Debanjana Kar | Sudeshna Sarkar | Pawan Goyal

Event Argument extraction refers to the task of extracting structured information from unstructured text for a particular event of interest. The existing works exhibit poor capabilities to extract causal event arguments like Reason and After Effects. Futhermore, most of the existing works model this task at a sentence level, restricting the context to a local scope. While it may be effective for short spans of text, for longer bodies of text such as news articles, it has often been observed that the arguments for an event do not necessarily occur in the same sentence as that containing an event trigger. To tackle the issue of argument scattering across sentences, the use of global context becomes imperative in this task. In our work, we propose an external knowledge aided approach to infuse document level event information to aid the extraction of complex event arguments. We develop a causal network for our event-annotated dataset by extracting relevant event causal structures from ConceptNet and phrases from Wikipedia. We use the extracted event causal features in a bi-directional transformer encoder to effectively capture long-range inter-sentence dependencies. We report the effectiveness of our proposed approach through both qualitative and quantitative analysis. In this task, we establish our findings on an event annotated dataset in 5 Indian languages. This dataset adds further complexity to the task by labeling arguments of entity type (like Time, Place) as well as more complex argument types (like Reason, After-Effect). Our approach achieves state-of-the-art performance across all the five languages. Since our work does not rely on any language specific features, it can be easily extended to other languages as well.

pdf bib abs
Claim extraction from text using transfer learning.
Acharya Ashish Prabhakar | Salar Mohtaj | Sebastian Möller

Building an end to end fake news detection system consists of detecting claims in text and later verifying them for their authenticity. Although most of the recent works have focused on political claims, fake news can also be propagated in the form of religious intolerance, conspiracy theories etc. Since there is a lack of training data specific to all these scenarios, we compiled a homogeneous and balanced dataset by combining some of the currently available data. Moreover, it is shown in the paper that how recent advancements in transfer learning can be leveraged to detect claims, in general. The obtained result shows that the recently developed transformers can transfer the tendency of research from claim detection to the problem of check worthiness of claims in domains of interest.

pdf bib abs
Assamese Word Sense Disambiguation using Genetic Algorithm
Arjun Gogoi | Nomi Baruah | Shikhar Kr. Sarma

Word sense disambiguation (WSD) is a problem to determine a word according to a context in which it occurs. There are plenty amount of works done in WSD for some languages such as English, but research work on Assamese WSD remains limited. It is a more exigent task because Assamese has an intrinsic complexity in its writing structure and ambiguity, such as syntactic, semantic, and anaphoric ambiguity levels.A novel unsupervised genetic word sense disambiguation algorithm is proposed in this paper. The algorithm first uses WordNet to extract all possible senses for a given ambiguous word, then a genetic algorithm is used taking Wu-Palmer’s similarity measure as the fitness function and calculating the similarity measure for all extracted senses. The winner sense which will have the highest score declared as he winner sense.

pdf bib abs
Free Word Order in Sanskrit and Well-nestedness
Sanal Vikram | Amba Kulkarni

The common wisdom about Sanskrit is that it is free word order language. This word order poses challenges such as handling non-projectivity in parsing. The earlier works on the word order of Sanskrit have shown that there are syntactic structures in Sanskrit which cannot be covered under even the non-planarity. In this paper, we study these structures further to investigate if they can fall under well-nestedness or not. A small manually tagged corpus of the verses of Śrīmad-Bhagavad-Gītā was considered for this study. It was noticed that there are as many well-nested trees as there are ill-nested ones. From the linguistic point of view, we could get a list of relations that are involved in the planarity violations. All these relations had one thing in common - that they have unilateral expectancy. It was this loose binding, as against the mutual expectancy with certain other relations, that allowed them to cross the phrasal boundaries.

pdf bib abs
A Multi-modal Personality Prediction System
Chanchal Suman | Aditya Gupta | Sriparna Saha | Pushpak Bhattacharyya

Automatic prediction of personality traits has many real-life applications, e.g., in forensics, recommender systems, personalized services etc.. In this work, we have proposed a solution framework for solving the problem of predicting the personality traits of a user from videos. Ambient, facial and the audio features are extracted from the video of the user. These features are used for the final output prediction. The visual and audio modalities are combined in two different ways: averaging of predictions obtained from the individual modalities, and concatenation of features in multi-modal setting. The dataset released in Chalearn-16 is used for evaluating the performance of the system. Experimental results illustrate that it is possible to obtain better performance with a hand full of images, rather than using all the images present in the video

pdf bib abs
D-Coref: A Fast and Lightweight Coreference Resolution Model using DistilBERT
Chanchal Suman | Jeetu Kumar | Sriparna Saha | Pushpak Bhattacharyya

Smart devices are often deployed in some edge-devices, which require quality solutions in limited amount of memory usage. In most of the user-interaction based smart devices, coreference resolution is often required. Keeping this in view, we have developed a fast and lightweight coreference resolution model which meets the minimum memory requirement and converges faster. In order to generate the embeddings for solving the task of coreference resolution, DistilBERT, a light weight BERT module is utilized. DistilBERT consumes less memory (only 60% of memory in comparison to BERT-based heavy model) and it is suitable for deployment in edge devices. DistilBERT embedding helps in 60% faster convergence with an accuracy compromise of 2.59%, and 6.49% with respect to its base model and current state-of-the-art, respectively.

pdf bib abs
Semantic Slot Prediction on low corpus data using finite user defined list
Bharatram Natarajan | Dharani Simma | Chirag Singh | Anish Nediyanchath | Sreoshi Sengupta

Semantic slot prediction is one of the important task for natural language understanding (NLU). They depend on the quality and quantity of the human crafted training data, which affects model generalization. With the advent of voice assistants exposing AI platforms to third party developers, training data quality and quantity matters for any machine learning algorithm to learn and generalize properly.AI platforms provides provision to add custom external plist defined by the developers for the training data. Hence we are exploring dataset, called LowCorpusSlotData, containing low corpus training data with larger number of slots and significant test data. We also use external plist for the above dataset to aid in slot identification. We experimented using state of the art architectures like Bi-directional Encoder Representations from Transformers (BERT) with variants and Bi-directional Encoder with Custom Decoder. To address the low corpus problem, we propose a pipeline approach where we extract candidate slot information using the external plist extractor module and feed as input along with utterance.

pdf bib abs
Leveraging Latent Representations of Speech for Indian Language Identification
Samarjit Karmakar | P Radha Krishna

Identification of the language spoken from speech utterances is an interesting task because of the diversity associated with different languages and human voices. Indian languages have diverse origins and identifying them from speech utterances would help several language recognition, translation and relationship mining tasks. The current approaches for tackling the problem of languages identification in the Indian context heavily use feature engineering and classical speech processing techniques. This is a bottleneck for language identification systems, as we require to exploit necessary features in speech, required for machine identification, which are learnt by a probabilistic framework, rather than handcrafted feature engineering. In this paper, we tackle the problem of language identification using latent representations learnt from speech using Variational Autoencoders (VAEs) and leverage the representations learnt to train sequence models. Our framework attains an accuracy of 89% in the identification of 8 well known Indian languages (namely Tamil, Telugu, Punjabi, Marathi, Gujarati, Hindi, Kannada and Bengali) from the CMU Indic Speech Database. The presented approach can be applied to several scenarios for speech processing by employing representation learning and leveraging them for sequence models.

pdf bib abs
Acoustic Analysis of Native (L1) Bengali Speakers’ Phonological Realization of English Lexical Stress Contrast
Shambhu Nath Saha | Shyamal Kr. Das Mandal

Acoustically, English lexical stress is multidimensional and involving manipulation of duration, intensity, fundamental frequency (F0) and vowel quality. The current study investigates the acquisition of English lexical stress by L1 Bengali speakers at the phonological level in terms of the properties of acoustic cues. For this purpose, this study compares 20 L1 Bengali speakers’ use of acoustic correlates for the production of English lexical stress in context sentence and neutral frame sentence. The result of this study showed that L1 Bengali speakers were not able to achieve neutral frame sentence like control over duration, intensity, F0 and to a limited extent vowel quality in context sentence. As a result, unlike neutral frame sentence, L1 Bengali speakers were not sensitive to English lexical stress contrast in context sentence. This analysis reveals that, the difference between the neutral frame and context sentences in terms of L1 Bengali speakers’ realization of phonology of English lexical stress contrast was probably due to the influence of Bengali phonology of lexical stress placement (restricted to the initial syllable of a word) on L1 Bengali speakers’ English speech.

pdf bib abs
Towards Performance Improvement in Indian Sign Language Recognition
Kinjal Mistree | Devendra Thakor | Brijesh Bhatt

Sign language is a complete natural language used by deaf and dumb people. It has its own grammar and it differs with spoken language to a great extent. Since people without hearing and speech impairment lack the knowledge of the sign language, the deaf and dumb people find it difficult to communicate with them. The conception of system that would be able to translate the sign language into text would facilitate understanding of sign language without human interpreter. This paper describes a systematic approach that takes Indian Sign Language (ISL) video as input and converts it into text using frame sequence generator and image augmentation techniques. By incorporating these two concepts, we have increased dataset size and reduced overfitting. It is demonstrated that using simple image manipulation techniques and batch of shifted frames of videos, performance of sign language recognition can be significantly improved. Approach described in this paper achieves 99.57% accuracy on the dynamic gesture dataset of ISL.

pdf bib abs
Question and Answer pair generation for Telugu short stories
Meghana Bommadi | Shreya Terupally | Radhika Mamidi

Question Answer pair generation is a task that has been worked upon by multiple researchers in many languages. It has been a topic of interest due to its extensive uses in different fields like self assessment, academics, business website FAQs etc. Many experiments were conducted on Question Answering pair generation in English, concentrating on basic Wh-questions with a rule-based approach. We have built the first hybrid machine learning and rule-based solution in Telugu which is efficient for short stories or short passages in children’s books. Our work covers the fundamental question forms with the question types: adjective, yes/no, adverb, verb, when, where, whose, quotative, and quantitative(how many/ how much). We constructed rules for question generation using POS tags and UD tags along with linguistic information of the surrounding context of the word.

pdf bib abs
Detection of Similar Languages and Dialects Using Deep Supervised Autoencoder
Shantipriya Parida | Esau Villatoro-Tello | Sajit Kumar | Maël Fabien | Petr Motlicek

Language detection is considered a difficult task especially for similar languages, varieties, and dialects. With the growing number of online content in different languages, the need for reliable and robust language detection tools also increased. In this work, we use supervised autoencoders with a bayesian optimizer for language detection and highlights its efficiency in detecting similar languages with dialect variance in comparison to other state-of-the-art techniques. We evaluated our approach on multiple datasets (Ling10, Discriminating between Similar Language (DSL), and Indo-Aryan Language Identification (ILI)). Obtained results demonstrate that SAE are higly effective in detecting languages, up to a 100% accuracy in the Ling10. Similarly, we obtain a competitive performance in identifying similar languages, and dialects, 92% and 85% for DSL ans ILI datasets respectively.

pdf bib abs
Weak Supervision using Linguistic Knowledge for Information Extraction
Sachin Pawar | Girish Palshikar | Ankita Jain | Jyoti Bhat | Simi Johnson

In this paper, we propose to use linguistic knowledge to automatically augment a small manually annotated corpus to obtain a large annotated corpus for training Information Extraction models. We propose a powerful patterns specification language for specifying linguistic rules for entity extraction. We define an Enriched Text Format (ETF) to represent rich linguistic information about a text in the form of XML-like tags. The patterns in our patterns specification language are then matched on the ETF text rather than raw text to extract various entity mentions. We demonstrate how an entity extraction system can be quickly built for a domain-specific entity type for which there are no readily available annotated datasets.

pdf bib abs
Leveraging Alignment and Phonology for low-resource Indic to English Neural Machine Transliteration
Parth Patel | Manthan Mehta | Pushpak Bhattacharya | Arjun Atreya

In this paper we present a novel transliteration technique based on Orthographic Syllable(OS) segmentation for low-resource Indian languages (ILs). Given that alignment has produced promising results in Statistical Machine Transliteration systems and phonology plays an important role in transliteration, we introduce a new model which uses alignment representation similar to that of IBM model 3 to pre-process the tokenized input sequence and then use pre-trained source and target OS-embeddings for training. We apply our model for transliteration from ILs to English and report our accuracy based on Top-1 Exact Match. We also compare our accuracy with a previously proposed Phrase-Based model and report improvements.

pdf bib abs
STHAL: Location-mention Identification in Tweets of Indian-context
Kartik Verma | Shobhit Sinha | Md. Shad Akhtar | Vikram Goyal

We investigate the problem of extracting Indian-locations from a given crowd-sourced textual dataset. The problem of extracting fine-grained Indian-locations has many challenges. One challenge in the task is to collect relevant dataset from the crowd-sourced platforms that contain locations. The second challenge lies in extracting the location entities from the collected data. We provide an in-depth review of the information collection process and our annotation guidelines such that a reliable dataset annotation is guaranteed. We evaluate many recent algorithms and models, including Conditional Random fields (CRF), Bi-LSTM-CNN and BERT (Bidirectional Encoder Representations from Transformers), on our developed dataset named . The study shows the best F1-score of 72.49% for BERT, followed by Bi-LSTM-CNN and CRF. As a result of our work, we prepare a publicly-available annotated dataset of Indian geolocations that can be used by the research community. Code and dataset are available at https://github.com/vkartik2k/STHAL.

Sentence completion detection (SCD) is an important task for various downstream Natural Language Processing (NLP) based applications. For NLP based applications, which use the Automatic Speech Recognition (ASR) from third parties as a service, SCD is essential to prevent unnecessary processing. Conventional approaches for SCD operate within the confines of sentence boundary detection using language models or sentence end detection using speech and text features. These have limitations in terms of relevant available data for training, performance within the memory and latency constraints, and the generalizability across voice assistant domains. In this paper, we propose a novel sentence completion detection method with low memory footprint for On-Device applications. We explore various sequence-level and sentence-level experiments using state-of-the-art Bi-LSTM and BERT based models for English language.

pdf bib abs
Polarization and its Life on Social Media: A Case Study on Sabarimala and Demonetisation
Ashutosh Ranjan | Dipti Sharma | Radhika Krishnan

This paper is an attempt to study polarisation on social media data. We focus on two hugely controversial and talked about events in the Indian diaspora, namely 1) the Sabarimala Temple (located in Kerala, India) incident which became a nationwide controversy when two women under the age of 50 secretly entered the temple breaking a long standing temple rule that disallowed women of menstruating age (10-50) to enter the temple and 2) the Indian government’s move to demonetise all existing 500 and 1000 denomination banknotes, comprising of 86% of the currency in circulation, in November 2016. We gather tweets around these two events in various time periods, preprocess and annotate them with their sentiment polarity and emotional category, and analyse trends to help us understand changing polarity over time around controversial events. The tweets collected are in English, Hindi and code-mixed Hindi-English. Apart from the analysis on the annotated data, we also present the twitter data comprising a total of around 1.5 million tweets.

pdf bib abs
A Rule Based Lightweight Bengali Stemmer
Souvick Das | Rajat Pandit | Sudip Kumar Naskar

In the field of Natural Language Processing (NLP) the process of stemming plays a significant role. Stemmer transforms an inflected word to its root form. Stemmer significantly increases the efficiency of Information Retrieval (IR) systems. It is a very basic yet fundamental text pre-processing task widely used in many NLP tasks. Several important works on stemming have been carried out by researchers in English and other major languages. In this paper, we study and review existing works on stemming in Bengali and other Indian languages. Finally, we propose a rule based approach that explores Bengali morphology and leverages WordNet to achieve better accuracy. Our algorithm produced stemming accuracy of 98.86% for Nouns and 99.75% for Verbs.

pdf bib abs
End-to-End Automatic Speech Recognition for Gujarati
Deepang Raval | Vyom Pathak | Muktan Patel | Brijesh Bhatt

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning based approach which includes Convolutional Neural Network (CNN), Bi-directional Long Short Term Memory (BiLSTM) layers, Dense layers, and Connectionist Temporal Classification (CTC) as a loss function. In order to improve the performance of the system with the limited size of the dataset, we present a combined language model (WLM and CLM) based prefix decoding technique and Bidirectional Encoder Representations from Transformers (BERT) based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we proposed different analysis methods. These insights help to understand our ASR system based on a particular language (Gujarati) as well as can govern ASR systems’ to improve the performance for low resource languages. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.11% decrease in Word Error Rate (WER) with respect to base-model WER.

pdf bib abs
Deep Neural Model for Manipuri Multiword Named Entity Recognition with Unsupervised Cluster Feature
Jimmy Laishram | Kishorjit Nongmeikapam | Sudip Naskar

The recognition task of Multi-Word Named Entities (MNEs) in itself is a challenging task when the language is inflectional and agglutinative. Having breakthrough NLP researches with deep neural network and language modelling techniques, the applicability of such techniques/algorithms for Indian language like Manipuri remains unanswered. In this paper an attempt to recognize Manipuri MNE is performed using a Long Short Term Memory (LSTM) recurrent neural network model in conjunction with Part Of Speech (POS) embeddings. To further improve the classification accuracy, word cluster information using K-means clustering approach is added as a feature embedding. The cluster information is generated using a Skip-gram based words vector that contains the semantic and syntactic information of each word. The model so proposed does not use extensive language morphological features to elevate its accuracy. Finally the model’s performance is compared with the other machine learning based Manipuri MNE models.

pdf bib abs
ScAA: A Dataset for Automated Short Answer Grading of Children’s free-text Answers in Hindi and Marathi
Dolly Agarwal | Somya Gupta | Nishant Baghel

Automatic short answer grading (ASAG) techniques are designed to automatically assess short answers written in natural language. Apart from MCQs, evaluating free text answer is essential to assess the knowledge and understanding of children in the subject. But assessing descriptive answers in low resource languages in a linguistically diverse country like India poses significant hurdles. To solve this assessment problem and advance NLP research in regional Indian languages, we present the Science Answer Assessment (ScAA) dataset of children’s answers in the age group of 8-14. ScAA dataset is a 2-way (correct/incorrect) labeled dataset and contains 10,988 and 1,955 pairs of natural answers along with model answers for Hindi and Marathi respectively for 32 questions. We benchmark various state-of-the-art ASAG methods, and show the data presents a strong challenge for future research.

In this paper, we address the task of improving pair-wise machine translation for specific low resource Indian languages. Multilingual NMT models have demonstrated a reasonable amount of effectiveness on resource-poor languages. In this work, we show that the performance of these models can be significantly improved upon by using back-translation through a filtered back-translation process and subsequent fine-tuning on the limited pair-wise language corpora. The analysis in this paper suggests that this method can significantly improve multilingual models’ performance over its baseline, yielding state-of-the-art results for various Indian languages.

pdf bib abs
Only text? only image? or both? Predicting sentiment of internet memes
Pranati Behera | Mamta | Asif Ekbal

Nowadays, the spread of Internet memes on online social media platforms such as Instagram, Facebook, Reddit, and Twitter is very fast. Analyzing the sentiment of memes can provide various useful insights. Meme sentiment classification is a new area of research that is not explored yet. Recently SemEval provides a dataset for meme sentiment classification. As this dataset is highly imbalanced, we extend this dataset by annotating new instances and use a sampling strategy to build a meme sentiment classifier. We propose a multi-modal framework for meme sentiment classification by utilizing textual and visual features of the meme. We found that for meme sentiment classification, only textual or only visual features are not sufficient. Our proposed framework utilizes textual as well as visual features together. We propose to use the attention mechanism to improve meme classification performance. Our proposed framework achieves macro F1 and accuracy of 34.23 and 50.02, respectively. It increases the accuracy by 6.77 and 7.86 compared to only textual and visual features, respectively.

pdf bib abs
Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations
Md. Rajib Hossain | Mohammed Moshiul Hoque

Distributional word vector representation or word embedding has become an essential ingredient in many natural language processing (NLP) tasks such as machine translation, document classification, information retrieval and question answering. Investigation of embedding model helps to reduce the feature space and improves textual semantic as well as syntactic relations. This paper presents three embedding techniques (such as Word2Vec, GloVe, and FastText) with different hyperparameters implemented on a Bengali corpus consists of 180 million words. The performance of the embedding techniques is evaluated with extrinsic and intrinsic ways. Extrinsic performance evaluated by text classification, which achieved a maximum of 96.48% accuracy. Intrinsic performance evaluated by word similarity (e.g., semantic, syntactic and relatedness) and analogy tasks. The maximum Pearson (rˆ) correlation accuracy of 60.66% (Ssrˆ) achieved for semantic similarities and 71.64% (Syrˆ) for syntactic similarities whereas the relatedness obtained 79.80% (Rsrˆ). The semantic word analogy tasks achieved 44.00% of accuracy while syntactic word analogy tasks obtained 36.00%.

Emotion recognition is a very well-attended problem in Natural Language Processing (NLP). Most of the existing works on emotion recognition focus on the general domain and in some cases to specific domains like fairy tales, blogs, weather, Twitter etc. But emotion analysis systems in the domains of security, social issues, technology, politics, sports, etc. are very rare. In this paper, we create a benchmark setup for emotion recognition in these specialised domains. First, we construct a corpus of 18,921 tweets in English annotated with Paul Ekman’s six basic emotions (Anger, Disgust, Fear, Happiness, Sadness, Surprise) and a non-emotive class Others. Thereafter, we propose a deep neural framework to perform emotion recognition in an end-to-end setting. We build various models based on Convolutional Neural Network (CNN), Bi-directional Long Short Term Memory (Bi-LSTM), Bi-directional Gated Recurrent Unit (Bi-GRU). We propose a Hierarchical Attention-based deep neural network for Emotion Detection (HAtED). We also develop multiple systems by considering different sets of emotion classes for each system and report the detailed comparative analysis of the results. Experiments show the hierarchical attention-based model achieves best results among the considered baselines with accuracy of 69%.

pdf bib abs
PhraseOut: A Code Mixed Data Augmentation Method for MultilingualNeural Machine Tranlsation
Binu Jasim | Vinay Namboodiri | C V Jawahar

Data Augmentation methods for Neural Machine Translation (NMT) such as back- translation (BT) and self-training (ST) are quite popular. In a multilingual NMT system, simply copying monolingual source sentences to the target (Copying) is an effective data augmentation method. Back-translation aug- ments parallel data by translating monolingual sentences in the target side to source language. In this work we propose to use a partial back- translation method in a multilingual setting. Instead of translating the entire monolingual target sentence back into the source language, we replace selected high confidence phrases only and keep the rest of the words in the target language itself. (We call this method PhraseOut). Our experiments on low resource multilingual translation models show that PhraseOut gives reasonable improvements over the existing data augmentation methods.

pdf bib abs
CLPLM: Character Level Pretrained Language Model for ExtractingSupport Phrases for Sentiment Labels
Raj Pranesh | Sumit Kumar | Ambesh Shekhar

In this paper, we have designed a character-level pre-trained language model for extracting support phrases from tweets based on the sentiment label. We also propose a character-level ensemble model designed by properly blending Pre-trained Contextual Embeddings (PCE) models- RoBERTa, BERT, and ALBERT along with Neural network models- RNN, CNN and WaveNet at different stages of the model. For a given tweet and associated sentiment label, our model predicts the span of phrases in a tweet that prompts the particular sentiment in the tweet. In our experiments, we have explored various model architectures and configuration for both single as well as ensemble models. We performed a systematic comparative analysis of all the model’s performance based on the Jaccard score obtained. The best performing ensemble model obtained the highest Jaccard scores of 73.5, giving it a relative improvement of 2.4% over the best performing single RoBERTa based character-level model, at 71.5(Jaccard score).

pdf bib abs
Developing a Faroese PoS-tagging solution using Icelandic methods
Hinrik Hafsteinsson | Anton Karl Ingason

We describe the development of a dedicated, high-accuracy part-of-speech (PoS) tagging solution for Faroese, a North Germanic language with about 50,000 speakers. To achieve this, a state-of-the-art neural PoS tagger for Icelandic, ABLTagger, was trained on a 100,000 word PoS-tagged corpus for Faroese, standardised with methods previously applied to Icelandic corpora. This tagger was supplemented with a novel Experimental Database of Faroese Inflection (EDFM), which contains morphological information on 67,488 Faroese words with about one million inflectional forms. This approach produced a PoS-tagging model for Faroese which achieves a 91.40% overall accuracy when evaluated with 10-fold cross validation, which is currently the highest reported accuracy for a dedicated Faroese PoS-tagger. The tagging model, morphological database, proposed revised PoS tagset for Faroese as well as a revised and standardised PoS tagged corpus are all presented as products of this project and are made available for use in further research in Faroese language technology

pdf bib abs
Leveraging Multi-domain, Heterogeneous Data using Deep Multitask Learning for Hate Speech Detection
Prashant Kapil | Asif Ekbal

With the exponential rise in user-generated web content on social media, the proliferation of abusive languages towards an individual or a group across the different sections of the internet is also rapidly increasing. It is very challenging for human moderators to identify the offensive contents and filter those out. Deep neural networks have shown promise with reasonable accuracy for hate speech detection and allied applications. However, the classifiers are heavily dependent on the size and quality of the training data. Such a high-quality large data set is not easy to obtain. Moreover, the existing data sets that have emerged in recent times are not created following the same annotation guidelines and are often concerned with different types and sub-types related to hate. To solve this data sparsity problem, and to obtain more global representative features, we propose a Convolution Neural Network (CNN) based multi-task learning models (MTLs) to leverage information from multiple sources. Empirical analysis performed on three benchmark datasets shows the efficacy of the proposed approach with the significant improvement in accuracy and F-score to obtain state-of-the-art performance with respect to the existing systems.

pdf (full)
bib (full) Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task

pdf bib abs
MUCS@TechDOfication using FineTuned Vectors and n-grams
Fazlourrahman Balouchzahi | M D Anusha | H L Shashirekha

The increase in domain specific text processing applications are demanding tools and techniques for domain specific Text Classification (TC) which may be helpful in many downstream applications like Machine Translation, Summarization, Question Answering etc. Further, many TC algorithms are applied on globally recognized languages like English giving less importance for local languages particularly Indian languages. To boost the research for technical domains and text processing activities in Indian languages, a shared task named ”TechDOfication2020” is organized by ICON’20. The objective of this shared task is to automatically identify the technical domain of a given text which provides information about coarse grained technical domains and fine grained subdomains in eight languages. To tackle this challenge we, team MUCS have proposed three models, namely, DL-FineTuned model applied for all subtasks, and VC-FineTuned and VC-ngrams models applied only for some subtasks. n-grams and word embedding with a step of fine-tuning are used as features and machine learning and deep learning algorithms are used as classifiers in the proposed models. The proposed models outperformed in most of subtasks and also obtained first rank in subTask1b (Bangla) and subTask1e (Malayalam) with f1 score of 0.8353 and 0.3851 respectively using DL-FineTuned model for both the subtasks.

pdf bib abs
A Graph Convolution Network-based System for Technical Domain Identification
Alapan Kuila | Ayan Das | Sudeshna Sarkar

This paper presents the IITKGP contribution at the Technical DOmain Identification (TechDOfication) shared task at ICON 2020. In the preprocessing stage, we applied part-of-speech (PoS) taggers and dependency parsers to tag the data. We trained a graph convolution neural network (GCNN) based system that uses the tokens along with their PoS and dependency relations as features to identify the domain of a given document. We participated in the subtasks for coarse-grained domain classification in the English (Subtask 1a), Bengali (Subtask 1b) and Hindi language (Subtask 1d), and, the subtask for fine-grained domain classification task within Computer Science domain in English language (Subtask 2a).

pdf bib abs
Multichannel LSTM-CNN for Telugu Text Classification
Sunil Gundapu | Radhika Mamidi

With the instantaneous growth of text information, retrieving domain-oriented information from the text data has a broad range of applications in Information Retrieval and Natural language Processing. Thematic keywords give a compressed representation of the text. Usually, Domain Identification plays a significant role in Machine Translation, Text Summarization, Question Answering, Information Extraction, and Sentiment Analysis. In this paper, we proposed the Multichannel LSTM-CNN methodology for Technical Domain Identification for Telugu. This architecture was used and evaluated in the context of the ICON shared task “TechDOfication 2020” (task h), and our system got 69.9% of the F1 score on the test dataset and 90.01% on the validation set.

pdf bib abs
Multilingual Pre-Trained Transformers and Convolutional NN Classification Models for Technical Domain Identification
Suman Dowlagar | Radhika Mamidi

In this paper, we present a transfer learning system to perform technical domain identification on multilingual text data. We have submitted two runs, one uses the transformer model BERT, and the other uses XLM-ROBERTa with the CNN model for text classification. These models allowed us to identify the domain of the given sentences for the ICON 2020 shared Task, TechDOfication: Technical Domain Identification. Our system ranked the best for the subtasks 1d, 1g for the given TechDOfication dataset.

pdf bib abs
Technical Domain Identification using word2vec and BiLSTM
Koyel Ghosh | Dr. Apurbalal Senapati | Dr. Ranjan Maity

Coarse-grained and Fine-grained classification tasks are mostly based on sentiment or basic emotion analysis. Now, switching from emotion and sentiment analysis to another domain, in this paper, we are going to work on technical domain identification. The task is to identify the technical domain of a given English text. In the case of Coarse-grained domain classification, such a piece of text provides information about specific Coarse-grained technical domains like Computer Science, Physics, Math, etc, and in Fine-grained domain classification, Fine-grained subdomains for Computer science domain, it can be like Artificial Intelligence, Algorithm, Computer Architecture, Computer Networks, Database Management system, etc. To do the task, Word2Vec skip-gram model is used for word embedding, later, applied the Bidirectional Long Short Term memory (BiLSTM) model to classify Coarse-grained domains and Fine-grained sub-domains. To evaluate the performance of the approached model accuracy, precision, recall, and F1-score have been applied.

pdf bib abs
Automatic Technical Domain Identification
Hema Ala | Dipti Sharma

In this paper we present two Machine Learning algorithms namely Stochastic Gradient Descent and Multi Layer Perceptron to Identify the technical domain of given text as such text provides information about the specific domain. We performed our experiments on Coarse-grained technical domains like Computer Science, Physics, Law, etc for English, Bengali, Gujarati, Hindi, Malayalam, Marathi, Tamil, and Telugu languages, and on fine-grained sub domains for Computer Science like Operating System, Computer Network, Database etc for only English language. Using TFIDF as a feature extraction method we show how both the machine learning models perform on the mentioned languages.

pdf bib abs
Fine-grained domain classification using Transformers
Akshat Gahoi | Akshat Chhajer | Dipti Mishra Sharma

The introduction of transformers in 2017 and successively BERT in 2018 brought about a revolution in the field of natural language processing. Such models are pretrained on vast amounts of data, and are easily extensible to be used for a wide variety of tasks through transfer learning. Continual work on transformer based architectures has led to a variety of new models with state of the art results. RoBERTa(CITATION) is one such model, which brings about a series of changes to the BERT architecture and is capable of producing better quality embeddings at an expense of functionality. In this paper, we attempt to solve the well known text classification task of fine-grained domain classification using BERT and RoBERTa and perform a comparative analysis of the same. We also attempt to evaluate the impact of data preprocessing specially in the context of fine-grained domain classification. The results obtained outperformed all the other models at the ICON TechDOfication 2020 (subtask-2a) Fine-grained domain classification task and ranked first. This proves the effectiveness of our approach.

pdf bib abs
TechTexC: Classification of Technical Texts using Convolution and Bidirectional Long Short Term Memory Network
Omar Sharif | Eftekhar Hossain | Mohammed Moshiul Hoque

This paper illustrates the details description of technical text classification system and its results that developed as a part of participation in the shared task TechDofication 2020. The shared task consists of two sub-tasks: (i) first task identify the coarse-grained technical domain of given text in a specified language and (ii) the second task classify a text of computer science domain into fine-grained sub-domains. A classification system (called ‘TechTexC’) is developed to perform the classification task using three techniques: convolution neural network (CNN), bidirectional long short term memory (BiLSTM) network, and combined CNN with BiLSTM. Results show that CNN with BiLSTM model outperforms the other techniques concerning task-1 of sub-tasks (a, b, c and g) and task-2a. This combined model obtained f1 scores of 82.63 (sub-task a), 81.95 (sub-task b), 82.39 (sub-task c), 84.37 (sub-task g), and 67.44 (task-2a) on the development dataset. Moreover, in the case of test set, the combined CNN with BiLSTM approach achieved that higher accuracy for the subtasks 1a (70.76%), 1b (79.97%), 1c (65.45%), 1g (49.23%) and 2a (70.14%).

pdf bib abs
An Attention Ensemble Approach for Efficient Text Classification of Indian Languages
Atharva Kulkarni | Amey Hengle | Rutuja Udyawar

The recent surge of complex attention-based deep learning architectures has led to extraordinary results in various downstream NLP tasks in the English language. However, such research for resource-constrained and morphologically rich Indian vernacular languages has been relatively limited. This paper proffers a solution for the TechDOfication 2020 subtask-1f: which focuses on the coarse-grained technical domain identification of short text documents in Marathi, a Devanagari script-based Indian language. Availing the large dataset at hand, a hybrid CNN-BiLSTM attention ensemble model is proposed that competently combines the intermediate sentence representations generated by the convolutional neural network and the bidirectional long short-term memory, leading to efficient text classification. Experimental results show that the proposed model outperforms various baseline machine learning and deep learning models in the given task, giving the best validation accuracy of 89.57% and f1-score of 0.8875. Furthermore, the solution resulted in the best system submission for this subtask, giving a test accuracy of 64.26% and f1-score of 0.6157, transcending the performances of other teams as well as the baseline system given by the organizers of the shared task.

pdf (full)
bib (full) Proceedings of the 17th International Conference on Natural Language Processing (ICON): TermTraction 2020 Shared Task

pdf bib abs
Graph Based Automatic Domain Term Extraction
Hema Ala | Dipti Sharma

We present a Graph Based Approach to automatically extract domain specific terms from technical domains like Biochemistry, Communication, Computer Science and Law. Our approach is similar to TextRank with an extra post-processing step to reduce the noise. We performed our experiments on the mentioned domains provided by ICON TermTraction - 2020 shared task. Presented precision, recall and f1-score for all experiments. Further, it is observed that our method gives promising results without much noise in domain terms.

pdf bib abs
Unsupervised Technical Domain Terms Extraction using Term Extractor
Suman Dowlagar | Radhika Mamidi

Terminology extraction, also known as term extraction, is a subtask of information extraction. The goal of terminology extraction is to extract relevant words or phrases from a given corpus automatically. This paper focuses on the unsupervised automated domain term extraction method that considers chunking, preprocessing, and ranking domain-specific terms using relevance and cohesion functions for ICON 2020 shared task 2: TermTraction.

pdf bib abs
N-Grams TextRank A Novel Domain Keyword Extraction Technique
Saransh Rajput | Akshat Gahoi | Manvith Reddy | Dipti Mishra Sharma

The rapid growth of the internet has given us a wealth of information and data spread across the web. However, as the data begins to grow we simultaneously face the grave problem of an Information Explosion. An abundance of data can lead to large scale data management problems as well as the loss of the true meaning of the data. In this paper, we present an advanced domain specific keyword extraction algorithm in order to tackle this problem of paramount importance. Our algorithm is based on a modified version of TextRank algorithm - an algorithm based on PageRank to successfully determine the keywords from a domain specific document. Furthermore, this paper proposes a modification to the traditional TextRank algorithm that takes into account bigrams and trigrams and returns results with an extremely high precision. We observe how the precision and f1-score of this model outperforms other models in many domains and the recall can be easily increased by increasing the number of results without affecting the precision. We also discuss about the future work of extending the same algorithm to Indian languages.

pdf (full)
bib (full) Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task

pdf bib abs
JUNLP@ICON2020: Low Resourced Machine Translation for Indic Languages
Sainik Mahata | Dipankar Das | Sivaji Bandyopadhyay

In the current work, we present the description of the systems submitted to a machine translation shared task organized by ICON 2020: 17th International Conference on Natural Language Processing. The systems were developed to show the capability of general domain machine translation when translating into Indic languages, English-Hindi, in our case. The paper shows the training process and quantifies the performance of two state-of-the-art translation systems, viz., Statistical Machine Translation and Neural Machine Translation. While Statistical Machine Translation systems work better in a low-resource setting, Neural Machine Translation systems are able to generate sentences that are fluent in nature. Since both these systems have contrasting advantages, a hybrid system, incorporating both, was also developed to leverage all the strong points. The submitted systems garnered BLEU scores of 8.701943312, 0.6361336198, and 11.78873307 respectively and the scores of the hybrid system helped us to the fourth spot in the competition leaderboard.

pdf bib abs
AdapNMT : Neural Machine Translation with Technical Domain Adaptation for Indic Languages
Hema Ala | Dipti Sharma

Adapting new domain is highly challenging task for Neural Machine Translation (NMT). In this paper we show the capability of general domain machine translation when translating into Indic languages (English - Hindi , English - Telugu and Hindi - Telugu), and low resource domain adaptation of MT systems using existing general parallel data and small in domain parallel data for AI and Chemistry Domains. We carried out our experiments using Byte Pair Encoding(BPE) as it solves rare word problems. It has been observed that with addition of little amount of in-domain data to the general data improves the BLEU score significantly.

pdf bib abs
Domain Adaptation of NMT models for English-Hindi Machine Translation Task : AdapMT Shared Task ICON 2020
Ramchandra Joshi | Rusbabh Karnavat | Kaustubh Jirapure | Raviraj Joshi

Recent advancements in Neural Machine Translation (NMT) models have proved to produce a state of the art results on machine translation for low resource Indian languages. This paper describes the neural machine translation systems for the English-Hindi language presented in AdapMT Shared Task ICON 2020. The shared task aims to build a translation system for Indian languages in specific domains like Artificial Intelligence (AI) and Chemistry using a small in-domain parallel corpus. We evaluated the effectiveness of two popular NMT models i.e, LSTM, and Transformer architectures for the English-Hindi machine translation task based on BLEU scores. We train these models primarily using the out of domain data and employ simple domain adaptation techniques based on the characteristics of the in-domain dataset. The fine-tuning and mixed-domain data approaches are used for domain adaptation. The system achieved the second-highest score on chemistry and general domain En-Hi translation task and the third-highest score on the AI domain En-Hi translation task.

pdf bib abs
Terminology-Aware Sentence Mining for NMT Domain Adaptation: ADAPT’s Submission to the Adap-MT 2020 English-to-Hindi AI Translation Shared Task
Rejwanul Haque | Yasmin Moslem | Andy Way

This paper describes the ADAPT Centre’s submission to the Adap-MT 2020 AI Translation Shared Task for English-to-Hindi. The neural machine translation (NMT) systems that we built to translate AI domain texts are state-of-the-art Transformer models. In order to improve the translation quality of our NMT systems, we made use of both in-domain and out-of-domain data for training and employed different fine-tuning techniques for adapting our NMT systems to this task, e.g. mixed fine-tuning and on-the-fly self-training. For this, we mined parallel sentence pairs and monolingual sentences from large out-of-domain data, and the mining process was facilitated through automatic extraction of terminology from the in-domain data. This paper outlines the experiments we carried out for this task and reports the performance of our NMT systems on the evaluation test set.

pdf bib abs
MUCS@Adap-MT 2020: Low Resource Domain Adaptation for Indic Machine Translation
Asha Hegde | H.l. Shashirekha

Machine Translation (MT) is the task of automatically converting the text in source language to text in target language by preserving the meaning. MT usually require large corpus for training the translation models. Due to scarcity of resources very less attention is given to translating into low resource languages and in particular into Indic languages. In this direction, a shared task called “Adap-MT 2020: Low Resource Domain Adaptation for Indic Machine Translation” is organized to illustrate the capability of general domain MT when translating into Indic languages and low resource domain adaptation of MT systems. In this paper, we, team MUCS, describe a simple word extraction based domain adaptation approach applied to English-Hindi MT only. MT in the proposed model is carried out using Open-NMT - a popular Neural Machine Translation tool. A general domain corpus is built effectively combining the available English-Hindi corpora and removing the duplicate sentences. Further, domain specific corpus is updated by extracting the sentences from generic corpus that contains the words given in the domain specific corpus. The proposed model exhibited satisfactory results for small domain specific AI and CHE corpora provided by the organizers in terms of BLEU score with 1.25 and 2.72 respectively. Further, this methodology is quite generic and can easily be extended to other low resource language pairs as well.

pdf (full)
bib (full) Proceedings of the 17th International Conference on Natural Language Processing (ICON): System Demonstrations

pdf bib
Proceedings of the 17th International Conference on Natural Language Processing (ICON): System Demonstrations
Vishal Goyal | Asif Ekbal

pdf bib abs
Demonstration of a Literature Based Discovery System based on Ontologies, Semantic Filters and Word Embeddings for the Raynaud Disease-Fish Oil Rediscovery
Toby Reed | Vassilis Cutsuridis

A novel literature-based discovery system based on UMLS Ontologies, Semantic Filters, Statistics, and Word Embed-dings was developed and validated against the well-established Raynaud’s disease – Fish Oil discovery by min-ing different size and specificity corpora of Pubmed titles and abstracts. Results show an ‘inverse effect’ between open ver-sus closed discovery search modes. In open discovery, a more general and bigger corpus (Vascular disease or Peri-vascular disease) produces better results than a more specific and smaller in size corpus (Raynaud disease), whereas in closed discovery, the exact opposite is true.

pdf bib abs
Development of Hybrid Algorithm for Automatic Extraction of Multiword Expressions from Monolingual and Parallel Corpus of English and Punjabi
Kapil Dev Goyal | Vishal Goyal

Identification and extraction of Multiword Expressions (MWEs) is very hard and challenging task in various Natural Language processing applications like Information Retrieval (IR), Information Extraction (IE), Question-Answering systems, Speech Recognition and Synthesis, Text Summarization and Machine Translation (MT). Multiword Expressions are two or more consecutive words but treated as a single word and actual meaning this expression cannot be extracted from meaning of individual word. If any systems recognized this expression as separate words, then results of system will be incorrect. Therefore it is mandatory to identify these expressions to improve the result of the system. In this report, our main focus is to develop an automated tool to extract Multiword Expressions from monolingual and parallel corpus of English and Punjabi. In this tool, Rule based approach, Linguistic approach, statistical approach, and many more approaches were used to identify and extract MWEs from monolingual and parallel corpus of English and Punjabi and achieved more than 90% f-score value in some types of MWEs.

pdf bib abs
Punjabi to English Bidirectional NMT System
Kamal Deep | Ajit Kumar | Vishal Goyal

Machine Translation is ongoing research for last few decades. Today, Corpus-based Machine Translation systems are very popular. Statistical Machine Translation and Neural Machine Translation are based on the parallel corpus. In this research, the Punjabi to English Bidirectional Neural Machine Translation system is developed. To improve the accuracy of the Neural Machine Translation system, Word Embedding and Byte Pair Encoding is used. The claimed BLEU score is 38.30 for Punjabi to English Neural Machine Translation system and 36.96 for English to Punjabi Neural Machine Translation system.

pdf bib abs
EXTRACTING PARALLEL PHRASES FROM COMPARABLE ENGLISH AND PUNJABI CORPORA USING AN INTEGRATED APPROACH
Manpreet Singh Lehal | Vishal Goyal

Machine translation from English to Indian languages is always a difficult task due to the unavailability of a good quality corpus and morphological richness in the Indian languages. For a system to produce better translations, the size of the corpus should be huge. We have employed three similarity and distance measures for the research and developed a software to extract parallel data from comparable corpora automatically with high precision using minimal resources. The software works upon four algorithms. The three algorithms have been used for finding Cosine Similarity, Euclidean Distance Similarity and Jaccard Similarity. The fourth algorithm is to integrate the outputs of the three algorithms in order to improve the efficiency of the system.

pdf bib abs
A SANSKRIT TO HINDI LANGUAGE MACHINE TRANSLATOR USING RULE BASED APPROACH
Prateek Agrawal | Vishu Madaan

Hindi and Sanskrit both the languages are having the same script i.e. Devnagari Script which results in few basic similarities in their grammar rules. As we know that Hindi ranks fourth in terms of speaker’s size in the world and over 60 million people in India are Hindi internet users. In India itself, there are approximately 120 languages and 240 mother tongues but hardly a few languages are recognized worldwide while the others are losing their existence in society day by day. Likewise, Sanskrit is one of those important languages that are being ignored in society. As per census report of India in 2001, less than 15000 citizens have returned Sanskrit as their Mother tongue or preferred medium of communication. A key reason behind poor acceptance of Sanskrit is due to language barrier among Indian masses and lack of knowledge about this language among people. Therefore, our attempt is just to connect a big crowd of Hindi users with Sanskrit language and make them familiar at least with the basics of Sanskrit. We developed a translation tool that parses Sanskrit words (prose) one by one and translate it into equivalent Hindi language in step by step manner: (i) We created a strong Hindi-Sanskrit corpus that can deal with Sanskrit words effectively and efficiently. (ii) We proposed an algorithm to stem Sanskrit word that chops off the starts/ends of words to find the root words in the form of nouns and verbs. (iii) After stemming, we developed an algorithm to search the equivalent Hindi meaning of stemmed words from the corpus-based on semantic analysis. (iv)We developed an algorithm to implement semantic analysis to translate words that help the tool to identify required parameter details like gender, number, case etc. (v) Next, we developed an algorithm for discourse integration to dis-join each translated sentence based on subject/noun dependency. (vi) Next, we implemented pragmatic analysis algorithm that ensures the meaningful validation of these translated Hindi sentences syntactically and semantically. (vii) We further extended our work to summarize the translated text story and suggest a suitable heading/title. For this, we referred ripple down rule-based parts of speech (RDR-POS) Tagger for word tagging in the POS tagger corpora. (viii) We proposed a title generation algorithm which suggests some suitable title of the translated text. (ix) Finally, we assembled all phases to one translation tool that takes a story of maximum one hundred words as input and translates it into equivalent Hindi language.

pdf bib abs
Urdu To Punjabi Machine Translation System
Umrinder Pal Singh | Vishal Goyal | Gurpreet Lehal

Machine Translation is a popular area of NLP research field. There are various approaches to develop a machine translation system like Rule-Based, Statistical, Neural and Hybrid. A rule-Based system is based on grammatical rules and uses bilingual lexicons. Statistical and Neural use the large parallel corpus for training the respective models. Where the Hybrid MT system is a mixture of different approaches. In these days the corpus-based machine translation system is quite popular in NLP research area. But these models demands huge parallel corpus. In this research, we have used a hybrid approach to develop Urdu to Punjabi machine translation system. In the developed system, statistical and various sub-system based on the linguistic rule has been used. The system yield 80% accuracy on a different set of the sentence related to domains like Political, Entertainment, Tourism, Sports and Health. The complete system has been developed in a C#.NET programming language.

pdf bib abs
The Hindi to Dogri Machine Translation System
Preeti Dubey

The Hindi to Dogri Machine translation system is a rule-based MT developed and copyrighted by GoI in 2014. It is the first system developed to convert Hindi text into Dogri (the regional language of Jammu). The system is developed using ASP.Net and the databases are in MS-Access. This Machine Translation system accepts Hindi text as input and provides Dogri text as output in Unicode.

pdf bib abs
Opinion Mining System for Processing Hindi Text for Home Remedies Domain
Arpana Prasad | Neeraj Sharma | Shubhangi Sharma

Opinion Mining (OM) is a field of study in Computer Science that deals with development of software applications related to text classifications and summarizations. Researchers working in this field contribute lexical resources, computing methodologies, text classification approaches, and summarization modules to perform OM tasks across various domains and different languages. Lexical and computational components developed for an Opinion Mining System that processes Hindi text taken from weblogs are presented in the paper for the demonstration. Text chosen for processing are the ones demonstrating cause and effect relationship between related entities ‘Food’ and ‘Health Issues’. The work is novel and lexical resources developed are useful in current research and may be of importance for future research in the field. The resources are developed for an algorithm ‘A’ such that for a sentence ‘Y’ which is a domain specific sentence from weblogs in Hindi, A(Y) returns a set F, HI, p, s such that F is a subset of set, FOOD=set of word or phrases in Hindi used for an edible item and HI is a subset of set, HEALTH_ISSUE= set of word or phrases in Hindi used for a part of body composition ‘BODY_COMPONENT’ UNION set of word or phrases in Hindi used for a health problem a human being face ‘HEALTH_PROBLEM’. Element ‘p’ takes numeric value ‘1’ or ‘-1’ where value ‘1’ means that from the text ‘Y’, algorithm ‘A’ computationally derived that the food entities mentioned in set ‘F’ have a positive effect in health issues mentioned in set ‘HI’ and the numeric value ‘-1’ means that the food entities in set ‘F’ have a negative effect in health issues in set ‘HI’. The element‘s’ may take value ‘1’ or ‘2’ indicating that the strength of polarity ‘p’ is medium or strong.

pdf bib abs
Sentiment Analysis of English-Punjabi Code-Mixed Social Media Content
Mukhtiar Singh | Vishal Goyal

Sentiment analysis is a field of study for analyzing people’s emotions, such as Nice, Happy, ਦੁਖੀ (sad), changa (Good), etc. towards the entities and attributes expressed in written text. It noticed that, on microblogging websites (Facebook, YouTube, Twitter ), most people used more than one language to express their emotions. The change of one language to another language within the same written text is called code-mixing. In this research, we gathered the English-Punjabi code-mixed corpus from micro-blogging websites. We have performed language identification of code-mix text, which includes Phonetic Typing, Abbreviation, Wordplay, Intentionally misspelled words and Slang words. Then we performed tokenization of English and Punjabi language words consisting of different spellings. Then we performed sentiment analysis based on the above text based on the lexicon approach. The dictionary created for English Punjabi code mixed consists of opinionated words. The opinionated words are then categorized into three categories i.e. positive words list, negative words list, and neutral words list. The rest of the words are being stored in an unsorted word list. By using the N-gram approach, a statistical technique is applied at sentence level sentiment polarity of the English-Punjabi code-mixed dataset. Our results show an accuracy of 83% with an F-1 measure of 77%.

pdf bib abs
NLP Tools for Khasi, a low resource language
Medari Tham

Khasi is an Austro Asiatic language spoken by one of the tribes in Meghalaya, and parts of Assam and Bangladesh. The fact that some NLP tools for Khasi are now available online for testing purposes is the culmination of the arduous investment in time and effort. Initially when work for Khasi was initiated, resources for Khasi, such as tagset and annotated corpus or any NLP tools, were nonexistent. As part of the author’s ongoing work for her doctoral program, currently, the resources for Khasi that are in place are the BIS (Bureau of Indian Standards) tagset for Khasi, a 90k annotated corpus, and NLP tools such as POS (parts of speech) taggers and shallow parsers. These mentioned tools are highlighted in this demonstration paper.

pdf bib abs
A Chatbot in Malayalam using Hybrid Approach
Praveen Prasannan | Stephy Joseph | Rajeev R R

Chatbot is defined as one of the most advanced and promising expressions of interaction between humans and machines. They are sometimes called as digital assistants that can analyze human capabilities. There are so many chatbots already developed in English with supporting libraries and packages. But to customize these engines in other languages is a tedious process. Also there are many barriers to train these engines with other morphologically rich languages. Artificial Intelligence (AI) based or Machine Learning based Chatbots can answer complex ambiguous questions. The AI chatbots are capable of creating replies from scratch using Natural Language Processing techniques. Both categories have their advantages and disadvantages. Rule based chatbots can give more reliable and grammatically correct answers but fail to respond to questions outside their knowledge base. On the other hand, machine learning based chatbots need a vast amount of learning data and necessitated continuous improvement to the data-base to improve the cognitive capabilities.A hybrid chatbot employs the concepts of both AI and rule based bots, it can handle situations with both the approaches. One of the biggest threat faced by the society during the Corona pandemic was Mis-Information, Dis-information and Mal- information. Government wanted to establish a single source of truth, where the public can rely for authentic information. To support the cause and to fulfill the need to support the general public due to the rapid spread of COVID-19 Pandemic during the months of February and March 2020, ICFOSS has developed an interactive bot which is based on ‘hybrid technology’ and interacts with the people in regional language (Malayalam).

pdf bib abs
Language Identification and Normalization of Code Mixed English and Punjabi Text
Neetika Bansal | Dr. Vishal Goyal | Dr. Simpel Rani

Code mixing is prevalent when users use two or more languages while communicating. It becomes more complex when users prefer romanized text to Unicode typing. The automatic processing of social media data has become one of popular areas of interest. Especially since COVID period the involvement of youngsters has attained heights. Walking with the pace our intended software deals with Language Identification and Normalization of English and Punjabi code mixed text. The software designed follows a pipeline which includes data collection, pre-processing, language identification, handling Out of Vocabulary words, normalization and transliteration of English- Punjabi text. After applying five-fold cross validation on the corpus, the accuracy of 96.8% is achieved on a trained dataset of around 80025 tokens. After the prediction of the tags: the slangs, contractions in the user input are normalized to their standard form. In addition, the words with Punjabi as predicted tags are transliterated to Punjabi.

pdf bib abs
Punjabi to Urdu Machine Translation System
Nitin Bansal | Ajit Kumar

Development of Machine Translation System (MTS) for any language pair is a challenging task for several reasons. Lack of lexical resources for any language is one of the major issue arise while developing MTS using that language. For example, during the development of Punjabi to Urdu MTS, many issues were recognized while preparing lexical resources for both the language. Since there is no machine readable dictionary is available for Punjabi to Urdu which can be directly used for translation; however various dictionaries are available to explain the meaning of word. Along with this, handling of OOV (out of vocabulary words), handling of multiple sense Punjabi word in Urdu, identification of proper nouns, identification of collocations in the source sentence i.e. Punjabi sentence in our case, are the issues which we are facing during development of this system. Since MTSs are in great demand from the last one decade and are being widely used in applications such as in case of smart phones. Therefore, development of such a system becomes more demanding and more users friendly. There usage is mainly in large scale translations, automated translations; act as an instrument to bridge a digital divide.

pdf bib abs
Design and Implementation of Anaphora Resolution in Punjabi Language
Kawaljit Kaur | Dr Vishal Goyal | Dr Kamlesh Dutta

Natural Language Processing (NLP) is the most attention-grabbing field of artificial intelligence. It focuses on the interaction between humans and computers. Through NLP we can make thec omputers recognize, decode and deduce the meaning ofhuman dialect splendidly. But there are numerous difficulties that are experienced in NLP and, Anaphora is one such issue. Anaphora emerges often in composed writings and oral talk. Anaphora Resolution is the process of finding antecedent of corresponding referent and is required in different applications of NLP.Appreciable works have been accounted for anaphora in English and different languages, but no work has been done in Punjabi Language. Through this paper we are enumerating the introduction of Anaphora Resolution in Punjabi language. The accuracy achieved for the system is 47%.

pdf bib abs
Airport Announcement System for Deaf
Rakesh Kumar | Vishal Goyal | Lalit Goyal

People belonging to hearing-impaired community feels very uncomfortable while travelling or visiting at airport without the help of human interpreter. Hearing-impaired people are not able to hear any announcements made at airport like which flight heading to which destination. They remain ignorant about the choosing of gate number or counter number without the help of interpreter. Even they cannot find whether flight is on time, delayed or cancelled. The Airport Announcement System for Deaf is a rule-based MT developed. It is the first system developed in the domain of public places to translate all the announcements used at Airport into Indian Sign Language (ISL) synthetic animations. The system is developed using Python and Flask Framework. This Machine Translation system accepts announcements in the form of English text as input and produces Indian Sign Language (ISL) synthetic animations as output.

pdf bib abs
Railway Stations Announcement System for Deaf
Rakesh Kumar | Vishal Goyal | Lalit Goyal

People belonging to hearing-impaired community feels very uncomfortable while travelling or visiting at Railway Stations without the help of human interpreter. Hearing-impaired people are not able to hear any announcements made at Railway Stations like which train heading to which destination. They remain ignorant about the choosing of platform number or counter number without the help of interpreter. Even they cannot find whether train is on time, delayed or cancelled. The Railway Stations Announcement System for Deaf is a rule-based MT developed. It is the first system developed in the domain of public places to translate all the announcements used at Railway Stations into Indian Sign Language (ISL) synthetic animations. The system is developed using Python and Flask Framework. This Machine Translation system accepts announcements in the form of English text as input and produces Indian Sign Language (ISL) synthetic animations as output.

pdf bib abs
Automatic Translation of Complex English Sentences to Indian Sign Language Synthetic Video Animations
Deepali Goyal | Vishal Goyal | Lalit Goyal

Sign Language is the natural way of expressing thoughts and feelings for the deaf community. Sign language is a diagrammatic and non-verbal language used by the impaired community to communicate their feeling to their lookalike one. Today we live in the era of technological development, owing to which instant communication is quite easy but even then, a lot of work needs to be done in the field of Sign language automation to improve the quality of life among the deaf community. The traditional approaches used for representing the signs are in the form of videos or text that are expensive, time-consuming, and are not easy to use. In this research work, an attempt is made for the conversion of Complex and Compound English sentences to Indian Sign Language (ISL) using synthetic video animations. The translation architecture includes a parsing module that parses the input complex or compound English sentences to their simplified versions by using complex to simple and compound to simple English grammar rules respectively. The simplified sentence is then forwarded to the conversion segment that rearranges the words of the English language into its corresponding ISL using the devised grammar rules. The next segment constitutes the removal of unwanted words or stop words. This segment gets an input sentence generated by ISL grammar rules. Unwanted or unnecessary words are eliminated by this segment. This removal is important because ISL needs only a meaningful sentence rather than unnecessary usage of linking verbs, helping verbs, and so on. After parsing through the eliminator segment, the sentence is sent to the concordance segment. This segment checks each word in the sentence and translates them into their respective lemma. Lemma is the basic requiring node of each word because sign language makes use of basic words irrespective of other languages that make use of gerund, suffixes, three forms of verbs, different kinds of nouns, adjectives, pronouns in their sentence theory. All the words of the sentence are checked in the lexicon which contains the English word with its HamNoSys notation and the words that are not in the lexicon are replaced by their synonym. The words of the sentence are replaced by their counter HamNoSys code. In case the word is not present in the lexicon, the HamNoSys code will be taken for each alphabet of the word in sequence. The HamNoSys code is converted into the SiGML tags (a form of XML tags) and these SiGML tags are then sent to the animation module which converts the SiGML code into the synthetic animation using avatar (computer-generated animation character).

pdf bib abs
Plagiarism Detection Tool for Indian Languages with Special focus on Hindi and Punjabi
Vishal Goyal | Rajeev Puri | Jitesh Pubreja | Jaswinder Singh

Plagiarism is closely linked with Intellectual Property Rights and Copyrights laws, both of which have been formed to protect the ownership of the concept. Most of the available tools for detecting plagiarism when tested with sample Punjabi text, failed to recognise the Punjabi text and the ones, which supported Punjabi text, did a simple string comparison for detecting the suspected copy-paste plagiarism, ignoring the other forms of plagiarism such as word switching, synonym replacement and sentence switching etc.

pdf (full)
bib (full) Proceedings of the Workshop on Joint NLP Modelling for Conversational AI @ ICON 2020

pdf bib
Proceedings of the Workshop on Joint NLP Modelling for Conversational AI @ ICON 2020
Praveen Kumar G S | Siddhartha Mukherjee | Ranjan Samal

pdf bib abs
Neighbor Contextual Information Learners for Joint Intent and Slot Prediction
Bharatram Natarajan | Gaurav Mathur | Sameer Jain

Intent Identification and Slot Identification aretwo important task for Natural Language Understanding(NLU). Exploration in this areahave gained significance using networks likeRNN, LSTM and GRU. However, modelscontaining the above modules are sequentialin nature, which consumes lot of resourceslike memory to train the model in cloud itself. With the advent of many voice assistantsdelivering offline solutions for manyapplications, there is a need for finding replacementfor such sequential networks. Explorationin self-attention, CNN modules hasgained pace in the recent times. Here we exploreCNN based models like Trellis and modifiedthe architecture to make it bi-directionalwith fusion techniques. In addition, we proposeCNN with Self Attention network calledNeighbor Contextual Information Projector usingMulti Head Attention (NCIPMA) architecture. These architectures beat state of the art inopen source datasets like ATIS, SNIPS.

pdf bib abs
Unified Multi Intent Order and Slot Prediction using Selective Learning Propagation
Bharatram Natarajan | Priyank Chhipa | Kritika Yadav | Divya Verma Gogoi

Natural Language Understanding (NLU) involves two important task namely Intent Determination(ID) and Slot Filling (SF). With recent advancements in Intent Determination and Slot Filling tasks, explorations on handling of multiple intent information in a single utterance is increasing to make the NLU more conversation-based rather than command execution-based. Many have proven this task with huge multi-intent training data. In addition, lots of research have addressed multi intent problem only. The problem of multi intent also poses the challenge of addressing the order of execution of intents found. Hence, we are proposing a unified architecture to address multi-intent detection, associated slotsdetection and order of execution of found intents using low proportion multi-intent corpusin the training data. This architecture consists of Multi Word Importance relation propagator using Multi-Head GRU and Importance learner propagator module using self-attention. This architecture has beaten state-of-the-art by 2.58% on the MultiIntentData dataset.

Word emphasis in textual content aims at conveying the desired intention by changing the size, color, typeface, style (bold, italic, etc.), and other typographical features. The emphasized words are extremely helpful in drawing the readers’ attention to specific information that the authors wish to emphasize. However, performing such emphasis using a soft keyboard for social media interactions is time-consuming and has an associated learning curve. In this paper, we propose a novel approach to automate the emphasis word detection on short written texts. To the best of our knowledge, this work presents the first lightweight deep learning approach for smartphone deployment of emphasis selection. Experimental results show that our approach achieves comparable accuracy at a much lower model size than existing models. Our best lightweight model has a memory footprint of 2.82 MB with a matching score of 0.716 on SemEval-2020 public benchmark dataset.

Determining the popularity of a Named Entity after completion of Named Entity Recognition (NER) task finds many applications. This work studies Named Entities of Music and Movie domains and solves the problem considering relevant 11 features. Decision Trees and Random Forests approaches were applied on the dataset and the latter algorithm resulted in acceptable accuracy.

Building Chabot’s requires a large amount of conversational data. In this paper, a web crawler is designed to fetch multi-turn dialogues from websites such as Twitter, YouTube and Reddit in the form of a JavaScript Object Notation (JSON) file. Tools like Twitter Application Programming Interface (API), LXML Library, and JSON library are used to crawl Twitter, YouTube and Reddit to collect conversational chat data. The data obtained in a raw form cannot be used directly as it will have only text metadata such as author or name, time to provide more information on the chat data being scraped. The data collected has to be formatted for proper use case and the JSON library of python allows us to format the data easily. The scraped dialogues are further filtered based on the context of a search keyword without introducing bias and with flexible strictness of classification.

pdf bib abs
A character representation enhanced on-device Intent Classification
Sudeep Deepak Shivnikar | Himanshu Arora | Harichandana B S S

Intent classification is an important task in natural language understanding systems. Existing approaches have achieved perfect scores on the benchmark datasets. However they are not suitable for deployment on low-resource devices like mobiles, tablets, etc. due to their massive model size. Therefore, in this paper, we present a novel light-weight architecture for intent classification that can run efficiently on a device. We use character features to enrich the word representation. Our experiments prove that our proposed model outperforms existing approaches and achieves state-of-the-art results on benchmark datasets. We also report that our model has tiny memory footprint of ~5 MB and low inference time of ~2 milliseconds, which proves its efficiency in a resource-constrained environment.