Natural Language Processing for Achieving Sustainable Development: The Case of Neural Labelling to Enhance Community Profiling

In recent years, there has been an increasing interest in the application of Artificial Intelligence - and especially Machine Learning - to the field of Sustainable Development (SD). However, until now, NLP has not been applied in this context. In this research paper, we show the high potential of NLP applications to enhance the sustainability of projects. In particular, we focus on the case of community profiling in developing countries, where, in contrast to the developed world, a notable data gap exists. In this context, NLP could help to address the cost and time barrier of structuring qualitative data that prohibits its widespread use and associated benefits. We propose the new task of Automatic UPV classification, which is an extreme multi-class multi-label classification problem. We release Stories2Insights, an expert-annotated dataset, provide a detailed corpus analysis, and implement a number of strong neural baselines to address the task. Experimental results show that the problem is challenging, and leave plenty of room for future research at the intersection of NLP and SD.


Introduction
Sustainable Development (SD) is an interdisciplinary field which studies the integration and balancing of economic, environmental and social concerns to tackle the broad goal of achieving inclusive and sustainable growth (Brundtland, 1987;Keeble, 1988;Sachs, 2015). As a collective, trans-national effort toward sustainability, in 2015 the United Nations approved the 2030 Agenda (United Nations, 2015), which identifies 17 Sustainable Development Goals (SDGs) to be reached by 2030(Lee et al., 2016. In recent years, there has been increasing recognition of the fundamental role played by data in achieving the objectives set out in the SDGs (Griggs et al., 2013;Nilsson et al., 2016;Vinuesa et al., 2020).
In this paper, we focus on data-driven planning and delivery of projects 1 which address one or more of the SDGs in a developing country context. When dealing with developing countries, a deep understanding of project beneficiaries' needs and values (hereafter referred to as User-Perceived Values or UPVs, Hirmer and Guthrie (2016)) is of particular importance. This is because beneficiaries with limited financial means are especially good at assessing needs and values (Hirji, 2015). When a project fails to create value to a benefiting community, the community is less likely to care about its continued operation (Watkins et al., 2012;Chandler et al., 2013;Hirmer, 2018) and as a consequence, the chances of the project's longterm success is jeopardised (Bishop et al., 2010). Therefore, comprehensive community profiling 2 plays a key role in understanding what is important for a community and act upon it, thus ensuring a project's sustainability (van der Waldt, 2019).
Obtaining data with such characteristics requires knowledge extraction from qualitative interviews which come in the form of unstructured free text (Saggion et al., 2010;Parmar et al., 2018). This step is usually done manually by domain experts (Lundegård and Wickman, 2007), which further raises the costs. Thus, structured qualitative data is often unaffordable for project developers. As a consequence, project planning heavily relies upon sub-optimal aggregated statistical data, like household surveys (WHO, 2016) or remotely-sensed satellite imagery (Bello and Aina, 2014;Jean et al., 2016), which unfortu-nately is of considerable lower resolution in developing countries. Whilst these quantitative data sets are important and necessary, they are insufficient to ensure successful project design, lacking insights on UPVs that are crucial to success. In this context, the application of NLP techniques can help to make qualitative data more accessible to project developers by dramatically reducing time and costs to structure data. However, despite having been successfully applied to many other domains -ranging from biomedicine (Simpson and Demner-Fushman, 2012), to law (Kanapala et al., 2019) and finance (Loughran and McDonald, 2016) -to our knowledge, NLP has not yet been applied to the field of SD in a systematic and academically rigorous format 3 .
In this paper, we make the following contributions: (1) we articulate the potential of NLP to enhance SD-at the time of writing this is the first time NLP is systematically applied to this field; (2) as a case-study at the intersection between NLP and SD, we focus on enhancing project planning in the context of a developing country, namely Uganda; (3) we propose the new task of UPV Classification, which consists in labeling qualitative interviews using an annotation schema developed in the field of SD; (4) we annotate and release Stories2Insights, a corpus of UPV-annotated interviews in English; (5) we provide a set of strong neural baselines for future reference; and (6) we show -through a detailed error analysis -that the task is challenging and important, and we hope it will raise interest from the NLP community.  cial Good (Hager et al., 2017;Shi et al., 2020). In this context, Machine Learning, in particular in the field of Computer Vision (De-Arteaga et al., 2018), has been applied to contexts ranging from conservation biology (Kwok, 2019), to poverty (Blumenstock et al., 2015 and slavery mapping (Foody et al., 2019), to deforestation and water quality monitoring (Holloway and Mengersen, 2018).

Ethics of AI for Social Good
Despite its positive impact, it is important to recognise that some AI techniques can act both as an enhancer and inhibitor of sustainability. As recently shown by Vinuesa et al. (2020), AI might inhibit meeting a considerable number of targets across the SDGs and may result in inequalities within and across countries due to application biases. Understanding the implications of AI and its related fields on SD, or Social Good more generally, is particularly important for countries where action on SDGs is being focused and where issues are most acute (UNESCO, 2019a,b).

Project biases
Various works highlight the importance of understanding the local context and engaging with local stakeholders, including beneficiaries, to achieve project sustainability. Where such information is not available, projects are designed and delivered based on the judgment of other actors (e.g. project funders, developers or domain experts, (Risal, 2014;Axinn, 1988;Harman and Williams, 2014)). Their judgment, in turn, is subject to biases (Kahneman, 2011) that are shaped by past experiences, beliefs, preferences and worldviews: such biases can include, for example, preferences towards a specific sector (e.g. energy or water), technology (e.g. solar, hydro) or gender-group (e.g. solutions which benefit a gender disproportionately), which are pushed without considering the local needs.
NLP has the potential to increase the availability of community-specific data to key decision makers and ensure project design is properly informed and appropriately targeted. However, careful attention needs to be paid to the potential for bias in data collection resulting from the interviewers (Bryman, 2016), as well as the potential to introduce new bias through NLP.
As a means to obtain qualitative data with the characteristics mentioned above, we adapt the (a) User-Perceived Value wheel.
(b) Flowchart of the intersection between NLP (purple square) and the delivery of SD projects. Figure 1: Using UPVs (1a) to build sustainable projects: note the role of NLP (purple square in 1b).
User-Perceived Values (UPV) framework (Hirmer, 2018). The UPV framework builds on value theory, which is widely used in marketing and product design in the developed world (Sheth et al., 1991;Woo, 1992;Solomon, 2002;Boztepe, 2007). Value theory assumes that a deep connection exists between what consumers perceive as important and their inclinations to adopt a new product or service (Nurkka et al., 2009).
In the context of developing countries, our UPV framework identifies a set of 58 UPVs which can be used to frame the wide range of perspectives on what is of greatest concern to project beneficiaries (Hirmer and Guthrie, 2016). UPVs (or tier 3 (T3) values) can be clustered into 17 tier 2 (T2) value groups, each one embracing a set of similar T3 values; in turn, T2 values can be categorized into 6 tier 1 (T1) high-level value pillars, as follows: (Hirmer and Guthrie, 2014 The interplay between T1, T2 and T3 values is graphically depicted in the UPV Wheel ( Figure 1a). See Appendix A for the full set of UPV definitions.

Integrating UPVs into Sustainable
Project Planning.
The UPV approach offers a theoretical framework to place communities at the centre of project design ( Figure 1b). Notably, it allows to (a) facilitate more responsible and beneficial project planning (Gallarza and Saura, 2006); and (b) enable effective communication with rural dwellers. The latter allows the use of messaging of project benefits in a way that resonates with the beneficiaries' own understanding of benefits, as discussed by Hirji (2015). This results in a higher end-user acceptance, because the initiative is perceived to have personal value to the beneficiaries: as a consequence, community commitment will be increased, eventually enhancing the project success rate and leading to more sustainable results (Hirmer, 2018).

3.3
The role of NLP to enhance Sustainable Project Planning.
Data conveying the beneficiaries' perspective is seldom considered in practical application, mainly due to the fact that it comes in the form of unstructured qualitative interviews. As introduced above, data needs to be structured in order to be useful (OECD, 2017; UN Agenda for Sustainable Development, 2018). This makes the entire process very long and costly, thus making it almost prohibitive to afford in practice for most small-scale projects. In this context, the role of AI, and more specifically NLP, can have a yet unexplored opportunity. Implementing successful NLP systems to automatically perform the annotation process on interviews ( Figure 1b, purple square), which constitutes the major bottleneck in the project planning pipeline (Section 4.1), would dramatically speed up the entire project life-cycle and drastically reduce its costs. In this context, we introduce the task of Automatic UPV classification, which consists of annotating each sentence of a given input interview with the appropriate UPV labels which are (implicitly) conveyed by the interviewee.

The Stories2Insights Corpus: a Corpus Annotated for User-Perceived Values
To enable research in UPV classification, we release S2I, a corpus of labelled reports from 7 rural villages in Uganda ( Figure 2c). In this Section, we report on the corpus collection and annotation procedures and outline the challenges this poses for NLP.

Building a Corpus with the UPV game
The UPV game. As widely recognised in marketing practice (Van Kleef et al., 2005), consumers are usually unable to articulate their own values and needs (Ulwick, 2002). This requires the use of methods that elicit what is important, such as laddering (Reynolds and Gutman, 2001) or Zaltman Metaphor Elicitation Technique (ZMET) (Coulter et al., 2001). To avoid direct inquiry (Pinegar, 2006), Hirmer and Guthrie (2016) developed an approach to identify perceived values in low-income settings by means of a game (hereafter referred to as UPV game). Expanding on the items proposed by Peace Child International (2005), the UPV game makes reference to 46 everyday-use items in rural areas 5 , which are graphically depicted (Figure 2a). The decision to represent items graphically stems from the high level of illiteracy across developing countries (UNESCO, 2013).
Building on the techniques proposed by Coulter et al. (2001) and Reynolds et al. (2001), the UPV game is framed in the form of semi-structured interviews: (1) participants are asked to select 20 items, based on what is most important to them (Select stimuli), (2) to rank them in order of importance; and finally, (3) they have to give reasons as to why an item was important to them. Why-probing was used to encourage discussion (Storytelling). Case-Study Villages. 7 rural villages were studied: 3 in the West Nile Region (Northern Uganda); 1 in Mount Elgon (Eastern Uganda); 2 in the Ruwenzori Mountains (Western Uganda); and 1 in South Western Uganda. All villages are located in remote areas far from the main roads ( Figure 2c). A total of 7 languages are spoken across the villages 6 . Data Collection Setting and Guidelines for Interviewers. For each village, 3 native speaker interviewers guided the UPV game. To ensure consistency and data quality, a two-day training workshop was held at Makerere University (Kampala, Uganda), and a local research assistant oversaw the entire data collection process in the field. Data Collection. 12 people per village were interviewed, consisting of an equal split between men and women with varying backgrounds and ages. In order to gather complete insight into the underlying decision-making process -which might be influenced by the context (Barry et al., 2008) -interviews were conducted both individually and in groups of 6 people following standard focus group methods (Silverman, 2013;Bryman, 2016). Each interview lasted around 90 minutes. The data collection process took place over a period of 3 months and resulted in a total of 119 interviews. Ethical Considerations. Participants received compensation in the amount of 1 day of labour. An informed consent form was read out loud by the interviewer prior to the UPV game, to cater for the high-level of illiteracy amongst participants. To ensure integrity, a risk assessment following the University of Cambridge's Policy on the Ethics of Research Involving Human Participants and Personal Data was completed. To protect the participants' identity, locations and proper names were anonymized. Data Annotation. The interviews were translated 7 into English, analysed and annotated by domain experts 8 using the computer-assisted qualitative data analysis software HyperResearch (Hesse-Biber et al., 1991). To ensure consistency across interviews, they were annotated following Bryman (2012), using cross-sectional indexing (Mason, 2002). Due to the considerable size of collected data, the annotation process took around 6 months.

Corpus Statistics and NLP Challenges
We obtain a final corpus of 5102 annotated utterances from the interviews. Samples present an average length of 20 tokens. The average number of samples per T3 label is 169.1, with an extremely skewed distribution: the most frequent T3, Economic Opportunity, occurs 957 times, while the least common, Preservation of the Environment, only 7 ( Figure 3).
58.8% of the samples are associated with more than 1 UPV, and 22.3% with more than 2 UPVs (refer to Appendix B for further details on UPV correlation). Such characteristics make UPV classification highly challenging to model: the task is an extreme multi-class multi-label problem, with high class imbalancy. Imbalanced label distributions pose a challenge for many NLP applications -as sentiment analysis (Li et al., 2011), sarcasm detection (Liu et al., 2014), and NER (Tomanek and Hahn, 2009) -but are not uncommon in usergenerated data (Imran et al., 2016). The following interview excerpt illustrates the multi-class multilabel characteristics of the problem: Further challenges for NLP are introduced by the frequent use of non-standard grammar and poor sentence structuring, which often occur in oral production (Cole et al., 1995). Moreover, manual transcription of interviews may lead to spelling errors, thus increasing OOVs. This is illustrated in the below excerpts (spelling errors are underlined): • Also men like phone there are so jealous for their women for example like in the morning my husband called me and asked that are you in church; so that's why they picked a phone.

User-Perceived Values Classification
As outlined above, given an input interview, the task consists in annotating each sentence with the appropriate UPV(s). The extreme multi-class multilabel quality of the task (Section 4.2) makes it impractical to tackle as a standard multi-class classification problem-where, given an input sample x, a system is trained to predict its label from a tagset T = {l 1 , l 2 , l 3 } as x → l 2 (i.e. [0,1,0]). Instead, we model the task as a binary classification problem: given x, the system learns to predict its relatedness with each one of the possible labels, i.e. (x, l 1 ) → 0, (x, l 2 ) → 1 and (x, l 3 ) → 0 9 . We consider the samples from the S2I corpus as positive instances. Then, we generate three kinds of negative instances by pairing the sample text with random labels. To illustrate, consider the three T2 classes Convenience, Identity and Status, which contain the following T3 values: Given a sample x and its gold label Aspiration T 3 , we can generate the following training samples: as x is linked with a wrong T3 with the same T2; • (x, Dignity T 3 ) is negative sample, as x is a associated with a wrong T3 from a different T2 class, but both T2 classes belong to the same T1; and • (x, Aesthetic T 3 ) is a strictly negative sample, as x is associated with a wrong label from the another T2 class in a different T1. In this way, during training the system is exposed to positive (real) samples and negative (randomly generated) samples. A UPV classification system should satisfy the following desiderata: (1) it should be relatively light, given that it will be used in the context of developing countries, which may suffer from access bias 10 and (2) the goal of such a system isn't to completely replace the work of human SD experts, but rather to reduce the time needed for interview annotation. In this context, false positives are quick to notice and delete, while false negatives are more difficult to spot and correct. Moreover, when assessing a community's needs and values, missing a relevant UPV is worse than including one which wasn't originally present. For these reasons, recall is particularly important for a UPV classifier.
In the next Section, we provide a set of strong baselines for future reference.

Baseline Architecture
Embedding Layer. The system receives an input sample (x, T 3), where x is the sample text (e 1 , ..., e n ), T 3 is the T3 label as the sequence of its tokens (e 1 , ..., e m ), and e i is the word embed-ding representation of a token at position i. We obtain a T3 embedding e T 3 for each T3 label using a max pool operation over its word embeddings: given the short length of T3 codes, this proved to work well and it is similar to findings in relation extraction and targeted sentiment analysis (Tang et al., 2016). We replicate e T 3 n times and concatenate it to the text's word embeddings x (Figure 4). Encoding Layer. We obtain a hidden representation h text with a forward LSTM (Gers et al., 1999) over the concatenated input. We then apply attention to capture the key parts of the input text w.r.t. the given T3. In detail, given the output matrix of the LSTM layer H = [h 1 , ..., h n ], we produce a hidden representation h text as follows: h text = Hα T This is similar in principle to the attention-based LSTM by Wang et al. (2016), and proved to work better than classic attention over H on our data. Decoding Layer. We predictŷ ∈ [0, 1] with a dense layer followed by a sigmoidal activation.

Including Description Information
Each T3 comes with a short description, which was written by domain experts and used during manual labelling (the complete list is in the Appendix A). We integrate information from such descriptions into our model as follows: given the ordered word embeddings from the UPV description (e 1 , ..., e d ), we obtain a description representation h descr following the same steps as for the sample text.
In line with previous studies on siamese networks (Yan et al., 2018), we observe better results when sharing the weights between the two LSTMs. We keep two separated attention layers for sample texts and descriptions. We concatenate h text and h descr and feed the obtained vector to the output layer.

Multi-task Training
A clear hierarchy exists between T3, T2 and T1 values (Section 3). We integrate such information using multi-task learning (Caruana, 1997;Ruder, 2017). Given an input sample, we predict its relatedness not only w.r.t. a T3 label, but also with its corresponding T2 and T1 labels 11 . In practice, 11 The mapping between sample and correct labels [T3, T2, T1] is as follows:  given the hidden representation h = h text ⊕ h descr , we first feed it into a dense layer dense T 1 to obtain h T 1 , and predictŷ T 1 with a sigmoidal function. We then concatenate h T 1 with the previously obtained h, and we predictŷ T 2 with a T2-specific dense layer σ(dense T 2 (h ⊕ h T 1 )). Finally,ŷ T 3 is predicted as σ(dense T 3 (h ⊕ h T 2 )).
In this way, the predictionŷ i is based on both the original h and the hidden representation computed in the previous stage of the hierarchy, h i−1 (Figure 4).
6 Experiments and Discussion 6.1 Experimental Setting

Data Preparation
For each positive sample, we generate 40 negative samples (we found empirically that this was the best performing ratio, see Appendix C).
Moreover, to expose the system to more diverse input, we slightly deform the sample's text when generating negative samples. Following Wei and Zou (2019), we implement 4 operations: random deletion, swap, insertion, and semanticallymotivated substitution. We also implement character swapping to increase the system's robustness to spelling errors ( Figure 5).
We consider only samples belonging to UPV labels with a support higher than 30 in the S2I corpus, thus rejecting 12 very rare UPVs. We select a random 80% proportion from the data as training set; out of the remaining 980 samples, we randomly select 450 as dev and use the rest as test set.

Training Setting
In order to allow for robust handling of OOVs, typos and spelling errors in the data, we use FastText subword-informed pretrained vectors (Bojanowski et al., 2017) to initialise the word embedding matrix. We train using binary cross-entropy loss, with early stopping monitoring the development set loss with a patience of 5. Sample weighting was used to account for the different error seriousness (1 for negative and strictly neg and 0.5 for mildly neg).  Network hyperparameters are reported in Appendix C for replication.

Models Performance
During experiments, we monitor precision, recall and F 1 score. For evaluation, we consider a test set where negative samples appear in the same proportion as in the train set (1/40 positive/negative ratio). The results of our experiments are reported in Table 1. Notably, adding attention and integrating signal from descriptions to the base system lead to significant improvements in performance.

Multi-task Training
We consider the best performing model and run experiments with the three considered multi-task train settings (Section 5.1.3). We consider 3 layers of performance, corresponding to T3, T2 and T1 labels. This is useful because, in the application context, different levels of granularity can be monitored. As shown in Table 2, we observe relevant improvements in F1 scores when jointly learning more than one training objective. This holds true not only for T3 classification, but also for T2 classification when training with the T3+T2+T1 setting. This seems to indicate that the signal encoded in the additional training objectives indirectly conveys information about the label hierarchy which is indeed useful for classification.

Real-World Simulation and Error Analysis
To simulate a real scenario where we annotate a new interview with the corresponding UPVs, we perform further experiments on the test set by generating, for each sample, all possible negative samples. We annotate using the T1+T2+T3 model,  finetuning the threshold for each UPV on the development set, and perform a detailed error analysis of the results on the test set. As reported in Table 3, we observe a significant drop in precision, which confirms the extreme difficulty of the task in a real-world setting due to the extreme data imbalancy. Note, however, that recall remains relatively stable over changes in evaluation settings. This is particularly important for a system which is meant to enhance the annotators' speed, rather than to completely replace human experts: in this context, missing labels are more time consuming to recover than correcting false positives.
Not surprisingly, particularly good performance is often obtained on T3 labels which tend to correlate with specific terms (as School Fees, or Faith). In particular, we observe a correlation between a T3 label's support in the corpus and the system's precision in predicting that label: with very few exceptions, all labels where the system obtained a precision lower than 30 had a support similar or lower than 3%.
The analysis of the ROC curves shows that, overall, satisfactory results are obtained for all T1 labels considered (Appendix D), leaving, however, considerable room for future research.

Conclusions and Future Work
In this study, we provided a first stepping stone towards future research at the intersection of NLP and Sustainable Development (SD). As a case study, we investigated the opportunity of NLP to enhancing project sustainability through improved community profiling by providing a cost effective way towards structuring qualitative data.
This research is in line with a general call for AI towards social good, where the potential positive impact of NLP is notably missing. In this context, we proposed the new challenging task of Automatic User-Perceived Values Classification: we provided the task definition, an annotated dataset (the Sto-ries2Insights corpus) and a set of light (in terms of overall number of parameters) neural baselines for future reference.
Future work will investigate ways to improve performance (and especially precision scores) on our data, in particular on low-support labels. Possible research direction could include more sophisticated thresholding selection techniques (Fan and Lin, 2007;Read et al., 2011) to replace the simple threshold finetuning which is currently used for simplicity. While deeper and computationally heavier models as Devlin et al. (2019) could possibly obtain notable gains in performance on our data, it is the responsibility of the NLP community -especially with regards to social good applicationsto provide solutions which don't penalise countries suffering from access biases (as contexts with low access to computational power), as it is the case of many developing countries.
We hope our work will spark interest and open a constructive dialogue between the fields of NLP and SD, and result in new interesting applications. Rate of output and means that lead to increased productivity Reliability The ability to rely or depend on operation or function of an item or service Usability Refers to physical interaction with item being easy to operate handle or look after

Social Norm Celebration
Association chosen as they play important part during celebration Manners Ways of behaving with reference to polite standards and social components Morality Following rules and the conduct Tradition Expected form of behaviour embedded into the specific culture of city or village Religion Faith Belief in god or in the doctrines or teachings of religion

Intrinsic Human
Health Longevity Means that lead to an extended life span Health Care Access Being able to access medical services or medicine Treatment To require a hospital or medical attention as a consequence of illness or injury Preserv. of Health Practices performed for the preservation of health Physiological Education Access Being able to access educational services Energy Access Being able to obtain energy services or resources Food Security The ability to have a reliable and continuous supply of food Shelter A place giving protection from bad weather or danger Water Access Continuous access or availability of water Water Quality To have clean water as sickness, colour and taste Quality of Life Wellbeing Obtaining good or satisfying living condition (for people or for the community) Significance Identity Appearance Act or fact of appearing as to the eye or mind of the public Belongingness Association with a certain group, their values and interests Dignity The State or quality of being worthy of honour or respect Personal Performance The productivity to which someone executes or accomplishes work Status

Aspiration
Desire or aim to become someone better or more powerful or wise

Modernisation
Transition to a modern society away from a traditional society Reputation Commonly held opinion about ones character Social Interaction Altruism The principle and practice of unselfish concern Family Caring Displaying kindness and concern for family members Role Fulfilling Duty to fulfilling tasks or responsibilities associated with a certain role Togetherness Warm fellowship, as among friends or members of a family . Figure 6: Co-occurrence matrix of T3 labels in the S2I corpus.
Appendix B -Co-occurrence matrix of User-Perceived Values in the S2I corpus.
The co-occurrence matrix in Figure 6 depicts the inter-relatedness between different T3 labels. The intensity of colour corresponds to the number of samples in the S2I corpus where the given T3 labels co-occur.
The analysis of labels co-occurrence can offer valuable insights on commonly associated User-Perceived Values (UPVs, (Hirmer and Guthrie, 2014)): this can be useful to highlight challenges and problems in the considered community, which might not be known to the dwellers themselves. While some correlations are typical and expected, others are related to the specific Ugandan context, and might be surprising to those external to the location.
For example, Economic Opportunity, Food Security and Preservation of Health appear to frequently co-occur with other T3 labels. Note that the lack of employment opportunity, the availability of food and the quality of healthcare services represent endemic problems in the rural context studied in this paper. As they constitute primary concerns for most interviewees, it is therefore unsurprising that they were mentioned frequently in relation to many of the items selected as part of the UPV game (Section 4). A further illustrative example of the cultural context -in this case rural Uganda -is the high concurrence of Unburden and Mobility. This can be explained by the fact that rural roads are often of poor quality and villages or areas are inaccessible by motorised vehicles. Henceforth, people are required to find alternatives moves of transport for moving themselves to hospital or crops to the nearest market for sell, for example. As a final example, the frequent mentioning of Faith, Harmony and Morality, which also tend to co-occur in similar contexts, testifies the fundamental role played by religion in the rural villages considered in this study.
The information on the (co-)occurrence of UPVs in a community is also particularly valuable for designing appropriate project communication ( Figure  1b), which can increase project buy-in through focused messaging (Section 3).

Appendix C -Experimental Specifications.
In this Appendix, we report on the exact experimental setting used for experiments to aid experiment reproducibility.

C.1 Data Specifications
Data Selection and Splitting. We select all sentences from the 119 interviews which were at least 3 tokens long and which were annotated with at least one UPV. We then randomly select an 80% proportion of the data as training set, and take the rest as heldout data (with a dev/test split of respectively 450 and 530 samples). Figure 7 shows that the obtained label distribution is similar. Data Anonymization. In order to prevent the identification of the interviewees (Sweeney, 2000), data was manually anonymized. We anonymized all occurrences of: proper names, names of villages, cities or other geographical elements, and other names that might be sensitive (as names of tribes, languages, ...).
Data Sample. We are providing a sample of the data in the supplementary material. Each data sample is associated with the following fields: • id: a unique identifier of the sample; • text: a sentence to be classified; • t3 labels: a list of the gold T3 labels associated with the sample.
For privacy reasons, we are not releasing metadata information associated with the samples (as the interviewee's name, gender, age, or the exact village name). Data Preprocessing.
For sentence splitting and word tokenization, we used NLTK's sent tokenize and word tokenize tokenizers (Bird and Loper, 2004) 12 . We use a set of regex to find interviewer comments and questions. Given that Why-Probing (Section 4.1, Reynolds and Gutman (2001)) was used, interviewers' comments are very limited and standard. Negative Samples Generation. To generate negative samples (Section 6.1), we slightly modify Wei and Zou (2019)'s implementation 13 EDA (Easy Data Augmentation techniques) by adding a new function for character swapping and by adapting the stopword list. For semantic-based replacement, we rely on NLTK's interface 14 to WordNet (Fellbaum, 2012). Random shuffling and choice are controlled by a seed.

C.2 Further Specifications
(Hyper)-Parameters Selection. All parameters used for experiments are reported in Table 4  We use 300-dimensional FastText subwordinformed pretrained vectors (Bojanowski et al., 2017) 15 to get the word embedding representations for each input sample. Note that the goal of this paper is to present a new interesting NLP application, namely NLP for Sustainable Development: therefore, our goal here is to provide a set of robust baselines on our new S2I dataset, which can be referenced for future research. For this reason, we don't perform extensive hyper-parameter tuning on the selected models.
The only parameters we optimize are the number of generated negative samples of each type (mildly negative, negative and strictly negative). The best ratios were found empirically through experiments. The ratio used for optimization are reported in Ta  The analysis of the performance progression over training ( Figure 8) shows that, in line with Wei and Zou (2019), adding negative examples is useful to improve performance: in our case, the plateau is reached around 40 augmented samples. In particular, we observe gains in all considered output levels (T1, T2 and T3 labels). Number of Parameters and Runtime Specifications. Table 6 reports on the total number of (trainable) parameters and the average runtime/step for each considered model. Embeddings are kept fixed over training to avoid overfitting. Computing Infrastructure. We run experiments on an NVIDIA GeForce GTX 1080 GPU. Evaluation Specifications. For computing the evaluation metrics, we use the sklearn's (Pedregosa 15 We chose the wiki.en.zip model pretrained on the English Wikipedia https://fasttext.cc/docs/en/ pretrained-vectors.html