Discourse Coherence in the Wild: A Dataset, Evaluation and Methods

To date there has been very little work on assessing discourse coherence methods on real-world data. To address this, we present a new corpus of real-world texts (GCDC) as well as the first large-scale evaluation of leading discourse coherence algorithms. We show that neural models, including two that we introduce here (SentAvg and ParSeq), tend to perform best. We analyze these performance differences and discuss patterns we observed in low coherence texts in four domains.


Introduction
Discourse coherence is an important aspect of text quality. It encompasses how sentences are connected as well as how the entire document is organized to convey information to the reader. Developing discourse coherence models to distinguish coherent writing from incoherent writing is useful to a range of applications. An automated coherence scoring model could provide writing feedback, e.g. identifying a missing transition between topics or highlighting a poorly organized paragraph. Such a model could also improve the quality of natural language generation systems.
One approach to modeling coherence is to model the distribution of entities over sentences. The entity grid (Barzilay and Lapata, 2005), based on Centering Theory (Grosz et al., 1995), was the first of these models. Extensions to the entity grid include additional features (Elsner andCharniak, 2008, 2011;Feng et al., 2014), a graph representation (Guinaudeau and Strube, 2013;Mesgar and Strube, 2015), and neural convolutions (Tien Nguyen and Joty, 2017). Other approaches have used lexical cohesion (Morris and Hirst, * Research performed while at Grammarly. 1991;Somasundaran et al., 2014), discourse relations (Lin et al., 2011;Feng et al., 2014), and syntactic features (Louis and Nenkova, 2012). Neural networks have also been successfully applied to coherence (Li and Hovy, 2014;Tien Nguyen and Joty, 2017;Li and Jurafsky, 2017). However, until now, these approaches have not been benchmarked on a common dataset.
Past work has focused on the discourse coherence of well-formed texts in domains like newswire (Barzilay and Lapata, 2005;Elsner and Charniak, 2008) via tasks like sentence ordering that use artificially constructed data. It was unknown how well the best methods would fare on real-world data that most people generate.
In this work, we seek to address the above deficiencies via four main contributions. First, we present a new corpus, the Grammarly Corpus of Discourse Coherence (GCDC), for real-world discourse coherence. The corpus contains texts the average person might write, e.g. emails and online reviews, each with a coherence rating from expert annotators (see examples in Table 11 and supplementary material). Second, we introduce two simple yet effective neural network models to score coherence. Third, we perform the first large-scale benchmarking of 7 leading coherence algorithms. We show that prior models, which performed at a very high level on well-formed and artificially generated data, have markedly lower performance in these new domains. Finally, the data, annotation guidelines, and code have all been made public. 1 2 A Corpus for Discourse Coherence

Related Work
Most previous work in discourse coherence has been evaluated on a sentence ordering task that assumes each text is well-formed and perfectly co-Score Text Low Should I be flattered? Even a little bit? And, as for my alibi, well, let's just say it depends on the snow and the secret service. So, subject to cross for sure. Do you think there could be copycats? Do you think the guy chose that mask or just picked up the nearest one? Please keep me informed as the case unfolds-On another matter, can you believe Dan Burton will be the chair of one of the House subcommittees we'll have to deal w? Irony and satire are the only sane responses. Happy New Year-and here's hoping for many more stories that make us laugh! High Cheryl, I just spoke with Vidal Jorgensen. They expect to be on the ground in about 8 months. They have not yet raised enough money to get the project started -the total needed is $6M and they need $2M to get started. Vidal said they process has been delayed because their work in Colombia and China is consuming all their resources at the moment. Once on the ground, they will target the poorest of the poor and go to the toughest areas of Haiti. They anticipate an average loan size of $200 and they expect to reach about 10,000 borrowers in five years. They expect to be profitable in 4-5 years. Meghann Table 1: Examples of texts and coherence scores from the Clinton domain.
herent, and any reordering of the same sentences is less coherent. Presented with a pair of texts -the original and a random permutation of the same sentences -a coherence model should be able to identify the original text. More challenging versions of this task (sentence insertion (Elsner and Charniak, 2011) and paragraph reconstruction (Lapata, 2003;Li and Jurafsky, 2017)) all assume that the original text is perfectly coherent. Datasets for the sentence ordering task tend to use texts that have been professionally written and extensively edited. These have included the Accidents and Earthquakes datasets (Barzilay and Lapata, 2005), the Wall Street Journal (Elsner andCharniak, 2008, 2011;Lin et al., 2011;Feng et al., 2014;Tien Nguyen and Joty, 2017), and Wikipedia (Li and Jurafsky, 2017).
Another task, summary evaluation (Barzilay and Lapata, 2005), uses human coherence judgments, but include machine-generated texts. Coherence models are only required to identify which of a pair of texts is more coherent (presumably identifying human-written texts).
The line of work most closely related to our approach is the application of coherence modeling to automated essay scoring. Essays are written by test-takers, not professional writers, so they are not assumed to be coherent. Manual annotation is required to assign the essay an overall quality score (Feng et al., 2014) or to rate the coherence of the essay (Somasundaran et al., 2014;Burstein et al., 2010Burstein et al., , 2013. While this line of work goes beyond sentence ordering to examine the qualities of a low-coherence text, it has only been applied to test-taker essays.
In contrast to previous datasets, we collect writ-ing from non-professional writers in everyday contexts. Rather than using permuted or machinegenerated texts as examples of low coherence, we want to investigate the ways in which people try but fail to write coherently. We present a corpus that contains texts from four domains, covering a range of coherence, each annotated with a document-level coherence score. In Sections 2.2-2.6, we describe our data collection process and the characteristics of the resulting corpus.

Domains
For a robust evaluation, we selected domains that reflect what an average person writes on a regular basis: forum posts, emails, and product reviews. For online forum posts, we sampled responses from the Yahoo Answers L6 corpus 2 for the Yahoo domain. For emails, we used the State Department's release of emails from Hillary Clinton's office 3 and emails from the Enron Corpus 4 to make up our Clinton and Enron domains. Finally, we sampled reviews of businesses from the Yelp Open Dataset 5 for our Yelp domain.

Text Selection
We randomly selected texts from each domain given a few filters. We want each text to be long enough to exhibit a range of characteristics of local and global coherence, but not so long that the labeling process is tedious for annotators. Therefore, we considered texts between 100 and 300 words in length. We ignored texts containing URLs (as they often quote writing from other sources) and texts with too many line breaks (usually lists).

Annotation
We collected coherence judgments both from expert raters with prior linguistic annotation experience, as in Burstein et al. (2010) and from untrained raters via Amazon Mechanical Turk. This allows us to assess the efficacy of using untrained raters for this task. We asked the raters to rate the coherence of each text on a 3-point scale from 1 (low coherence) to 3 (high coherence) given the following instructions, which are based on prior coherence annotation efforts (Barzilay and Lapata, 2008;Burstein et al., 2013): A text with high coherence is easy to understand, well-organized, and contains only details that support the main point of the text. A text with low coherence is difficult to understand, not well organized, or contains unnecessary details. Try to ignore the effects of grammar or spelling errors when assigning a coherence rating.

Expert Rater Annotation
We solicited judgments from 13 expert raters with previous annotation experience. We provided a high-level description of coherence but no detailed rubric, as we wanted them to use their own judgment. We also provided examples of low, medium, and high coherence along with a brief justification for each label. The raters went through a calibration phase during which we provided feedback about their judgments. In the annotation phase, we collected 3 expert rater judgments for each text.
Mechanical Turk Annotation We collected 5 MTurk judgments for each text from a group of 62 Mechanical Turk annotators who passed our qualification test. We again provided a high-level description of coherence. However, we only provided a few examples for each category so as not to overwhelm the annotators. We were mindful of how the characteristics of each domain might affect the resulting coherence scores. For example, after rating a batch of generally low coherence forum data, business emails may appear to be more coherent. However, our goal is to discover the characteristics of a low coherence business email or a low coherence forum post, not to compare the two domains. Therefore, we recruited new MTurk raters for each domain so as not to bias their scores. The same 13 expert raters worked on all four domains, but we specifically instructed them to consider whether each text was a coherent document for its domain.

Grammarly Corpus of Discourse Coherence
The resulting four domains each contain 1200 texts (1000 for training, 200 for testing). Each text has been scored as {low, medium, high} coherence by 5 MTurk raters and 3 expert raters. There is one consensus label for the expert ratings and another consensus label for the MTurk ratings. We computed the consensus label by averaging the integer values of the coherence ratings (low = 1, medium = 2, high = 3) over the MTurk or expert ratings and thresholding the mean coherence score (low ≤ 1.8 < medium ≤ 2.2 < high) to produce a 3-way classification label (Table 2). We observed that the MTurk raters tended to label more texts as "medium" coherence than the expert raters. Since the MTurk raters did not go through an extensive training session, they may be less confident in their ratings, defaulting to medium as the safe option. Table 3 contains type and token counts for the full dataset, and Figure 1 shows the number of paragraphs, sentences, and words per document.

Annotation Agreement
To quantify agreement among annotators, we follow Pavlick and Tetreault (2016)'s approach to   simulate two annotators from crowdsourced labels. We repeat the simulation 1000 times and report the mean agreement values in Table 4 for both intraclass correlation (ICC) and quadratic weighted Cohen's κ for an ordinal scale. The expert raters have fair agreement (Landis and Koch, 1977) for three of the domains, but agreement among MTurk raters is quite low. These agreement numbers are the result of an extensive annotation development process and emphasize the difficulty of the task. We recommend that future work in this area leverages raters with a strong annotation background and the time for indepth instructions. For evaluation, we use the consensus label from the expert judgments. For comparison, we include an experiment using MTurk consensus labels in the supplementary material.

Models
We evaluate a range of existing discourse coherence models on GCDC: entity-based models, a word embedding graph model, and neural network models. These models from previous work have been very effective on the sentence ordering task, but have not been used to produce coherence scores. We also introduce two new neural sequence models.

Baseline
We compute the Flesch-Kincaid grade level (Kincaid et al., 1975) of each text and treat it as a coherence score. While Flesch-Kincaid is a readability measure, previous work has treated readability and text coherence as overlapping tasks (Barzilay and Lapata, 2008;Mesgar and Strube, 2015). For coherence classification, we search over the grade level scores on the training data and select thresholds that result in the highest accuracy.

Entity-based Models
Entity-based models track entity mentions throughout the text. In the majority of our experiments, we applied Barzilay and Lapata (2008)'s coreference heuristic and consider two nouns to be coreferent only if they are identical. As Elsner and Charniak (2011) noted, automatic coreference resolution often fails to improve coherence modeling results. However, we also evaluate the effect of adding an automatic coreference system in Section 4.1.
Entity grid (EGRID) The entity grid (Barzilay and Lapata, 2005) is a matrix that tracks entity mentions over sentences. We reimplemented the model from Barzilay and Lapata (2008), converting the entity grid into a feature vector that expresses the probabilities of local entity transitions. We use scikit-learn (Pedregosa et al., 2011) to train a random forest classifier over the feature vectors.
Entity graph (EGRAPH) The entity graph (Guinaudeau and Strube, 2013) interprets the entity grid as a graph whose nodes are sentences. Two nodes are connected if they share at least one entity. Graph edges can be weighted according to the number of entities shared, the syntactic roles of the entities, or the distance between sentences. The coherence score of a text is the average out-degree of its graph, so for classification we identify the thresholds that maximize accuracy on the training data.

Entity grid with convolutions (EGRIDCONV)
Tien Nguyen and Joty (2017) applied a convolutional neural network to the entity grid to capture long-range transitions. We use the authors' implementation. 6

Lexical Coherence Graph (LEXGRAPH)
The lexical coherence graph (Mesgar and Strube, 2016) represents sentences as nodes of a graph, connecting nodes with an edge if the two sentences contain a pair of similar words (i.e. the cosine similarity of their pre-trained word vectors is greater than a threshold). From the graph, we can extract a feature vector that expresses the frequency of all k-node subgraphs. We use the authors' implementation 7 and train a random forest classifier over the feature vectors.

Neural Network Models
We reimplemented a neural network model of coherence, the sentence clique model, to evaluate its effectiveness on GCDC. We also introduce two new neural network models that are more straightforward to implement than the clique model.
Sentence clique (CLIQUE) Li and Jurafsky (2017)'s model operates over cliques of adjacent sentences. For the sentence ordering task, a positive clique is a sequence of k sentences from the original document. A negative clique is created by replacing the middle sentence of a positive clique with a random sentence from elsewhere in the text. The model contains a single LSTM (Hochreiter and Schmidhuber, 1997) that takes a sequence of GloVe word embeddings and produces a sentence vector at the final output step. All k sentence vectors are concatenated and passed through a final layer to produce a probability that the clique is coherent. The final coherence score is the average of the scores of all cliques in the document.
We extend CLIQUE to 3-class classification by labeling each clique with the document class label (low, medium, high). To predict the text label, the model averages the predicted coherence class distributions over all cliques.
Sentence averaging (SENTAVG) To investigate the extent to which sentence order is important in our data, we introduce a neural network model that ignores sentence order. The model contains a single LSTM that produces a sentence vector (the final output vector) from a sequence of GloVe embeddings for the words in that sentence. The document vector is the average over all sentence vectors in that document, and is passed through a hidden layer and a softmax to produce a distribution over coherence labels.
Paragraph sequence (PARSEQ) The role of paragraph breaks has not been explicitly discussed in previous work. Models like EGRID assume that entity transitions have the same weight whether adjacent sentences A and B occur in the same paragraph or different paragraphs. We expect paragraph breaks to be important for assessing coherence in longer documents.
Therefore, we introduce a paragraph sequence model, PARSEQ, that can distinguish between paragraphs.
PARSEQ contains three stacked LSTMs: the first takes a sequence of GloVe embeddings to produce a sentence vector, the second takes a sequence of sentence vectors to produce a paragraph vector, and the third takes a sequence of paragraph vectors to produce a document vector. The document vector is passed through a hidden layer and a softmax to produce a distribution over coherence labels. A diagram of this model is available in the supplementary material.

Evaluation
We evaluate the models on multiple coherence prediction tasks. The best model parameters, reported in the supplementary material, are the result of 10-fold cross-validation over the training data.
For all neural models (EGRIDCONV, EGRID-CONV +coref, CLIQUE, SENTAVG, and PARSEQ), the reported results are the mean of 10 runs with different random seeds, as suggested by Reimers and Gurevych (2017).
We indicate ( †) when the best neural model result is significantly better (p < 0.05) than the best non-neural result. We use the one-sample Wilcoxon signed rank test and adjusted the pvalues to account for the false discovery rate.

Classification
For this task, each text has a consensus label expressing how coherent it is: {low, medium, high}.  We report overall accuracy for all systems on predicting the expert rater consensus label (Table 5). We repeated this evaluation using the MTurk rater labels and included those results in the supplementary material. The neural models outperformed the entitybased and lexical graph models. Non-neural models showed mixed results, performing on par with or worse than our baseline. Most models perform poorly on Yelp, worse than the baseline, perhaps because Yelp has the lowest annotator agreement among expert raters.
We also tried adding coreference information for the entity-based methods, as it has been shown to be useful in some prior work (Barzilay and Lapata, 2008;Elsner and Charniak, 2008). For the base entity model experiments, we used Barzilay and Lapata (2008)'s heuristic to determine whether two nouns are coreferent. For the +coref setting, we used the Stanford coreference annotator (Clark and Manning, 2015) as a preprocessing step before computing the entity grid. The coreference system yielded consistent performance improvements of 1-5% accuracy over the corresponding heuristic results, indicating that automatic coreference resolution can help entity-based models in these domains.

Score Prediction
A 3-point coherence score might not reflect the range of coherence that actually exists in the data. We can instead present a more fine-grained score prediction task where the gold score is the mean of the three expert rater judgments (low coherence = 1, medium = 2, high = 3). In Table 6, we report Spearman's rank correlation coefficient  between the gold scores and the predicted coherence scores. As in the classification task, the neural methods convincingly outperformed all other methods, with PARSEQ the top performer in three out of four domains.

Sentence Ordering
The sentence ordering ranking task is a somewhat artificial evaluation, as a document whose sentences have been randomly shuffled does not resemble a human-written text that is not very coherent. However, we still want to assess whether good performance on previous sentence ordering datasets translates to GCDC. Since the sentence ordering task assumes well-formed texts, we use only the high coherence texts. As a result, there are fewer texts than for the classification task, as we show below.  Table 7 shows the accuracy of each system on identifying the original text in each (original, permuted) text pair. We leave out the baseline and SENTAVG because they ignore sentence order. We also simplify PARSEQ to a sentence sequence model (SENTSEQ) containing only two LSTMs because the sentence ordering task ignores paragraph information. As in the prior two evaluations, the neural models perform best in most domains, although EGRAPH is best on Yahoo.

Minority Class Classification
One application of a coherence classification system would be to provide feedback to writers by flagging text that is not very coherent. Such a sys-   We relabel a text as low coherence if at least two expert annotators judged the text to be low coherence, and relabel as not low coherence otherwise.
We report the F 0.5 score of the low coherence class in Table 8, where precision is emphasized twice as much as recall. 8 This is in line with evaluation standards in other writing feedback applications (Ng et al., 2014). Again, the neural models perform best in most domains. However, the results of this experiment in particular show that there is still a large gap between the performance of these models and what might be required for high-precision real-world applications.

Cross-Domain Classification
Up to this point, we assumed that the four domains are different enough from one another that we should train separate models for each. To test 8 Precision and recall are in the supplementary material.   this assumption, we train PARSEQ, one of the top performing neural models, in one domain (e.g. Yahoo) and evaluate it in a different domain (Clinton, Enron, and Yelp). Table 9 compares the in-domain results (the diagonal) to the cross-domain results. While the model's accuracy generally decreases when transferred to a different domain, sometimes this decrease is not too severe: for example, training on Yahoo/Enron data and testing on Clinton data, or training on Yahoo data and testing on Yelp data. It is reasonable that training on one set of business emails (Clinton or Enron) produces a model that can accurately score the coherence of other sets of business emails. Similarly, both Yahoo and Yelp contain online text written for public consumption which may share coherence characteristics, so it is not surprising that a model trained on Yahoo data works on Yelp (even outperforming the Yelp-trained model).
These results indicate that we might be able to train a better coherence model by combining all our data across multiple domains. We evaluate this theory in Table 10, comparing the results of the PARSEQ model evaluated in-domain (e.g. trained and tested on Yahoo data) to a model trained on the combined training data from all four domains. With four times as much training data, the performance of PARSEQ improves in all domains, indicating that better coherence models may be trained from data outside of a specific, narrow domain.

Discussion
We observe some trends across our experiments. The basic entity models (EGRID and EGRAPH) tend to perform poorly, often barely outperform-ing the baseline. The entity grids computed from GCDC texts are often extremely sparse, so meaningful entity transitions between sentences are infrequent. In addition, scoring the coherence of a text (either classification or score prediction) is more difficult than the sentence ordering task, where basic entity models do outperform the random baseline by a reasonable margin. Both the data and the difficulty of the tasks contribute to poor performance from the basic entity models.
The neural network models almost always outperform other models. This supports Li and Jurafsky (2017)'s claim that neural models are better able to extend to other domains compared to previous coherence models. Our PARSEQ and SENTAVG models are easier to implement than CLIQUE and outperform CLIQUE on a majority of experiments. EGRIDCONV usually does not perform as well as the other neural models, but it usually improves over EGRID. Finally, the relative success of SENTAVG, which ignores sentence order, is evidence that identifying a document's original sentence order is not the same as distinguishing low and high coherence documents. The large number of parameters in PARSEQ may explain why it is sometimes outperformed by SENTAVG.

Analysis
To better understand what distinguishes a low coherence text from a high coherence text, we manually analyzed Yahoo and Clinton texts whose labels were unanimously agreed on by all three raters. Regardless of the domain, many low coherence texts are not well-organized and appear to be written almost as stream of consciousness. They often lack connectives, resembling a list of points rather than a coherent document.
Incoherent Yahoo texts often contain extremely long sentences, lack paragraph breaks, and veer off-topic without a transition or any connection back to the main point. This is an especially frequent occurrence with personal anecdotes.
Low coherence Clinton emails make better use of paragraphs, but they too often lack transitions between topics. In addition, missing information was a primary reason for low coherence scores. We provided the raters with individual emails, not the entire email thread, so raters had less information than the original recipient of the email. This amplifies the detrimental effects on coherence of jargon, abbreviation, and missing context. However, overuse of these compression strategies can result in low coherence even for the intended recipient, so it is worth modeling their effects.
Across domains, coherent texts have a clear topic that is maintained throughout the text, and they are well-organized, with sentences, paragraphs and sub-topics following a logical ordering. Connectives, such as however, for example, in turn, also, in addition are used more frequently to assist the structure and flow.
Although sentence order is clearly important, rewriting a disorganized text is not as simple as reordering sentences. Even if changing the location of one sentence increases coherence, a true fix would still require rewriting that sentence or the surrounding sentences. Our analysis indicates that the sentence reordering task is not a good evaluation of whether models can truly be useful to the task of identifying low coherence texts.

Conclusion
In this paper, we examine the evaluation of discourse coherence by presenting a new corpus (GCDC) to benchmark leading methods on realworld data in four domains. While neural models outperform others across multiple evaluations, much work remains before any of these methods can be used for real-world applications. That said, our SENTAVG and PARSEQ models serve as simple and effective methods to use in future work.
We recommend that future evaluations move away from the sentence ordering task. While it is an easy evaluation to carry out, the performance numbers overpredict the success of those systems in real-world conditions. For example, prior evaluations (Tien Nguyen and Joty, 2017; Li and Jurafsky, 2017) report performance numbers around or above 90% accuracy, which contrasts with the much lower figures shown in this paper. In addition, we recommend that future annotation efforts leverage expert raters, preferably with a background in annotation, as this task is difficult for untrained workers on crowdsourcing platforms.
By releasing GCDC, the annotation guidelines, and our code, we hope to encourage future work on more realistic coherence tasks.  Table 11 contains additional examples of texts from our corpus, specifically from the Yahoo Answers domain, with their coherence labels.

A.2 Annotator Instructions
The annotation instructions in Section 2.4 are the simplified instructions that we provided to Mechanical Turk workers. The expert annotators received a longer version of those instructions, which are available in Table 12.
A.3 Model Details Figure 2: Structure of PARSEQ model. The sentence vectors are the output from the first LSTM (not pictured), which takes GloVe word embeddings as input. Figure 2 shows the structure of PARSEQ. The sentence vectors pictured are the output at the final timestep from the first LSTM (not pictured), which takes GloVe word embeddings as input. A second LSTM takes these sentence vectors as input and produces paragraph vectors, and a third LSTM takes a sequence of paragraph vectors and produces a single document vector. Table 13 contains the classification test results of all systems when the consensus labels come from the Mechanical Turk judgments rather than the expert judgments. Table 14 contains the precision and recall results for the minority class classification test. For neural models, we report precision and recall for one run on test (F0.5 scores in Section 4.4 were averaged over 10 runs).

A.4 Additional Results
To compare all models on an established dataset, we report results on the sentence ordering task using the Wall Street Journal (WSJ) portion of the Penn Treebank. Following previous work, we use 20 random permutations of each article and the train/test split defined by Tien Nguyen and Joty (2017) (train = Section 00-13, test = 14-24). Table 15 contains the results of all models on WSJ. These results verify our re-implementation of the EGRID model, as well as establishing the reasonable performance of our neural sequence model on news text.

A.5 Model Parameters
We specify the parameters for all models and experiments in Tables 16 and 17. Additionally, for the combined training data experiment (Table 10 in the paper), we train parseq with LSTM dimensionality = 100, hidden layer = 200, dropout = 0.5.
EGRID Sequence length is the length of the transition sequences used to compute the feature vector from the entity grid. For salience, we follow Barzilay and Lapata (2008) and split entities into two salience classes (doubling the number of features) based on whether their frequency is greater than the salience threshold. (Salience = off means that there is only one salience class containing all entities.) Syntax indicates whether we consider grammatical roles (subject, object, other) in building the entity grid.
EGRAPH The graph type specifies whether we use an unweighted graph (u), a graph weighted by the number of entities shared between sentences (w), or a graph weighted by syntactic role information (syn). Distance indicates whether edge weights are decreased according to the distance between sentences.
EGRIDCONV We specify dropout rate, batch size, and entity role embedding size. For the convolution layer, we specify filter number, window size, and pooling length.
LEXGRAPH We define the similarity threshold used to filter out edge weights between sentences, and k as the size of the subgraphs we consider when extracting features from the document graph.
CLIQUE We define the dropout rate, the LSTM dimensionality, and the hidden layer dimensional-Domain Score Text

Yahoo
Low I see it, but then again almost every war entered by the U.S. is connected to gaining something. The U.S. is just using politically correct was of taking over a country without anybody noticing it. They enter a war and some how we come out better than the country we went in to help. We say we are helping but if the country has nothing for us then we don't bother with it. For example: Korea stated and I quote "we have nuclear weapons and we plan to use them" so how come we are in Iraq who have no weapons? Well maybe the U.S. sees no threat but then again somebody did sneak into the country and take over planes. Also not to long ago it was common for somebody to hijack a plane. Well that is all I have to say on the matter. Yahoo High Don't be intimidated by Impressionism. It is simply a style worked in loose strokes. The idea is to give an "impression" of the subject. Choose a simple subject, like a still life or bowl of fruit. Then layout your palette using the colors you see (make sure to look for subtle colors only an artist might see...such as the "blue" in an apple), and with a larger than usual brush, stroke the basic shapes in a medium value, then add shadows, then a highlight layer. That should do for a class project in Impressionism. The danger would come from over-working the painting. You don't want fine strokes or details, remember just the "impression" of your subject. The whole idea is to stay loose and free. A lot of people struggle with it. The trick is to just paint without worrying too much. Good luck. You will be given a short text (100-300 words) to read. We will specify which one of several domains the text comes from, and in some domains we will provide additional context for the text.
Your task is to rate the coherence of the text from 1 to 3 (1 means low coherence, 3 means high coherence).
Coherence in writing refers to how well ideas flow from one sentence to the next, and from one paragraph to the next. A text that is highly coherent is easy to understand and easy to read. This usually means the text is well-organized, logically structured, and presents only information that supports the main idea. On the other hand, a text with low coherence is difficult to understand. This may be because the text is not well organized, contains unrelated information that distracts from the main idea, or lacks transitions to connect the ideas in the text.
Try to ignore the effects of grammar or spelling errors when assigning a coherence rating, as long as the errors do not significantly interfere with your ability to read and understand the text. In the email data, assume that jargon and acronyms are used correctly, and do your best to judge coherence despite that.
You should assign a coherence rating to the text based on whether it is a coherent example of text in that domain. A reader has different expectations about how a business email should be written compared to a post on an online forum, and the coherence rating should reflect this difference. A business email with a score of 1 is not necessarily incoherent in the same way that a very incoherent Yahoo Answers post is, but it is not very coherent for a business email.