The Myth of Double-Blind Review Revisited: ACL vs. EMNLP

The review and selection process for scientific paper publication is essential for the quality of scholarly publications in a scientific field. The double-blind review system, which enforces author anonymity during the review period, is widely used by prestigious conferences and journals to ensure the integrity of this process. Although the notion of anonymity in the double-blind review has been questioned before, the availability of full text paper collections brings new opportunities for exploring the question: Is the double-blind review process really double-blind? We study this question on the ACL and EMNLP paper collections and present an analysis on how well deep learning techniques can infer the authors of a paper. Specifically, we explore Convolutional Neural Networks trained on various aspects of a paper, e.g., content, style features, and references, to understand the extent to which we can infer the authors of a paper and what aspects contribute the most. Our results show that the authors of a paper can be inferred with accuracy as high as 87% on ACL and 78% on EMNLP for the top 100 most prolific authors.


Introduction
The scientific peer-review process is indispensable for the dissemination of high-quality information (Hojat et al., 2003). However, one of the major problems with this process is bias (Williamson, 2003;Tomkins et al., 2017). For example, Tomkins et al. (2017) performed an experiment during the ACM Web Search and Data Mining conference 2017 to understand the potential bias in favoring authors from prestigious institutions and found that, indeed, when reviewers have access to the authors identities, they more often tend to favor well known authors from prestigious institutions. The double-blind review process is generally employed by top scientific journals and conferences in order to guarantee fairness of the paper selection, and thus, plays an essential role in how scientific quality is eventually measured (Meadows, 1998). It is designed to reduce the risk of bias in paper reviews, ensuring that all papers are judged solely based on their content and intrinsic quality and that any author has a fair chance of having a paper accepted, regardless of their prestige or previous work. The double-blind review process implies that the submitted papers have to be anonymized, i.e., the authors' names are not explicitly available with the papers, and any direct or indirect indications of who the authors might be (for example, referring to self-citations in the first person) are forbidden. Reviewers have access only to the papers' content, and the authors in turn do not know who their assigned reviewers are. Despite these strict considerations, the notion of anonymity in the double-blind review has still been questioned. Notably, Hill and Provost (2003) showed that the authors of a scientific paper can be inferred with fairly high accuracy using only the papers it references. Specifically, using the vectorspace representations of the list of references (or citations) in a paper and measuring similarity between these representations and the pattern of citations of an author, they are able to infer the authors of a paper with accuracy up to 60% for the top-10% most prolific authors, and show that selfcitations are an important predictive factor.
We are interested in further understanding how predictable the authorship of a paper is, and specifically what part of the paper gives it away. For this purpose, we include in our analysis additional text features to reflect various aspects of the text, as well as references, and make use of a more complex machine learning model, based on deep learning, for predicting authors based on these features. We focus on ACL and EMNLP, the top conferences in computational linguistics, which use the double-blind review system to decide whether to accept papers for publication.
Our contributions are as follows: we train deep learning models on papers published in the ACL and EMNLP conferences, using features extracted from each paper's body of text as well as its references, and show that these models are able to predict authors with accuracy of about 87% on ACL and about 78% on EMNLP. We additionally perform an ablation study, for an in-depth analysis of the predictive value of each feature. We finally also show how the number of authors considered for analysis can affect performance.
The rest of the paper is organized as follows: In the next section we present related work. Then in Section 3, we describe our datasets. Section 4 deals with the methodology of our experiments, including our baseline algorithm and the details of the model we propose. In Subsections 4.2 and 4.3 we also discuss the features we used in our model and the details of their extraction and preprocessing steps. Section 5 describes the setup of our deep learning experiments, including the metrics we use for measuring performance. Finally, in Subsection 5.3 we report and discuss our results, and in Section 6 we present our conclusions.

Related Work
There have been several studies approaching the question of the integrity of the double-blind review process. An early study on blind review published in a journal of psychology (Ceci and Peters, 1984) shows that authors of anonymous papers could be identified by surveyed reviewers using the combination of the paper's references and the referee's personal background knowledge.
Statistical studies on the difference between single-blind and double-blind peer review have more recently demonstrated that unveiling the identity of the authors to the reviewers leads to biased reviews, favoring more prestigious authors and institutions. Tomkins et al. (2017) performed a controlled experiment on scientific articles submitted to the 10th ACM Conference on Web Search and Data Mining, where for every article half of the reviewers had access to author information, while the other half did not. They found that single-blind reviewers are more likely to recommend famous authors for acceptance by a factor of 1.58. A few studies have previously proposed automatic approaches for author prediction for scien-tific articles. For example, Hill and Provost (2003) successfully predicted the authors of scientific articles published as part of the KDD Cup 2003 competition, using only information from the articles' references lists.
In a related task to ours, several studies have looked at authorship attribution on scientific articles, or predicting authors of scientific articles from an exclusively stylistic point of view. Althoff et al. (2013) studied authorship attribution on scientific articles specifically in the multi-author setting, using various text-based features (including word n-grams and various stylistic features) and models based on logistic regression and expectation maximization. Hitschler et al. (2017) performed experiments for predicting authors of ACL articles, restricting their data to only single-author articles. Their study focused on the style level, using only POS tag sequences, and showed that limiting the number of words considered as features can have a beneficial effect on the predictor's performance. Seroussi et al. (2012) proposed the use of an author-topic model (Rosen-Zvi et al., 2004) for the task of authorship attribution and showed promising results in a scenario with many authors. Rexha et al. (2015) analyzed the style of medical scientific articles and how the stylistic uniformity of an article varies with the number of co-authors.
Outside the world of scientific articles, a few previous studies showed the promise of using neural networks for authorship attribution. Bagnall (2015) successfully used a multi-headed Recurrent Neural Network for an author identification task at PAN 2015. The use of Convolutional Neural Networks (CNNs) for learning from text data was proposed by Kim (2014), where CNNs are successfully applied to several sentence classification tasks. Rhodes (2015) trained a Convolutional Neural Network on word embeddings for predicting authors of medium-sized texts, and Shrestha et al. (2017) used CNNs in an authorship attribution task on tweets. Luyckx and Daelemans (2008) studied the effects of having many authors as classes and of limited training data on author attribution -which are realistic, but difficult scenarios, common to our problem as well.
As far as we are aware, no other study has dealt with analyzing the authorship of articles published at ACL or EMNLP (or a comparably prestigious conference) without restricting the scenario to only a subtask (for example, focusing only on a subset of the data), or limiting the analysis to one aspect of the text (for example, focusing on the stylistic level). While previous studies support the hypothesis that authors of a scientific article are possible to predict from an anonymized paper, we attempt to provide a fuller picture regarding what exactly it is about an anonymous article that can give away its authors.

Datasets
For evaluation, we used two datasets of articles from the computational linguistics conferences ACL and EMNLP, published on or before 2014 (Bird et al., 2008;Anderson et al., 2012). The ACL dataset contains 4, 412 articles authored by a total of 6, 565 unique authors, whereas the EMNLP dataset is comprised of 1, 027 articles written by 1, 861 unique authors in total. 1 Note that the size of the EMNLP dataset is much smaller than the size of the ACL dataset since EMNLP is a much newer conference compared to ACL. From each dataset, we normalized the author names to consist of the initial of the first name and the full last name and removed the authors with less than three articles (to ensure enough data for training and evaluation), leaving us with 922 authors for the ACL dataset and 262 authors for the EMNLP dataset (which represent our classes).
As illustrated in Figure 1, which plots the class distribution in each dataset (i.e., the number of articles per author in decreasing order), we can see that the distribution is very skewed, with the more prolific authors being responsible for many of the articles in each dataset and with many authors contributing only a few articles. A similar, if not more pronounced, imbalance can be observed at the level of cited authors. For the purpose of our experiments, we also extracted and analyzed the references (or citation) lists of each article in our datasets, and looked at the dis-1 Code and data available upon request.    Author Name Normalization. It is worth mentioning that our method for normalizing author names can produce collisions, and hence, ambiguities. However, we chose to normalize the names because the noise resulting from not doing so (i.e., having the same author encoded with multiple different ways of writing their name, especially prevalent in references) might be even more detrimental to learning than the possible ambiguities.
In order to understand the level of collisions in our classes (i.e., the author names in the headers), we show in Figure 3 the number of author names that result in three collisions, two collisions, or no collisions at all after normalization, for both ACL and EMNLP. As can be seen from the figure, the number of author names with collisions in each dataset is small. Among the names with collisions, 13 occur within the top most prolific 100 authors in ACL, and only 3 occur in the top 100 for EMNLP. Note that a similar analysis for the author names in the references lists is difficult since many names appear already in the normalized form (first name initial last name). For this reason, normalizing author names is also necessary for computing our baseline, which matches the names of article authors with names of cited authors. Author name normalization is in some cases useful, e.g., in the case of authors with middle names, which are sometimes explicit and other times omitted (there are 12 cases in the ACL dataset of authors with middle names whose names occur differently in different articles), or in the case of authors whose first names can have different spellings such as Dan/Daniel Jurafsky.

Baseline
For our baseline we chose to focus solely on the citations of each article. As Hill and Provost (2003) have shown before, citations alone can be a strong indicator of the authors of an article, with self- citations being especially telling. Specifically, our baseline algorithm consists of simply ranking the authors cited in an article in reverse order of how frequently they were cited overall in the article's references list, and outputs as predicted authors the top (10) most cited authors in this ranking.

Proposed Model
For the purpose of our machine learning experiments, we formulate the problem as a supervised classification task, where each article is labelled with one or multiple authors, and a machine learning model learns to predict the set of correct labels (authors) for each data point (article). The order of the authors is not taken into consideration.
As our model, we choose a neural network with several subcomponents corresponding to various types of features, as detailed below.

Features
In order to capture as many of the aspects of a scientific article as possible in our model, we extract and use various features, corresponding to different levels at which characteristics of the author could manifest. We categorize these into three main types of features: • Content level: word sequences (consisting of 100-word sequences in the article's title, abstract and body).
• Citation level: bag-of-words of cited authors. Figure 4 shows a high-level view of the network architecture and its various components.
The network is designed to learn from each separate feature using dedicated subcomponents. At the content level, we use convolutional layers to learn from word sequences. CNNs have been shown to be successful in text classification tasks by Kim (2014). We use similar settings to the ones reccommended in this study -passing the word sequences through a single convolutional layer with 300 filters, and a kernel size of 9, followed by a max pooling layer. Before going through the convolutions, the word sequences are passed through a word embedding layer of 300 dimensions, initialized with the pre-trained word2vec embeddings available from Google, trained using the skip-gram objective (Mikolov et al., 2013). Using embeddings that are already pre-trained on a large dataset should benefit our task (our dataset being itself not very large), but since we use generalpurpose pre-trained embeddings, we choose not to fix the embedding weights, but rather let the network further update them by learning from our data to tune them to our task and domain.
A separate convolutional layer is dedicated to learning from part-of-speech sequences. As in the case of word sequences, we consider the order of parts of speech in a text segment to be relevant, assuming certain types of syntactical constructs can be specific to certain authors. Thus, after tagging a text segment with parts of speech, we encode the result as part-of-speech sequences and pass them through a convolutional layer of 50 filters and kernel size 4, followed by a max pooling layer. The POS tags are given as one-hot vectors. Stopwords are extracted from each article segment and encoded as bag-of-words, keeping their frequencies, but not their order in the text. We used the stopword list available from the NLTK package. Stopwords frequencies are traditionally used in stylometry, being one of the most indicative features of an author's style (Koppel et al., 2009).
To extract knowledge from citations, we focus on cited authors, encoding for each article the authors cited in its references section, along with the citation frequencies for each author. The total number of cited authors in each of our datasets is much larger than the number of authors that contribute directly to one of the articles, e.g., over 22, 000 unique authors are cited in our ACL dataset. This makes the one-hot encoding that we use for cited authors to be very high-dimensional, so we pass the extracted feature through an additional lower-dimensional fully connected layer.
In the final layers of our network, we collect all of the output from each subcomponent dealing with the various features, and pass them through a dense component consisting of a fully connected layer and a Softmax layer that produces the network's predicted probabilities for each class.

Preprocessing and Feature Extraction
We extracted the text from PDFs using Grobid. 2 Several preprocessing steps were necessary before using the articles' text as features in our model.
For our text-related features we consider the title, abstract and body of the articles, and exclude references from the article's text, by removing them both from the references section and from the citations within the article text (so as to isolate text features from citation features). After normalizing and tokenizing the resulted text according to usual practice in natural language processing applications (including lowercasing every word, discarding numbers and punctuation, resolving endof-line hyphenation), we construct a list of vocabulary words consisting of the most frequent 50,000 words in all texts. Our choice of vocabulary size was informed by a previous study looking at authorship on ACL data (Hitschler et al., 2017), which showed that 50,000 words is an optimal vocabulary size for authorship tasks on this dataset. For EMNLP, which is a smaller dataset, we restrict the minimum word frequency to 5 occurrences, leaving us with a vocabulary of approximately 23,000 words. Considering only the words in each vocabulary (and replacing all other words with an "unknown" token), we encode the text as word sequences of 100 words, padding the sequences with zeros if they are shorter. Further, our training examples consist of these word segments, rather than full articles. Before extracting content features, we discard outliers, ignoring articles consisting of either zero or more than 20, 000 words.
In addition, we also extract the context around citation mentions within the content of articles, by selecting a window of 100 characters around the citation (and excluding the citation itself), then applying the same text preprocessing steps as above only on this window. This is used as a separate feature, as described in Section 5.2.
The extraction of the part-of-speech features is done by applying the Stanford POS tagger (Toutanova et al., 2003) to the word sequences, re-sulting in part-of-speech sequences corresponding to each article segment. Stopwords are encoded as bag-of-words for each article segment.
Citations extracted from the "References" section of each article are encoded as bag-of-authorsunordered sets of citation frequencies corresponding to each cited author. Recall that author names (when they occur either as authors of the target article, or as authors of a cited paper in a references list) are normalized to consist of the initial of the first name and the full last name (see Section 3).

Experimental Setup and Results
The nature of our dataset requires special attention to the setup of the training experiments, one of the main particularities of the data being the skewness of the label distribution. We split the dataset in three subsets: one for actual training, one for validation (used for tuning hyperparameters), and the third for testing performance. At this stage, we ensure that each of the three sets contains at least one article from each author in our labeled set. This also implies that we exclude any authors with less than three articles, obtaining 922 authors for the ACL dataset and 262 for EMNLP (which are the different classes in our supervised learning problem). Given our datapoints, as explained in the previous section, consisting of article segments (word sequences of 100 words) rather than full articles, we also ensure that all segments extracted from one given scientific article are appointed to the same set, and not split between training, validation and test. We take this precaution to make sure that anything our network learns is not an artifact of the particular article, but rather of its author.
Lastly, to reduce the impact of the label skewness on our trained model, we use weighted sampling for generating the training examples, making sure the probability of generating a training example from any class is approximately the same across classes. For this to be possible, a final adjustment had to be made to our training examples. Our datasets, comprising of scientific articles, with each article having been written either by a single author or in co-authorship between several authors, essentially consists of multi-label examples. For training, we transform the training examples from multi-label examples to singlelabel examples, by generating several copies of the same datapoint, each labelled with only one of its authors, whenever a text was written by more than   one author. This allows us to perform weighted sampling, as well as use a simple softmax layer as the final layer in the network, which generates one predicted label for any training example, and cross-entropy loss as our loss function. At the evaluation stage, we use the original multilabeled examples, to be able to correctly measure the model's performance using our metrics. Tables 3 and 4 show the number of article segments and the number of articles we end up with after extracting the features and splitting the ACL and EMNLP datasets, respectively, into the training, validation, and test sets, in each of the experimental settings.

Metrics
Depending on whether we see the list of authors that contributed to an article as a sorted or unsorted list, a machine learning model that can predict this set of authors can be designed either as a multilabel classifier or as a ranking model. We use the former, and disregard the order of the authors of an article, assuming it is not always relevant to the quantity of each author's contribution to the article text, thus representing the authors for an article as an unordered set of labels.
We do, however, consider the order of the predicted classes in the model's output. For evaluating our model, we use both the performance metrics that are suited for multi-class classification (where we treat the model's predictions as unordered sets) and metrics suited to ranking problems, which are generally used in information retrieval (where we see the list of model predictions as a sorted list, ranked according to the Softmax probabilities in the model's output). These performance metrics are as follows: • Accuracy@k -computed as the number of articles for which at least one true author was in the top k predicted authors. We use k = 10 as this number was shown to perform well in other search and retrieval tasks (Spink and Jansen, 2004).
• Mean Average Precision (MAP) where A is the set of articles, and precision at rank k P @k is the number of correct authors within the first k predicted authors relative to the number of predictions (k). Here too we use a maximum rank of r = 10.
• Mean Average Recall (MAR) R@k where recall at rank k R@k is defined as the number of correct authors within the first k predicted authors relative to the total number of true authors (r = 10 as before).
• Mean Reciprocal Rank (MRR) M RR = 1 |A| a∈A 1 r a where r a is the rank at which the first correct predicted author was found for article a ∈ A.
Since our datapoints are article segments rather than full articles, the measured performance on the model's output will be with respect to these segments: for example, an accuracy of 10% denotes that 10% of the segments were correctly classified.
We additionally adapt these metrics to be able to also output the performance at article-level, which allows us to properly compare it to our baseline's performance (which is measured on full articles). We accomplish this by grouping the article segments in our test set according to the article they were extracted from, and for each article, order all probabilites in the network's output for the given article, and consider the top predicted classes across this global ranking as the model's predictions for the target article.

Experiment Settings
In order to understand the contribution of each feature for predicting the authors of a paper, we perform an ablation study -isolating and using in turn only certain features and combinations of features and deliberately not using others.
In a first experiment, we compare our model's performance on both the entire set of authors in each dataset, and only a subset of the authors, e.g., selecting only the top most prolific ones (top-100, top-200, and so on) and ignoring the rest. In this experiment, we only consider articles authored by these selected authors, both for training and test. We expect performance should be higher on this subset where rare authors are discarded, since the pronounced skewedness of the data makes so that for many authors in the tail of the distribution there are very few data points to train on.
In a second experiment, we group our features according to the level of the text that they represent: content level (word sequences), style level (stopwords and POS sequences) or citation level (cited authors). We run our model using the features in each of the groups in turn, and ignoring the rest. The measured performance in each separate setting should be an indicator of the importance of the specific feature (or feature combination) and of the aspect of the article that it captures.
Finally, we also experiment separately with our text features, in order to understand which part of the scientific text is more specific to its author, and how much of the text is really useful in predicting the author. We look separately first at the entire body of the article, secondly at only the title and abstract, and thirdly only at the context of the citations that occur in the text -assuming these might be useful by giving away something about the author's citation pattern that cannot be inferred from the references list alone. Citation contexts have been shown to capture useful information in previous studies focusing on text summarization (Qazvinian et al., 2010;Abu-Jbara and Radev, 2011;Qazvinian and Radev, 2008), document indexing (Ritchie et al., 2008), keyphrase extraction (Gollapalli and , and author influence in digital libraries (Kataria et al., 2011).
These last experiments aiming to analyze feature importance are performed in the small-scale setting, where only the top 100 authors are considered. We report all results both at segment level   and at article level, with the exception of the experiment where the only features used are references, which are article-level features. Table 5 shows the classification results measured on our data points consisting of article segments on ACL, whereas Table 6 contains the values of the metrics aggregated at article-level on the same ACL dataset. Tables 7 and 8 show similar results on the EMNLP dataset. Underlined scores are best within each group and bold scores are best overall. Figure 5 shows the accuracy of our best deep learning model on ACL and EMNLP as we increase the number of authors (classes), compared with the baseline model that considers only references and with a random model. As can be seen from the tables and figure, results show that references are still the features that by far contribute the most to predicting the author(s) of an article. The importance of the references list also contributes to the strong performance of the baseline, which is able to correctly predict authors for as much as 54.86% of the ar-   ticles on ACL and 57.80% on EMNLP, for the top 100 classes. Interestingly, the baseline's performance is comparable to the results reported by (Hill and Provost, 2003) who use a similar method, even on a different dataset. Feeding the extracted references into the deep network further boosts the predictive power of the references feature, reaching on its own an accuracy of 86.67% on ACL and 78.49% on EMNLP in an experiment looking at the top 100 most prolific authors in each dataset. The more general text features (content level) are the second best predictor, whereas the stylelevel features come last. Even if much less predictive than references, these text-based features are still far better than chance at predicting the true authors. For example, on ACL, in the setting with 100 possible classes, the expected accuracy of a random predictor (according to our definition of accuracy), would be around 10%, whereas when using all 922 classes, the chance accuracy is 1%.

Results
With regard to the parts of the article text that seem to be most predictive, reference contexts seem to play a more important role than the title and abstract, even though using the full article content still gives the best results on both datasets.
Moreover, as we go from the top 100 most prolific authors to the last (rare) authors, the performance keeps decreasing. For example, the ac- Figure 5: Accuracy with number of authors considered. curacy on ACL at article level decreases from 87.88% (achieved for top 100 most prolific authors) to 50% (achieved for last 200 rare authors).

Error Analysis
In order to achieve a better understanding of the model's weaknesses and more generally of the difficulties of predicting authors of scientific papers, we examine the set of misclassified articles in the ACL test set, and compute the misclassification rate for an author as the number of their articles for which the model did not assign to them, divided by the total number of articles authored by them. An interesting finding is that the correlation between the rank of the author (in order of their number of written articles) and the misclassification rate is 0.35, showing that more prolific authors tend to be more accurately classified. One of the most misclassified authors in the top 5 most prolific authors is Christopher Manning (40% of articles are misclassifed, among which there are Accurate Unlexicalized Parsing and Deep Learning for NLP (without Magic). For the first paper, some of the predicted authors are: Eugene Charniak, Mark Johnson, Lenhart K Schubert, Dan Klein, and Daniel Jurafsky, with Dan Klein being indeed one of the authors. Other 45 articles not authored by Christopher Manning were predicted as being written by this author, possibly due to a large number of his citations in the articles' references list and/or similar keywords with those of Christopher Manning.

Conclusions and Discussion
We showed that the top most prolific authors of anonymized scientific articles can be predicted with high accuracy, with characteristics of the authors being apparent at all levels of the text: from content to style. Still, the most direct indicator of who the author of a paper might be comes from the papers that are referenced -both from the references themselves and from the citation context they occur in within the content of an article. Our work contributes to the debate about the doubleblind reviewing process and aims at contributing toward the rapidly emerging field of Fairness in AI. Although we found that the most prolific authors can be inferred with accuracy as high as 87.88% on ACL and 78.49% on EMNLP, the authors with less papers are more and more difficult to infer, which enforces the benefits of the doubleblind review in offering any author a fair chance of having their papers accepted in top venues.
The finding that authors of anonymized papers can be predicted with such high accuracy bears important consequences for the way scientific articles are reviewed and published. De-anonymizing articles means compromising the integrity of the review and selection process. The insights into how the authors of an article can be inferred are not only interesting, but could help guide a reconsidered approach of the way we write papers for submission to various venues. Still, the findings and conclusions of this paper should not be seen as a premise that the portions of an article which help a neural network identify authorship are the same as those which help a human reviewer identify authorship, and are not necessarily expected to inform how humans perform peer review.
In future experiments, more attention to the contribution of each author of the article might lead to further improvements in the prediction performance. In this article, we construct our training datasets as if all parts of an article were written by all authors, which is not accurate, and could even put an upper bound on the network's performance, by providing it with contradictory information during training. Techniques for segmenting the text according to their probable authorship could help improve the method.