Non-Complementarity of Information in Word-Embedding and Brain Representations in Distinguishing between Concrete and Abstract Words

Word concreteness and imageability have proven crucial in understanding how humans process and represent language in the brain. While word-embeddings do not explicitly incorporate the concreteness of words into their computations, they have been shown to accurately predict human judgments of concreteness and imageability. Inspired by the recent interest in using neural activity patterns to analyze distributed meaning representations, we first show that brain responses acquired while human subjects passively comprehend natural stories can significantly distinguish the concreteness levels of the words encountered. We then examine for the same task whether the additional perceptual information in the brain representations can complement the contextual information in the word-embeddings. However, the results of our predictive models and residual analyses indicate the contrary. We find that the relevant information in the brain representations is a subset of the relevant information in the contextualized word-embeddings, providing new insight into the existing state of natural language processing models.


Introduction
Language comprises concrete and abstract words that are distinctively used in everyday conversations. Concrete words refer to entities that can be easily perceived with the senses (e.g., "house", "blink", "red"). On the other hand, abstract words refer to concepts that one cannot directly perceive with the senses (e.g., "luck", "justify", "risky"), but relies on the use of language to understand them .
This categorization of words based on their concreteness is rooted in theoretical accounts in cognitive science. One such account is the Dual Coding Theory (Paivio, 1971(Paivio, , 1991, according to which two separate but interconnected cognitive systems represent word meanings, i.e., a non-verbal system that encodes perceptual properties of words and a verbal system that encodes linguistic properties of words. Concrete concepts can be easily imagined and are represented in the brain with both verbal and non-verbal codes. Abstract concepts are less imaginable and are represented with only verbal codes. For example, one can readily picture as well as describe the word bicycle (e.g., "has a chain", "has wheels"), but relies more on a verbal description for the word bravery.
The concreteness of words has since been used as a differentiating property of word meaning representations. Previous studies in natural language processing (NLP) have examined the wordembedding spaces of concrete and abstract words and showed: (i) distinct vector representations of the two categories within and across languages (Ljubešić et al., 2018), and (ii) high predictability of concreteness scores from pre-trained wordembeddings (Charbonnier and Wartena, 2019).
Neurolinguistic studies have shown an extensive, distributed network of brain regions representing the conceptual meaning of words (Mitchell et al., 2008;Wehbe et al., 2014;Huth et al., 2016). Among these, regions more closely involved in sensory processing have been shown to respond favorably to concrete words (Binder et al., 2005) over abstract words.  argued that concrete and abstract concepts must be represented differently in the human brain by showing through a statistical analysis that concrete concepts have fewer but stronger associations in the mind with other concepts, while abstract concepts have weak associations with several other concepts. Wang et al. (2013) showed that functional Magnetic Resonance Imaging (fMRI) signals of brain activity recorded as subjects attempted to decide which two out of a triplet of words were most similar contained sufficient information to classify the concreteness level of the word triplet, providing further evidence of the dissimilar representations of the two categories in the brain. However, it remains an open question whether the brain responses within the semantic system can directly predict concreteness levels in the more challenging setting of naturalistic word stimuli (e.g., words encountered while reading a story). Moreover, given the human brain's expertise in generating and processing perceptual as well as linguistic information, one could expect the brain representations to provide information that complements the word-embeddings purely learned from linguistic contexts, improving their predictive capability. We address both these questions in this paper.
While several related works exist, the following limitations prompted a new study: (i) Anderson et al. (2017) indirectly decoded the brain representations for concrete and abstract nouns with the help of word-embeddings and convolutional neural network image representations. Instead of building a predictive model, the authors used a similarity metric to determine which signal in a pair of fMRI signals corresponds to which word in a pair of words. However, a direct, supervised decoding approach (as adopted here) would provide more substantial evidence about the strengths and weaknesses of the different information modalities. (ii)  found word concreteness scores to be highly correlated with both visual and haptic perceptual strength. However, multi-modal methods (Anderson et al., 2017;Bhaskar et al., 2017) have incorporated only visual features (as the second source of information) instead of general perceptual features into their predictions. By incorporating brain representations in our models, we do not miss out on such perceptual information (e.g., the adjectives "silky", "crispy", and "salty" are concrete but not as imagery-inducing as the adjective "blue"). (iii) In contrast to previous studies that have required participants to actively imagine a randomly presented word stimulus 1 (before being given a few seconds to "reset" their thoughts) during the brain data acquisition task (Anderson et al., 2012;Wang et al., 2013;Anderson et al., 2017), we adopt a task where participants would read highly engaging natural stories (without unnatural pauses), enabling them to process the word stimuli in a more realistic context.
To summarize, our objectives with this paper are twofold. First, we investigate how well human 1 e.g., one word would be presented every 10s.
brain representations can predict the concreteness levels of words encountered in natural stories using simple, supervised learning algorithms. Second, we investigate whether brain representations encode information that may be missing from wordembeddings trained on a text corpus in making the concrete/abstract distinction. We believe that answering such questions will shed light on the current state of human and machine intelligence and on the ways to incorporate human language processing information into NLP models.

Related Work
A few studies have shown that the concreteness (and imageability) of words can be directly predicted with high accuracy from precomputed wordembeddings using supervised learning algorithms. Recently, Charbonnier and Wartena (2019) used a combination of word-embeddings and morphological features to predict the word concreteness and imageability values provided in seven publicly available datasets. Ljubešić et al. (2018) extended the idea to perform a cross-lingual transfer of concreteness and imageability scores by exploiting pretrained bilingual aligned word-embeddings (Conneau et al., 2017).
Multi-modal models that use both linguistic and perceptual information have been shown to outperform language models at various NLP tasks, such as learning concrete or abstract word embeddings Lazaridou et al., 2015), concept categorization (Silberer and Lapata, 2014), and compositionality prediction (Roller and Schulte im Walde, 2013). However, Bhaskar et al. (2017) found that the concreteness of nouns could be predicted equally well from the textual, visual, and combined modalities. This suggests that the textual and visual modalities independently provided reliable, non-complementary information to represent both concrete and abstract nouns.
Several studies have addressed the idea of decoding neural activity patterns recorded in subjects when presented with certain textual or visual stimuli. Anderson et al. (2017) applied linguistic and visually-grounded computational models to decode the fMRI representations of a set of concrete and abstract nouns. They, too, reported no decoding advantage for multi-modal combinations over the linguistic model. Anderson et al. (2012) demonstrated that fMRI signals contained sufficient information to perform a 7-way classification of a set of words into WordNet-based (Miller, 1995) taxonomic categories.
Lately, there has been an increasing research interest at the intersection of neuroimaging and language models (Jain and Huth, 2018;Abnar et al., 2019;Gauthier and Levy, 2019;Hollenstein et al., 2019;Jain et al., 2020;Caucheteux and King, 2020;Schrimpf et al., 2020). In an interesting study, Schwartz et al. (2019) finetuned the BERT language model to predict the fMRI responses of text-reading participants to obtain representations that encode brain-activityrelevant semantic information. While the modified representations could better predict neural activity and even generalize to new participants, this inclusion of brain-relevant bias did not improve or degrade the model's performance on downstream NLP tasks.

Stimulus and fMRI data
We briefly describe the functional Magnetic Resonance Imaging (fMRI) data-collection procedure here and refer the reader to Deniz et al. (2019) for specific details.
Nine participants were asked to read 11 autobiographical narrative stories taken from The Moth Radio Hour podcast. We used six participants' data in our experiments. The stories are each 10-15 minutes long and were chosen to cover a wide range of topics. Each story was first aligned to its transcript by applying the UPenn Forced Aligner (Yuan and Liberman, 2008) and Praat (Boersma and Weenink, 2001) on the narration audio. Timestamps for word-occurrences were then obtained from Praat's TextGrid as a list of entries of the form (w i , t i ) representing the ith word and its onset time, respectively. Using this word-representation list for each story, each word in the story was displayed one-by-one at the center of a screen for a duration equal to its duration in the spoken version.
Each fMRI scan consists of a sequence of voxel-responses 2 acquired at a fixed repetition-time (T R = 2.0045s) with a voxel-size of 2.24×2.24× 4.1mm. A separate scan was conducted for each subject and presented story (all analysis was done within subjects). The acquired volumetric fMRI responses for each subject were first preprocessed to correct for motion and then aligned to the first 2 voxel = volumetric pixel. scan's temporal average, using the FMRIB Linear Image Registration Tool (FLIRT) from FSL v5.0 (Jenkinson et al., 2002;Jenkinson and Smith, 2001). A Savitzky-Golay filter (Schafer, 2011) with a 120s window was applied to remove lowfrequency voxel-response drift from the signal. Finally, the voxel-responses for each story were zscored separately so that they have zero mean and unit variance across all acquisitions for the story.
We note that an equivalent analysis could be carried out through a listening task since the elicited brain representations have been shown to be largely invariant to the stimulus modality (Deniz et al., 2019).

Concreteness Ratings
We used the dataset collected by , consisting of concreteness ratings for 39,954 English words. Each word was rated by around 25 participants (recruited through Amazon Mechanical Turk) on a 1-5 scale so that the most concrete words are assigned the highest score of 5, and the most abstract words are assigned the lowest score of 1. For each word, the average rating (and standard deviation) across all raters was recorded.

Word-Embeddings
We extracted the 768-dimensional activations from the final hidden layer of the Generative Pre-trained Transformer (GPT-2) (Radford et al., 2019) to obtain contextualized representations for the words in the stories. The reasons for selecting GPT-2 in this work are due to the findings of Schrimpf et al. (2020). First, GPT-2 was constrained to use unidirectional attention in the same way humans process text in a left-to-right fashion. Second, the authors find that models best matching human language processing are precisely those trained for a next word prediction objective (such as the GPT family).

Data Preparation
Rating and Vectorizing Using the wordrepresentation for each story and a list of the fMRI acquisition-times (identical for all subjects), we partitioned the words into disjoint chunks so that all words in a chunk correspond to the same acquisition. Therefore, all words read by the subjects within a duration of 1 T R from the start of the acquisition pulse were included in the same chunk.
We used GPT-2 to vectorize each word in a story by supplying all words in the story leading up to it 3 as context and extracting the network's hidden layer representation corresponding to the last input position. To rate the words in the story, we first lowercased and lemmatized them and then used the  concreteness dataset to assign a rating to each word in a chunk. Only around 7% of all words in the stories were not covered by the dataset and were dropped before subsequent analysis.
We stored the ith preprocessed functional image of each subject as an N b -dimensional voxelresponse vector b i , where N b denotes the number of voxels for that subject's brain. Typical values for N b were found to lie in the 70k-90k range (with a mean of 80976 and a standard deviation of 6173, across subjects).
Downsampling Since the rate at which the text stimulus was presented to the subjects (the narration rate) is higher than the rate of fMRI data acquisition (2.0045s per acquisition), several words may occur within the TR corresponding to a single acquisition and will all fall under the same chunk. Therefore, we downsampled the stimulus to match the acquisition rate before further analysis by averaging out the concreteness ratings (r w ) and word-embeddings ( e w ) within each TR. Thus, the chunk-rating and chunk-embedding for chunk C i are given by: Stacking We temporally stacked the voxelresponse vectors, chunk-embeddings and chunkratings, first within each story and then across all 11 stories to obtain (i) a per-subject voxel-response matrix B ∈ R T ×N b , (ii) an embedding matrix E ∈ R T ×D , and (iii) a rating vector r ∈ R T , where T denotes the total number of fMRI acquisitions across all stories per subject, and D denotes the dimensionality of the word-embedding space. D = 768 for GPT-2, and 11 stories with an average duration close to 12.5 min per story gives T = 4028.
3 or as many as allowed by the model's capacity.

Word-Embedding based model
We consider the task of classifying words as concrete or abstract (based on their concreteness ratings) using the word-embeddings (chunkembeddings, e i ) as explanatory variables. For this, we first defined a concreteness threshold τ as follows: any word is labeled concrete if its assigned rating is strictly greater than τ , and is labeled abstract otherwise. We take τ = 3.
We then segregated the data into well-defined classes by discarding any chunks that were found to consist of a mixture of concrete and abstract words (as defined above). This retains roughly 42% of all chunks (T s < T ), resulting in the following strict counterparts to the embedding matrix and rating vector obtained in Section 4: (i) E s ∈ R T s ×D , and (ii) r s ∈ R T s , with the superscript s denoting that only chunks satisfying the strictly concrete/abstract property are being considered. We binary-encoded r s into the boolean vector y s ∈ {0, 1} T s , so that y s i = 1 if the corresponding chunk is strictly concrete and y s i = 0 otherwise. Our specific choice for the concreteness threshold (τ = 3) produces a dataset that is approximately balanced between the two classes and is a natural choice for a 1-5 scale. 4 We learned the E s → y s mapping for each subject through L2-regularized logistic regression. We trained on 75% of the available data and picked the best value for the regularization parameter C through 5-fold cross-validation. We report the accuracy, recall, and F1 score of the classifier in our results.
An important variable in cognitive processing is the frequency with which words are encountered in language. High-frequency words are often perceived and processed faster than low-frequency words (van Heuven et al., 2014). Thus, word frequency could be a confounding variable to our objective if its distribution over the concrete words significantly differs from its distribution over the abstract words encountered in the stories. To check if this is the case, we computed the distribution of SUBTLEX-US (Brysbaert and New, 2009) word frequencies separately over all concrete vs. abstract words encountered by the subjects. However, a Kolmogorov-Smirnov test showed that the computed distribution over the concrete words was not significantly different from the distribution over the abstract words (ks = 0.056, p = 0.063).

Voxel-Response based model
Voxel Selection With up to 90,000 voxelresponses recorded per fMRI acquisition, not all voxels may be relevant to our objective of predicting the concreteness of word stimuli (Binder et al., 2005).
A standard voxel selection method is to manually determine regions of interest (ROIs) in the brain by analyzing the fMRI responses recorded in an auxiliary functional localizer task (Fedorenko et al., 2010) and select voxels from only these regions. However, this comes at the risk of being too restrictive. For example, one might inadvertently exclude regions in the brain encoding relevant sensory processing information in favor of regions encoding linguistic information. Given our objective to investigate whether brain representations contain any such additional information over wordembeddings, we avoided ROI-based methods for voxel selection.
We instead selected voxels based on their fractions of potentially-explainable response variance across time steps. This may be estimated separately for each voxel by recording different versions of its (time-varying) response corresponding to repeated presentations (Hsu et al., 2004) of the same stimulus-sequence. Assume that one story is repeatedly presented N times to a given subject and b represents a voxel being analyzed. If b (n) t represents its response at time step t corresponding to the nth repetition, then its mean response across repetitions is The following equations estimate the fraction of potentially-explainable variance for b assuming the voxel-responses are z-scored across all time steps for the story: Thus, ev(b) is analogous to the adjusted R 2 of a (perfect) model that always predicts the mean response (b t ) across repetitions. A larger value indicates that the voxel responds consistently to repetitions of the same stimulus. Each subject was presented the last story N = 2 times, and the top-V voxels with the highest ev values were retained.
From this, we obtain the desired reduced formB ∈ R T ×V . The optimal number of semantic voxels V was chosen separately for each subject to maximize performance on the validation set (described next).
Prediction Task Blood-oxygen-level-dependent (BOLD) signals in the brain typically persist for 8-10s after stimulus onset (Ashby, 2019). Since each chunk covers nearly 2s of stimulus presentation, we expect the response to each chunk to be jointly encoded by the first, second, third, and fourth (reduced) voxel-response vectors that follow the current acquisition. However, including the first or fourth acquisition significantly degraded predictive performance. We posit that this degradation occurs because the voxel-response vectors recorded one or four TRs after the current acquisition are more prone to be directly affected by words falling in chunks preceding or succeeding the chunk of interest.
With this observation, we modeled the brain's representation of the stimulus in chunk C i to be of the form f (ˆ b i+2 ,ˆ b i+3 ), whereˆ b i represents the reduced voxel-response vector from the i th acquisition. We therefore constructed the reduced+delayed voxel-response matrixB + ∈ R T ×2V by replacing each row ofB with the concatenation of the second and third rows that succeed it. 5 For classification, we first discarded chunks that are not strictly concrete/abstract and obtained B +s ∈ R T s ×2V . We then used regularized logistic regression to learn the per-subjectB +s → y s mapping. The training procedure is identical to the one followed in Section 5.1.

Statistical Significance
We determined the statistical significance of our classification results using a label-permutation method (Ojala and Garriga, 2009) with cross-validated accuracy as the chosen test statistic. Here, the distribution of a test statistic under the null hypothesis (that data and labels are independent) is estimated by training and evaluating the classifier on several randomized versions of the original data (by permuting classification labels). The p-value is then calculated as the proportion of randomized samples where the classifier performs better than it does on the original sample. We ran 100 iterations per subject.
6 Comparing the Representations 6.1 Combined model First, we combined the word-embedding and voxelresponse stimulus representations (obtained in Section 4 and Section 5.2) for each subject, by stacking the word-embedding matrix (E) and the re-duced+delayed voxel-response matrix (B + ) along the feature dimension to obtain the combined stimulus matrix C ∈ R T ×(D+2V ) . Limiting the data to strict chunks yields the matrix C s ∈ R T s ×(D+2V ) , which was then used for the classification task.
The rationale behind combining representations is the following. If the information encoded by the word-embedding and voxel-response representations were indeed complementary, the combined model should fare better at the prediction task than the two individual models because it now has access to information that was missing in either representation.
The classification task (predicting y s ) and its training procedure are identical to those described in Section 5.1.

Residual Classification
Next, we attempted to remove the information present in each representation from the other and then train the classification model using the resulting representation. This procedure is described below.
1. Removing voxel-response information from word-embeddings: For each subject, we learned a linear mapping L ∈ R 2V ×D from B +s to E s through multivariate ridge regression (Haitovsky, 1987). We then computed the residuals E s r ∈ R T s ×D in a cross-validated manner as follows, and used the residuals for the classification task: Removing word-embedding information from voxel-responses: For each subject, we learned the linear mapping L ∈ R D×2V from E s toB +s through multivariate ridge regression. We then computed the residualsB +s r ∈ R T s ×2V in a cross-validated manner as follows, and used the residuals for the classification task:B +s r =B +s − E s · L Statistical Significance To statistically validate that any observed decrease in a residual model's performance compared to the corresponding nonresidual model is really due to shared information between the representations (and not due to overfitting/chance), we adopted a "residual-permutation" procedure similar to that in Section 5.2.
Here, an empirical null distribution is created by training and evaluating each residual model above with several randomized versions of whichever representation is to be regressed out. The randomization is performed by permuting this representation over all time steps. The p-value is then calculated as the fraction of such residual models with crossvalidated accuracies lower than that of the true (non-randomized) residual model. We ran 100 iterations per subject.

Results
We use the abbreviations E for the wordembedding based model, B for the voxel-response based (brain) model, E+B for the combinedrepresentation model, E-B for the word-embedding model with voxel-response information removed, and B-E for the voxel-response model with wordembedding information removed. Figure 1 shows the classification accuracies of all models across the six subjects. Table 1 shows the average accuracy, recall, and F1 score of E and B.

Individual models
B achieved an average classification accuracy of 69% and F1 score of 71%, and performed significantly higher than chance under the labelpermutation test (p ≤ 9 × 10 −3 ) for each subject. This indicates that the fMRI signals triggered due to words encountered by subjects in natural stories encode enough information to significantly distinguish their concreteness levels under the current predictive framework. Evidently, this information must be useful above and beyond the noise present in the fMRI data unique to the data acquisition process. To our knowledge, the ability to classify the concreteness of naturalistic word stimuli from their induced brain representations in a direct, supervised fashion has not been shown in the existing literature.
E achieved a comparatively higher classification accuracy of 87%, which is in agreement with existing research (in non-naturalistic settings) on the pre-

Model
Performance ( Table 1: Classification metrics across the six participants for the word-embedding based (E), voxel-response based (B) and combined (E+B) models. dictability of word concreteness and imageability using word-embeddings as explanatory variables (Charbonnier and Wartena, 2019;Ljubešić et al., 2018). Table 1 shows the average accuracy, recall, and F1 score of E, B, and E+B. As argued in Section 1, we expect the additional sensory processing information encoded in the voxel-responses to complement the linguistic/contextual information encoded in the wordembeddings. Consequently, the combined model should fare better at distinguishing the concreteness of words in the stories.

Comparative models
However, our results indicate otherwise. The performance of E+B (86 ± 1.9%) was not significantly different from E (87%) under a 1-sample t-test (t = −2.33, p = 0.07, df = 5, 2-tail), meaning the combined model is only as good as the wordembedding based model at the task considered. Therefore, the information present in the voxelresponses relevant to differentiating between concrete and abstract words is already well-encoded by the word-embeddings, and the former does not complement the latter. On the other hand, the performance of E+B (86 ± 1.9%) was significantly higher than B (69 ± 2.5%) under a paired t-test (t = 17.77, p = 5 × 10 −6 , df = 5, 1-tail). This indicates that the word-embeddings may even contain useful extra information above that in the fMRI signals (note that we already demonstrated the effectiveness of our predictive framework in significantly distinguishing word-concreteness purely from fMRI signals). We explore this idea further next. Table 2 shows the average accuracy, recall, and F1 score of the residual models E-B and B-E.
The results of the residual analyses are surprising. First, E-B achieved an average accuracy of 84%, which was significant under the residualpermutation test (p ≤ 9 × 10 −3 ) for each subject. The performance of E-B (84 ± 1.7%) was also significantly lower than E (87%) across subjects under a 1-sample t-test (t = −4.71, p = 2.6 × 10 −3 , df = 5, 1-tail). This shows that removing the voxel-response information from the word-embeddings marginally affects its ability to classify word concreteness. More strikingly, B-E achieved an average accuracy of 48%, which is lower than the theoretical chance accuracy of 50% (see Figure 1). This result was significant under the residual-permutation test (p ≤ 9 × 10 −3 ) for each subject, ruling out the possibility that the

Residual Model
Performance (Mean ± S.D.) Accuracy Recall F1 score E-B 0.84 ± 1.7% 0.85 ± 2.4% 0.84 ± 1.4% B-E 0.48 ± 9.1% 0.60 ± 5.8% 0.55 ± 5.6%  Table 3: Examples of chunks frequently misclassified by the voxel-response model. The exact phrase falling within the chunk is in dark color. We find that a majority of such misclassifications come from the abstract category.
huge performance decrease was merely caused by overfitting/chance. Across subjects too, the performance of B-E (48 ± 9.1%) was significantly lower than B (69 ± 2.5%) under a paired t-test (t = −8.52, p = 1.8 × 10 −4 , df = 5, 1-tail). Therefore, while removing the word-embedding information from the voxel-responses fully eliminates the latter's predictive capability (a 30% decrease), going the other way around only has a marginal effect on predictive performance (a 3% decrease). These results show not only that the fMRI signals do not provide complementary information to the word-embeddings in making the concrete/abstract distinction, but that the relevant information in the voxel-responses is really a subset of the relevant information in the word-embeddings. This is a surprising result, considering the task was to distinguish a property of words theorized to fundamentally affect how the human brain represents language. We summarize our findings and provide some additional observations about this work next.

Conclusion
This paper has three key findings. First, we showed that words encountered in natural stories could be classified based on concreteness purely from the neural activity elicited as subjects passively comprehended the stories, using a direct, supervised approach.
Second, we showed that in making the concrete/abstract distinction, contextualized wordembeddings (i.e., GPT-2) do not benefit from the inclusion of information from the accompanying fMRI signals, despite evidence from several neurolinguistic studies of the human brain exhibiting fundamentally different representations over the two categories.
Finally, we found that while the residual information remaining in fMRI signals after regressing out word-embedding information can no longer distinguish concrete from abstract words, the residual information in word-embeddings beyond the fMRI signals performs significantly at this task. This shows that the information in the voxel-responses important to our prediction task is a subset of the corresponding information in the contextualized word-embeddings.
Our results should be interpreted in light of the following observations: A limitation of our work is that while the voxelresponses and word-embeddings (from GPT-2) considered provide contextualized stimulus representations, the  dataset provides non-contextualized ratings for each word. We partially addressed this discrepancy by formulating the prediction task as a classification problem since the available labels are now much more likely to match ground-truth. I.e., it is reasonable to assume that the broad binary concreteness class of a word will rarely be modified by context as much as the continuous scores would. Future work could overcome this limitation by developing the ideas from the recently introduced CONcreTEXT task 6
of computing contextualized rating scores. We still report regression results in Table 4 for completeness and observe that they are consistent with our findings (e.g., B-E can no longer predict word concreteness as suggested by its near-zero rankcorrelation). Finally, we find that repeating our analyses with non-contextualized word2vec embeddings (Mikolov et al., 2013) also yielded qualitatively identical results as in Section 7.2, indicating that our three conclusions above hold for word-embeddings more generally.
Another observation is that while B (69 ± 2.5%) significantly distinguishes concrete from abstract words, it still does not perform as well as E (87%) at this task. There could be two reasons for this difference. First, B does not handle abstract stimuli as well as E does. Quantitatively, while B achieves a recall of 77 ± 2.6% on concrete chunks, its recall on abstract chunks is significantly lower at 63 ± 3.6%. On the other hand, E shows nearly identical performances over the categories. Table  3 shows some of B's misclassified examples common to as many as four out of six subjects. Out of the 29 such common misclassifications, 19 (65.5%) were found to be abstract. This could indicate that neural activity patterns are not as informative for abstract stimuli as concrete stimuli, which is in agreement with psycholinguistic studies demonstrating verbal processing advantages for concrete concepts over abstract concepts (Holmes and Langford, 1976;Kroll and Merves, 1986;Romani et al., 2008). Second, the temporal resolution of functional Magnetic Resonance Imaging may be too coarse (Gauthier and Levy, 2019;Schwartz et al., 2019) for optimal performance on our task (we had to downsample the stimulus in Section 4). Nevertheless, our findings are important. Applying the current predictive framework on the fMRI signals produced highly significant results, and it is under such a framework that the above conclusions were made. Future work could explore the differences in decoding neural activity from naturalistic stimuli with imaging methods of different temporal resolu-tions (e.g., EEG, MEG) to determine which method should be used for which kind of task.
To conclude, we believe that this paper will inspire future work to take up the following exciting directions: Which natural language processing tasks may benefit from incorporating human language processing information into the existing frameworks? Are there ways of including such information to expose avenues for improvement in these models?