Automatic Identification of Age-Appropriate Ratings of Song Lyrics

This paper presents a novel task, namely the automatic identiﬁcation of age-appropriate ratings of a musical track, or album, based on its lyrics. Details are provided regarding the construction of a dataset of lyrics from 12,242 tracks across 1,798 albums along with age-appropriate ratings obtained from various web resources, along with results from various text classiﬁcation experiments. The best accuracy of 71.02% for classifying albums by age groups is achieved by combining vector space model and psycholinguistic features.


Introduction
Media age-appropriateness can be defined as the suitability of the consumption of a media item, e.g. a song, book, film, videogame, etc., by a child of a given age based on norms that are generally agreed upon within a society. Such norms may include behavioral, sociological, psychological, and other factors. Whilst we acknowledge that this is largely a subjective judgment, and that there may be wide variance between very small circles that could be considered demographically homogenous, nevertheless, parents, educators, and policymakers may find such judgments valuable in the process of guiding and supervising the media consumption of children.
This topic is closely related to well-known content rating schemes such as the MPAA film rating system 1 , but whereas such schemes are focused more on whether a film contains adult material or not, age-appropriatness can be thought of as being more nuanced, and takes into consideration more factors such as educational value. 1 http://www.mpaa.org/film-ratings One popular resource for such ratings is Common Sense Media 2 , a website that provides reviews for various media, with a focus on age appropriateness and learning potential for children.
Whilst acknowledging that such ratings are of interest to many people, the position of this research is neutral towards the efficacy and utility of such ratings: we only seek to ask the question of whether it is possible to automate the identification of these age-appropriateness ratings.
This work focuses on song lyrics. There are many aspects that can contribute to the ageappropriateness of a song, but we believe that by far the most dominant factor is its lyrics. Thus, the approach that is taken to automating the identification of age-appropriatness ratings is to treat it as a supervised text classification task: first, a corpus of song lyrics along with age-appropriateness ratings is constructed, and subsequently this corpus is used to train a model based on various textual features.
To give the reader an idea of this task, Figures 1 to 3 show a sampler of snippets of lyrics 3 from songs along with their age-appropriate ratings according to Common Sense Media. Our goal is to be able to automatically predict the ageappropriate rating given the lyrics of a song in such cases.
Oh, I'm Sammy the snake And I look like the letter "S"ssss. Oh, yes. I'm all wiggly and curvy, And I look like the letter "S"ssss. I confess.
(age-appropriate rating: 2) You can take everything I have You can break everything I am Like I'm made of glass Like I'm made of paper Go on and try to tear me down I will be rising from the ground Like a skyscraper Like a skyscraper (age-appropriate rating: 9) In Section 2 we discuss related work, before presenting our work on constructing the corpus (Section 3) and carrying out text classification experiments (Section 4). Finally, we present a tentative summary in Section 5.

Related Work
To our knowledge, there is no previous work that has attempted what is described in this paper. There is some thematically related work, such as automatic filtering of pornographic content (Polpinij et al., 2006;Sood et al., 2012;Xiang et al., 2012;Su et al., 2004), but we believe the nature of the task is significantly different such that a different approach is required.
However, text or document classification, the general technique employed in this paper, is a very common task (Manning et al., 2008). In text classification, given a document d, the task is to assign it a class, or label, c, from a fixed, human-defined set of possible classes C = {c 1 , c 2 , . . . , c n }. In order to achieve this, a training set of labelled documents d, c is given to a learning algorithm to learn a classifier that maps documents to classes.
Documents are typically represented as a vector in a high-dimensional space, such as termdocument matrices, or results of dimensionality reduction techniques such as Latent Semantic Analysis (Landauer et al., 1998), or more recently, using vector representations of words produced by neural networks (Pennington et al., 2014).
Text classification has many applications, among others spam filtering (Androutsopoulos et al., 2000) and sentiment analysis (Pang and Lee, 2008).
One particular application that could be deemed of relevance with respect to our work is that of readability assessment (Pitler and Nenkova, 2008;Feng et al., 2010), i.e. determining the ease with which a written text can be understood by a reader, since age is certainly a dimension along which readability varies. However, our literature review of this area suggested that the aspects being considered in readability assessment are sufficiently different from the dimensions that seem to be most relevant for media age appropriatness ratings. Following Manurung et al. (2008), we hypothesize that utilizing resources such as the MRC Psycholinguistic Database (Coltheart, 1981) could be valuable in determining age appropriateness, in particular various features such as familiarity, imageability, age-of-acquisition, and concreteness.

Corpus Construction
There are three steps in obtaining the data required for our corpus: obtaining album details and ageappropriateness ratings, searching for the tracklisting of each album, and obtaining the lyrics for each song. Each step is carried out by querying a different website. To achieve this, a Java application that utilizes the jsoup library 4 was developed.

Obtaining album details and age-appropriateness ratings
The Common Sense Media website provides reviews for various music albums. The reviews consist of a textual review, the age-appropriate rating for the album, which consists of an integer in the interval [2,17] or the label 'Not For Kids', and metadata about the album such as title, artist, and genre. Aside from that, there are also other annotations such as a quality rating (1-5 stars), and specific aspectual ratings such as positive messages, role models, violence, sex, language, consumerism, drinking, drugs & smoking. The website also allows visitors to contribute user ratings and reviews. In our experiments we only utilize the album metadata and integer indicating the ageappropriate rating.

Tracklist searching
A tracklist is a list of all the songs, or tracks, contained within an album. From the information previously obtained from Common Sense Media, the next step is to obtain the tracklist of each album. For this we query the MusicBrainz website 5 , an open music encyclopedia that makes music metadata available to the public. To obtain the tracklists we employed the advanced query search mode that allows the use of boolean operators. We tried several combinations of queries involving album title, singer, and label information, and it turned out that queries consisting of album title and singer produced the highest recall. When MusicBrainz returns multiple results for a given query, we simply select the first result. For special cases where the tracks on an album are performed by various artists, e.g. a compilation album, or a soundtrack album, it is during this stage that we also extract information regarding the track-specific artist name. Finally, we assume that if the album title contains the string 'CD Single' then it only contains one track and we skip forward to the next step.

Lyrics searching
For this step, we consulted two websites as the source reference for song lyrics, songlyrics.com and lyricsmode.com. The former is first consulted, and only if it fails to yield any results is the latter consulted. If a track is not found on both websites, we discard it from our data set. Similar to the previous step, we perform a query to obtain results, however during this step the query consists of the song title and singer. Once again, given multiple results we simply choose the first result. In total, we were able to retrieve lyrics from 12,242 songs across 1,798 albums. Table 1 provides an overview of the number of tracks and albums obtained per age rating. Once the dataset is complete, classifiers were trained and used to carry out experiment scenarios that vary along several factors. For the class labels, two scenarios are considered: one where each age rating from 2 to 17 and 'Not For Kids' is a separate class, and another where the data is clustered together based on some conventional developmental age groupings 6 , i.e. toddlers (ages 2 & 3), preschoolers (ages 4 & 5), middle-childhood 1 (ages 6 to 8), middle-childhood 2 (ages 9 to 11), youngteens (ages 12 to 14), and teenagers (ages 15 to 17), with an additional category for ages beyond 17 using the 'Not For Kids' labelled data.
For the instance data, two scenarios are also considered: one where classification is done on a per-track basis, and one on a per-album basis (i.e. where lyrics from all its constituent tracks are concatenated).
As for the feature representation, three primary variations are considered: Vector Space Model. This is a baseline method where each word appearing in the dataset becomes a feature, and a vector representing an instance consists of the tf.idf values of all words. Additionally, stemming is first performed on the words, and information gain-based attribute selection is applied.
MRC Psycholinguistic data. For this feature representation, given each distinct word appearing in the lyrics of a track (or album), a lookup is performed on the MRC psycholinguistic database, and if appropriate values exist, they are added to the tally for the familiarity, imageability, age-ofacquisition, and concreteness scores. Thus, an instance is represented by a vector with four real values. The vectors are normalized with respect to the number of words contributing to the values.
GloVe vectors. GloVe 7 is a tool that produces vector representations of words trained on very large corpora (Pennington et al., 2014). It is similar to dimensionality reduction approaches such as latent semantic analysis. For this experiment, the 50-dimensional pre-trained vectors trained on Wikipedia and Gigaword corpora were used.
When combining feature representations, we simply concatenate their vectors.
Finally, for the classification itself, the Weka toolkit is used. Given the ordinal nature of the class labels, classification is carried out via regression (Frank et al., 1998), using the M5P-based classifier (Wang and Witten, 1997). The experiments were run using 4-fold cross validation.
For the initial experiment, only the baseline VSM feature representation was used, and the treatment of class labels and instance granularity was varied. The results can be seen in Table 2, which shows the average accuracy, i.e. the percentage of test instances that were correctly labelled, across 4 folds.

Age group
Year Per-track 69.77% 58.58% Per-album 70.60% 57.15% Table 2: Initial experiment varying class and instance granularity For the follow-up experiment, we focus on the task of classifying at the per-album level of granularity, as ultimately this is the level at which the original annotations are obtained. For the class labels, both age groups and separate ages are used. The feature representation was varied ranging from VSM, VSM + MRC, VSM + GloVe, and VSM + GloVe + MRC. The results can be seen in Table 3  From the initial experiment, it appears that distinguishing tracks at the level of granularity of specific year/age (e.g. "is this song more appropriate for a 4 or 5 year old?") is very difficult, as indicated by an accuracy of only 57% to 58%. Bear in mind, however, that this is a seventeen-way classification task. Shifting the level of granularity to that of age groups transforms the task into a more feasible one, with an accuracy around the 70% mark. It is surprising to note that the per-track performance is better than the per-album performance when tracks are distinguished by specific age/year rather than age groups. We had initially hypothesized that classifying albums would be a more consistent task given the increased context and evidence available. As for the various feature representations, we note that the addition of the MRC psycholinguistic features of familiarity, imageability, concreteness, and age-of-acquisition does provide a small accuracy increase in certain cases, as evidenced by the highest accuracy of 71.02% when classifying albums by age group using the VSM + MRC features. The use of the GloVe vectors gives a slight contribution in the case of classifying albums by specific age/year, where the highest accuracy of 57.85% is obtained when combining VSM with both the MRC and GloVe features.
There are many other features and contexts that can also be utilized. For instance, given the metadata of artist, album, and genre, additional information may be extracted from the web, e.g. the artist's biography, general-purpose album reviews, genre tendencies, etc., all of which may contribute to discerning age-appropriateness. Another set of features that can be utilized are readability metrics, as they are often correlated with the age of the reader.
To summarize, this paper has introduced a novel task with clear practical applications in the form of automatically identifying age-appropriate ratings of songs and albums based on lyrics. The work reported is still in its very early stages, nevertheless we believe the findings are of interest to NLP researchers.
Another question that needs to be addressed is what sort of competence and agreement humans achieve on this task. To that end, we plan to conduct a manual annotation experiment involving several human subjects, themselves varied across different age groups, and to measure interannotator reliability (Passonneau et al., 2006).