Efficient construction of metadata-enhanced web corpora

Metadata extraction is known to be a problem in general-purpose Web corpora, and so is extensive crawling with little yield. The contributions of this paper are threefold: a method to ﬁnd and download large numbers of WordPress pages; a targeted extraction of content featuring much needed metadata; and an analysis of the documents in the corpus with insights of actual blog uses. The study focuses on a publishing software (WordPress), which allows for reliable extraction of structural elements such as metadata, posts, and comments. The download of about 9 million documents in the course of two experiments leads after processing to 2.7 billion tokens with usable metadata. This comparatively high yield is a step towards more efﬁciency with respect to machine power and “Hi-Fi” web corpora. The resulting corpus complies with formal requirements on metadata-enhanced corpora and on weblogs considered as a series of dated entries. However, existing typologies on Web texts have to be revised in the light of this hybrid genre.


Context
This article introduces work on focused web corpus construction with linguistic research in mind. The purpose of focused web corpora is to complement existing collections, as they allow for better coverage of specific written text types and genres, user-generated content, as well as latest language evolutions. However, it is quite rare to find readymade resources. Specific issues include first the discovery of relevant web documents, and second the extraction of text and metadata, e.g. because of exotic markup and text genres . Nonetheless, proper extraction is necessary for the corpora to be established as scientific objects, as science needs an agreed scheme for identifying and registering research data (Sampson, 2000). Web corpus yield is another recurrent problem (Suchomel and Pomikálek, 2012;Schäfer et al., 2014). The shift from web as corpus to web for corpus -mostly due to an expanding Web universe and the need for better text quality (Versley and Panchenko, 2012) -as well as the limited resources of research institutions make extensive downloads costly and prompt for handy solutions (Barbaresi, 2015).
The DWDS lexicography project 1 at the Berlin-Brandenburg Academy of Sciences already features a good coverage of specific written text genres (Geyken, 2007). Further experiments including internet-based text genres are currently conducted in joint work with the Austrian Academy of Sciences (Academy Corpora). The common absence of metadata known to the philological tradition such as authorship and publication date accounts for a certain defiance regarding Web resources, as linguistic evidence cannot be cited or identified properly in the sense of the tradition. Thus, missing or erroneous metadata in "one size fits all" web corpora may undermine the relevance of web texts for linguistic purposes and in the humanities in general. Additionally, nearly all existing text extraction and classification techniques have been developed in the field of information retrieval, that is not with linguistic objectives in mind.
The contributions of this paper are threefold: (1) a method to find and download large amounts of WordPress pages; (2) a targeted extraction of content featuring much needed metadata; (3) an analysis of the documents in the corpus with insights of actual uses of the blog genre. My study focuses on a publishing software with two experiments, first on the official platform wordpress.com and second on the .at-domain. WordPress is used by about a quarter of the websites worldwide 2 , the software has become so broadly used that its current deployments can be expected to differ from the original ones. A number of 158,719 blogs in German have previously been found on wordpress.com (Barbaresi and Würzner, 2014). The .at-domain (Austria) is in quantitative terms the 32th top-level domain with about 3,7 million hosts reported. 3

Definitional and typological criteria
From the beginning of research on blogs/weblogs, the main definitional criterion has always been their form, a "reverse chronological sequences of dated entries" (Kumar et al., 2003). Another formal criterion is the use of dedicated software to articulate and publish the entries, a "weblog publishing software tool" (Glance et al., 2004), "publicdomain blog software" (Kumar et al., 2003), or Content Management System (CMS). These tools largely impact the way blogs are created and run. 1996 seems to be the acknowledged beginning of the blog/weblog genre, with an exponential increase of their use starting in 1999 with the emergence of several user-friendly publishing tools (Kumar et al., 2003;Herring et al., 2004).
Whether a blog is to be considered to be a web page in its whole (Glance et al., 2004) or a website containing a series of dated entries, or posts, (Kehoe and Gee, 2012) being each a web page, there are invariant elements, such as "a persistent sidebar containing profile information" as well as links to other blogs (Kumar et al., 2003), or blogroll. For that matter, blogs are intricately intertwined in what has been called the blogosphere: "The crosslinking that takes place between blogs, through blogrolls, explicit linking, trackbacks, and referrals has helped create a strong sense of community in the weblogging world." (Glance et al., 2004). This means that a comprehensive crawl could lead to better yields.
Regarding the classification of blogs, Blood (2002) distinguishes three basic types: filters, personal journals, and notebooks, while Krishnamurthy (2002) builds a typology based on function and intention of the blogs: online diaries, support group, enhanced column, collaborative content creation. More comprehensive typologies established on one hand several genres: online journal, self-declared expert, news filter, writer/artist, spam/advertisement; and on the other hand distinctive "variations": collaborative writing, comments from readers, means of publishing (Glance et al., 2004).

Related work 2.1 (Meta-)Data Extraction
Data extraction has first been based on "wrappers" (nowadays: "scrapers") which were mostly relying on manual design and tended to be brittle and hard to maintain (Crescenzi et al., 2001). These extraction procedures have also been used early on by blogs search engines (Glance et al., 2004). Since the genre of "web diaries" was established before the blogs in Japan, there have been attempts to target not only blog software but also regular pages (Nanno et al., 2004), in which the extraction of metadata also allows for a distinction based on heuristics.
Efforts were made to generate wrappers automatically, with emphasis on three different approaches (Guo et al., 2010): wrapper induction (e.g. by building a grammar to parse a web page), sequence labeling (e.g. labeled examples or a schema of data in the page), and statistical analysis and series of resulting heuristics. This analysis combined to the inspection of DOM tree characteristics (Wang et al., 2009;Guo et al., 2010) is a common ground to the information retrieval and web corpus linguistics communities, with the categorization of HTML elements and linguistic features (Ziegler and Skubacz, 2007) for the former, and markup and boilerplate removal operations known to the latter community .
Regarding content-based wrappers for blogs in particular, targets include the title of the entry, the date, the author, the content, the number of comments, the archived link, and the trackback link (Glance et al., 2004); they can also aim at com-ments specifically (Mishne and Glance, 2006).

Blog corpus construction
The first and foremost issue in blog corpus construction still holds true today: "there is no comprehensive directory of weblogs, although several small directories exist" (Glance et al., 2004). Previous work established several modes of construction, from broad, opportunistic approaches, to the focus on a particular method or platform due to the convenience of retrieval processes. Corpus size and length of downloads are frequently mentioned as potential obstacles. Glance et al. (2004) performed URL harvesting through specialized directories, and found a practical upper bound at about 100,000 active weblogs, which were used as a corpus in their study.
The first comprehensive studies used feeds to collect blog texts (Gruhl et al., 2004), since they are a convenient way to bypass extensive crawling and to harvest blog posts (and more rarely comments) without needing any boilerplate removal.
An approach based on RSS and Atom feeds is featured in the TREC-Blog collection 4 (Macdonald and Ounis, 2006), a reference in Information Extraction which has been used in a number of evaluation tasks. 100,649 blogs were predetermined, they are top blogs in terms of popularity, but no further information is given. Spam blogs, and hand-picked relevant blogs (no information on the criteria either) are used to complement and to balance the corpus to make it more versatile. The corpus is built by fetching feeds describing recent postings, whose permalinks are used as a reference. From initial figures totaling 3,215,171 permalinks and 324,880 homepages, most recent ones from 2008 mention 1,303,520 feeds and 28,488,766 permalink documents. 5 Another way to enhance the quality of data and the ease of retrieval is the focus on a particular platform. To study authorship attribution, Schler et al. (2006) gathered a total of 71,000 blogs on the Google-owned Blogger platform, which allowed for easier extraction of content, although no comments are included in the corpus.
The Birmingham Blog Corpus (Kehoe and Gee, 2012) is a more recent approach to comprehensive corpus construction. Two platforms are taken into consideration: Blogger and wordpress.com, with the "freshly pressed" page on WordPress as well as a series of trending blogs used as seed for the crawls, leading to 222,245 blog posts and 2,253,855 comments from Blogger and WordPress combined, totaling about 95 million tokens (for the posts) and 86 million tokens (for the comments).
The YACIS Corpus (Ptaszynski et al., 2012) is a Japanese corpus consisting of blogs collected from a single blog platform, which features mostly users in the target language as well as a clear HTML structure. Its creators were able to gather about 13 million webpages from 60,000 bloggers for a total of 5.6 billion tokens.
Last, focused crawl on the German version of the platform wordpress.com led to the construction of a corpus of 100 million tokens under Creative Commons licenses (Barbaresi and Würzner, 2014), albeit with a much lower proportion of comments (present on 12.7% of the posts). In fact, comments have been shown to be strongly related to the popularity of a blog (Mishne and Glance, 2006), so that the number of comments is much lower when blogs are taken at random.
The sharp decrease in publication of work documenting blog corpus construction after 2008 signalizes a shift of focus, not only because web corpus construction does not often get the attention it deserves, but also because of the growing popularity of short message services like Twitter, which allow for comprehensive studies on social networks and internet-based communication, with a larger number of users and messages as well as clear data on network range (e.g. followers).

Discovery
A detection phase is needed to be able to observe bloggers "in the wild" without needing to resort to large-scale crawling. In fact, guessing if a website uses WordPress by analyzing HTML code is straightforward if nothing was been done to hide it, which is almost always the case. However, downloading even a reasonable number of web pages may take a lot of time. That is why I chose to perform massive scans in order to find websites using WordPress, which to my best knowledge has not yet been tried in the literature. The detection process is twofold, the first filter is URL-based whereas the final selection uses shallow HTTP requests.
The permalinks settings 6 in WordPress define five common URL structures: default (?p= or ?page id= or ?paged=), date (/year/ and/or /month/ and/or /day/ and so on), post number (/keyword/number -where keyword is for example "archives"), tag or category (/tag/, /category/, or cross-language equivalents), and finally post name (long URLs containing a lot of hyphens). Patterns derived from those structures can serve as a first filter, although the patterns are not always reliable: news websites tend to use dates very frequently in URLs, in that case the accuracy of the prediction is poor.
The most accurate method would be a scan of fully-rendered HTML documents with clear heuristics such as the "generator" meta tag in the header, which by default points to WordPress. In this study, HTTP HEAD 7 requests are used to spare bandwidth and get cleaner, faster results. HEAD requests are part of the HTTP protocol. Like the most frequent request, GET, which fetches the content, they are supposed to be implemented by every web server. A HEAD request fetches the meta-information written in response headers without downloading the actual content, which makes it much faster, but also more resource-friendly, as according to my method less than three requests per domain name are sufficient.
The following rules come from from the official documentation and have been field-tested: (1) A request sent to the homepage is bound to yield pingback information to use via the XML-RPC protocol in the X-Pingback header. Note that if there is a redirect, this header usually points to the "real" domain name and/or path, ending in xmlrpc.php. What is more, frequently used Word-Press modules may leave a trace in the header as well, e.g. WP-Super-Cache, which identifies a WordPress-run website with certainty.
(2) A request sent to /login or /wp-login.php should yield a HTTP status corresponding to an existing page (2XX, 3XX, more rarely 401).
(3) A request sent to /feed or /wp-feed.php should yield the header Location.
The criteria can be used separately or in combination. I chose to use a simple decision tree. The information provided is rarely tampered on or misleading, since almost all WordPress installations stick to the defaults. Sending more than one request makes the guess more precise, it also acts like a redirection check which provides the effectively used domain name behind a URL. Thus, since the requests help deduplicating a URL list, they are doubly valuable.

Sources and crawls
This study falls doubly into the category of focused or scoped crawling (Olston and Najork, 2010): the emphasis lies on German or on the .atdomain, and a certain type of websites are examined based on structural characteristics.
I have previously shown that the diversity of sources has a positive impact on yield and quality (Barbaresi, 2014). Aside from URL lists from this and other previous experiments (Barbaresi, 2013) and URLs extracted from each batch of downloaded web documents (proper crawls), several sources were queried, not in the orthodox BootCat way with randomized tuples (Baroni and Bernardini, 2004) but based on formal URL characteristics as described above: (1) URLs from the CommonCrawl 8 , a repository already used in web corpus construction (Habernal et al., 2016;Schäfer, 2016); (2) the CDX index query frontend of the internet Archive; 9 (3) public instances of the metasearch engine Searx. 10 A further restriction resides in the downloads of sitemaps for document retrieval. A majority of websites are optimized in this respect, and experiments showed that crawls otherwise depend on unclear directory structures such as posts classified by categories or month, as well as on variables (e.g. page) in URL structures, which leads to numerous duplicates and an inefficient crawl. Another advantage is that websites offering sitemaps are almost systematically robot-friendly, which solves ethical robots.txt-related issues such as the crawl delay, which is frequently mentioned as an obstacle in the literature.

Extraction
I designed a text extraction targeting specifically WordPress pages, which is transferable to a whole range of self-hosted websites using WordPress, allowing to reach various blogger profiles thanks to a comparable if not identical content structure. The extractor acts like a state-of-the-art wrapper: after parsing the HTML page, XPATHexpressions select subtrees and operate on them through pruning and tag conversion to (1) write the data with the desired amount of markup and (2) convert the desired HTML tags into the output XML format in strict compliance to the guidelines of the Text Encoding Initiative 11 , in order to allow for a greater interoperability within the research community.
The extraction of metadata targets the following fields, if available: title of post, title of blog, date of publication, canonical URL, author, categories, and tags. The multiple plugins cause strong divergences in the rendered HTML code, additionally not all websites use all the fields at their disposal. Thus, titles and canonical URL are the most often extracted data, followed by date, categories, tags, and author.
Content extraction allows for a distinction between post and comments, the latter being listed as a series of paragraphs with text formatting. The main difference with extractors used in information retrieval is that structural boundaries are kept (titles, paragraphs), whereas links are discarded for corpus use. A special attention is given to dates. Documents with non-existent or missed date or entry content are discarded during processing and are not part of the corpus, which through the dated entries is a corpus of "blogs" in a formal sense. Removal of duplicates is performed on entry basis.

Content analysis
In the first experiment, language detection is performed with langid.py (Lui and Baldwin, 2012) and sources are evaluated using the Filtering and Language identification for URL Crawling Seeds 12 toolchain (Barbaresi, 2014), which includes obvious spam and non-text documents filtering, redirection checks, collection of host-and markup-based data, HTML code stripping, document validity check, and language identification. No language detection is undertaken is the second experiment since no such filtering is intended. That being said, a large majority of webpages are expected to be in German, as has been shown for another German-speaking country in the .de-TLD 11 http://www.tei-c.org/ 12 https://github.com/adbar/flux-toolchain .
The token counts below are produced by the WASTE tokenizer (Jurish and Würzner, 2013).

Experiment 1: Retrieving German blogs 4.1 General figures on harvested content
In a previous experiment, the largest platform for WordPress-hosted websites, wordpress.com, blogs under CC license were targeted (Barbaresi and Würzner, 2014). In the summer of 2015, sitemaps were retrieved for all known home pages, which lead to the integral download of 145,507 different websites for a total number of 6,605,078 documents (390 Gb), leaving 6,095,630 files after processing (36 Gb). There are 6,024,187 "valid" files (with usable date and content) from 141,648 websites, whose text amounts to about 2.11 billion tokens. The distribution of harvested documents in the course of years is documented in table 6, there are 6,095,206 documents with at least a reliable indication of publication year, i.e. 92.3% of all documents. Contrarily to dates in the literature, these results are not from reported permalinks dates from feeds, but directly from page metadata; nonetheless, there is also a fair share of implausible dates, comparable to the 3% of the TREC blog corpus (Macdonald and Ounis, 2006). This indicates that these dates are not an extraction problem but rather a creative license on the side of the authors.

Year
Docs.    There are 2,312,843 pages with tags, 15,856,481 uses in total, and 2,431,920 different tags, the top-15 results are displayed in table 4. They are as general as the top categories but slightly more informative.
All in all, the observed metadata are in line with the expectations, even if the high proportion of photoblogs is not ideal for text collection. Comments were extracted for 1,454,752 files (24%), this proportion confirms the hypothesis that the wordpress.com-platform leads primarily to the publication of blogs in a traditional fashion. On the contrary, the typology has to be more detailed in the second experiment due to the absence of previous knowledge about the collection.   There are about 2 million "valid" files (with usable date and content), whose text amounts to about 550 million tokens. There are 5,664 different domain names before processing, and 7,275 after (due to the resolution of canonical URLs).

Typology
Of all canonical domain names, only 240 contain the word blog. Comments were extracted for 181,246 files (7%), which is explained mainly by the actual absence of comments and partly by difficulty of extraction in the case of third-party comment systems.
The distribution of harvested documents in the course of years is documented in table 6. There are 2,083,535 documents with at least a reliable indication of publication year, i.e. 80.5% of all documents. The relative amount of "creative" dates is slightly higher than in experiment 1, which hints at a larger diversity of form and content.
The increase in the number of documents exceeds by far the increase of domains registered in the .at-TLD 13 , which seems to hint at the growing popularity of WordPress and maybe also at the ephemeral character of blogs.   7) gives the following typology: informational for general news websites (9), promotional/commercial for websites which list ads, deals, jobs or products (12), specialized for focused news and community websites (16), entertainment (3), political (3), personal for websites dedicated to a person or an organization (3), adult (2), forum (1).
There  The tags reflect a number of different preoccupations, including family, holidays, sex, job and labor legislation. "Homemade" and "amateur" can be used in German, albeit rarely, these words give more insights on the genre (most probably adult entertainment) Table 9: Most frequent tags in the second experiment than on content language.
All in all, the distribution of categories and tags indicates that the majority of texts target as expected German-speaking users.

Discussion
Although the definition of blogs as a hybrid genre neither fundamentally new nor unique (Herring et al., 2004) holds true, several assumptions about weblogs cannot be considered to be accurate anymore in the light of frequencies in the corpus. Blogs are not always "authored by a single individual" (Kumar et al., 2003), nor does the frequency criterion given by the Oxford English Dictionary (Kehoe and Gee, 2012) -"frequently updated web site" -necessarily correspond to the reality. Even if both experiments gathered blogs in a formal sense, there are differences between the websites on the platform wordpress.com and freely hosted websites. The former are cleaner in form and content, they are in line with a certain tradition. The "local community interactions between a small number of bloggers" (Kumar et al., 2003) of the beginnings have been relegated by websites corresponding to the original criteria of a blog but whose finality is to sell information, entertainment, or concrete products and services.
Consequently, the expectation that "blog software makes Web pages truly interactive, even if that interactive potential has yet to be fully exploited" (Herring et al., 2004) is either outdated or yet to come. Beside these transformations and the emergence of other social networks, the whole range from top to barely known websites shows that the number of comments per post and per website is largely inferior to the "bursting" phase of webblogging, where comments were "a substantial part of the blogosphere" (Mishne and Glance, 2006). The evolution of the Web as well as the scope of this study cast the typical profile of a passive internet consumer, a "prosumer" at best, which should be taken in consideration in web corpus construction and computer-mediated communication studies. If blogs still bridge a technological gap between HTML-enhanced CMC and CMC-enhanced Web pages (Herring et al., 2004), a typological gap exists between original and current studies as well as between users of a platform and users of a content management system.

Conclusion
The trade-off to gain metadata using focused downloads following strict rules seems to get enough traction to build larger web corpora, since a total of 550 Gb of actually downloaded material allows after processing for the construction of a corpus of about 2.7 billion tokens with rich metadata. This comparatively high yield is a step towards more efficiency with respect to machine power and "Hi-Fi" web corpora, which could help promoting the cause of web sources and modernization of research methodology.
The resulting corpus complies with formal requirements on metadata-enhanced corpora and on weblogs considered as a series of dated entries. The interlinking of blogs and their rising popularity certainly don't stay in the way. However, addressing the tricky question of web genres seems inevitable in order to be able to properly qualify my findings and subsequent linguistic inquiries. More than ever, blogs are a hybrid genre, and their