Web Markup Augmentation – Inferring Missing Information

We are happy to provide two labeled datasets which were obtained from Web markup.

The datasets were extracted from the Web Data Commons Oktober 2016 (WDC 2016) n-quad dataset. The first dataset contains events with labeled subtypes. The second dataset contains movies with labeled genres. Our datasets contain the 7 classes with the respective greatest number of instances.

In the event dataset, the labels are event subtypes which are defined by schema.org and are annotated via the rdf:type property.

In the movie dataset, the labels are movie genres and are object of the schema.org/genre property. The labels were unified by string matching the genre-literals against the movie genres defined by the imdb.com. Since movie genre classification is a multi label classification problem, we provide datasets for the training of binary classifiers.

The following datasets contain the sets of instances which were sampled from the WDC 2016 dataset . Each instance is represented through its node-id, its URL of provenance and its label.

Dataset	Size	Distinct plds	Avg. Instances/pld	Link
Events`_stratified`	67444	1482	45.71	Link
Events`_pld-aware`	67444	2064	32.82	Link
Drama`_stratified`	239030	360	663.97	Link
Drama`_pld-aware`	239030	476	502.16	Link
Comedy`_stratified`	239030	342	698.92	Link
Comedy`_pld-aware`	239030	476	502.16	Link
Action`_stratified`	239030	361	662.13	Link
Action`_pld-aware`	239030	476	502.16	Link
Thriller`_stratified`	239030	342	698.92	Link
Thriller`_pld-aware`	239030	476	502.16	Link
Romance`_stratified`	239030	347	688.85	Link
Romance`_pld-aware`	239030	476	502.16	Link
Documentary`_stratified`	239030	337	709.29	Link
Documentary`_pld-aware`	239030	476	502.16	Link
Adventure`_stratified`	239030	340	703.03	Link
Adventure`_pld-aware`	239030	476	502.16	Link

Cite as:

Tempelmeier, N., Demidova, E., Dietze, S., Inferring Missing Categorical Information in Noisy and Sparse Web Markup, The Web Conference 2018 (WWW2018), 27th edition of the former WWW conference, research track, ACM, Lyon, France, 23-27 April 2018.