Event and Movie Datasets Released

We are happy to provide two labeled datasets which were obtained from Web markup.

The  datasets  were extracted from the Web Data Commons Oktober 2016 (WDC 2016)  n-quad dataset. The first dataset contains events with labeled subtypes. The second dataset contains movies with labeled genres. Our datasets contain the 7 classes with the respective greatest number of instances.

In the event dataset, the labels are  event subtypes  which are defined by schema.org and are annotated via the rdf:type property.

In the movie dataset,   the labels are movie genres  and are object of the schema.org/genre property. The labels were unified by string matching the genre-literals against the movie genres defined by the imdb.com. Since movie genre classification is a multi label classification problem, we provide datasets for the training of binary classifiers.

The following datasets contain the  sets of instances which were sampled from the WDC 2016 dataset . Each instance is represented through its node-id, its URL of provenance and its label.

Dataset Size Distinct plds Avg. Instances/pld Link
Eventsstratified 67444 1482 45.71 Link
Eventspld-aware 67444 2064 32.82 Link
Dramastratified 239030 360 663.97 Link
Dramapld-aware 239030 476 502.16 Link
Comedystratified 239030 342 698.92 Link
Comedypld-aware 239030 476 502.16 Link
Actionstratified 239030 361 662.13 Link
Actionpld-aware 239030 476 502.16 Link
Thrillerstratified 239030 342 698.92 Link
Thrillerpld-aware 239030 476 502.16 Link
Romancestratified 239030 347 688.85 Link
Romancepld-aware 239030 476 502.16 Link
Documentarystratified 239030 337 709.29 Link
Documentarypld-aware 239030 476 502.16 Link
Adventurestratified 239030 340 703.03 Link
Adventurepld-aware 239030 476 502.16 Link

Cite as:

Tempelmeier, N., Demidova, E., Dietze, S., Inferring Missing Categorical Information in Noisy and Sparse Web Markup, The Web Conference 2018 (WWW2018), 27th edition of the former WWW conference, research track, ACM, Lyon, France, 23-27 April 2018.