We are happy to provide two labeled datasets which were obtained from Web markup.
The datasets were extracted from the Web Data Commons Oktober 2016 (WDC 2016) n-quad dataset. The first dataset contains events with labeled subtypes. The second dataset contains movies with labeled genres. Our datasets contain the 7 classes with the respective greatest number of instances.
In the event dataset, the labels are event subtypes which are defined by schema.org and are annotated via the rdf:type property.
In the movie dataset, the labels are movie genres and are object of the schema.org/genre property. The labels were unified by string matching the genre-literals against the movie genres defined by the imdb.com. Since movie genre classification is a multi label classification problem, we provide datasets for the training of binary classifiers.
The following datasets contain the sets of instances which were sampled from the WDC 2016 dataset . Each instance is represented through its node-id, its URL of provenance and its label.
Dataset | Size | Distinct plds | Avg. Instances/pld | Link |
---|---|---|---|---|
Eventsstratified
|
67444 | 1482 | 45.71 | Link |
Eventspld-aware
|
67444 | 2064 | 32.82 | Link |
Dramastratified
|
239030 | 360 | 663.97 | Link |
Dramapld-aware
|
239030 | 476 | 502.16 | Link |
Comedystratified
|
239030 | 342 | 698.92 | Link |
Comedypld-aware
|
239030 | 476 | 502.16 | Link |
Actionstratified
|
239030 | 361 | 662.13 | Link |
Actionpld-aware
|
239030 | 476 | 502.16 | Link |
Thrillerstratified
|
239030 | 342 | 698.92 | Link |
Thrillerpld-aware
|
239030 | 476 | 502.16 | Link |
Romancestratified
|
239030 | 347 | 688.85 | Link |
Romancepld-aware
|
239030 | 476 | 502.16 | Link |
Documentarystratified
|
239030 | 337 | 709.29 | Link |
Documentarypld-aware
|
239030 | 476 | 502.16 | Link |
Adventurestratified
|
239030 | 340 | 703.03 | Link |
Adventurepld-aware
|
239030 | 476 | 502.16 | Link |
Cite as:
Tempelmeier, N., Demidova, E., Dietze, S., Inferring Missing Categorical Information in Noisy and Sparse Web Markup, The Web Conference 2018 (WWW2018), 27th edition of the former WWW conference, research track, ACM, Lyon, France, 23-27 April 2018.