KG-SaF-Data: Complete KGR Datasets๏
KG-SaF-Data provides complete and curated knowledge graph datasets designed for machine learning and reasoning tasks. Each dataset comes with both a schema (TBox) and instance data (ABox), along with role definitions (RBox), making them ready for Knowledge Graph Refinement (KGR) research and embedding-based pipelines.
These datasets are compatible with Python, PyTorch, PyKEEN, and ontology editors like Protege.
Dataset Structure๏
KG-SaF datasets follow a Description Logic (DL) formalization, organizing the knowledge graph into three main components:
ABox (Assertional Box): Instance-level data (entities, class assertions, object property assertions)
TBox (Terminological Box): Schema-level knowledge, including class hierarchies and axioms
RBox (Role Box): Relationships and properties, including domain, range, and hierarchy
Note
๐ Files marked with this icon are new serializations or variations of the same data already available in OWL format (e.g., TSV or JSON representations), intended for easier use in ML pipelines.
Each dataset folder contains the following structure:
๐ abox ......................................... # Assertional Box (instance-level data)
โ โโโ ๐ splits ................................. # Train/test/validation splits
โ โ โโโ ๐ฆ train.nt ............................. # Training triples (N-Triples)
โ โ โโโ ๐ฆ valid.nt ............................. # Validation triples (N-Triples)
โ โ โโโ ๐ฆ test.nt .............................. # Test triples (N-Triples)
โ โ โโโ ๐ train.tsv ............................ # Training triples (TSV)
โ โ โโโ ๐ valid.tsv ............................ # Validation triples (TSV)
โ โ โโโ ๐ test.tsv ............................. # Test triples (TSV)
โ โ
โ โโโ ๐ฆ individuals.owl ........................ # Individuals definitions
โ โโโ ๐ฆ class_assertions.owl ................... # Individuals class assertions (OWL)
โ โโโ ๐ class_assertions.json .................. # Individuals class assertions (JSON)
โ โ
โ โโโ ๐ฆ obj_prop_assertions.nt ................. # Combined triples (N-Triples)
โ โโโ ๐ obj_prop_assertions.tsv ................ # Combined triples (TSV)
๐ rbox ......................................... # Role Box (relations and properties)
โ โโโ ๐ฆ roles.owl .............................. # Role definitions
โ โโโ ๐ roles_domain_range.json ................ # Domain and range of roles (JSON)
โ โโโ ๐ roles_hierarchy.json ................... # Role hierarchy (JSON)
๐ tbox ......................................... # Terminological Box (schema-level info)
โ โโโ ๐ฆ classes.owl ............................ # Class non-taxonomical axioms
โ โโโ ๐ฆ taxonomy.owl ........................... # Hierarchical taxonomy
โ โโโ ๐ taxonomy.json .......................... # Hierarchical taxonomy (JSON)
๐ฆ knowledge_graph.owl .......................... # Full merged TBox + RBox + ABox
๐ฆ ontology.owl ................................. # Core modularized schema
๐ mappings ..................................... # Mappings to IDs
โ โโโ ๐งพ class_to_id.json ....................... # Map ontology classes to IDs
โ โโโ ๐งพ individual_to_id.json .................. # Map entities/instances to IDs
โ โโโ ๐งพ object_property_to_id.json ............. # Map object properties to IDs
Available Ontologies๏
KG-SaF includes datasets derived from well-known knowledge graphs and domain-specific ontologies. Each ontology is provided as part of the dataset with modularized TBox, RBox, and ABox components.
DBpedia โ Link
Large-scale multilingual knowledge graph extracted from Wikipedia. Contains general-purpose concepts likePerson,Place,Organisation, and relationships between them.YAGO3 โ Link
Knowledge graph integrating Wikipedia, WordNet, and GeoNames. Known for high coverage and rich semantic types (ALHIF+DL fragment).YAGO4 โ Link
Updated version of YAGO with enhanced temporal and factual coverage, suitable for reasoning over large-scale datasets.ArCo โ Link
Cultural heritage ontology and KG covering Italian assets. Supports advanced reasoning withSROIQDL fragment.WHOW โ Link
Ontology for the World Heritage and cultural objects, designed for semantic reasoning and ML experiments.ApuliaTravel โ Link
Domain-specific KG for tourism in Apulia, Italy. Models attractions, accommodations, and travel routes.
Available Datasets๏
The table below lists the currently available ontologies and their corresponding datasets included in this resource.
Note
This table will be updated as new datasets and ontologies become available.
Ontology |
Dataset |
DL Fragment |
|---|---|---|
DBpedia |
DBPEDIA25-50K-C |
ALCHF |
DBpedia |
DBPEDIA25-100K-C |
ALCHF |
YAGO3 |
YAGO3-39K-C |
ALHIF+ |
YAGO3 |
YAGO3-10-C |
ALHIF+ |
YAGO4 |
YAGO4-20-C |
ALCHIF |
ArCo |
ARCO25-20 |
SROIQ |
ArCo |
ARCO25-10 |
SROIQ |
ArCo |
ARCO25-5 |
SROIQ |
WHOW |
WHOW25-5 |
SROIQ |
ApuliaTravel |
ATRAVEL |
SRIQ |
Dataset Statistics๏
Table: Available datasets schema axiom coverage comparison๏
Legend:
โ = axiom available in the dataset
โ = axiom not available
โ ๏ธ = axiom available in the KG but not in the dataset
Axiom Type |
DB100K-C |
YAGO3-10-C |
YAGO4-20-C |
APULIA |
WHOW-5 |
ARCO-5 |
|---|---|---|---|---|---|---|
ClassAssertion |
โ |
โ |
โ |
โ |
โ |
โ |
SubClassOf |
โ |
โ |
โ |
โ |
โ |
โ |
EquivalentClasses |
โ |
โ |
โ |
โ |
โ |
โ |
DisjointClasses |
โ |
โ |
โ |
โ |
โ |
โ |
UnionOf |
โ |
โ |
โ |
โ |
โ |
โ |
IntersectionOf |
โ |
โ |
โ |
โ |
โ |
โ |
ComplementOf |
โ |
โ |
โ |
โ |
โ |
โ |
Existential Restrictions |
โ |
โ |
โ |
โ |
โ |
โ |
Universal Restrictions |
โ |
โ |
โ |
โ |
โ |
โ |
Cardinality Restrictions |
โ |
โ |
โ |
โ |
โ |
โ |
ObjPropDomain |
โ |
โ |
โ |
โ |
โ |
โ |
ObjPropRange |
โ |
โ |
โ |
โ |
โ |
โ |
SubObjProp |
โ |
โ |
โ |
โ |
โ |
โ |
InverseObjProp |
โ |
โ |
โ |
โ |
โ |
โ |
EquivalentObjProp |
โ ๏ธ |
โ |
โ |
โ |
โ |
โ |
ObjPropCharacteristic |
โ ๏ธ |
โ |
โ |
โ |
โ |
โ |
ObjPropChain |
โ |
โ |
โ |
โ |
โ |
โ |
Table: Available datasets statistics๏
Dataset statistics including ABox (instance-level) and Schema/TBox (class & property-level) information. For object property structural columns, the highest values in each column are highlighted in bold (manually indicated with โฆ).
Dataset |
Triples |
Inds |
Props |
Classes |
1to1 |
1toN |
Nto1 |
NtoN |
Avg Triples |
Class Assert. |
TBox Classes |
Disjoints |
Subclass |
โR.C |
โR.C |
Props |
Domain |
Range |
Both |
Functional |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DB-50K-C |
28,525 |
22,268 |
275 |
169 |
0.21 |
0.08 |
0.33 |
0.38 |
103.73 |
12,419 |
232 |
14 |
226 |
- |
- |
280 |
194 |
217 |
152 |
- |
DB-100K-C |
577,249 |
96,375 |
406 |
229 |
0.09 |
0.06 |
0.15 |
0.71 |
1,421.80 |
82,750 |
297 |
14 |
290 |
- |
- |
411 |
295 |
317 |
233 |
- |
Y3-10-C |
1,080,398 |
123,038 |
34 |
92,539 |
0.00 |
0.00 |
0.12 |
0.88 |
31,776.41 |
1,309,964 |
94,726 |
- |
94,809 |
- |
- |
35 |
34 |
29 |
28 |
12 |
Y3-39K-C |
370,169 |
37,711 |
34 |
45,456 |
0.03 |
0.03 |
0.15 |
0.79 |
10,887.32 |
- |
46,892 |
- |
46,960 |
- |
- |
35 |
34 |
29 |
28 |
10 |
Y4-20-C |
653,988 |
91,904 |
68 |
1,433 |
0.07 |
0.07 |
0.13 |
0.72 |
9,617.47 |
116,141 |
1,462 |
9 |
1,665 |
- |
- |
69 |
68 |
56 |
56 |
6 |
ATRAVEL |
76,943 |
29,767 |
25 |
53 |
0.36 |
0.16 |
0.36 |
0.12 |
3,077.72 |
35,910 |
100 |
12 |
184 |
17 |
52 |
71 |
57 |
61 |
47 |
16 |
WHOW-5 |
584,791 |
137,740 |
25 |
31 |
0.00 |
0.00 |
0.76 |
0.24 |
23,391.64 |
144,892 |
91 |
16 |
202 |
43 |
34 |
102 |
99 |
99 |
99 |
- |
ARCO-20 |
95,840 |
15,690 |
53 |
78 |
0.04 |
0.09 |
0.40 |
0.47 |
1,808.30 |
41,626 |
315 |
14 |
873 |
181 |
366 |
454 |
348 |
352 |
271 |
62 |
ARCO-10 |
202,492 |
45,400 |
111 |
117 |
0.06 |
0.06 |
0.44 |
0.44 |
1,824.25 |
119,304 |
378 |
24 |
1,033 |
211 |
431 |
593 |
462 |
465 |
369 |
71 |
ARCO-5 |
655,089 |
198,674 |
196 |
192 |
0.10 |
0.10 |
0.40 |
0.40 |
3,342.29 |
471,031 |
438 |
26 |
1,153 |
228 |
474 |
683 |
546 |
551 |
450 |
84 |