KG-SaF-Data: Complete KGR Datasets๏ƒ

KG-SaF-Data provides complete and curated knowledge graph datasets designed for machine learning and reasoning tasks. Each dataset comes with both a schema (TBox) and instance data (ABox), along with role definitions (RBox), making them ready for Knowledge Graph Refinement (KGR) research and embedding-based pipelines.

These datasets are compatible with Python, PyTorch, PyKEEN, and ontology editors like Protege.

Dataset Structure๏ƒ

KG-SaF datasets follow a Description Logic (DL) formalization, organizing the knowledge graph into three main components:

  • ABox (Assertional Box): Instance-level data (entities, class assertions, object property assertions)

  • TBox (Terminological Box): Schema-level knowledge, including class hierarchies and axioms

  • RBox (Role Box): Relationships and properties, including domain, range, and hierarchy

Note

๐Ÿ“„ Files marked with this icon are new serializations or variations of the same data already available in OWL format (e.g., TSV or JSON representations), intended for easier use in ML pipelines.

Each dataset folder contains the following structure:

๐Ÿ“ abox ......................................... # Assertional Box (instance-level data)
โ”‚ โ”œโ”€โ”€ ๐Ÿ“ splits ................................. # Train/test/validation splits
โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ train.nt ............................. # Training triples (N-Triples)
โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ valid.nt ............................. # Validation triples (N-Triples)
โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ test.nt .............................. # Test triples (N-Triples)
โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ train.tsv ............................ # Training triples (TSV)
โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ valid.tsv ............................ # Validation triples (TSV)
โ”‚ โ”‚ โ””โ”€โ”€ ๐Ÿ“„ test.tsv ............................. # Test triples (TSV)
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ individuals.owl ........................ # Individuals definitions
โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ class_assertions.owl ................... # Individuals class assertions (OWL)
โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ class_assertions.json .................. # Individuals class assertions (JSON)
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ obj_prop_assertions.nt ................. # Combined triples (N-Triples)
โ”‚ โ””โ”€โ”€ ๐Ÿ“„ obj_prop_assertions.tsv ................ # Combined triples (TSV)

๐Ÿ“ rbox ......................................... # Role Box (relations and properties)
โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ roles.owl .............................. # Role definitions
โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ roles_domain_range.json ................ # Domain and range of roles (JSON)
โ”‚ โ””โ”€โ”€ ๐Ÿ“„ roles_hierarchy.json ................... # Role hierarchy (JSON)

๐Ÿ“ tbox ......................................... # Terminological Box (schema-level info)
โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ classes.owl ............................ # Class non-taxonomical axioms
โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ taxonomy.owl ........................... # Hierarchical taxonomy
โ”‚ โ””โ”€โ”€ ๐Ÿ“„ taxonomy.json .......................... # Hierarchical taxonomy (JSON)

๐Ÿฆ‰ knowledge_graph.owl .......................... # Full merged TBox + RBox + ABox
๐Ÿฆ‰ ontology.owl ................................. # Core modularized schema

๐Ÿ“ mappings ..................................... # Mappings to IDs
โ”‚ โ”œโ”€โ”€ ๐Ÿงพ class_to_id.json ....................... # Map ontology classes to IDs
โ”‚ โ”œโ”€โ”€ ๐Ÿงพ individual_to_id.json .................. # Map entities/instances to IDs
โ”‚ โ””โ”€โ”€ ๐Ÿงพ object_property_to_id.json ............. # Map object properties to IDs

Available Ontologies๏ƒ

KG-SaF includes datasets derived from well-known knowledge graphs and domain-specific ontologies. Each ontology is provided as part of the dataset with modularized TBox, RBox, and ABox components.

  • DBpedia โ€“ Link
    Large-scale multilingual knowledge graph extracted from Wikipedia. Contains general-purpose concepts like Person, Place, Organisation, and relationships between them.

  • YAGO3 โ€“ Link
    Knowledge graph integrating Wikipedia, WordNet, and GeoNames. Known for high coverage and rich semantic types (ALHIF+ DL fragment).

  • YAGO4 โ€“ Link
    Updated version of YAGO with enhanced temporal and factual coverage, suitable for reasoning over large-scale datasets.

  • ArCo โ€“ Link
    Cultural heritage ontology and KG covering Italian assets. Supports advanced reasoning with SROIQ DL fragment.

  • WHOW โ€“ Link
    Ontology for the World Heritage and cultural objects, designed for semantic reasoning and ML experiments.

  • ApuliaTravel โ€“ Link
    Domain-specific KG for tourism in Apulia, Italy. Models attractions, accommodations, and travel routes.

Available Datasets๏ƒ

The table below lists the currently available ontologies and their corresponding datasets included in this resource.

Note

This table will be updated as new datasets and ontologies become available.

Ontology

Dataset

DL Fragment

DBpedia

DBPEDIA25-50K-C

ALCHF

DBpedia

DBPEDIA25-100K-C

ALCHF

YAGO3

YAGO3-39K-C

ALHIF+

YAGO3

YAGO3-10-C

ALHIF+

YAGO4

YAGO4-20-C

ALCHIF

ArCo

ARCO25-20

SROIQ

ArCo

ARCO25-10

SROIQ

ArCo

ARCO25-5

SROIQ

WHOW

WHOW25-5

SROIQ

ApuliaTravel

ATRAVEL

SRIQ

Dataset Statistics๏ƒ

Table: Available datasets schema axiom coverage comparison๏ƒ

Legend:

  • โœ… = axiom available in the dataset

  • โŒ = axiom not available

  • โš ๏ธ = axiom available in the KG but not in the dataset

Axiom Type

DB100K-C

YAGO3-10-C

YAGO4-20-C

APULIA

WHOW-5

ARCO-5

ClassAssertion

โœ…

โœ…

โœ…

โœ…

โœ…

โœ…

SubClassOf

โœ…

โœ…

โœ…

โœ…

โœ…

โœ…

EquivalentClasses

โŒ

โŒ

โŒ

โœ…

โŒ

โœ…

DisjointClasses

โœ…

โŒ

โœ…

โœ…

โœ…

โœ…

UnionOf

โŒ

โŒ

โœ…

โœ…

โœ…

โœ…

IntersectionOf

โŒ

โŒ

โŒ

โŒ

โŒ

โŒ

ComplementOf

โŒ

โŒ

โŒ

โœ…

โŒ

โœ…

Existential Restrictions

โŒ

โŒ

โŒ

โœ…

โœ…

โœ…

Universal Restrictions

โŒ

โŒ

โŒ

โœ…

โœ…

โœ…

Cardinality Restrictions

โŒ

โŒ

โŒ

โœ…

โœ…

โœ…

ObjPropDomain

โœ…

โœ…

โœ…

โœ…

โœ…

โœ…

ObjPropRange

โœ…

โœ…

โœ…

โœ…

โœ…

โœ…

SubObjProp

โœ…

โœ…

โœ…

โœ…

โœ…

โœ…

InverseObjProp

โŒ

โŒ

โœ…

โœ…

โœ…

โœ…

EquivalentObjProp

โš ๏ธ

โŒ

โŒ

โœ…

โœ…

โœ…

ObjPropCharacteristic

โš ๏ธ

โœ…

โœ…

โœ…

โœ…

โœ…

ObjPropChain

โŒ

โŒ

โœ…

โœ…

โœ…

โœ…

Table: Available datasets statistics๏ƒ

Dataset statistics including ABox (instance-level) and Schema/TBox (class & property-level) information. For object property structural columns, the highest values in each column are highlighted in bold (manually indicated with โ€ฆ).

Dataset

Triples

Inds

Props

Classes

1to1

1toN

Nto1

NtoN

Avg Triples

Class Assert.

TBox Classes

Disjoints

Subclass

โˆƒR.C

โˆ€R.C

Props

Domain

Range

Both

Functional

DB-50K-C

28,525

22,268

275

169

0.21

0.08

0.33

0.38

103.73

12,419

232

14

226

-

-

280

194

217

152

-

DB-100K-C

577,249

96,375

406

229

0.09

0.06

0.15

0.71

1,421.80

82,750

297

14

290

-

-

411

295

317

233

-

Y3-10-C

1,080,398

123,038

34

92,539

0.00

0.00

0.12

0.88

31,776.41

1,309,964

94,726

-

94,809

-

-

35

34

29

28

12

Y3-39K-C

370,169

37,711

34

45,456

0.03

0.03

0.15

0.79

10,887.32

-

46,892

-

46,960

-

-

35

34

29

28

10

Y4-20-C

653,988

91,904

68

1,433

0.07

0.07

0.13

0.72

9,617.47

116,141

1,462

9

1,665

-

-

69

68

56

56

6

ATRAVEL

76,943

29,767

25

53

0.36

0.16

0.36

0.12

3,077.72

35,910

100

12

184

17

52

71

57

61

47

16

WHOW-5

584,791

137,740

25

31

0.00

0.00

0.76

0.24

23,391.64

144,892

91

16

202

43

34

102

99

99

99

-

ARCO-20

95,840

15,690

53

78

0.04

0.09

0.40

0.47

1,808.30

41,626

315

14

873

181

366

454

348

352

271

62

ARCO-10

202,492

45,400

111

117

0.06

0.06

0.44

0.44

1,824.25

119,304

378

24

1,033

211

431

593

462

465

369

71

ARCO-5

655,089

198,674

196

192

0.10

0.10

0.40

0.40

3,342.29

471,031

438

26

1,153

228

474

683

546

551

450

84