**KG-SaF-Data**: Complete KGR Datasets ================= KG-SaF-Data provides **complete and curated knowledge graph datasets** designed for **machine learning** and **reasoning tasks**. Each dataset comes with both a **schema (TBox)** and **instance data (ABox)**, along with **role definitions (RBox)**, making them ready for Knowledge Graph Refinement (KGR) research and embedding-based pipelines. These datasets are compatible with **Python**, **PyTorch**, **PyKEEN**, and ontology editors like **Protege**. ## Dataset Structure KG-SaF datasets follow a **Description Logic (DL) formalization**, organizing the knowledge graph into three main components: - **ABox (Assertional Box):** Instance-level data (entities, class assertions, object property assertions) - **TBox (Terminological Box):** Schema-level knowledge, including class hierarchies and axioms - **RBox (Role Box):** Relationships and properties, including domain, range, and hierarchy ```{note} ๐Ÿ“„ Files marked with this icon are **new serializations or variations** of the same data already available in OWL format (e.g., TSV or JSON representations), intended for easier use in ML pipelines. ``` Each dataset folder contains the following structure: ``` ๐Ÿ“ abox ......................................... # Assertional Box (instance-level data) โ”‚ โ”œโ”€โ”€ ๐Ÿ“ splits ................................. # Train/test/validation splits โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ train.nt ............................. # Training triples (N-Triples) โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ valid.nt ............................. # Validation triples (N-Triples) โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ test.nt .............................. # Test triples (N-Triples) โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ train.tsv ............................ # Training triples (TSV) โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ valid.tsv ............................ # Validation triples (TSV) โ”‚ โ”‚ โ””โ”€โ”€ ๐Ÿ“„ test.tsv ............................. # Test triples (TSV) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ individuals.owl ........................ # Individuals definitions โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ class_assertions.owl ................... # Individuals class assertions (OWL) โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ class_assertions.json .................. # Individuals class assertions (JSON) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ obj_prop_assertions.nt ................. # Combined triples (N-Triples) โ”‚ โ””โ”€โ”€ ๐Ÿ“„ obj_prop_assertions.tsv ................ # Combined triples (TSV) ๐Ÿ“ rbox ......................................... # Role Box (relations and properties) โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ roles.owl .............................. # Role definitions โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ roles_domain_range.json ................ # Domain and range of roles (JSON) โ”‚ โ””โ”€โ”€ ๐Ÿ“„ roles_hierarchy.json ................... # Role hierarchy (JSON) ๐Ÿ“ tbox ......................................... # Terminological Box (schema-level info) โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ classes.owl ............................ # Class non-taxonomical axioms โ”‚ โ”œโ”€โ”€ ๐Ÿฆ‰ taxonomy.owl ........................... # Hierarchical taxonomy โ”‚ โ””โ”€โ”€ ๐Ÿ“„ taxonomy.json .......................... # Hierarchical taxonomy (JSON) ๐Ÿฆ‰ knowledge_graph.owl .......................... # Full merged TBox + RBox + ABox ๐Ÿฆ‰ ontology.owl ................................. # Core modularized schema ๐Ÿ“ mappings ..................................... # Mappings to IDs โ”‚ โ”œโ”€โ”€ ๐Ÿงพ class_to_id.json ....................... # Map ontology classes to IDs โ”‚ โ”œโ”€โ”€ ๐Ÿงพ individual_to_id.json .................. # Map entities/instances to IDs โ”‚ โ””โ”€โ”€ ๐Ÿงพ object_property_to_id.json ............. # Map object properties to IDs ``` ## Available Ontologies KG-SaF includes datasets derived from **well-known knowledge graphs** and domain-specific ontologies. Each ontology is provided as part of the dataset with **modularized TBox, RBox, and ABox components**. - **DBpedia** โ€“ [Link](https://www.dbpedia.org/resources/ontology/) Large-scale multilingual knowledge graph extracted from Wikipedia. Contains general-purpose concepts like `Person`, `Place`, `Organisation`, and relationships between them. - **YAGO3** โ€“ [Link](https://yago-knowledge.org/downloads/yago-3) Knowledge graph integrating Wikipedia, WordNet, and GeoNames. Known for high coverage and rich semantic types (`ALHIF+` DL fragment). - **YAGO4** โ€“ [Link](https://yago-knowledge.org/downloads/yago-4) Updated version of YAGO with enhanced temporal and factual coverage, suitable for reasoning over large-scale datasets. - **ArCo** โ€“ [Link](http://wit.istc.cnr.it/arco) Cultural heritage ontology and KG covering Italian assets. Supports advanced reasoning with `SROIQ` DL fragment. - **WHOW** โ€“ [Link](https://whowproject.eu/) Ontology for the World Heritage and cultural objects, designed for semantic reasoning and ML experiments. - **ApuliaTravel** โ€“ [Link](https://github.com/rbarile17/ApuliaTravelKG) Domain-specific KG for tourism in Apulia, Italy. Models attractions, accommodations, and travel routes. ## Available Datasets The table below lists the currently available **ontologies** and their corresponding **datasets** included in this resource. ```{note} This table will be **updated** as new datasets and ontologies become available. ``` | Ontology | Dataset | DL Fragment | |---------------|--------------------|-------------| | DBpedia | DBPEDIA25-50K-C | ALCHF | | DBpedia | DBPEDIA25-100K-C | ALCHF | | YAGO3 | YAGO3-39K-C | ALHIF+ | | YAGO3 | YAGO3-10-C | ALHIF+ | | YAGO4 | YAGO4-20-C | ALCHIF | | ArCo | ARCO25-20 | SROIQ | | ArCo | ARCO25-10 | SROIQ | | ArCo | ARCO25-5 | SROIQ | | WHOW | WHOW25-5 | SROIQ | | ApuliaTravel | ATRAVEL | SRIQ | ## Dataset Statistics ### Table: Available datasets schema axiom coverage comparison **Legend:** - โœ… = axiom available in the dataset - โŒ = axiom not available - โš ๏ธ = axiom available in the KG but not in the dataset | Axiom Type | DB100K-C | YAGO3-10-C | YAGO4-20-C | APULIA | WHOW-5 | ARCO-5 | |-------------------------------|----------|------------|------------|--------|--------|--------| | ClassAssertion | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | | SubClassOf | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | | EquivalentClasses | โŒ | โŒ | โŒ | โœ… | โŒ | โœ… | | DisjointClasses | โœ… | โŒ | โœ… | โœ… | โœ… | โœ… | | UnionOf | โŒ | โŒ | โœ… | โœ… | โœ… | โœ… | | IntersectionOf | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | | ComplementOf | โŒ | โŒ | โŒ | โœ… | โŒ | โœ… | | Existential Restrictions | โŒ | โŒ | โŒ | โœ… | โœ… | โœ… | | Universal Restrictions | โŒ | โŒ | โŒ | โœ… | โœ… | โœ… | | Cardinality Restrictions | โŒ | โŒ | โŒ | โœ… | โœ… | โœ… | | ObjPropDomain | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | | ObjPropRange | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | | SubObjProp | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | | InverseObjProp | โŒ | โŒ | โœ… | โœ… | โœ… | โœ… | | EquivalentObjProp | โš ๏ธ | โŒ | โŒ | โœ… | โœ… | โœ… | | ObjPropCharacteristic | โš ๏ธ | โœ… | โœ… | โœ… | โœ… | โœ… | | ObjPropChain | โŒ | โŒ | โœ… | โœ… | โœ… | โœ… | ### Table: Available datasets statistics Dataset statistics including **ABox** (instance-level) and **Schema/TBox** (class & property-level) information. For object property structural columns, the **highest values** in each column are highlighted in bold (manually indicated with **โ€ฆ**). | Dataset | Triples | Inds | Props | Classes | 1to1 | 1toN | Nto1 | NtoN | Avg Triples | Class Assert. | TBox Classes | Disjoints | Subclass | โˆƒR.C | โˆ€R.C | Props | Domain | Range | Both | Functional | |--------------|----------:|---------:|------:|--------:|:-----:|:-----:|:-----:|:-----:|------------:|---------------:|-------------:|----------:|---------:|:----:|:----:|------:|:------:|:-----:|:----:|:-----------:| | DB-50K-C | 28,525 | 22,268 | 275 | 169 | 0.21 | 0.08 | 0.33 | **0.38** | 103.73 | 12,419 | 232 | 14 | 226 | - | - | 280 | 194 | 217 | 152 | - | | DB-100K-C | 577,249 | 96,375 | 406 | 229 | 0.09 | 0.06 | 0.15 | **0.71** | 1,421.80 | 82,750 | 297 | 14 | 290 | - | - | 411 | 295 | 317 | 233 | - | | Y3-10-C | 1,080,398 | 123,038 | 34 | 92,539 | 0.00 | 0.00 | 0.12 | **0.88** | 31,776.41 | 1,309,964 | 94,726 | - | 94,809 | - | - | 35 | 34 | 29 | 28 | 12 | | Y3-39K-C | 370,169 | 37,711 | 34 | 45,456 | 0.03 | 0.03 | 0.15 | **0.79** | 10,887.32 | - | 46,892 | - | 46,960 | - | - | 35 | 34 | 29 | 28 | 10 | | Y4-20-C | 653,988 | 91,904 | 68 | 1,433 | 0.07 | 0.07 | 0.13 | **0.72** | 9,617.47 | 116,141 | 1,462 | 9 | 1,665 | - | - | 69 | 68 | 56 | 56 | 6 | | ATRAVEL | 76,943 | 29,767 | 25 | 53 | **0.36** | 0.16 | **0.36** | 0.12 | 3,077.72 | 35,910 | 100 | 12 | 184 | 17 | 52 | 71 | 57 | 61 | 47 | 16 | | WHOW-5 | 584,791 | 137,740 | 25 | 31 | 0.00 | 0.00 | **0.76** | 0.24 | 23,391.64 | 144,892 | 91 | 16 | 202 | 43 | 34 | 102 | 99 | 99 | 99 | - | | ARCO-20 | 95,840 | 15,690 | 53 | 78 | 0.04 | 0.09 | 0.40 | **0.47** | 1,808.30 | 41,626 | 315 | 14 | 873 | 181 | 366 | 454 | 348 | 352 | 271 | 62 | | ARCO-10 | 202,492 | 45,400 | 111 | 117 | 0.06 | 0.06 | **0.44** | **0.44** | 1,824.25 | 119,304 | 378 | 24 | 1,033 | 211 | 431 | 593 | 462 | 465 | 369 | 71 | | ARCO-5 | 655,089 | 198,674 | 196 | 192 | 0.10 | 0.10 | **0.40** | **0.40** | 3,342.29 | 471,031 | 438 | 26 | 1,153 | 228 | 474 | 683 | 546 | 551 | 450 | 84 |