Tutorials and Guides

Loading and Analyzing a KG-SaF Datasets in PyTorch

This tutorial demonstrates how to load a Knowledge Graph dataset using kgsaf_jdex and inspect its components, classes, and object property hierarchies. You can find a full executable Notebook in tutorial/dataset_loader.ipynb. For this example the APULIATRAVEL dataset will be used.

Setup and Imports

Before loading the dataset, make sure your Python environment can access the library and required modules:

import sys
sys.path.append(str(Path.cwd().parent))  # Adds the parent folder to the path
import json
import random
from pathlib import Path
import kgsaf_jdex.utils.conventions.paths as pc
from kgsaf_jdex.loaders.pytorch.dataset import KnowledgeGraph

Tip

Adding the parent directory to sys.path allows Python to locate your kgsaf_jdex package if it is not installed system-wide.

Loading the Dataset

Use the KnowledgeGraph class to load a dataset from the folder created by the unpacking utility:

kg = KnowledgeGraph(
    path="<DATASET_PATH>"
)

Inspect Dataset Components

You can quickly inspect the shapes of the main dataset components:

print(f"{'Dataset Component':<35} | {'Shape'}")
print("-" * 50)
print(f"{'Training triples':<35} | {kg.train.shape}")
print(f"{'Test triples':<35} | {kg.test.shape}")
print(f"{'Validation triples':<35} | {kg.valid.shape}")
print(f"{'Class assertions':<35} | {kg.class_assertions.shape}")
print(f"{'Taxonomy (TBox)':<35} | {kg.taxonomy.shape}")
print(f"{'Object property hierarchy':<35} | {kg.obj_props_hierarchy.shape}")
print(f"{'Object property domains':<35} | {kg.obj_props_domain.shape}")
print(f"{'Object property ranges':<35} | {kg.obj_props_range.shape}")
Dataset Component                   | Shape
--------------------------------------------------
Training triples                    | torch.Size([65401, 3])
Test triples                        | torch.Size([7695, 3])
Validation triples                  | torch.Size([3847, 3])
Class assertions                    | torch.Size([35915, 2])
Taxonomy (TBox)                     | torch.Size([54, 2])
Object property hierarchy           | torch.Size([11, 2])
Object property domains             | torch.Size([71, 2])
Object property ranges              | torch.Size([69, 2])

Exploring Individuals Classes and Object Properties

You can select random elements for testing:

individual_uri_test = random.choice(list(kg._individual_to_id.keys()))
class_uri_test = random.choice(list(kg._class_to_id.keys()))
op_uri_test = random.choice(list(kg._obj_prop_to_id.keys()))

print(f"Testing on individual: {individual_uri_test}")
print(f"Testing on class: {class_uri_test}")
print(f"Testing on object property: {op_uri_test}")

Testing on individual: https://w3id.org/italia/onto/CLV/StreetToponym/31148_Castello_street_toponym
Testing on class: https://apuliatravel.org/td/HistoricPalace
Testing on object property: https://w3id.org/italia/onto/CLV/hasDirectHigherRank

Inspecting Class Assertions

Retrieve all classes associated with an individual:

cls = kg.individual_classes(kg.individual_to_id(individual_uri_test)).tolist()
print(f"Class assertions for [{kg.individual_to_id(individual_uri_test)}] {individual_uri_test}:")
for c in cls:
    print(f"\t[{c}] {kg.id_to_class(c)}")
Testing the Class Assertions of [7157] https://w3id.org/italia/onto/CLV/StreetToponym/31148_Castello_street_toponym
	 Tensor [58]
	 [58] https://w3id.org/italia/onto/CLV/StreetToponym

Exploring Class Hierarchies

Check superclasses and subclasses of a given class:

sup_cls = kg.sup_classes(kg.class_to_id(class_uri_test)).tolist()
sub_cls = kg.sub_classes(kg.class_to_id(class_uri_test)).tolist()

print("Superclasses:")
for c in sup_cls:
    print(f"\t[{c}] {kg.id_to_class(c)}")

print("Subclasses:")
for c in sub_cls:
    print(f"\t[{c}] {kg.id_to_class(c)}")
Testing the Hierarchy of [84] https://w3id.org/italia/onto/TI/Year. Leaf class? True
	 Superclasses
		 Tensor: [81]
		 [81] https://w3id.org/italia/onto/TI/TemporalEntity
	 Subclasses
		 Tensor: []

Note

is_leaf(class_id) can be used to check if the class is a leaf in the hierarchy.

Inspecting Object Properties Hierarchies

Check super and sub properties, as well as domains and ranges:

sup_op = kg.sup_obj_prop(kg.obj_prop_to_id(op_uri_test)).tolist()
sub_op = kg.sub_obj_prop(kg.obj_prop_to_id(op_uri_test)).tolist()

print("Super Object Properties:")
for c in sup_op:
    print(f"\t[{c}] {kg.id_to_obj_prop(c)}")

print("Sub Object Properties:")
for c in sub_op:
    print(f"\t[{c}] {kg.id_to_obj_prop(c)}")

domain = kg.obj_prop_domain(kg.obj_prop_to_id(op_uri_test)).tolist()
range = kg.obj_prop_range(kg.obj_prop_to_id(op_uri_test)).tolist()

print("Domain:")
for c in domain:
    print(f"\t[{c}] {kg.id_to_class(c)}")

print("Range:")
for c in range:
    print(f"\t[{c}] {kg.id_to_class(c)}")

Testing the Role Hierarhcy of [41] https://w3id.org/italia/onto/SM/hasEmailType
	 Super Obj Prop
		 Tensor: []
	 Sub Obj Prop
		 Tensor: []

Testing the Role Hierarhcy of [41] https://w3id.org/italia/onto/SM/hasEmailType
	 Domain
		 Tensor: [66]
		 [66] https://w3id.org/italia/onto/SM/Email
	 Range
		 Tensor: [67]
		 [67] https://w3id.org/italia/onto/SM/EmailType

Running TransE on KG-SaF Datasets

This tutorial demonstrates how to train and evaluate TransE KGR model on a dataset prepared with the kgsaf_jdex loaders. You can find a full executable Notebook in tutorial/kge_pykeen.ipynb. The example will use the APULIATRAVEL dataset.

Note

Make sure the dataset is unpacked and the paths in pc (paths conventions) point to the correct files.

Imports

This code imports all necessary packages, including PyKEEN for knowledge graph embeddings and your kgsaf_jdex dataset utilities.

import sys
from pathlib import Path
import json

from pykeen.triples import TriplesFactory
from pykeen.evaluation import RankBasedEvaluator
from pykeen.pipeline import pipeline

import kgsaf_jdex.utils.conventions.paths as pc

# Add parent folder to Python path if needed
sys.path.append(str(Path.cwd().parent))

Loading and Mapping Triples

Here we load entity and relation mappings from JSON files, and then build TriplesFactory objects for training, validation, and testing. PyKEEN uses these to manage KG data efficiently.

dataset_path = Path("/home/navis/dev/kg-saf/kgsaf_data/datasets/base/unpack/APULIATRAVEL-BASE")

# Load entity and relation mappings
with open(dataset_path / pc.INDIVIDUAL_MAPPINGS, "r") as f:
    entity_mapping = json.load(f)

with open(dataset_path / pc.OBJ_PROP_MAPPINGS, "r") as f:
    relation_mapping = json.load(f)

# Create PyKEEN TriplesFactory objects
train_tf = TriplesFactory.from_path(
    dataset_path / pc.TRAIN,
    entity_to_id=entity_mapping,
    relation_to_id=relation_mapping,
)

valid_tf = TriplesFactory.from_path(
    dataset_path / pc.VALID,
    entity_to_id=entity_mapping,
    relation_to_id=relation_mapping,
)

test_tf = TriplesFactory.from_path(
    dataset_path / pc.TEST,
    entity_to_id=entity_mapping,
    relation_to_id=relation_mapping,
)

Train TransE Model

This code trains a TransE model using the dataset we just loaded. The pipeline function handles everything: model initialization, training, validation, and testing.

result = pipeline(
    model="TransE",
    training=train_tf,
    validation=valid_tf,
    testing=test_tf,
    model_kwargs=dict(embedding_dim=100),
    training_kwargs=dict(num_epochs=25, batch_size=128),
    device="cpu"
)
INFO:pykeen.pipeline.api:Using device: cpu
INFO:pykeen.nn.representation:Inferred unique=False for Embedding()
INFO:pykeen.nn.representation:Inferred unique=False for Embedding()
Trainingepochsoncpu:100% 5/5[00:30<00:00,  6.71s/epoch,loss=0.141,prev_loss=0.182]
Evaluatingoncpu:100% 7.70k/7.70k[01:09<00:00,109triple/s]
INFO:pykeen.evaluation.evaluator:Evaluation took 69.78s seconds

Note

You can replace "TransE" with any other PyKEEN model, e.g., "DistMult", "ComplEx", etc. embedding_dim, num_epochs, and batch_size can be adjusted depending on your dataset size.

Evaluation

This code evaluates the trained model using filtered ranking metrics, such as MRR and Hits@K. Filtered evaluation ignores triples already seen in training/validation to avoid penalizing correct predictions.

evaluator = RankBasedEvaluator(filtered=True)

results = evaluator.evaluate(
    model=result.model,
    mapped_triples=test_tf.mapped_triples,
    additional_filter_triples=[
        train_tf.mapped_triples,
        valid_tf.mapped_triples,
    ],
)
results.to_dict()
Evaluating on cpu: 100% 7.70k/7.70k [01:19<00:00, 86.9triple/s]
INFO:pykeen.evaluation.evaluator:Evaluation took 79.32s seconds

Using KG-SaF-JDeX on Custom Knowledge Graphs

Warning

Please follow the Quick Start Guide before going forward.

Note

This guide is also available as a Notebook in tutorial/general.ipynb, a Python shell executable script in tutorial/general.py and shell script tutorial/general.sh. For easier customization, we suggest following the Notebook.

This guide show how to use the provided KG-SaF-JDeX Workflow Functionalities to generate a new Schema and Data Dataset from your custom Knowledge Graph. This guide provides an example dataset in the INPUT folder to test the functionalities. All the produces files will be created in the OUTPUT folder.

This notebook expectes the following inputs:

  • Any Knowledge Graph with both Schema and Data in a unique file (any format supported by the ROBOT Utility, the file will be converted to an intermediate OWL File)

  • The KG need to contain ABox assertions (object property assertions) in order to safely run the machine learning splitting and necessary checks

  • This module is specifically designed for generating dataset from Knowledge graphs that contain rich schema and large scale ABox (object property assertions)

And applies the following procedures following the KG-SaF-JDeX workflow:

  • Conversion to OWL Format

  • Consistency Check

  • Removal of Unsatisfiable Classes

  • Materialization and Realization

  • Filtering of ABox Individuals and Object Properties

  • Object Propety Assertion splitting using Coverage Criterion in Training, Test and Validatin Split

  • Inversion Leakage check and filtering

  • Class Assertions subset computation

  • Schema Modularization based on ABox Signature

  • Schema Decomposition in TBox and RBox (with subsequent division in Schema and Taxonomy)

  • Full cleaned Ontology and Knowledge Graph Reconstruction

  • Conversion and serialization of object property assertion to TSV format

  • Conversion and serialization of schema axioms to JSON format

Executable Script

Usage example:

python3 general.py \
    --kg_file /path/to/input_kg.owl \
    --output_path /path/to/output/folder \
    --dataset_name MY_DATASET \
    --robot_jar /path/to/robot.jar \
    [--reasoner]

Argument

Description

--kg_file

Path to your input KG (OWL/RDF file).

--output_path

Folder where the processed dataset and splits will be saved.

--dataset_name

Base name for the dataset. Reasoned datasets will append _reasoned otherwise _base.

--robot_jar

Path to the ROBOT JAR file (used for merging and reasoning).

--reasoner

Optional flag. If set, reasoning and unsatisfiable class removal will be performed.