Tutorials and Guides
Loading and Analyzing a KG-SaF Datasets in PyTorch
This tutorial demonstrates how to load a Knowledge Graph dataset using kgsaf_jdex and inspect its components, classes, and object property hierarchies. You can find a full executable Notebook in tutorial/dataset_loader.ipynb. For this example the APULIATRAVEL dataset will be used.
Setup and Imports
Before loading the dataset, make sure your Python environment can access the library and required modules:
import sys
sys.path.append(str(Path.cwd().parent)) # Adds the parent folder to the path
import json
import random
from pathlib import Path
import kgsaf_jdex.utils.conventions.paths as pc
from kgsaf_jdex.loaders.pytorch.dataset import KnowledgeGraph
Tip
Adding the parent directory to sys.path allows Python to locate your kgsaf_jdex package if it is not installed system-wide.
Loading the Dataset
Use the KnowledgeGraph class to load a dataset from the folder created by the unpacking utility:
kg = KnowledgeGraph(
path="<DATASET_PATH>"
)
Inspect Dataset Components
You can quickly inspect the shapes of the main dataset components:
print(f"{'Dataset Component':<35} | {'Shape'}")
print("-" * 50)
print(f"{'Training triples':<35} | {kg.train.shape}")
print(f"{'Test triples':<35} | {kg.test.shape}")
print(f"{'Validation triples':<35} | {kg.valid.shape}")
print(f"{'Class assertions':<35} | {kg.class_assertions.shape}")
print(f"{'Taxonomy (TBox)':<35} | {kg.taxonomy.shape}")
print(f"{'Object property hierarchy':<35} | {kg.obj_props_hierarchy.shape}")
print(f"{'Object property domains':<35} | {kg.obj_props_domain.shape}")
print(f"{'Object property ranges':<35} | {kg.obj_props_range.shape}")
Dataset Component | Shape
--------------------------------------------------
Training triples | torch.Size([65401, 3])
Test triples | torch.Size([7695, 3])
Validation triples | torch.Size([3847, 3])
Class assertions | torch.Size([35915, 2])
Taxonomy (TBox) | torch.Size([54, 2])
Object property hierarchy | torch.Size([11, 2])
Object property domains | torch.Size([71, 2])
Object property ranges | torch.Size([69, 2])
Exploring Individuals Classes and Object Properties
You can select random elements for testing:
individual_uri_test = random.choice(list(kg._individual_to_id.keys()))
class_uri_test = random.choice(list(kg._class_to_id.keys()))
op_uri_test = random.choice(list(kg._obj_prop_to_id.keys()))
print(f"Testing on individual: {individual_uri_test}")
print(f"Testing on class: {class_uri_test}")
print(f"Testing on object property: {op_uri_test}")
Testing on individual: https://w3id.org/italia/onto/CLV/StreetToponym/31148_Castello_street_toponym
Testing on class: https://apuliatravel.org/td/HistoricPalace
Testing on object property: https://w3id.org/italia/onto/CLV/hasDirectHigherRank
Inspecting Class Assertions
Retrieve all classes associated with an individual:
cls = kg.individual_classes(kg.individual_to_id(individual_uri_test)).tolist()
print(f"Class assertions for [{kg.individual_to_id(individual_uri_test)}] {individual_uri_test}:")
for c in cls:
print(f"\t[{c}] {kg.id_to_class(c)}")
Testing the Class Assertions of [7157] https://w3id.org/italia/onto/CLV/StreetToponym/31148_Castello_street_toponym
Tensor [58]
[58] https://w3id.org/italia/onto/CLV/StreetToponym
Exploring Class Hierarchies
Check superclasses and subclasses of a given class:
sup_cls = kg.sup_classes(kg.class_to_id(class_uri_test)).tolist()
sub_cls = kg.sub_classes(kg.class_to_id(class_uri_test)).tolist()
print("Superclasses:")
for c in sup_cls:
print(f"\t[{c}] {kg.id_to_class(c)}")
print("Subclasses:")
for c in sub_cls:
print(f"\t[{c}] {kg.id_to_class(c)}")
Testing the Hierarchy of [84] https://w3id.org/italia/onto/TI/Year. Leaf class? True
Superclasses
Tensor: [81]
[81] https://w3id.org/italia/onto/TI/TemporalEntity
Subclasses
Tensor: []
Note
is_leaf(class_id) can be used to check if the class is a leaf in the hierarchy.
Inspecting Object Properties Hierarchies
Check super and sub properties, as well as domains and ranges:
sup_op = kg.sup_obj_prop(kg.obj_prop_to_id(op_uri_test)).tolist()
sub_op = kg.sub_obj_prop(kg.obj_prop_to_id(op_uri_test)).tolist()
print("Super Object Properties:")
for c in sup_op:
print(f"\t[{c}] {kg.id_to_obj_prop(c)}")
print("Sub Object Properties:")
for c in sub_op:
print(f"\t[{c}] {kg.id_to_obj_prop(c)}")
domain = kg.obj_prop_domain(kg.obj_prop_to_id(op_uri_test)).tolist()
range = kg.obj_prop_range(kg.obj_prop_to_id(op_uri_test)).tolist()
print("Domain:")
for c in domain:
print(f"\t[{c}] {kg.id_to_class(c)}")
print("Range:")
for c in range:
print(f"\t[{c}] {kg.id_to_class(c)}")
Testing the Role Hierarhcy of [41] https://w3id.org/italia/onto/SM/hasEmailType
Super Obj Prop
Tensor: []
Sub Obj Prop
Tensor: []
Testing the Role Hierarhcy of [41] https://w3id.org/italia/onto/SM/hasEmailType
Domain
Tensor: [66]
[66] https://w3id.org/italia/onto/SM/Email
Range
Tensor: [67]
[67] https://w3id.org/italia/onto/SM/EmailType
Running TransE on KG-SaF Datasets
This tutorial demonstrates how to train and evaluate TransE KGR model
on a dataset prepared with the kgsaf_jdex loaders. You can find a full executable Notebook in tutorial/kge_pykeen.ipynb. The example will use the APULIATRAVEL dataset.
Note
Make sure the dataset is unpacked and the paths in pc (paths conventions) point to the correct files.
Imports
This code imports all necessary packages, including PyKEEN for knowledge graph embeddings and your kgsaf_jdex dataset utilities.
import sys
from pathlib import Path
import json
from pykeen.triples import TriplesFactory
from pykeen.evaluation import RankBasedEvaluator
from pykeen.pipeline import pipeline
import kgsaf_jdex.utils.conventions.paths as pc
# Add parent folder to Python path if needed
sys.path.append(str(Path.cwd().parent))
Loading and Mapping Triples
Here we load entity and relation mappings from JSON files, and then build TriplesFactory objects for training, validation, and testing. PyKEEN uses these to manage KG data efficiently.
dataset_path = Path("/home/navis/dev/kg-saf/kgsaf_data/datasets/base/unpack/APULIATRAVEL-BASE")
# Load entity and relation mappings
with open(dataset_path / pc.INDIVIDUAL_MAPPINGS, "r") as f:
entity_mapping = json.load(f)
with open(dataset_path / pc.OBJ_PROP_MAPPINGS, "r") as f:
relation_mapping = json.load(f)
# Create PyKEEN TriplesFactory objects
train_tf = TriplesFactory.from_path(
dataset_path / pc.TRAIN,
entity_to_id=entity_mapping,
relation_to_id=relation_mapping,
)
valid_tf = TriplesFactory.from_path(
dataset_path / pc.VALID,
entity_to_id=entity_mapping,
relation_to_id=relation_mapping,
)
test_tf = TriplesFactory.from_path(
dataset_path / pc.TEST,
entity_to_id=entity_mapping,
relation_to_id=relation_mapping,
)
Train TransE Model
This code trains a TransE model using the dataset we just loaded. The pipeline function handles everything: model initialization, training, validation, and testing.
result = pipeline(
model="TransE",
training=train_tf,
validation=valid_tf,
testing=test_tf,
model_kwargs=dict(embedding_dim=100),
training_kwargs=dict(num_epochs=25, batch_size=128),
device="cpu"
)
INFO:pykeen.pipeline.api:Using device: cpu
INFO:pykeen.nn.representation:Inferred unique=False for Embedding()
INFO:pykeen.nn.representation:Inferred unique=False for Embedding()
Training epochs on cpu: 100% 5/5 [00:30<00:00, 6.71s/epoch, loss=0.141, prev_loss=0.182]
Evaluating on cpu: 100% 7.70k/7.70k [01:09<00:00, 109triple/s]
INFO:pykeen.evaluation.evaluator:Evaluation took 69.78s seconds
Note
You can replace "TransE" with any other PyKEEN model, e.g., "DistMult", "ComplEx", etc.
embedding_dim, num_epochs, and batch_size can be adjusted depending on your dataset size.
Evaluation
This code evaluates the trained model using filtered ranking metrics, such as MRR and Hits@K. Filtered evaluation ignores triples already seen in training/validation to avoid penalizing correct predictions.
evaluator = RankBasedEvaluator(filtered=True)
results = evaluator.evaluate(
model=result.model,
mapped_triples=test_tf.mapped_triples,
additional_filter_triples=[
train_tf.mapped_triples,
valid_tf.mapped_triples,
],
)
results.to_dict()
Evaluating on cpu: 100% 7.70k/7.70k [01:19<00:00, 86.9triple/s]
INFO:pykeen.evaluation.evaluator:Evaluation took 79.32s seconds
Using KG-SaF-JDeX on Custom Knowledge Graphs
Warning
Please follow the Quick Start Guide before going forward.
Note
This guide is also available as a Notebook in tutorial/general.ipynb, a Python shell executable script in tutorial/general.py and shell script tutorial/general.sh. For easier customization, we suggest following the Notebook.
This guide show how to use the provided KG-SaF-JDeX Workflow Functionalities to generate a new Schema and Data Dataset from your custom Knowledge Graph. This guide provides an example dataset in the INPUT folder to test the functionalities. All the produces files will be created in the OUTPUT folder.
This notebook expectes the following inputs:
Any Knowledge Graph with both Schema and Data in a unique file (any format supported by the ROBOT Utility, the file will be converted to an intermediate OWL File)
The KG need to contain ABox assertions (object property assertions) in order to safely run the machine learning splitting and necessary checks
This module is specifically designed for generating dataset from Knowledge graphs that contain rich schema and large scale ABox (object property assertions)
And applies the following procedures following the KG-SaF-JDeX workflow:
Conversion to OWL Format
Consistency Check
Removal of Unsatisfiable Classes
Materialization and Realization
Filtering of ABox Individuals and Object Properties
Object Propety Assertion splitting using Coverage Criterion in Training, Test and Validatin Split
Inversion Leakage check and filtering
Class Assertions subset computation
Schema Modularization based on ABox Signature
Schema Decomposition in TBox and RBox (with subsequent division in Schema and Taxonomy)
Full cleaned Ontology and Knowledge Graph Reconstruction
Conversion and serialization of object property assertion to TSV format
Conversion and serialization of schema axioms to JSON format
Executable Script
Usage example:
python3 general.py \
--kg_file /path/to/input_kg.owl \
--output_path /path/to/output/folder \
--dataset_name MY_DATASET \
--robot_jar /path/to/robot.jar \
[--reasoner]
Argument |
Description |
|---|---|
|
Path to your input KG (OWL/RDF file). |
|
Folder where the processed dataset and splits will be saved. |
|
Base name for the dataset. Reasoned datasets will append |
|
Path to the ROBOT JAR file (used for merging and reasoning). |
|
Optional flag. If set, reasoning and unsatisfiable class removal will be performed. |