|
3 weeks ago | |
---|---|---|
configs | 3 weeks ago | |
data | 3 weeks ago | |
metrics | 3 weeks ago | |
models | 3 weeks ago | |
setup | 3 weeks ago | |
tools | 3 weeks ago | |
training | 3 weeks ago | |
README.md | 3 weeks ago | |
build_crec_datasets.py | 3 weeks ago | |
eval_datasets.py | 3 weeks ago | |
main.py | 3 weeks ago | |
pyproject.toml | 3 weeks ago |
This repository implements the Bachelor thesis project:
DECONSTRUCTING KNOWLEDGE GRAPH DIFFICULTY: A FRAMEWORK FOR COMPLEXITYβAWARE EVALUATION AND GENERATION
It trains and evaluates knowledgeβgraph embedding (KGE) models using custom training logic. We use PyKeen for core building blocks (models, datasets, evaluators, losses, regularizers) but do not use the PyKeen pipeline. Hydra powers configuration and multiβrun sweeps.
Root/
ββ configs/
β ββ common/ # run name, logging, save paths, resume/eval flags
β β ββ common.yaml
β ββ data/ # dataset choices
β β ββ data.yaml
β β ββ fb15k.yaml
β β ββ wn18.yaml
β β ββ wn18rr.yaml
β β ββ yago3_10.yaml
β ββ model/ # model choices & defaults
β β ββ model.yaml
β β ββ trans_e.yaml
β β ββ trans_h.yaml
β β ββ trans_r.yaml
β ββ training/ # optimizer/lr/batch/steps + trainer class
β β ββ training.yaml
β β ββ trans_e_trainer.yaml
β ββ model_trainers/ # (Hydra group) trainer implementations
β β ββ model_trainer_base.py
β β ββ translation/{trans_e_trainer.py, trans_h_trainer.py, trans_r_trainer.py}
β ββ config.yaml # Hydra defaults: common, data, model, training
β ββ trans_e_fb15k.yaml # readyβmade composed config
β ββ trans_e_wn18.yaml
β ββ trans_e_wn18rr.yaml
β ββ trans_e_yago3_10.yaml
ββ data/ # dataset wrappers + TSV helper
β ββ kg_dataset.py # KGDataset + create_from_tsv(...)
β ββ wn18.py # WN18Dataset, WN18RRDataset
β ββ fb15k.py, yago3_10.py, openke_wiki.py, hationet.py, open_bio_link.py
ββ models/ # minimal translationβbased models
β ββ base_model.py
β ββ translation/{trans_e.py, trans_h.py, trans_r.py}
ββ metrics/ # complexity metrics and ranking metrics
β ββ c_swklf.py, wlcrec.py, wlec.py, greedy_crec.py, crec_radius_sample.py, ranking.py
ββ training/ # Trainer orchestrating data/model/loop
β ββ trainer.py
ββ tools/ # logging, TB, sampling, checkpoints, params
β ββ pretty_logger.py, tb_handler.py, sampling.py, checkpoint_manager.py
β ββ params.py # CommonParams, TrainingParams dataclasses
ββ main.py # **single entrypoint** (@hydra.main)
ββ build_crec_datasets.py # helper to tune/sample CREC subsets
ββ eval_datasets.py # example: compute WL(C)REC over datasets
ββ pyproject.toml # formatting/lint settings
TrainingLoop
.-m
).Python 3.10+ recommended. GPU optional but encouraged for larger graphs.
Install script
bash setup/install.sh
Conda + pip
conda create -n kge python=3.10 -y
conda activate kge
pip install -U pip wheel
pip install -e .
Virtualenv/venv
python -m venv .venv && source .venv/bin/activate
conda create -n kge python=3.10 -y
conda activate kge
pip install -U pip wheel
pip install -e .
CUDA users: install a PyTorch build matching your CUDA (see pytorch.org) before project deps.
Core dependencies: torch
, hydra-core
, omegaconf
, pykeen
(core), pandas
, numpy
.
Train TransE on WN18RR:
python main.py model=trans_e data=wn18rr training.batch_size=1024 training.lr=5e-4 common.run_name=transe_wn18rr_bs1024_lr5e4
Predefined composition for TransE on FB15K:
python main.py -cn trans_e_fb15k
# or: python main.py --config-name trans_e_fb15k
You can still override fields:
python main.py -cn trans_e_fb15k training.batch_size=2048 training.lr=1e-3
Set common.evaluate_only=true
and point common.load_path
to a checkpoint. Two modes are supported by CheckpointManager
:
(model_id, iteration)
that resolves to checkpoints/<model_id>/checkpoints/<iteration>.pt
Examples:
# full path
python main.py common.evaluate_only=true common.load_path=/absolute/path/to/checkpoints/2/checkpoints/1800.pt
# by components (id, iter) β YAML style tuple
python main.py -cn trans_e_fb15k common.evaluate_only=true common.load_path="(2, 1800)"
-m
)# 3 models Γ 2 seeds = 6 runs
python main.py -m model=trans_e,trans_h,trans_r seed=1,2
# grid over batch size and LR
python main.py -m training.batch_size=512,1024 training.lr=5e-4,1e-3
Builtβin wrappers (via PyKeen) are provided for WN18, WN18RR, FB15K, YAGO3β10. Select with data=<name>
where <name>
is one of wn18
, wn18rr
, fb15k
, yago3_10
.
data/kg_dataset.py
provides create_from_tsv(root)
which expects train.txt
, valid.txt
, test.txt
under root/
(tabβseparated: head<TAB>relation<TAB>tail
). To use this with Hydra, add a small config (e.g. configs/data/custom.yaml
):
# configs/data/custom.yaml
train:
_target_: data.kg_dataset.create_from_tsv
root: /absolute/path/to/mykg
splits: [train]
valid:
_target_: data.kg_dataset.create_from_tsv
root: /absolute/path/to/mykg
splits: [valid]
Then run with data=custom
:
python main.py model=trans_e data=custom common.run_name=my_custom_kg
RankBasedEvaluator
(e.g., Hits@K, I(H)MR) are wired in the training/trainer.py
loop.metrics/
: WLCREC
, WLEC
, CSWKLF
, greedy CREC, radius sampling.eval_datasets.py
(prints WLECβfamily metrics).Set common.evaluate_only=true
to run evaluation on a loaded model as shown above.
./outputs
(configured in configs/config.yaml
).logs/<run_name>
(see tools/tb_handler.py
; open with tensorboard --logdir logs
).common.save_dpath
(see configs/common/common.yaml
). The CheckpointManager
supports both absolute file paths and (model_id, iteration)
addressing via common.load_path
.# Model & dimensions
python main.py model=trans_e model.dim=200
python main.py model=trans_h model.dim=400
python main.py model=trans_r model.dim=500
# Training length & batches
python main.py training.num_train_steps=20000 training.batch_size=2048
# Learning rate & weight decay
python main.py training.lr=1e-3 training.weight_decay=1e-5
# Reproducibility
python main.py seed=1234