Naser Kazemi 786ddee1f1 Updated README.md		4 months ago
configs	Added code base.	4 months ago
data	Added code base.	4 months ago
metrics	Added code base.	4 months ago
models	Added code base.	4 months ago
setup	Added code base.	4 months ago
tools	Added code base.	4 months ago
training	Added code base.	4 months ago
README.md	Updated README.md	4 months ago
build_crec_datasets.py	Added code base.	4 months ago
eval_datasets.py	Added code base.	4 months ago
main.py	Added code base.	4 months ago
pyproject.toml	Added code base.	4 months ago

🧭 Knowledge-Graph Embeddings — Training & Evaluation

This repository implements the Bachelor thesis project:

DECONSTRUCTING KNOWLEDGE GRAPH DIFFICULTY: A FRAMEWORK FOR COMPLEXITY‑AWARE EVALUATION AND GENERATION

It trains and evaluates knowledge‑graph embedding (KGE) models using custom training logic. We use PyKeen for core building blocks (models, datasets, evaluators, losses, regularizers) but do not use the PyKeen pipeline. Hydra powers configuration and multi‑run sweeps.

✅ What’s in this repo

Root/
├─ configs/
│  ├─ common/               # run name, logging, save paths, resume/eval flags
│  │  └─ common.yaml
│  ├─ data/                 # dataset choices
│  │  ├─ data.yaml
│  │  ├─ fb15k.yaml
│  │  ├─ wn18.yaml
│  │  ├─ wn18rr.yaml
│  │  └─ yago3_10.yaml
│  ├─ model/                # model choices & defaults
│  │  ├─ model.yaml
│  │  ├─ trans_e.yaml
│  │  ├─ trans_h.yaml
│  │  └─ trans_r.yaml
│  ├─ training/             # optimizer/lr/batch/steps + trainer class
│  │  ├─ training.yaml
│  │  └─ trans_e_trainer.yaml
│  ├─ model_trainers/       # (Hydra group) trainer implementations
│  │  ├─ model_trainer_base.py
│  │  └─ translation/{trans_e_trainer.py, trans_h_trainer.py, trans_r_trainer.py}
│  ├─ config.yaml           # Hydra defaults: common, data, model, training
│  ├─ trans_e_fb15k.yaml    # ready‑made composed config
│  ├─ trans_e_wn18.yaml
│  ├─ trans_e_wn18rr.yaml
│  └─ trans_e_yago3_10.yaml
├─ data/                    # dataset wrappers + TSV helper
│  ├─ kg_dataset.py         # KGDataset + create_from_tsv(...)
│  ├─ wn18.py               # WN18Dataset, WN18RRDataset
│  ├─ fb15k.py, yago3_10.py, openke_wiki.py, hationet.py, open_bio_link.py
├─ models/                  # minimal translation‑based models
│  ├─ base_model.py
│  └─ translation/{trans_e.py, trans_h.py, trans_r.py}
├─ metrics/                 # complexity metrics and ranking metrics
│  ├─ c_swklf.py, wlcrec.py, wlec.py, greedy_crec.py, crec_radius_sample.py, ranking.py
├─ training/                # Trainer orchestrating data/model/loop
│  └─ trainer.py
├─ tools/                   # logging, TB, sampling, checkpoints, params
│  ├─ pretty_logger.py, tb_handler.py, sampling.py, checkpoint_manager.py
│  └─ params.py             # CommonParams, TrainingParams dataclasses
├─ main.py                  # **single entrypoint** (@hydra.main)
├─ build_crec_datasets.py   # helper to tune/sample CREC subsets
├─ eval_datasets.py         # example: compute WL(C)REC over datasets
└─ pyproject.toml           # formatting/lint settings

✨ Highlights

No PyKeen pipeline. Models, samplers, and evaluators are instantiated directly; training runs through our wrappers around PyKeen’s TrainingLoop.
Hydra CLI. One-line overrides, organized config groups, and multi-run sweeps (-m).
Datasets. Built-ins or custom triples (TSV/CSV).
Reproducible outputs. Each run gets a timestamped directory with the resolved config, checkpoints, metrics, and artifacts.
Extendable. Add models/configs without touching the training loop.

🧰 Requirements

Python 3.10+
PyTorch
PyKeen
Hydra Core + OmegaConf
NumPy, einops
lovely‑tensors (optional pretty tensor prints)
TensorBoard

🛠 Installation

Python 3.10+ recommended. GPU optional but encouraged for larger graphs.

Install script

bash setup/install.sh

Conda + pip

conda create -n kge python=3.10 -y
conda activate kge
pip install -U pip wheel
pip install -e .

Virtualenv/venv

python -m venv .venv && source .venv/bin/activate
conda create -n kge python=3.10 -y
conda activate kge
pip install -U pip wheel
pip install -e .

CUDA users: install a PyTorch build matching your CUDA (see pytorch.org) before project deps.

Core dependencies: torch, hydra-core, omegaconf, pykeen (core), pandas, numpy.

🚀 Quick start

1) Single run (built‑in dataset)

Train TransE on WN18RR:

python main.py   model=trans_e   data=wn18rr   training.batch_size=1024   training.lr=5e-4   common.run_name=transe_wn18rr_bs1024_lr5e4

2) Use a composed config

Predefined composition for TransE on FB15K:

python main.py -cn trans_e_fb15k
# or: python main.py --config-name trans_e_fb15k

You can still override fields:

python main.py -cn trans_e_fb15k training.batch_size=2048 training.lr=1e-3

3) Evaluate only (load checkpoint)

Set common.evaluate_only=true and point common.load_path to a checkpoint. Two modes are supported by CheckpointManager:

Full path to a checkpoint file
A pair (model_id, iteration) that resolves to checkpoints/<model_id>/checkpoints/<iteration>.pt

Examples:

# full path
python main.py common.evaluate_only=true common.load_path=/absolute/path/to/checkpoints/2/checkpoints/1800.pt

# by components (id, iter) — YAML style tuple
python main.py -cn trans_e_fb15k common.evaluate_only=true common.load_path="(2, 1800)"

4) Multi‑run sweeps (Hydra `-m`)

# 3 models × 2 seeds = 6 runs
python main.py -m model=trans_e,trans_h,trans_r seed=1,2

# grid over batch size and LR
python main.py -m training.batch_size=512,1024 training.lr=5e-4,1e-3

📦 Datasets

Built‑in wrappers (via PyKeen) are provided for WN18, WN18RR, FB15K, YAGO3‑10. Select with data=<name> where <name> is one of wn18, wn18rr, fb15k, yago3_10.

Custom triples from TSV

data/kg_dataset.py provides create_from_tsv(root) which expects train.txt, valid.txt, test.txt under root/ (tab‑separated: head<TAB>relation<TAB>tail). To use this with Hydra, add a small config (e.g. configs/data/custom.yaml):

# configs/data/custom.yaml
train:
  _target_: data.kg_dataset.create_from_tsv
  root: /absolute/path/to/mykg
  splits: [train]

valid:
  _target_: data.kg_dataset.create_from_tsv
  root: /absolute/path/to/mykg
  splits: [valid]

Then run with data=custom:

python main.py model=trans_e data=custom common.run_name=my_custom_kg

🧪 Metrics & Evaluation

Ranking metrics via PyKeen’s RankBasedEvaluator (e.g., Hits@K, I(H)MR) are wired in the training/trainer.py loop.
Complexity metrics in metrics/: WLCREC, WLEC, CSWKLF, greedy CREC, radius sampling.
For a quick dataset‑level report, see eval_datasets.py (prints WLEC‑family metrics).

Set common.evaluate_only=true to run evaluation on a loaded model as shown above.

📝 Logging, outputs, checkpoints

Hydra outputs: ./outputs (configured in configs/config.yaml).
TensorBoard: logs/<run_name> (see tools/tb_handler.py; open with tensorboard --logdir logs).
Checkpoints: by default in common.save_dpath (see configs/common/common.yaml). The CheckpointManager supports both absolute file paths and (model_id, iteration) addressing via common.load_path.

🔧 Common knobs (cheat sheet)

# Model & dimensions
python main.py model=trans_e model.dim=200
python main.py model=trans_h model.dim=400
python main.py model=trans_r model.dim=500

# Training length & batches
python main.py training.num_train_steps=20000 training.batch_size=2048

# Learning rate & weight decay
python main.py training.lr=1e-3 training.weight_decay=1e-5

# Reproducibility
python main.py seed=1234

README.md