# 🧭 Knowledge-Graph Embeddings β€” Training & Evaluation This repository implements the Bachelor thesis project: > **DECONSTRUCTING KNOWLEDGE GRAPH DIFFICULTY: A FRAMEWORK FOR COMPLEXITY‑AWARE EVALUATION AND GENERATION** It trains and evaluates **knowledge‑graph embedding (KGE)** models using **custom training logic**. We use **PyKeen** for core building blocks (models, datasets, evaluators, losses, regularizers) but **do not use the PyKeen pipeline**. **Hydra** powers configuration and multi‑run sweeps. --- ## βœ… What’s in this repo ``` Root/ β”œβ”€ configs/ β”‚ β”œβ”€ common/ # run name, logging, save paths, resume/eval flags β”‚ β”‚ └─ common.yaml β”‚ β”œβ”€ data/ # dataset choices β”‚ β”‚ β”œβ”€ data.yaml β”‚ β”‚ β”œβ”€ fb15k.yaml β”‚ β”‚ β”œβ”€ wn18.yaml β”‚ β”‚ β”œβ”€ wn18rr.yaml β”‚ β”‚ └─ yago3_10.yaml β”‚ β”œβ”€ model/ # model choices & defaults β”‚ β”‚ β”œβ”€ model.yaml β”‚ β”‚ β”œβ”€ trans_e.yaml β”‚ β”‚ β”œβ”€ trans_h.yaml β”‚ β”‚ └─ trans_r.yaml β”‚ β”œβ”€ training/ # optimizer/lr/batch/steps + trainer class β”‚ β”‚ β”œβ”€ training.yaml β”‚ β”‚ └─ trans_e_trainer.yaml β”‚ β”œβ”€ model_trainers/ # (Hydra group) trainer implementations β”‚ β”‚ β”œβ”€ model_trainer_base.py β”‚ β”‚ └─ translation/{trans_e_trainer.py, trans_h_trainer.py, trans_r_trainer.py} β”‚ β”œβ”€ config.yaml # Hydra defaults: common, data, model, training β”‚ β”œβ”€ trans_e_fb15k.yaml # ready‑made composed config β”‚ β”œβ”€ trans_e_wn18.yaml β”‚ β”œβ”€ trans_e_wn18rr.yaml β”‚ └─ trans_e_yago3_10.yaml β”œβ”€ data/ # dataset wrappers + TSV helper β”‚ β”œβ”€ kg_dataset.py # KGDataset + create_from_tsv(...) β”‚ β”œβ”€ wn18.py # WN18Dataset, WN18RRDataset β”‚ β”œβ”€ fb15k.py, yago3_10.py, openke_wiki.py, hationet.py, open_bio_link.py β”œβ”€ models/ # minimal translation‑based models β”‚ β”œβ”€ base_model.py β”‚ └─ translation/{trans_e.py, trans_h.py, trans_r.py} β”œβ”€ metrics/ # complexity metrics and ranking metrics β”‚ β”œβ”€ c_swklf.py, wlcrec.py, wlec.py, greedy_crec.py, crec_radius_sample.py, ranking.py β”œβ”€ training/ # Trainer orchestrating data/model/loop β”‚ └─ trainer.py β”œβ”€ tools/ # logging, TB, sampling, checkpoints, params β”‚ β”œβ”€ pretty_logger.py, tb_handler.py, sampling.py, checkpoint_manager.py β”‚ └─ params.py # CommonParams, TrainingParams dataclasses β”œβ”€ main.py # **single entrypoint** (@hydra.main) β”œβ”€ build_crec_datasets.py # helper to tune/sample CREC subsets β”œβ”€ eval_datasets.py # example: compute WL(C)REC over datasets └─ pyproject.toml # formatting/lint settings ``` ## ✨ Highlights - **No PyKeen pipeline.** Models, samplers, and evaluators are instantiated directly; training runs through our wrappers around PyKeen’s `TrainingLoop`. - **Hydra CLI.** One-line overrides, organized config groups, and multi-run sweeps (`-m`). - **Datasets.** Built-ins or custom triples (TSV/CSV). - **Reproducible outputs.** Each run gets a timestamped directory with the resolved config, checkpoints, metrics, and artifacts. - **Extendable.** Add models/configs without touching the training loop. ## 🧰 Requirements - Python 3.10+ - PyTorch - PyKeen - Hydra Core + OmegaConf - NumPy, einops - lovely‑tensors (optional pretty tensor prints) - TensorBoard ## πŸ›  Installation > Python **3.10+** recommended. GPU optional but encouraged for larger graphs. **Install script** ```bash bash setup/install.sh ``` **Conda + pip** ```bash conda create -n kge python=3.10 -y conda activate kge pip install -U pip wheel pip install -e . ``` **Virtualenv/venv** ```bash python -m venv .venv && source .venv/bin/activate conda create -n kge python=3.10 -y conda activate kge pip install -U pip wheel pip install -e . ``` **CUDA users:** install a PyTorch build matching your CUDA (see pytorch.org) **before** project deps. Core dependencies: `torch`, `hydra-core`, `omegaconf`, `pykeen` (core), `pandas`, `numpy`. --- ## πŸš€ Quick start ### 1) Single run (built‑in dataset) Train TransE on WN18RR: ```bash python main.py model=trans_e data=wn18rr training.batch_size=1024 training.lr=5e-4 common.run_name=transe_wn18rr_bs1024_lr5e4 ``` ### 2) Use a composed config Predefined composition for TransE on FB15K: ```bash python main.py -cn trans_e_fb15k # or: python main.py --config-name trans_e_fb15k ``` You can still override fields: ```bash python main.py -cn trans_e_fb15k training.batch_size=2048 training.lr=1e-3 ``` ### 3) Evaluate only (load checkpoint) Set `common.evaluate_only=true` and point `common.load_path` to a checkpoint. Two modes are supported by `CheckpointManager`: - **Full path** to a checkpoint file - A **pair** `(model_id, iteration)` that resolves to `checkpoints//checkpoints/.pt` Examples: ```bash # full path python main.py common.evaluate_only=true common.load_path=/absolute/path/to/checkpoints/2/checkpoints/1800.pt # by components (id, iter) β€” YAML style tuple python main.py -cn trans_e_fb15k common.evaluate_only=true common.load_path="(2, 1800)" ``` ### 4) Multi‑run sweeps (Hydra `-m`) ```bash # 3 models Γ— 2 seeds = 6 runs python main.py -m model=trans_e,trans_h,trans_r seed=1,2 # grid over batch size and LR python main.py -m training.batch_size=512,1024 training.lr=5e-4,1e-3 ``` --- ## πŸ“¦ Datasets Built‑in wrappers (via PyKeen) are provided for **WN18**, **WN18RR**, **FB15K**, **YAGO3‑10**. Select with `data=` where `` is one of `wn18`, `wn18rr`, `fb15k`, `yago3_10`. ### Custom triples from TSV `data/kg_dataset.py` provides `create_from_tsv(root)` which expects `train.txt`, `valid.txt`, `test.txt` under `root/` (tab‑separated: `headrelationtail`). To use this with Hydra, add a small config (e.g. `configs/data/custom.yaml`): ```yaml # configs/data/custom.yaml train: _target_: data.kg_dataset.create_from_tsv root: /absolute/path/to/mykg splits: [train] valid: _target_: data.kg_dataset.create_from_tsv root: /absolute/path/to/mykg splits: [valid] ``` Then run with `data=custom`: ```bash python main.py model=trans_e data=custom common.run_name=my_custom_kg ``` --- ## πŸ§ͺ Metrics & Evaluation - Ranking metrics via PyKeen’s `RankBasedEvaluator` (e.g., Hits@K, I(H)MR) are wired in the `training/trainer.py` loop. - Complexity metrics in `metrics/`: `WLCREC`, `WLEC`, `CSWKLF`, greedy CREC, radius sampling. - For a quick dataset‑level report, see `eval_datasets.py` (prints WLEC‑family metrics). Set `common.evaluate_only=true` to run evaluation on a loaded model as shown above. --- ## πŸ“ Logging, outputs, checkpoints - **Hydra outputs**: `./outputs` (configured in `configs/config.yaml`). - **TensorBoard**: `logs/` (see `tools/tb_handler.py`; open with `tensorboard --logdir logs`). - **Checkpoints**: by default in `common.save_dpath` (see `configs/common/common.yaml`). The `CheckpointManager` supports both absolute file paths and `(model_id, iteration)` addressing via `common.load_path`. --- ## πŸ”§ Common knobs (cheat sheet) ```bash # Model & dimensions python main.py model=trans_e model.dim=200 python main.py model=trans_h model.dim=400 python main.py model=trans_r model.dim=500 # Training length & batches python main.py training.num_train_steps=20000 training.batch_size=2048 # Learning rate & weight decay python main.py training.lr=1e-3 training.weight_decay=1e-5 # Reproducibility python main.py seed=1234 ``` --- ## πŸ“š Citation