| @@ -1 +1,234 @@ | |||
| # KGEvaluation | |||
| # π§ Knowledge-Graph Embeddings β Training & Evaluation | |||
| This repository implements the Bachelor thesis project: | |||
| > **DECONSTRUCTING KNOWLEDGE GRAPH DIFFICULTY: A FRAMEWORK FOR COMPLEXITYβAWARE EVALUATION AND GENERATION** | |||
| It trains and evaluates **knowledgeβgraph embedding (KGE)** models using **custom training logic**. We use **PyKeen** for core building blocks (models, datasets, evaluators, losses, regularizers) but **do not use the PyKeen pipeline**. **Hydra** powers configuration and multiβrun sweeps. | |||
| --- | |||
| ## β Whatβs in this repo | |||
| ``` | |||
| Root/ | |||
| ββ configs/ | |||
| β ββ common/ # run name, logging, save paths, resume/eval flags | |||
| β β ββ common.yaml | |||
| β ββ data/ # dataset choices | |||
| β β ββ data.yaml | |||
| β β ββ fb15k.yaml | |||
| β β ββ wn18.yaml | |||
| β β ββ wn18rr.yaml | |||
| β β ββ yago3_10.yaml | |||
| β ββ model/ # model choices & defaults | |||
| β β ββ model.yaml | |||
| β β ββ trans_e.yaml | |||
| β β ββ trans_h.yaml | |||
| β β ββ trans_r.yaml | |||
| β ββ training/ # optimizer/lr/batch/steps + trainer class | |||
| β β ββ training.yaml | |||
| β β ββ trans_e_trainer.yaml | |||
| β ββ model_trainers/ # (Hydra group) trainer implementations | |||
| β β ββ model_trainer_base.py | |||
| β β ββ translation/{trans_e_trainer.py, trans_h_trainer.py, trans_r_trainer.py} | |||
| β ββ config.yaml # Hydra defaults: common, data, model, training | |||
| β ββ trans_e_fb15k.yaml # readyβmade composed config | |||
| β ββ trans_e_wn18.yaml | |||
| β ββ trans_e_wn18rr.yaml | |||
| β ββ trans_e_yago3_10.yaml | |||
| ββ data/ # dataset wrappers + TSV helper | |||
| β ββ kg_dataset.py # KGDataset + create_from_tsv(...) | |||
| β ββ wn18.py # WN18Dataset, WN18RRDataset | |||
| β ββ fb15k.py, yago3_10.py, openke_wiki.py, hationet.py, open_bio_link.py | |||
| ββ models/ # minimal translationβbased models | |||
| β ββ base_model.py | |||
| β ββ translation/{trans_e.py, trans_h.py, trans_r.py} | |||
| ββ metrics/ # complexity metrics and ranking metrics | |||
| β ββ c_swklf.py, wlcrec.py, wlec.py, greedy_crec.py, crec_radius_sample.py, ranking.py | |||
| ββ training/ # Trainer orchestrating data/model/loop | |||
| β ββ trainer.py | |||
| ββ tools/ # logging, TB, sampling, checkpoints, params | |||
| β ββ pretty_logger.py, tb_handler.py, sampling.py, checkpoint_manager.py | |||
| β ββ params.py # CommonParams, TrainingParams dataclasses | |||
| ββ main.py # **single entrypoint** (@hydra.main) | |||
| ββ build_crec_datasets.py # helper to tune/sample CREC subsets | |||
| ββ eval_datasets.py # example: compute WL(C)REC over datasets | |||
| ββ pyproject.toml # formatting/lint settings | |||
| ``` | |||
| ## β¨ Highlights | |||
| - **No PyKeen pipeline.** Models, samplers, and evaluators are instantiated directly; training runs through our wrappers around PyKeenβs `TrainingLoop`. | |||
| - **Hydra CLI.** One-line overrides, organized config groups, and multi-run sweeps (`-m`). | |||
| - **Datasets.** Built-ins or custom triples (TSV/CSV). | |||
| - **Reproducible outputs.** Each run gets a timestamped directory with the resolved config, checkpoints, metrics, and artifacts. | |||
| - **Extendable.** Add models/configs without touching the training loop. | |||
| ## π§° Requirements | |||
| - Python 3.10+ | |||
| - PyTorch | |||
| - PyKeen | |||
| - Hydra Core + OmegaConf | |||
| - NumPy, einops | |||
| - lovelyβtensors (optional pretty tensor prints) | |||
| - TensorBoard | |||
| ## π Installation | |||
| > Python **3.10+** recommended. GPU optional but encouraged for larger graphs. | |||
| **Install script** | |||
| ```bash | |||
| bash setup/install.sh | |||
| ``` | |||
| **Conda + pip** | |||
| ```bash | |||
| conda create -n kge python=3.10 -y | |||
| conda activate kge | |||
| pip install -U pip wheel | |||
| pip install -e . | |||
| ``` | |||
| **Virtualenv/venv** | |||
| ```bash | |||
| python -m venv .venv && source .venv/bin/activate | |||
| conda create -n kge python=3.10 -y | |||
| conda activate kge | |||
| pip install -U pip wheel | |||
| pip install -e . | |||
| ``` | |||
| **CUDA users:** install a PyTorch build matching your CUDA (see pytorch.org) **before** project deps. | |||
| Core dependencies: `torch`, `hydra-core`, `omegaconf`, `pykeen` (core), `pandas`, `numpy`. | |||
| --- | |||
| ## π Quick start | |||
| ### 1) Single run (builtβin dataset) | |||
| Train TransE on WN18RR: | |||
| ```bash | |||
| python main.py model=trans_e data=wn18rr training.batch_size=1024 training.lr=5e-4 common.run_name=transe_wn18rr_bs1024_lr5e4 | |||
| ``` | |||
| ### 2) Use a composed config | |||
| Predefined composition for TransE on FB15K: | |||
| ```bash | |||
| python main.py -cn trans_e_fb15k | |||
| # or: python main.py --config-name trans_e_fb15k | |||
| ``` | |||
| You can still override fields: | |||
| ```bash | |||
| python main.py -cn trans_e_fb15k training.batch_size=2048 training.lr=1e-3 | |||
| ``` | |||
| ### 3) Evaluate only (load checkpoint) | |||
| Set `common.evaluate_only=true` and point `common.load_path` to a checkpoint. Two modes are supported by `CheckpointManager`: | |||
| - **Full path** to a checkpoint file | |||
| - A **pair** `(model_id, iteration)` that resolves to `checkpoints/<model_id>/checkpoints/<iteration>.pt` | |||
| Examples: | |||
| ```bash | |||
| # full path | |||
| python main.py common.evaluate_only=true common.load_path=/absolute/path/to/checkpoints/2/checkpoints/1800.pt | |||
| # by components (id, iter) β YAML style tuple | |||
| python main.py -cn trans_e_fb15k common.evaluate_only=true common.load_path="(2, 1800)" | |||
| ``` | |||
| ### 4) Multiβrun sweeps (Hydra `-m`) | |||
| ```bash | |||
| # 3 models Γ 2 seeds = 6 runs | |||
| python main.py -m model=trans_e,trans_h,trans_r seed=1,2 | |||
| # grid over batch size and LR | |||
| python main.py -m training.batch_size=512,1024 training.lr=5e-4,1e-3 | |||
| ``` | |||
| --- | |||
| ## π¦ Datasets | |||
| Builtβin wrappers (via PyKeen) are provided for **WN18**, **WN18RR**, **FB15K**, **YAGO3β10**. Select with `data=<name>` where `<name>` is one of `wn18`, `wn18rr`, `fb15k`, `yago3_10`. | |||
| ### Custom triples from TSV | |||
| `data/kg_dataset.py` provides `create_from_tsv(root)` which expects `train.txt`, `valid.txt`, `test.txt` under `root/` (tabβseparated: `head<TAB>relation<TAB>tail`). To use this with Hydra, add a small config (e.g. `configs/data/custom.yaml`): | |||
| ```yaml | |||
| # configs/data/custom.yaml | |||
| train: | |||
| _target_: data.kg_dataset.create_from_tsv | |||
| root: /absolute/path/to/mykg | |||
| splits: [train] | |||
| valid: | |||
| _target_: data.kg_dataset.create_from_tsv | |||
| root: /absolute/path/to/mykg | |||
| splits: [valid] | |||
| ``` | |||
| Then run with `data=custom`: | |||
| ```bash | |||
| python main.py model=trans_e data=custom common.run_name=my_custom_kg | |||
| ``` | |||
| --- | |||
| ## π§ͺ Metrics & Evaluation | |||
| - Ranking metrics via PyKeenβs `RankBasedEvaluator` (e.g., Hits@K, I(H)MR) are wired in the `training/trainer.py` loop. | |||
| - Complexity metrics in `metrics/`: `WLCREC`, `WLEC`, `CSWKLF`, greedy CREC, radius sampling. | |||
| - For a quick datasetβlevel report, see `eval_datasets.py` (prints WLECβfamily metrics). | |||
| Set `common.evaluate_only=true` to run evaluation on a loaded model as shown above. | |||
| --- | |||
| ## π Logging, outputs, checkpoints | |||
| - **Hydra outputs**: `./outputs` (configured in `configs/config.yaml`). | |||
| - **TensorBoard**: `logs/<run_name>` (see `tools/tb_handler.py`; open with `tensorboard --logdir logs`). | |||
| - **Checkpoints**: by default in `common.save_dpath` (see `configs/common/common.yaml`). The `CheckpointManager` supports both absolute file paths and `(model_id, iteration)` addressing via `common.load_path`. | |||
| --- | |||
| ## π§ Common knobs (cheat sheet) | |||
| ```bash | |||
| # Model & dimensions | |||
| python main.py model=trans_e model.dim=200 | |||
| python main.py model=trans_h model.dim=400 | |||
| python main.py model=trans_r model.dim=500 | |||
| # Training length & batches | |||
| python main.py training.num_train_steps=20000 training.batch_size=2048 | |||
| # Learning rate & weight decay | |||
| python main.py training.lr=1e-3 training.weight_decay=1e-5 | |||
| # Reproducibility | |||
| python main.py seed=1234 | |||
| ``` | |||
| --- | |||
| ## π Citation | |||