|
|
@@ -1 +1,234 @@ |
|
|
|
# KGEvaluation |
|
|
|
# π§ Knowledge-Graph Embeddings β Training & Evaluation |
|
|
|
|
|
|
|
This repository implements the Bachelor thesis project: |
|
|
|
|
|
|
|
> **DECONSTRUCTING KNOWLEDGE GRAPH DIFFICULTY: A FRAMEWORK FOR COMPLEXITYβAWARE EVALUATION AND GENERATION** |
|
|
|
|
|
|
|
It trains and evaluates **knowledgeβgraph embedding (KGE)** models using **custom training logic**. We use **PyKeen** for core building blocks (models, datasets, evaluators, losses, regularizers) but **do not use the PyKeen pipeline**. **Hydra** powers configuration and multiβrun sweeps. |
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
## β
Whatβs in this repo |
|
|
|
|
|
|
|
``` |
|
|
|
Root/ |
|
|
|
ββ configs/ |
|
|
|
β ββ common/ # run name, logging, save paths, resume/eval flags |
|
|
|
β β ββ common.yaml |
|
|
|
β ββ data/ # dataset choices |
|
|
|
β β ββ data.yaml |
|
|
|
β β ββ fb15k.yaml |
|
|
|
β β ββ wn18.yaml |
|
|
|
β β ββ wn18rr.yaml |
|
|
|
β β ββ yago3_10.yaml |
|
|
|
β ββ model/ # model choices & defaults |
|
|
|
β β ββ model.yaml |
|
|
|
β β ββ trans_e.yaml |
|
|
|
β β ββ trans_h.yaml |
|
|
|
β β ββ trans_r.yaml |
|
|
|
β ββ training/ # optimizer/lr/batch/steps + trainer class |
|
|
|
β β ββ training.yaml |
|
|
|
β β ββ trans_e_trainer.yaml |
|
|
|
β ββ model_trainers/ # (Hydra group) trainer implementations |
|
|
|
β β ββ model_trainer_base.py |
|
|
|
β β ββ translation/{trans_e_trainer.py, trans_h_trainer.py, trans_r_trainer.py} |
|
|
|
β ββ config.yaml # Hydra defaults: common, data, model, training |
|
|
|
β ββ trans_e_fb15k.yaml # readyβmade composed config |
|
|
|
β ββ trans_e_wn18.yaml |
|
|
|
β ββ trans_e_wn18rr.yaml |
|
|
|
β ββ trans_e_yago3_10.yaml |
|
|
|
ββ data/ # dataset wrappers + TSV helper |
|
|
|
β ββ kg_dataset.py # KGDataset + create_from_tsv(...) |
|
|
|
β ββ wn18.py # WN18Dataset, WN18RRDataset |
|
|
|
β ββ fb15k.py, yago3_10.py, openke_wiki.py, hationet.py, open_bio_link.py |
|
|
|
ββ models/ # minimal translationβbased models |
|
|
|
β ββ base_model.py |
|
|
|
β ββ translation/{trans_e.py, trans_h.py, trans_r.py} |
|
|
|
ββ metrics/ # complexity metrics and ranking metrics |
|
|
|
β ββ c_swklf.py, wlcrec.py, wlec.py, greedy_crec.py, crec_radius_sample.py, ranking.py |
|
|
|
ββ training/ # Trainer orchestrating data/model/loop |
|
|
|
β ββ trainer.py |
|
|
|
ββ tools/ # logging, TB, sampling, checkpoints, params |
|
|
|
β ββ pretty_logger.py, tb_handler.py, sampling.py, checkpoint_manager.py |
|
|
|
β ββ params.py # CommonParams, TrainingParams dataclasses |
|
|
|
ββ main.py # **single entrypoint** (@hydra.main) |
|
|
|
ββ build_crec_datasets.py # helper to tune/sample CREC subsets |
|
|
|
ββ eval_datasets.py # example: compute WL(C)REC over datasets |
|
|
|
ββ pyproject.toml # formatting/lint settings |
|
|
|
``` |
|
|
|
|
|
|
|
## β¨ Highlights |
|
|
|
|
|
|
|
- **No PyKeen pipeline.** Models, samplers, and evaluators are instantiated directly; training runs through our wrappers around PyKeenβs `TrainingLoop`. |
|
|
|
- **Hydra CLI.** One-line overrides, organized config groups, and multi-run sweeps (`-m`). |
|
|
|
- **Datasets.** Built-ins or custom triples (TSV/CSV). |
|
|
|
- **Reproducible outputs.** Each run gets a timestamped directory with the resolved config, checkpoints, metrics, and artifacts. |
|
|
|
- **Extendable.** Add models/configs without touching the training loop. |
|
|
|
|
|
|
|
|
|
|
|
## π§° Requirements |
|
|
|
|
|
|
|
- Python 3.10+ |
|
|
|
- PyTorch |
|
|
|
- PyKeen |
|
|
|
- Hydra Core + OmegaConf |
|
|
|
- NumPy, einops |
|
|
|
- lovelyβtensors (optional pretty tensor prints) |
|
|
|
- TensorBoard |
|
|
|
|
|
|
|
## π Installation |
|
|
|
|
|
|
|
> Python **3.10+** recommended. GPU optional but encouraged for larger graphs. |
|
|
|
|
|
|
|
**Install script** |
|
|
|
```bash |
|
|
|
bash setup/install.sh |
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
**Conda + pip** |
|
|
|
```bash |
|
|
|
conda create -n kge python=3.10 -y |
|
|
|
conda activate kge |
|
|
|
pip install -U pip wheel |
|
|
|
pip install -e . |
|
|
|
``` |
|
|
|
|
|
|
|
**Virtualenv/venv** |
|
|
|
```bash |
|
|
|
python -m venv .venv && source .venv/bin/activate |
|
|
|
conda create -n kge python=3.10 -y |
|
|
|
conda activate kge |
|
|
|
pip install -U pip wheel |
|
|
|
pip install -e . |
|
|
|
``` |
|
|
|
|
|
|
|
**CUDA users:** install a PyTorch build matching your CUDA (see pytorch.org) **before** project deps. |
|
|
|
|
|
|
|
Core dependencies: `torch`, `hydra-core`, `omegaconf`, `pykeen` (core), `pandas`, `numpy`. |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
## π Quick start |
|
|
|
|
|
|
|
### 1) Single run (builtβin dataset) |
|
|
|
|
|
|
|
Train TransE on WN18RR: |
|
|
|
|
|
|
|
```bash |
|
|
|
python main.py model=trans_e data=wn18rr training.batch_size=1024 training.lr=5e-4 common.run_name=transe_wn18rr_bs1024_lr5e4 |
|
|
|
``` |
|
|
|
|
|
|
|
### 2) Use a composed config |
|
|
|
|
|
|
|
Predefined composition for TransE on FB15K: |
|
|
|
|
|
|
|
```bash |
|
|
|
python main.py -cn trans_e_fb15k |
|
|
|
# or: python main.py --config-name trans_e_fb15k |
|
|
|
``` |
|
|
|
|
|
|
|
You can still override fields: |
|
|
|
|
|
|
|
```bash |
|
|
|
python main.py -cn trans_e_fb15k training.batch_size=2048 training.lr=1e-3 |
|
|
|
``` |
|
|
|
|
|
|
|
### 3) Evaluate only (load checkpoint) |
|
|
|
|
|
|
|
Set `common.evaluate_only=true` and point `common.load_path` to a checkpoint. Two modes are supported by `CheckpointManager`: |
|
|
|
|
|
|
|
- **Full path** to a checkpoint file |
|
|
|
- A **pair** `(model_id, iteration)` that resolves to `checkpoints/<model_id>/checkpoints/<iteration>.pt` |
|
|
|
|
|
|
|
Examples: |
|
|
|
|
|
|
|
```bash |
|
|
|
# full path |
|
|
|
python main.py common.evaluate_only=true common.load_path=/absolute/path/to/checkpoints/2/checkpoints/1800.pt |
|
|
|
|
|
|
|
# by components (id, iter) β YAML style tuple |
|
|
|
python main.py -cn trans_e_fb15k common.evaluate_only=true common.load_path="(2, 1800)" |
|
|
|
``` |
|
|
|
|
|
|
|
### 4) Multiβrun sweeps (Hydra `-m`) |
|
|
|
|
|
|
|
```bash |
|
|
|
# 3 models Γ 2 seeds = 6 runs |
|
|
|
python main.py -m model=trans_e,trans_h,trans_r seed=1,2 |
|
|
|
|
|
|
|
# grid over batch size and LR |
|
|
|
python main.py -m training.batch_size=512,1024 training.lr=5e-4,1e-3 |
|
|
|
``` |
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
## π¦ Datasets |
|
|
|
|
|
|
|
Builtβin wrappers (via PyKeen) are provided for **WN18**, **WN18RR**, **FB15K**, **YAGO3β10**. Select with `data=<name>` where `<name>` is one of `wn18`, `wn18rr`, `fb15k`, `yago3_10`. |
|
|
|
|
|
|
|
### Custom triples from TSV |
|
|
|
|
|
|
|
`data/kg_dataset.py` provides `create_from_tsv(root)` which expects `train.txt`, `valid.txt`, `test.txt` under `root/` (tabβseparated: `head<TAB>relation<TAB>tail`). To use this with Hydra, add a small config (e.g. `configs/data/custom.yaml`): |
|
|
|
|
|
|
|
```yaml |
|
|
|
# configs/data/custom.yaml |
|
|
|
train: |
|
|
|
_target_: data.kg_dataset.create_from_tsv |
|
|
|
root: /absolute/path/to/mykg |
|
|
|
splits: [train] |
|
|
|
|
|
|
|
valid: |
|
|
|
_target_: data.kg_dataset.create_from_tsv |
|
|
|
root: /absolute/path/to/mykg |
|
|
|
splits: [valid] |
|
|
|
``` |
|
|
|
|
|
|
|
Then run with `data=custom`: |
|
|
|
|
|
|
|
```bash |
|
|
|
python main.py model=trans_e data=custom common.run_name=my_custom_kg |
|
|
|
``` |
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
## π§ͺ Metrics & Evaluation |
|
|
|
|
|
|
|
- Ranking metrics via PyKeenβs `RankBasedEvaluator` (e.g., Hits@K, I(H)MR) are wired in the `training/trainer.py` loop. |
|
|
|
- Complexity metrics in `metrics/`: `WLCREC`, `WLEC`, `CSWKLF`, greedy CREC, radius sampling. |
|
|
|
- For a quick datasetβlevel report, see `eval_datasets.py` (prints WLECβfamily metrics). |
|
|
|
|
|
|
|
Set `common.evaluate_only=true` to run evaluation on a loaded model as shown above. |
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
## π Logging, outputs, checkpoints |
|
|
|
|
|
|
|
- **Hydra outputs**: `./outputs` (configured in `configs/config.yaml`). |
|
|
|
- **TensorBoard**: `logs/<run_name>` (see `tools/tb_handler.py`; open with `tensorboard --logdir logs`). |
|
|
|
- **Checkpoints**: by default in `common.save_dpath` (see `configs/common/common.yaml`). The `CheckpointManager` supports both absolute file paths and `(model_id, iteration)` addressing via `common.load_path`. |
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
## π§ Common knobs (cheat sheet) |
|
|
|
|
|
|
|
```bash |
|
|
|
# Model & dimensions |
|
|
|
python main.py model=trans_e model.dim=200 |
|
|
|
python main.py model=trans_h model.dim=400 |
|
|
|
python main.py model=trans_r model.dim=500 |
|
|
|
|
|
|
|
# Training length & batches |
|
|
|
python main.py training.num_train_steps=20000 training.batch_size=2048 |
|
|
|
|
|
|
|
# Learning rate & weight decay |
|
|
|
python main.py training.lr=1e-3 training.weight_decay=1e-5 |
|
|
|
|
|
|
|
# Reproducibility |
|
|
|
python main.py seed=1234 |
|
|
|
``` |
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
## π Citation |