You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Naser Kazemi 786ddee1f1 Updated README.md 3 weeks ago
configs Added code base. 3 weeks ago
data Added code base. 3 weeks ago
metrics Added code base. 3 weeks ago
models Added code base. 3 weeks ago
setup Added code base. 3 weeks ago
tools Added code base. 3 weeks ago
training Added code base. 3 weeks ago
README.md Updated README.md 3 weeks ago
build_crec_datasets.py Added code base. 3 weeks ago
eval_datasets.py Added code base. 3 weeks ago
main.py Added code base. 3 weeks ago
pyproject.toml Added code base. 3 weeks ago

README.md

🧭 Knowledge-Graph Embeddings β€” Training & Evaluation

This repository implements the Bachelor thesis project:

DECONSTRUCTING KNOWLEDGE GRAPH DIFFICULTY: A FRAMEWORK FOR COMPLEXITY‑AWARE EVALUATION AND GENERATION

It trains and evaluates knowledge‑graph embedding (KGE) models using custom training logic. We use PyKeen for core building blocks (models, datasets, evaluators, losses, regularizers) but do not use the PyKeen pipeline. Hydra powers configuration and multi‑run sweeps.


βœ… What’s in this repo

Root/
β”œβ”€ configs/
β”‚  β”œβ”€ common/               # run name, logging, save paths, resume/eval flags
β”‚  β”‚  └─ common.yaml
β”‚  β”œβ”€ data/                 # dataset choices
β”‚  β”‚  β”œβ”€ data.yaml
β”‚  β”‚  β”œβ”€ fb15k.yaml
β”‚  β”‚  β”œβ”€ wn18.yaml
β”‚  β”‚  β”œβ”€ wn18rr.yaml
β”‚  β”‚  └─ yago3_10.yaml
β”‚  β”œβ”€ model/                # model choices & defaults
β”‚  β”‚  β”œβ”€ model.yaml
β”‚  β”‚  β”œβ”€ trans_e.yaml
β”‚  β”‚  β”œβ”€ trans_h.yaml
β”‚  β”‚  └─ trans_r.yaml
β”‚  β”œβ”€ training/             # optimizer/lr/batch/steps + trainer class
β”‚  β”‚  β”œβ”€ training.yaml
β”‚  β”‚  └─ trans_e_trainer.yaml
β”‚  β”œβ”€ model_trainers/       # (Hydra group) trainer implementations
β”‚  β”‚  β”œβ”€ model_trainer_base.py
β”‚  β”‚  └─ translation/{trans_e_trainer.py, trans_h_trainer.py, trans_r_trainer.py}
β”‚  β”œβ”€ config.yaml           # Hydra defaults: common, data, model, training
β”‚  β”œβ”€ trans_e_fb15k.yaml    # ready‑made composed config
β”‚  β”œβ”€ trans_e_wn18.yaml
β”‚  β”œβ”€ trans_e_wn18rr.yaml
β”‚  └─ trans_e_yago3_10.yaml
β”œβ”€ data/                    # dataset wrappers + TSV helper
β”‚  β”œβ”€ kg_dataset.py         # KGDataset + create_from_tsv(...)
β”‚  β”œβ”€ wn18.py               # WN18Dataset, WN18RRDataset
β”‚  β”œβ”€ fb15k.py, yago3_10.py, openke_wiki.py, hationet.py, open_bio_link.py
β”œβ”€ models/                  # minimal translation‑based models
β”‚  β”œβ”€ base_model.py
β”‚  └─ translation/{trans_e.py, trans_h.py, trans_r.py}
β”œβ”€ metrics/                 # complexity metrics and ranking metrics
β”‚  β”œβ”€ c_swklf.py, wlcrec.py, wlec.py, greedy_crec.py, crec_radius_sample.py, ranking.py
β”œβ”€ training/                # Trainer orchestrating data/model/loop
β”‚  └─ trainer.py
β”œβ”€ tools/                   # logging, TB, sampling, checkpoints, params
β”‚  β”œβ”€ pretty_logger.py, tb_handler.py, sampling.py, checkpoint_manager.py
β”‚  └─ params.py             # CommonParams, TrainingParams dataclasses
β”œβ”€ main.py                  # **single entrypoint** (@hydra.main)
β”œβ”€ build_crec_datasets.py   # helper to tune/sample CREC subsets
β”œβ”€ eval_datasets.py         # example: compute WL(C)REC over datasets
└─ pyproject.toml           # formatting/lint settings

✨ Highlights

  • No PyKeen pipeline. Models, samplers, and evaluators are instantiated directly; training runs through our wrappers around PyKeen’s TrainingLoop.
  • Hydra CLI. One-line overrides, organized config groups, and multi-run sweeps (-m).
  • Datasets. Built-ins or custom triples (TSV/CSV).
  • Reproducible outputs. Each run gets a timestamped directory with the resolved config, checkpoints, metrics, and artifacts.
  • Extendable. Add models/configs without touching the training loop.

🧰 Requirements

  • Python 3.10+
  • PyTorch
  • PyKeen
  • Hydra Core + OmegaConf
  • NumPy, einops
  • lovely‑tensors (optional pretty tensor prints)
  • TensorBoard

πŸ›  Installation

Python 3.10+ recommended. GPU optional but encouraged for larger graphs.

Install script

bash setup/install.sh

Conda + pip

conda create -n kge python=3.10 -y
conda activate kge
pip install -U pip wheel
pip install -e .

Virtualenv/venv

python -m venv .venv && source .venv/bin/activate
conda create -n kge python=3.10 -y
conda activate kge
pip install -U pip wheel
pip install -e .

CUDA users: install a PyTorch build matching your CUDA (see pytorch.org) before project deps.

Core dependencies: torch, hydra-core, omegaconf, pykeen (core), pandas, numpy.


πŸš€ Quick start

1) Single run (built‑in dataset)

Train TransE on WN18RR:

python main.py   model=trans_e   data=wn18rr   training.batch_size=1024   training.lr=5e-4   common.run_name=transe_wn18rr_bs1024_lr5e4

2) Use a composed config

Predefined composition for TransE on FB15K:

python main.py -cn trans_e_fb15k
# or: python main.py --config-name trans_e_fb15k

You can still override fields:

python main.py -cn trans_e_fb15k training.batch_size=2048 training.lr=1e-3

3) Evaluate only (load checkpoint)

Set common.evaluate_only=true and point common.load_path to a checkpoint. Two modes are supported by CheckpointManager:

  • Full path to a checkpoint file
  • A pair (model_id, iteration) that resolves to checkpoints/<model_id>/checkpoints/<iteration>.pt

Examples:

# full path
python main.py common.evaluate_only=true common.load_path=/absolute/path/to/checkpoints/2/checkpoints/1800.pt

# by components (id, iter) β€” YAML style tuple
python main.py -cn trans_e_fb15k common.evaluate_only=true common.load_path="(2, 1800)"

4) Multi‑run sweeps (Hydra -m)

# 3 models Γ— 2 seeds = 6 runs
python main.py -m model=trans_e,trans_h,trans_r seed=1,2

# grid over batch size and LR
python main.py -m training.batch_size=512,1024 training.lr=5e-4,1e-3

πŸ“¦ Datasets

Built‑in wrappers (via PyKeen) are provided for WN18, WN18RR, FB15K, YAGO3‑10. Select with data=<name> where <name> is one of wn18, wn18rr, fb15k, yago3_10.

Custom triples from TSV

data/kg_dataset.py provides create_from_tsv(root) which expects train.txt, valid.txt, test.txt under root/ (tab‑separated: head<TAB>relation<TAB>tail). To use this with Hydra, add a small config (e.g. configs/data/custom.yaml):

# configs/data/custom.yaml
train:
  _target_: data.kg_dataset.create_from_tsv
  root: /absolute/path/to/mykg
  splits: [train]

valid:
  _target_: data.kg_dataset.create_from_tsv
  root: /absolute/path/to/mykg
  splits: [valid]

Then run with data=custom:

python main.py model=trans_e data=custom common.run_name=my_custom_kg

πŸ§ͺ Metrics & Evaluation

  • Ranking metrics via PyKeen’s RankBasedEvaluator (e.g., Hits@K, I(H)MR) are wired in the training/trainer.py loop.
  • Complexity metrics in metrics/: WLCREC, WLEC, CSWKLF, greedy CREC, radius sampling.
  • For a quick dataset‑level report, see eval_datasets.py (prints WLEC‑family metrics).

Set common.evaluate_only=true to run evaluation on a loaded model as shown above.


πŸ“ Logging, outputs, checkpoints

  • Hydra outputs: ./outputs (configured in configs/config.yaml).
  • TensorBoard: logs/<run_name> (see tools/tb_handler.py; open with tensorboard --logdir logs).
  • Checkpoints: by default in common.save_dpath (see configs/common/common.yaml). The CheckpointManager supports both absolute file paths and (model_id, iteration) addressing via common.load_path.

πŸ”§ Common knobs (cheat sheet)

# Model & dimensions
python main.py model=trans_e model.dim=200
python main.py model=trans_h model.dim=400
python main.py model=trans_r model.dim=500

# Training length & batches
python main.py training.num_train_steps=20000 training.batch_size=2048

# Learning rate & weight decay
python main.py training.lr=1e-3 training.weight_decay=1e-5

# Reproducibility
python main.py seed=1234

πŸ“š Citation