Browse Source

Updated README.md

main
Naser Kazemi 3 weeks ago
parent
commit
786ddee1f1
1 changed files with 234 additions and 1 deletions
  1. 234
    1
      README.md

+ 234
- 1
README.md View File

@@ -1 +1,234 @@
# KGEvaluation
# 🧭 Knowledge-Graph Embeddings β€” Training & Evaluation

This repository implements the Bachelor thesis project:

> **DECONSTRUCTING KNOWLEDGE GRAPH DIFFICULTY: A FRAMEWORK FOR COMPLEXITY‑AWARE EVALUATION AND GENERATION**

It trains and evaluates **knowledge‑graph embedding (KGE)** models using **custom training logic**. We use **PyKeen** for core building blocks (models, datasets, evaluators, losses, regularizers) but **do not use the PyKeen pipeline**. **Hydra** powers configuration and multi‑run sweeps.

---

## βœ… What’s in this repo

```
Root/
β”œβ”€ configs/
β”‚ β”œβ”€ common/ # run name, logging, save paths, resume/eval flags
β”‚ β”‚ └─ common.yaml
β”‚ β”œβ”€ data/ # dataset choices
β”‚ β”‚ β”œβ”€ data.yaml
β”‚ β”‚ β”œβ”€ fb15k.yaml
β”‚ β”‚ β”œβ”€ wn18.yaml
β”‚ β”‚ β”œβ”€ wn18rr.yaml
β”‚ β”‚ └─ yago3_10.yaml
β”‚ β”œβ”€ model/ # model choices & defaults
β”‚ β”‚ β”œβ”€ model.yaml
β”‚ β”‚ β”œβ”€ trans_e.yaml
β”‚ β”‚ β”œβ”€ trans_h.yaml
β”‚ β”‚ └─ trans_r.yaml
β”‚ β”œβ”€ training/ # optimizer/lr/batch/steps + trainer class
β”‚ β”‚ β”œβ”€ training.yaml
β”‚ β”‚ └─ trans_e_trainer.yaml
β”‚ β”œβ”€ model_trainers/ # (Hydra group) trainer implementations
β”‚ β”‚ β”œβ”€ model_trainer_base.py
β”‚ β”‚ └─ translation/{trans_e_trainer.py, trans_h_trainer.py, trans_r_trainer.py}
β”‚ β”œβ”€ config.yaml # Hydra defaults: common, data, model, training
β”‚ β”œβ”€ trans_e_fb15k.yaml # ready‑made composed config
β”‚ β”œβ”€ trans_e_wn18.yaml
β”‚ β”œβ”€ trans_e_wn18rr.yaml
β”‚ └─ trans_e_yago3_10.yaml
β”œβ”€ data/ # dataset wrappers + TSV helper
β”‚ β”œβ”€ kg_dataset.py # KGDataset + create_from_tsv(...)
β”‚ β”œβ”€ wn18.py # WN18Dataset, WN18RRDataset
β”‚ β”œβ”€ fb15k.py, yago3_10.py, openke_wiki.py, hationet.py, open_bio_link.py
β”œβ”€ models/ # minimal translation‑based models
β”‚ β”œβ”€ base_model.py
β”‚ └─ translation/{trans_e.py, trans_h.py, trans_r.py}
β”œβ”€ metrics/ # complexity metrics and ranking metrics
β”‚ β”œβ”€ c_swklf.py, wlcrec.py, wlec.py, greedy_crec.py, crec_radius_sample.py, ranking.py
β”œβ”€ training/ # Trainer orchestrating data/model/loop
β”‚ └─ trainer.py
β”œβ”€ tools/ # logging, TB, sampling, checkpoints, params
β”‚ β”œβ”€ pretty_logger.py, tb_handler.py, sampling.py, checkpoint_manager.py
β”‚ └─ params.py # CommonParams, TrainingParams dataclasses
β”œβ”€ main.py # **single entrypoint** (@hydra.main)
β”œβ”€ build_crec_datasets.py # helper to tune/sample CREC subsets
β”œβ”€ eval_datasets.py # example: compute WL(C)REC over datasets
└─ pyproject.toml # formatting/lint settings
```

## ✨ Highlights

- **No PyKeen pipeline.** Models, samplers, and evaluators are instantiated directly; training runs through our wrappers around PyKeen’s `TrainingLoop`.
- **Hydra CLI.** One-line overrides, organized config groups, and multi-run sweeps (`-m`).
- **Datasets.** Built-ins or custom triples (TSV/CSV).
- **Reproducible outputs.** Each run gets a timestamped directory with the resolved config, checkpoints, metrics, and artifacts.
- **Extendable.** Add models/configs without touching the training loop.


## 🧰 Requirements

- Python 3.10+
- PyTorch
- PyKeen
- Hydra Core + OmegaConf
- NumPy, einops
- lovely‑tensors (optional pretty tensor prints)
- TensorBoard

## πŸ›  Installation

> Python **3.10+** recommended. GPU optional but encouraged for larger graphs.

**Install script**
```bash
bash setup/install.sh
```


**Conda + pip**
```bash
conda create -n kge python=3.10 -y
conda activate kge
pip install -U pip wheel
pip install -e .
```

**Virtualenv/venv**
```bash
python -m venv .venv && source .venv/bin/activate
conda create -n kge python=3.10 -y
conda activate kge
pip install -U pip wheel
pip install -e .
```

**CUDA users:** install a PyTorch build matching your CUDA (see pytorch.org) **before** project deps.

Core dependencies: `torch`, `hydra-core`, `omegaconf`, `pykeen` (core), `pandas`, `numpy`.


---

## πŸš€ Quick start

### 1) Single run (built‑in dataset)

Train TransE on WN18RR:

```bash
python main.py model=trans_e data=wn18rr training.batch_size=1024 training.lr=5e-4 common.run_name=transe_wn18rr_bs1024_lr5e4
```

### 2) Use a composed config

Predefined composition for TransE on FB15K:

```bash
python main.py -cn trans_e_fb15k
# or: python main.py --config-name trans_e_fb15k
```

You can still override fields:

```bash
python main.py -cn trans_e_fb15k training.batch_size=2048 training.lr=1e-3
```

### 3) Evaluate only (load checkpoint)

Set `common.evaluate_only=true` and point `common.load_path` to a checkpoint. Two modes are supported by `CheckpointManager`:

- **Full path** to a checkpoint file
- A **pair** `(model_id, iteration)` that resolves to `checkpoints/<model_id>/checkpoints/<iteration>.pt`

Examples:

```bash
# full path
python main.py common.evaluate_only=true common.load_path=/absolute/path/to/checkpoints/2/checkpoints/1800.pt

# by components (id, iter) β€” YAML style tuple
python main.py -cn trans_e_fb15k common.evaluate_only=true common.load_path="(2, 1800)"
```

### 4) Multi‑run sweeps (Hydra `-m`)

```bash
# 3 models Γ— 2 seeds = 6 runs
python main.py -m model=trans_e,trans_h,trans_r seed=1,2

# grid over batch size and LR
python main.py -m training.batch_size=512,1024 training.lr=5e-4,1e-3
```

---

## πŸ“¦ Datasets

Built‑in wrappers (via PyKeen) are provided for **WN18**, **WN18RR**, **FB15K**, **YAGO3‑10**. Select with `data=<name>` where `<name>` is one of `wn18`, `wn18rr`, `fb15k`, `yago3_10`.

### Custom triples from TSV

`data/kg_dataset.py` provides `create_from_tsv(root)` which expects `train.txt`, `valid.txt`, `test.txt` under `root/` (tab‑separated: `head<TAB>relation<TAB>tail`). To use this with Hydra, add a small config (e.g. `configs/data/custom.yaml`):

```yaml
# configs/data/custom.yaml
train:
_target_: data.kg_dataset.create_from_tsv
root: /absolute/path/to/mykg
splits: [train]

valid:
_target_: data.kg_dataset.create_from_tsv
root: /absolute/path/to/mykg
splits: [valid]
```

Then run with `data=custom`:

```bash
python main.py model=trans_e data=custom common.run_name=my_custom_kg
```

---

## πŸ§ͺ Metrics & Evaluation

- Ranking metrics via PyKeen’s `RankBasedEvaluator` (e.g., Hits@K, I(H)MR) are wired in the `training/trainer.py` loop.
- Complexity metrics in `metrics/`: `WLCREC`, `WLEC`, `CSWKLF`, greedy CREC, radius sampling.
- For a quick dataset‑level report, see `eval_datasets.py` (prints WLEC‑family metrics).

Set `common.evaluate_only=true` to run evaluation on a loaded model as shown above.

---

## πŸ“ Logging, outputs, checkpoints

- **Hydra outputs**: `./outputs` (configured in `configs/config.yaml`).
- **TensorBoard**: `logs/<run_name>` (see `tools/tb_handler.py`; open with `tensorboard --logdir logs`).
- **Checkpoints**: by default in `common.save_dpath` (see `configs/common/common.yaml`). The `CheckpointManager` supports both absolute file paths and `(model_id, iteration)` addressing via `common.load_path`.

---

## πŸ”§ Common knobs (cheat sheet)

```bash
# Model & dimensions
python main.py model=trans_e model.dim=200
python main.py model=trans_h model.dim=400
python main.py model=trans_r model.dim=500

# Training length & batches
python main.py training.num_train_steps=20000 training.batch_size=2048

# Learning rate & weight decay
python main.py training.lr=1e-3 training.weight_decay=1e-5

# Reproducibility
python main.py seed=1234
```

---

## πŸ“š Citation

Loading…
Cancel
Save