A Persian grapheme-to-phoneme (G2P) model designed for homograph disambiguation, fine-tuned using the HomoRich dataset to improve pronunciation accuracy.

Mahta Fetrat 849837bf9a Update README.md		1 week ago
assets	add assets	2 weeks ago
model-weights	add model weights	2 weeks ago
testing-scripts	Add files via upload	2 weeks ago
training-scripts	Add files via upload	2 weeks ago
LICENSE	Initial commit	2 weeks ago
README.md	Update README.md	1 week ago

Homo-GE2PE: Persian Grapheme-to-Phoneme Conversion with Homograph Disambiguation

Homo-GE2PE is a Persian grapheme-to-phoneme (G2P) model specialized in homograph disambiguation—words with identical spellings but context-dependent pronunciations (e.g., مرد pronounced as mard “man” or mord “died”). Introduced in Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models, the model extends GE2PE by fine-tuning it on the HomoRich dataset, explicitly designed for such pronunciation challenges.

Repository Structure

model-weights/
│   ├── homo-ge2pe.zip       # Homo-GE2PE model checkpoint
│   └── homo-t5.zip          # Homo-T5 model checkpoint (T5-based G2P model)

training-scripts/
│   ├── finetune-ge2pe.py    # Fine-tuning script for GE2PE
│   └── finetune-t5.py       # Fine-tuning script for T5

testing-scripts/
│   └── test.ipynb           # Benchmarking the models with SentenceBench Persian G2P Benchmark

assets/
│   └── (files required for inference, e.g., Parsivar, GE2PE.py)

Model Performance

Below are the performance metrics for each model variant on the SentenceBench dataset:

Model	PER (%)	Homograph Acc. (%)	Avg. Inf. Time (s)
GE2PE (Base)	4.81	47.17	0.4464
Homo-T5	4.12	76.32	0.4141
Homo-GE2PE	3.98	76.89	0.4473

Inference

For inference, use the provided inference.ipynb notebook or the Colab link. The notebook demonstrates how to load the checkpoints and perform grapheme-to-phoneme conversion using Homo-GE2PE and Homo-T5.

Dataset: HomoRich G2P Persian

The models in this repository were fine-tuned on HomoRich, the first large-scale public Persian homograph dataset for grapheme-to-phoneme (G2P) tasks, resolving pronunciation/meaning ambiguities in identically spelled words. Introduced in “Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models”, the dataset is available here (TODO: Update link).

Citation

If you use this project in your work, please cite the corresponding paper:

TODO

Contributions

Contributions and pull requests are welcome. Please open an issue to discuss the changes you intend to make.

Additional Links

Paper PDF (TODO: link to paper)
Base GE2PE Paper
Base GE2PE Model
HomoRich Dataset (TODO: To be updated)
SentenceBench Persian G2P Benchmark

README.md