| # Homo-GE2PE: Persian Grapheme-to-Phoneme Conversion with Homograph Disambiguation | # Homo-GE2PE: Persian Grapheme-to-Phoneme Conversion with Homograph Disambiguation | ||||
| **Homo-GE2PE** is a Persian grapheme-to-phoneme (G2P) model specialized in homograph disambiguation—words with identical spellings but context-dependent pronunciations (e.g., *مرد* pronounced as *mard* "man" or *mord* "died"). Introduced in *[Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models](link)*, the model extends **GE2PE** by fine-tuning it on the **HomoRich** dataset, explicitly designed for such pronunciation challenges. | |||||
| **Homo-GE2PE** is a Persian grapheme-to-phoneme (G2P) model specialized in homograph disambiguation—words with identical spellings but context-dependent pronunciations (e.g., *مرد* pronounced as *mard* "man" or *mord* "died"). Introduced in *[Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models](https://arxiv.org/abs/2505.12973)*, the model extends **GE2PE** by fine-tuning it on the **HomoRich** dataset, explicitly designed for such pronunciation challenges. | |||||
| --- | --- | ||||
| --- | --- | ||||
| ## Inference | |||||
| Here’s a merged version that combines your existing intro with the condensed usage instructions: | |||||
| For inference, use the provided `inference.ipynb` notebook or the [Colab link](https://colab.research.google.com/drive/1Osue8HOgTGMZXIhpvCuiRyfuxpte1v0p?usp=sharing). The notebook demonstrates how to load the checkpoints and perform grapheme-to-phoneme conversion using Homo-GE2PE and Homo-T5. | |||||
| --- | |||||
| ## Usage | |||||
| [](https://colab.research.google.com/drive/1Osue8HOgTGMZXIhpvCuiRyfuxpte1v0p?usp=sharing) | |||||
| For inference, run the provided [`inference.ipynb`](https://github.com/MahtaFetrat/Homo-GE2PE-Persian/blob/main/Inference.ipynb) notebook either locally or via the [Colab link](https://colab.research.google.com/drive/1Osue8HOgTGMZXIhpvCuiRyfuxpte1v0p?usp=sharing) (recommended for easy setup). | |||||
| ### Quick Setup | |||||
| 1. **Install dependencies**: | |||||
| ```bash | |||||
| pip install unidecode | |||||
| ``` | |||||
| 2. **Download models**: | |||||
| ```bash | |||||
| git clone https://huggingface.co/MahtaFetrat/Homo-GE2PE-Persian/ | |||||
| unzip -q Homo-GE2PE-Persian/assets/Parsivar.zip | |||||
| unzip -q Homo-GE2PE-Persian/model-weights/homo-ge2pe.zip -d homo-ge2pe | |||||
| unzip -q Homo-GE2PE-Persian/model-weights/homo-t5.zip -d homo-t5 | |||||
| mv Homo-GE2PE-Persian/assets/GE2PE.py ./ | |||||
| ``` | |||||
| 3. **Fix compatibility** (if needed): | |||||
| ```bash | |||||
| sed -i 's/from collections import Iterable/from collections.abc import Iterable/g' Parsivar/token_merger.py | |||||
| ``` | |||||
| ### Example Usage | |||||
| ```python | |||||
| from GE2PE import GE2PE | |||||
| g2p = GE2PE(model_path='/content/homo-ge2pe') # or homo-t5 | |||||
| g2p.generate(['تست مدل تبدیل نویسه به واج', 'این کتابِ علی است'], use_rules=True) | |||||
| # Output: ['teste model t/bdil nevise be vaj', '@in ketabe @ali @/st'] | |||||
| ``` | |||||
| --- | --- | ||||
| ## Dataset: HomoRich G2P Persian | ## Dataset: HomoRich G2P Persian | ||||
| The models in this repository were fine-tuned on HomoRich, the first large-scale public Persian homograph dataset for grapheme-to-phoneme (G2P) tasks, resolving pronunciation/meaning ambiguities in identically spelled words. Introduced in "Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models", the dataset is available [here](https://anonymous.4open.science/r/HomoRich-G2P-Persian/) (TODO: Update link). | |||||
| [](https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian) | |||||
| The models in this repository were fine-tuned on HomoRich, the first large-scale public Persian homograph dataset for grapheme-to-phoneme (G2P) tasks, resolving pronunciation/meaning ambiguities in identically spelled words. Introduced in *[Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models](https://arxiv.org/abs/2505.12973)*, the dataset is available [here](https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian). | |||||
| --- | --- | ||||
| If you use this project in your work, please cite the corresponding paper: | If you use this project in your work, please cite the corresponding paper: | ||||
| > TODO | |||||
| ```bibtex | |||||
| @misc{qharabagh2025fastfancyrethinkingg2p, | |||||
| title={Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models}, | |||||
| author={Mahta Fetrat Qharabagh and Zahra Dehghanian and Hamid R. Rabiee}, | |||||
| year={2025}, | |||||
| eprint={2505.12973}, | |||||
| archivePrefix={arXiv}, | |||||
| primaryClass={cs.CL}, | |||||
| url={https://arxiv.org/abs/2505.12973}, | |||||
| } | |||||
| ``` | |||||
| --- | --- | ||||
| ## Contributions | ## Contributions | ||||
| ### Additional Links | ### Additional Links | ||||
| * [Paper PDF](#) (TODO: link to paper) | |||||
| * [Link to Paper](https://arxiv.org/abs/2505.12973) | |||||
| * [Homo-GE2PE (Huggingface)](https://huggingface.co/MahtaFetrat/Homo-GE2PE-Persian) | |||||
| * [Base GE2PE Paper](https://aclanthology.org/2024.findings-emnlp.196/) | * [Base GE2PE Paper](https://aclanthology.org/2024.findings-emnlp.196/) | ||||
| * [Base GE2PE Model](https://github.com/Sharif-SLPL/GE2PE) | * [Base GE2PE Model](https://github.com/Sharif-SLPL/GE2PE) | ||||
| * [HomoRich Dataset](https://anonymous.4open.science/r/HomoRich-G2P-Persian/) (TODO: To be updated) | |||||
| * [HomoRich Dataset (Huggingface)](https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian) | |||||
| * [HomoRich Dataset (Github)](https://github.com/MahtaFetrat/HomoRich-G2P-Persian) | |||||
| * [SentenceBench Persian G2P Benchmark](https://huggingface.co/datasets/MahtaFetrat/SentenceBench) | * [SentenceBench Persian G2P Benchmark](https://huggingface.co/datasets/MahtaFetrat/SentenceBench) |