![]() |
1 week ago | |
---|---|---|
assets | 2 weeks ago | |
data | 2 weeks ago | |
scripts | 2 weeks ago | |
LICENSE | 2 weeks ago | |
README.md | 1 week ago |
HomoRich is the first large-scale, sentence-level Persian homograph dataset designed for grapheme-to-phoneme (G2P) conversion tasks. It addresses the scarcity of balanced, contextually annotated homograph data for low-resource languages. The dataset was created using a semi-automated pipeline combining human expertise and LLM-generated samples, as described in the paper:
Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models.
The dataset contains 528,891 annotated Persian sentences (327,475 homograph-focused) covering 285 homograph words with 2-4 pronunciation variants each. Variants are equally represented (~500 samples each) to mitigate bias. The composition blends multiple sources for diversity, as shown below:
![]() Distribution of data sources in HomoRich dataset |
![]() The source for different parts of the HomoRich dataset |
Persian G2P systems use two common phoneme formats:
The HomoRich dataset includes both formats for broad compatibility. Below is a visual comparison:
![]() Repr. 1 |
![]() Repr. 2 |
The dataset is available both on Hugging Face and in this repository:
Option 1: From Hugging Face
from datasets import load_dataset
dataset = load_dataset("AnonymousOwner/HomoRich") # To be updated
Option 2: From this repository
You can access the dataset files directly from the data folder of this repository.
TODO
The dataset was used to improve:
See paper Table 3 for full metrics.
The scripts
folder contains two key notebooks used in the dataset creation and processing pipeline:
Generate\_Homograph\_Sentences.ipynb
: This notebook implements the prompt templates used to generate homograph-focused sentences as described in the paper, Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models.
Phonemize\_Sentences.ipynb
: This notebook applies the phonemization process based on the LLM-powered G2P method detailed in the LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study paper.
TODO: citation to paper arxiv
Contributions and pull requests are welcome. Please open an issue to discuss the changes you intend to make.