![]() |
2 weeks ago | |
---|---|---|
assets | 3 weeks ago | |
data | 3 weeks ago | |
scripts | 3 weeks ago | |
LICENSE | 3 weeks ago | |
README.md | 2 weeks ago |
HomoRich is the first large-scale, sentence-level Persian homograph dataset designed for grapheme-to-phoneme (G2P) conversion tasks. It addresses the scarcity of balanced, contextually annotated homograph data for low-resource languages. The dataset was created using a semi-automated pipeline combining human expertise and LLM-generated samples, as described in the paper:
“Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models”.
The dataset contains 528,891 annotated Persian sentences (327,475 homograph-focused) covering 285 homograph words with 2-4 pronunciation variants each. Variants are equally represented (~500 samples each) to mitigate bias. The composition blends multiple sources for diversity, as shown below:
![]() Distribution of data sources in HomoRich dataset |
![]() The source for different parts of the HomoRich dataset |
Persian G2P systems use two common phoneme formats:
The HomoRich dataset includes both formats for broad compatibility. Below is a visual comparison:
![]() Repr. 1 |
![]() Repr. 2 |
The dataset is available both on Hugging Face and in this repository:
Option 1: From Hugging Face
from datasets import load_dataset
dataset = load_dataset("MahtaFetrat/HomoRich")
Option 2: From this repository
You can access the dataset files directly from the data folder of this repository.
TODO
The dataset was used to improve:
See paper Table 3 for full metrics.
TODO: citation to paper arxiv
Contributions and pull requests are welcome. Please open an issue to discuss the changes you intend to make.