|  Mahta Fetrat
					
				
				
						eddb4e5513
						
							
						
				
				
				Update README.md | 5 months ago | |
|---|---|---|
| assets | 5 months ago | |
| data | 5 months ago | |
| scripts | 5 months ago | |
| LICENSE | 5 months ago | |
| README.md | 5 months ago | |
HomoRich is the first large-scale, sentence-level Persian homograph dataset designed for grapheme-to-phoneme (G2P) conversion tasks. It addresses the scarcity of balanced, contextually annotated homograph data for low-resource languages. The dataset was created using a semi-automated pipeline combining human expertise and LLM-generated samples, as described in the paper:
Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models.
The dataset contains 528,891 annotated Persian sentences (327,475 homograph-focused) covering 285 homograph words with 2-4 pronunciation variants each. Variants are equally represented (~500 samples each) to mitigate bias. The composition blends multiple sources for diversity, as shown below:
|   Distribution of data sources in HomoRich dataset |   The source for different parts of the HomoRich dataset | 
Persian G2P systems use two common phoneme formats:
The HomoRich dataset includes both formats for broad compatibility. Below is a visual comparison:
|   Repr. 1 |   Repr. 2 | 
The dataset is available both on Hugging Face and in this repository:
Option 1: From Hugging Face
from datasets import load_dataset  
dataset = load_dataset("MahtaFetrat/HomoRich")  
Option 2: From this repository
You can access the dataset files directly from the data folder of this repository.
TODO  
The dataset was used to improve:
See paper Table 3 for full metrics.
The scripts folder contains two key notebooks used in the dataset creation and processing pipeline:
Generate\_Homograph\_Sentences.ipynb: This notebook implements the prompt templates used to generate homograph-focused sentences as described in the paper, Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models.
Phonemize\_Sentences.ipynb: This notebook applies the phonemization process based on the LLM-powered G2P method detailed in the LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study paper.
TODO: citation to paper arxiv
Contributions and pull requests are welcome. Please open an issue to discuss the changes you intend to make.