| @@ -1,7 +1,7 @@ | |||
| # HomoRich: A Persian Homograph Dataset for G2P Conversion | |||
| HomoRich is the first large-scale, sentence-level Persian homograph dataset designed for grapheme-to-phoneme (G2P) conversion tasks. It addresses the scarcity of balanced, contextually annotated homograph data for low-resource languages. The dataset was created using a semi-automated pipeline combining human expertise and LLM-generated samples, as described in the paper: | |||
| **"Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models"**. | |||
| *[Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models](TODO)*. | |||
| ## Overview | |||
| The dataset contains 528,891 annotated Persian sentences (327,475 homograph-focused) covering 285 homograph words with 2-4 pronunciation variants each. Variants are equally represented (~500 samples each) to mitigate bias. The composition blends multiple sources for diversity, as shown below: | |||
| @@ -49,6 +49,7 @@ The HomoRich dataset includes both formats for broad compatibility. Below is a v | |||
| </div> | |||
| --- | |||
| ## Usage | |||
| @@ -80,6 +81,16 @@ See [paper Table 3](#) for full metrics. | |||
| --- | |||
| ### Dataset Creation and Processing | |||
| The `scripts` folder contains two key notebooks used in the dataset creation and processing pipeline: | |||
| 1. `Generate\_Homograph\_Sentences.ipynb`: This notebook implements the prompt templates used to generate homograph-focused sentences as described in the paper, *[Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models.](TODO)* | |||
| 2. `Phonemize\_Sentences.ipynb`: This notebook applies the phonemization process based on the LLM-powered G2P method detailed in the *[LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study](TODO)* paper. | |||
| --- | |||
| ## License | |||
| - **Dataset**: Released under **CC0-1.0** (public domain). | |||
| - **Code/Models**: **MIT License** (where applicable). | |||