|
|
@@ -1,7 +1,7 @@ |
|
|
|
# HomoRich: A Persian Homograph Dataset for G2P Conversion |
|
|
|
|
|
|
|
HomoRich is the first large-scale, sentence-level Persian homograph dataset designed for grapheme-to-phoneme (G2P) conversion tasks. It addresses the scarcity of balanced, contextually annotated homograph data for low-resource languages. The dataset was created using a semi-automated pipeline combining human expertise and LLM-generated samples, as described in the paper: |
|
|
|
**"Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models"**. |
|
|
|
*[Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models](TODO)*. |
|
|
|
|
|
|
|
## Overview |
|
|
|
The dataset contains 528,891 annotated Persian sentences (327,475 homograph-focused) covering 285 homograph words with 2-4 pronunciation variants each. Variants are equally represented (~500 samples each) to mitigate bias. The composition blends multiple sources for diversity, as shown below: |
|
|
@@ -49,6 +49,7 @@ The HomoRich dataset includes both formats for broad compatibility. Below is a v |
|
|
|
|
|
|
|
</div> |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
## Usage |
|
|
@@ -80,6 +81,16 @@ See [paper Table 3](#) for full metrics. |
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
### Dataset Creation and Processing |
|
|
|
|
|
|
|
The `scripts` folder contains two key notebooks used in the dataset creation and processing pipeline: |
|
|
|
|
|
|
|
1. `Generate\_Homograph\_Sentences.ipynb`: This notebook implements the prompt templates used to generate homograph-focused sentences as described in the paper, *[Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models.](TODO)* |
|
|
|
|
|
|
|
2. `Phonemize\_Sentences.ipynb`: This notebook applies the phonemization process based on the LLM-powered G2P method detailed in the *[LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study](TODO)* paper. |
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
## License |
|
|
|
- **Dataset**: Released under **CC0-1.0** (public domain). |
|
|
|
- **Code/Models**: **MIT License** (where applicable). |