Browse Source

Update README.md

main
Mahta Fetrat 2 weeks ago
parent
commit
eddb4e5513
No account linked to committer's email address
1 changed files with 12 additions and 1 deletions
  1. 12
    1
      README.md

+ 12
- 1
README.md View File

@@ -1,7 +1,7 @@
# HomoRich: A Persian Homograph Dataset for G2P Conversion

HomoRich is the first large-scale, sentence-level Persian homograph dataset designed for grapheme-to-phoneme (G2P) conversion tasks. It addresses the scarcity of balanced, contextually annotated homograph data for low-resource languages. The dataset was created using a semi-automated pipeline combining human expertise and LLM-generated samples, as described in the paper:
**"Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models"**.
*[Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models](TODO)*.

## Overview
The dataset contains 528,891 annotated Persian sentences (327,475 homograph-focused) covering 285 homograph words with 2-4 pronunciation variants each. Variants are equally represented (~500 samples each) to mitigate bias. The composition blends multiple sources for diversity, as shown below:
@@ -49,6 +49,7 @@ The HomoRich dataset includes both formats for broad compatibility. Below is a v

</div>


---

## Usage
@@ -80,6 +81,16 @@ See [paper Table 3](#) for full metrics.

---

### Dataset Creation and Processing

The `scripts` folder contains two key notebooks used in the dataset creation and processing pipeline:

1. `Generate\_Homograph\_Sentences.ipynb`: This notebook implements the prompt templates used to generate homograph-focused sentences as described in the paper, *[Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models.](TODO)*

2. `Phonemize\_Sentences.ipynb`: This notebook applies the phonemization process based on the LLM-powered G2P method detailed in the *[LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study](TODO)* paper.

---

## License
- **Dataset**: Released under **CC0-1.0** (public domain).
- **Code/Models**: **MIT License** (where applicable).

Loading…
Cancel
Save