HomoRich: The first large-scale Persian homograph dataset for G2P conversion, featuring 528K annotated sentences with balanced pronunciation variants and dual phoneme representations.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Mahta Fetrat 3fbe198366
Update README.md
1 week ago
assets Add files via upload 2 weeks ago
data Add files via upload 2 weeks ago
scripts Add files via upload 2 weeks ago
LICENSE Initial commit 2 weeks ago
README.md Update README.md 1 week ago

README.md

HomoRich: A Persian Homograph Dataset for G2P Conversion

HomoRich is the first large-scale, sentence-level Persian homograph dataset designed for grapheme-to-phoneme (G2P) conversion tasks. It addresses the scarcity of balanced, contextually annotated homograph data for low-resource languages. The dataset was created using a semi-automated pipeline combining human expertise and LLM-generated samples, as described in the paper:
Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models.

Overview

The dataset contains 528,891 annotated Persian sentences (327,475 homograph-focused) covering 285 homograph words with 2-4 pronunciation variants each. Variants are equally represented (~500 samples each) to mitigate bias. The composition blends multiple sources for diversity, as shown below:

Distribution of data sources in HomoRich dataset

The source for different parts of the HomoRich dataset

Phoneme Representations:

Persian G2P systems use two common phoneme formats:

  • Repr. 1: Used in KaamelDict and SentenceBench (compatible with prior studies)
  • Repr. 2: Adopted by GE2PE (state-of-the-art model enhanced in this work)

The HomoRich dataset includes both formats for broad compatibility. Below is a visual comparison:

Repr. 1

Repr. 2


Usage

Loading the Dataset

The dataset is available both on Hugging Face and in this repository:

Option 1: From Hugging Face

from datasets import load_dataset  
dataset = load_dataset("AnonymousOwner/HomoRich")  # To be updated

Option 2: From this repository

You can access the dataset files directly from the data folder of this repository.

Example Use Case: Homograph Disambiguation

TODO  

Benchmarks

The dataset was used to improve:

  1. Homo-GE2PE (Neural T5-based model): 76.89% homograph accuracy (29.72% improvement).
  2. HomoFast eSpeak (Rule-based): 74.53% accuracy with real-time performance (30.66% improvement).

See paper Table 3 for full metrics.


Dataset Creation and Processing

The scripts folder contains two key notebooks used in the dataset creation and processing pipeline:

  1. Generate\_Homograph\_Sentences.ipynb: This notebook implements the prompt templates used to generate homograph-focused sentences as described in the paper, Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models.

  2. Phonemize\_Sentences.ipynb: This notebook applies the phonemization process based on the LLM-powered G2P method detailed in the LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study paper.


License

  • Dataset: Released under CC0-1.0 (public domain).
  • Code/Models: MIT License (where applicable).

Citation

TODO: citation to paper arxiv

Contributions

Contributions and pull requests are welcome. Please open an issue to discuss the changes you intend to make.


  • Paper PDF (TODO: link to paper)
  • HomoFast eSpeak NG (TODO: link to repo)
  • Homo-GE2PE Model (TODO: link to repo)