A free licensed Persian TTS dataset including 6+ hours of audio-text pairs with subject
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 2.2KB

5 months ago
5 months ago
123456789101112131415161718192021222324252627282930313233343536373839404142
  1. # GPTInformal-Persian-Speech-Dataset
  2. GPTInformal Persian is a free licensed Persian dataset of audio and text pairs designed for speech synthesis and other speech-related tasks. The dataset has been collected, processed, and annotated as a part of the Mana-TTS project. For details on data processing pipeline and statistics on this dataset, please refer to the paper in the Citation secition.
  3. ## Data Source
  4. The text for this dataset was generated using GPT4o, with prompts covering a wide range of subjects such as politics and nature. The texts are intentionally crafted in informal Persian. Below is the prompt format used to generate these texts:
  5. > Please give me a very long text written in informal Persian. I want it to be mostly about [SUBJECT].
  6. These generated texts were then recorded in a quiet environment. The audio and text files underwent forced alignment using [aeneas](https://github.com/readbeyond/aeneas), resulting in smaller chunks of audio-text pairs as presented in this dataset.
  7. ## Download
  8. You can download the dataset from [this repository](https://huggingface.co/datasets/MahtaFetrat/GPTInformal-Persian).
  9. ### Data Columns
  10. Each Parquet file contains the following columns:
  11. - **file name** (`string`): The unique identifier of the audio file.
  12. - **transcript** (`string`): The ground-truth transcript of the audio.
  13. - **duration** (`float64`): Duration of the audio file in seconds.
  14. - **subject** (`string`): The subject used in prompt to get the original text file.
  15. - **audio** (`sequence`): The actual audio data.
  16. - **samplerate** (`float64`): The sample rate of the audio.
  17. ## Citation
  18. If you use GPTInformal-Persian in your research or projects, please cite the following paper:
  19. ```bash
  20. @article{fetrat2024manatts,
  21. title={ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages},
  22. author={Mahta Fetrat Qharabagh and Zahra Dehghanian and Hamid R. Rabiee},
  23. journal={arXiv preprint arXiv:2409.07259},
  24. year={2024},
  25. }
  26. ```
  27. ## License
  28. This dataset is available under the cc0-1.0. However, the dataset should not be utilized for replicating or imitating the speaker’s voice for malicious
  29. purposes or unethical activities, including voice cloning for malicious intent.