ManaTTS is the largest open Persian speech dataset with 86+ hours of transcribed audio. Includes data collection pipeline and tools. Suitable for Persian text-to-speech models.

Mahta Fetrat 65b3279621 Add link to the crawling script on colab		1 year ago
LICENSE	Initial commit	1 year ago
README.md	Add link to the crawling script on colab	1 year ago

ManaTTS-Persian-Speech-Dataset

ManaTTS is the largest publicly accessible single-speaker Persian corpus, comprising approximately 86 hours of audio with a sampling rate of 44.1 kHz. It is released under the open CC-0 license, enabling educational and commercial use. This dataset is a comprehensive speech dataset for the Persian language, collected from the Nasl-e-Mana magazine. It includes a wide range of topics and domains, making it suitable for training high-quality text-to-speech models. The dataset is accompanied by a fully transparent, open-source pipeline for data collection and processing, including tools for sentence tokenization, audio segmentation, and forced alignment.

Dataset

The ManaTTS dataset can be downloaded from this link.

Raw Data Crawling

The raw data for this dataset was crawled from the Nasl-e-Mana magazine website. The crawling script used for this purpose is also provided in this repository and on Google Colab in this link.

Processing Pipeline

The following figure illustrates the overall processing pipeline used to create the ManaTTS dataset, including the steps for preproces

This pipeline is available as a Jupyter Notebook included in this repository. You can also run the notebook on Google Colab using this link.

To run the pipeline, follow these steps:

Set up the required environment (details in the notebook)
Place the raw audio and text files in a directory named raw
Execute the cells in the notebook sequentially

Trained TTS Model

A text-to-speech (TTS) model has been trained on the ManaTTS dataset. The code for training the model, as well as some output samples, are available in this repository.

Contributing

Contributions to this project are welcome! If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request.

License

The ManaTTS dataset is released under the CC-0 1.0 license, while the processing pipeline is licensed under the MIT license.

Citation

If you use this dataset or the processing pipeline in your work, please cite the following paper:

(citation to be updated)

README.md