Datasets
We’re building an open source, multi-language dataset of voices that anyone can use to train speech-enabled applications.
We believe that large, publicly available voice datasets will foster innovation and healthy commercial competition in machine-learning based speech technology. Common Voice’s multi-language dataset is already the largest publicly available voice dataset of its kind, but it’s not the only one. Look to this page as a reference hub for other open source voice datasets and, as Common Voice continues to grow, a home for our release updates.
Download the Dataset
Select the desired language dataset and choose the version you wish to download.
What’s inside the Common Voice dataset?
Each entry in the dataset consists of a unique MP3 and corresponding text file. Many of the 33,151 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help train the accuracy of speech recognition engines. The dataset currently consists of 22,109 validated hours in 133 languages, but we’re always adding more voices and languages. Take a look at our Languages page to request a language or start contributing.
Download the Single Word Target Segment
This is a use case driven segment containing data to power spoken digit recognition and yes / no detection.
![](/dist/nvidia.db6d92ff2fc0c4b8.jpg)
NVIDIA NeMo™
NVIDIA NeMo™ is an open-source toolkit for researchers developing state-of-the-art conversational AI models.
![](/dist/deepspeech.fb7d3b38751dae90.png)
DeepSpeech
Mozilla’s open source voice recognition engine Deep Speech can be used to build speech recognition applications. Read our Github overview or join the DeepSpeech Discourse to learn how to get started.
![](/dist/coqui.4ceaec94d65d8d01.jpg)
![](/dist/playbook.633d04116db46885.jpg)
Community Playbook
Find helpful guidance on the entire Common Voice journey, from localisation to dataset usage, as well as how to connect with our community.
![](/dist/librispeech.ef9e03462d9aa477.jpg)
LibriSpeech
LibriSpeech is a corpus of approximately 1000 hours of 16Khz read English speech derived from read audiobooks from the LibriVox project.
![](/dist/ted.ad08f85d672f7049.jpg)
TED-LIUM Corpus
The TED-LIUM corpus was made from audio talks and their transcriptions available on the TED website.
![](/dist/voxforge.cd59bcf43f97f316.jpg)
VoxForge
VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines.
![](/dist/tatoeba.0f6f242259be43ce.jpg)
Tatoeba
Tatoeba is a large database of sentences, translations, and spoken audio for use in language learning. This download contains spoken English recorded by their community.
![](/dist/feedback.1038f8e6b14913b8.png)