• abb
  • Afrikaans
  • Aragonés
  • Asturianu
  • Ásụ̀sụ́ Ìgbò
  • Ayisyen
  • Azərbaycanca
  • Bahasa Indonesia
  • Bahasa Melayu
  • Basaa
  • bax
  • bba
  • bbj
  • bbl
  • bci
  • beb
  • bfd
  • bft
  • bgp
  • bkm
  • bnm
  • bnn
  • Brezhoneg
  • bri
  • bsh
  • bsk
  • bum
  • byv
  • català
  • cdo
  • Čeština
  • cut
  • cux
  • Cymraeg
  • dag
  • Dansk
  • dav
  • Deutsch
  • Dioula ye
  • dmk
  • dml
  • Dolnoserbšćina
  • dru
  • dua
  • ebr
  • eesti
  • Emakhuwa
  • English
  • Español
  • Esperanto
  • esu
  • eto
  • Euskara
  • ewo
  • fan
  • Français
  • Frysk
  • fub
  • fue
  • Gaeilge
  • Galego
  • gju
  • Guarani
  • gv
  • gwc
  • gwt
  • gya
  • Hausa
  • hno
  • Hornjoserbšćina
  • Hrvatski
  • ibb
  • Ikinyarwanda
  • Interlingua
  • ipk
  • IsiNdebele (Sewula)
  • IsiXhosa
  • Íslenska
  • Italiano
  • jqr
  • Kernowek
  • khw
  • Kiswahili
  • kln
  • kls
  • ksf
  • Kurdî (Kurmancî)
  • Kurdkî (Zazakî)
  • Laiholh (Hakha)
  • Latgalīšu
  • Latviešu
  • Lietuvių
  • Ligure
  • lss
  • Luganda
  • luo
  • Magyar
  • Malti
  • mau
  • mbo
  • mvy
  • mxu
  • ncx
  • Nederlands
  • nhi
  • nnh
  • Norsk (bokmål)
  • Norsk (nynorsk)
  • O‘zbek
  • occitan
  • oru
  • pcm
  • phl
  • plk
  • polski
  • Português
  • pua
  • pwn
  • Quechua Chanka
  • qup
  • qux
  • qva
  • qvl
  • qwa
  • qws
  • qxa
  • qxp
  • qxt
  • qxu
  • qxw
  • Română
  • romontsch sursilvan
  • Rumantsch vallader
  • Sardu
  • scl
  • sd
  • sei
  • Sesotho sa Borwa
  • Sesotho sa Leboa
  • Setswana
  • Shqip
  • Sicilianu
  • Siswati
  • slovenčina
  • slovenščina
  • suomi
  • sva
  • Svenska
  • szy
  • t'pur
  • Taqbaylit
  • tay
  • toki pona
  • trv
  • trw
  • Tshivenḓa
  • Türkçe
  • Türkmençe
  • Twi
  • ush
  • vad̕d̕a
  • var
  • Việt
  • wbl
  • wes
  • xhe
  • Xitsonga
  • xka
  • xmf
  • yaq
  • Yòrùbá
  • zoc
  • Zulu
  • Ελληνικά
  • Адыгабзэ
  • Адыгэбзэ (Къэбэрдей)
  • Аԥсуа
  • Башҡорт
  • Беларуская
  • Български
  • Ирон
  • Кыргызча
  • Кырык мары
  • Қазақ тілі
  • Македонски
  • Марий
  • Мокшень кяль
  • Монгол хэл
  • Русский
  • Саха тыла
  • Српски
  • Татар
  • Тоҷикӣ
  • Українська
  • Чӑвашла
  • Эрзянь кель
  • ქართული
  • Հայերեն
  • אידיש
  • עברית
  • ئۇيغۇرچە
  • اردو
  • العربية
  • پښتو
  • سرائیکی
  • فارسی
  • کوردیی ناوەندی
  • ދިވެހި
  • ⵜⴰⵎⴰⵣⵉⵖⵜ
  • ትግረ
  • ትግርኛ
  • አማርኛ
  • नेपाली
  • मराठी
  • हिंदी
  • অসমীয়া
  • বাংলা
  • ਪੰਜਾਬੀ
  • ଓଡ଼ିଆ
  • தமிழ்
  • తెలుగు
  • മലയാളം
  • ꯃꯤꯇꯩ ꯂꯣꯟ
  • ไทย
  • ພາສາລາວ
  • ᱥᱟᱱᱛᱟᱲᱤ (ᱚᱞ ᱪᱤᱠᱤ)
  • 한국어
  • 中文(香港)
  • 台語
  • 日本語
  • 汉语(中国大陆)
  • 粵語
  • 華語(台灣)

Datasets

We’re building an open source, multi-language dataset of voices that anyone can use to train speech-enabled applications.

We believe that large, publicly available voice datasets will foster innovation and healthy commercial competition in machine-learning based speech technology. Common Voice’s multi-language dataset is already the largest publicly available voice dataset of its kind, but it’s not the only one. Look to this page as a reference hub for other open source voice datasets and, as Common Voice continues to grow, a home for our release updates.

Download the Dataset

We’ve made some changes. Delta Segments just contain the most recent clips since the last release. Read more about this work.

Select the desired language dataset and choose the version you wish to download.

Validated Hours
22,109
Recorded Hours
33,151
Languages
133

What’s inside the Common Voice dataset?

Validated Hours
22,109
Recorded Hours
33,151
Languages
133

Each entry in the dataset consists of a unique MP3 and corresponding text file. Many of the 33,151 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help train the accuracy of speech recognition engines. The dataset currently consists of 22,109 validated hours in 133 languages, but we’re always adding more voices and languages. Take a look at our Languages page to request a language or start contributing.

Download the Single Word Target Segment

This is a use case driven segment containing data to power spoken digit recognition and yes / no detection.

Validated Hours
83
Recorded Hours
142
Languages
34
Enter Email to Download

Why an email? We may need to contact you in the future about changes to the dataset, an email provides us a point of contact.

sha256 checksum:

f96d00524f8859a8bd154eb98822ec06a9d451e5902a419deb5e1472f1d2c3ff

NVIDIA NeMo

NVIDIA NeMo™ is an open-source toolkit for researchers developing state-of-the-art conversational AI models.

DeepSpeech

Mozilla’s open source voice recognition engine Deep Speech can be used to build speech recognition applications. Read our Github overview or join the DeepSpeech Discourse to learn how to get started.

Coqui

Coqui is dedicated to open speech technology. Their projects include deep learning based STT and TTS engines.

Community Playbook

Find helpful guidance on the entire Common Voice journey, from localisation to dataset usage, as well as how to connect with our community.

LibriSpeech

LibriSpeech is a corpus of approximately 1000 hours of 16Khz read English speech derived from read audiobooks from the LibriVox project.

TED-LIUM Corpus

The TED-LIUM corpus was made from audio talks and their transcriptions available on the TED website.

VoxForge

VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines.

Tatoeba

Tatoeba is a large database of sentences, translations, and spoken audio for use in language learning. This download contains spoken English recorded by their community.