High-quality, multilingual audio corpora designed for ASR, TTS, speaker recognition, and voice AI research. Ready to download, production-tested.
Carefully curated, annotated, and verified speech datasets spanning 10 languages — ready for your next model.
Over 1,000 hours of read audiobook speech derived from LibriVox recordings. Clean and other-conditions splits included, with full text transcriptions aligned at sentence level.
A large-scale broadcast news and conversational corpus in Modern Standard Arabic. Includes dialectal varieties from Gulf, Levantine, and Egyptian regions with rich phonetic annotation.
Professional TTS corpus featuring 100 unique female speakers with phonetically balanced sentences. Recorded in studio conditions at 24 kHz sampling rate with pitch and energy labels.
Spontaneous and scripted Korean speech collected from 500 native speakers across age groups. Annotated with morpheme-level segmentation and emotion labels, ideal for dialog systems.
Broad coverage of Spanish spoken across Latin America and Iberia. Includes spontaneous dialogue, news narration, and read speech with accent metadata for 12 regional variants.
High-fidelity Mandarin corpus recorded with professional microphones in a soundproof studio. Features tonal annotations, character-level alignments, and speaker demographic metadata.
Extended French speech corpus combining validated Common Voice contributions with broadcast data. Covers metropolitan France, Quebec, and Belgian French accents with age group segmentation.
Dual-channel German corpus covering both clean studio and real telephone conditions. Optimized for ASR robustness training with noise-augmented variants at multiple SNR levels.
Comprehensive Hindi corpus with scripted news, casual conversations, and command-style utterances. Features Devanagari transcriptions and code-switching examples with English vocabulary.
Studio-quality Brazilian Portuguese corpus recorded by 80 professional voice artists. Includes expressive speech styles (neutral, happy, sad, surprised) with prosody annotations for TTS training.
Built by researchers, for researchers — every dataset ships with what you actually need to train production-grade models.
Every recording passes a three-stage quality pipeline: automatic SNR filtering, human transcription review, and alignment verification to ensure clean, accurate labels.
Go beyond English with deeply curated datasets in 10+ languages, including tonal, agglutinative, and right-to-left scripts — each with native-speaker validation.
Download via direct HTTPS links, S3-compatible buckets, or our CLI tool. Incremental downloads and resumable transfers are supported out of the box.
Every dataset comes with a clearly stated license (CC-BY, CC0, or custom). No hidden restrictions — use in commercial products, cloud APIs, and academic publications.
Speaker demographics, recording conditions, noise levels, word-level timestamps, and emotion labels are included where relevant — so your models generalise better.
Datasets are versioned and updated quarterly with new speakers, corrected transcripts, and expanded vocabulary. Subscribers receive automatic change notifications.
Pre-formatted manifests for ESPnet, Hugging Face Datasets, Kaldi, NeMo, and wav2letter are bundled with every download — zero preprocessing required.
Direct access to our data-engineering team via Slack and email. We answer questions about corpus design, sampling strategy, and benchmark splits within 24 hours.
From download to your first trained model in four straightforward steps.
Browse the catalogue by language, size, voice type, or use-case tag (ASR, TTS, speaker ID). Filter by license if you have commercial requirements.
Download via the web UI, our CLI (voicedata pull <dataset-id>), or an S3 bucket. Each archive ships with an MD5 manifest for integrity checks.
Use the included Python helper scripts or load directly with datasets.load_dataset(). Standard train / dev / test splits are pre-defined and reproducible.
Plug the data into your favourite framework. Configuration examples for Whisper, Wav2Vec 2.0, Tacotron 2, and VITS are available in our documentation.
Measure WER, CER, or MOS using the evaluation scripts bundled with each dataset. Compare against published baselines listed in the dataset card.
Export your model in ONNX, TorchScript, or TFLite. Our deployment guide covers cloud, edge, and on-device inference scenarios with latency benchmarks.
Join thousands of AI engineers and academics who rely on our speech corpora.
"The English LibriSpeech Pro dataset cut our ASR model's WER by 12% compared to what we had before. The clean transcripts and pre-built Kaldi manifests saved us at least two weeks of data-prep work."
"We built a multilingual virtual assistant covering Arabic and Hindi. Having both corpora from the same provider with consistent metadata schema made cross-lingual alignment trivially easy."
"The JVS-Pro Japanese TTS corpus is outstanding quality — studio-grade recordings with proper prosody labels. Our synthetic voice passed A/B testing against real speakers in under a month of training."
"HispanicVoice covers every major Spanish dialect we needed for our Latin American product launch. Support team responded within hours when I had questions about accent metadata. Highly recommend."
"ThorSpeech's noise-augmented German telephony data was exactly what we needed for a call-center ASR engine. The SNR metadata let us build curriculum learning pipelines with zero extra labelling."
"As a PhD student on a tight budget, the CC0 license and academic pricing made it possible to use AISHELL-Pro legally in my research. The dataset quality exceeds what I've seen in many commercial offerings."
Need a custom dataset, volume licensing, or just have a question about our corpora? Send us a message and we'll get back to you within one business day.
Answers to the most common questions about AI speech datasets, licensing, formats, and use cases.
An AI speech dataset is a curated collection of audio recordings paired with transcriptions, speaker metadata, and linguistic annotations. They are used to train and evaluate automatic speech recognition (ASR) systems, text-to-speech (TTS) synthesisers, speaker identification models, and voice assistants. The quality and diversity of the dataset directly determines how well a model generalises to real-world speech.
For English ASR, our LibriSpeech Pro corpus (1,000+ hours, WAV/FLAC) is the go-to choice. It provides clean and noisy conditions, word-level alignments, and pre-built manifests for Whisper, Wav2Vec 2.0, and Kaldi. For conversational or call-center use cases, we also offer telephony-optimised variants — reach out via the contact form for details.
Yes. Every dataset on this platform ships with a clearly stated open or commercial-friendly license (CC0, CC-BY 4.0, or our Custom Commercial License). No hidden clauses. You can use the data to build and sell commercial products, cloud APIs, or embedded device applications. License details are visible on each dataset card and in the accompanying README.
We currently offer verified datasets for 10 languages: English, Arabic, Japanese, Korean, Spanish, Mandarin Chinese, French, German, Hindi, and Portuguese. We are actively expanding into low-resource languages including Swahili, Urdu, Vietnamese, and Turkish. If you need a specific language not yet listed, use our custom data request form.
Datasets are available in WAV (PCM 16-bit), FLAC (lossless), MP3, OGG, and OPUS formats. Sample rates range from 8 kHz (telephony) to 48 kHz (studio-grade), with 16 kHz being the standard for most ASR and TTS use cases. Format preferences can be selected at download time via our CLI tool.
Use our voicedata pull <dataset-id> CLI for resumable, parallel downloads with automatic MD5 checksum verification. Datasets are also accessible via S3-compatible endpoints (useful for cloud training on AWS, GCP, or Azure) and through the Hugging Face Hub with datasets.load_dataset(). Shard-level downloads let you start training before the full dataset is fetched.
Absolutely. Each dataset ships with ready-made configuration files and training scripts for Whisper, Wav2Vec 2.0, MMS, Conformer, and Tacotron 2 / VITS. The Hugging Face-compatible manifests make it especially easy to plug into the transformers Trainer API with just a few lines of code. Baseline WER / CER numbers are published in the dataset card for reproducibility.
ASR (Automatic Speech Recognition) datasets prioritise speaker diversity, acoustic variety, and accurate transcriptions — the goal is to teach a model to understand speech in many conditions. TTS (Text-to-Speech) datasets prioritise recording quality, phonetic balance, and consistent speaker style — the goal is to teach a model to generate natural-sounding speech. Many of our datasets include metadata that makes them suitable for both tasks.
Yes. Verified academic institutions, PhD students, and non-profit research organisations qualify for a 60% discount on all paid tiers. Several datasets are also available completely free under CC0 for academic use. Apply via the contact form with your institutional email address and a brief description of your research project.
Datasets are versioned using semantic versioning (e.g., v2.1.0) and updated quarterly with new speaker recordings, corrected transcripts, and expanded vocabulary coverage. All purchasers of a dataset tier receive update notifications by email and can re-download the latest version at no extra cost for the lifetime of their licence.