⚡ Premium AI Training Data

Speech Datasets
for AI Training

High-quality, multilingual audio corpora designed for ASR, TTS, speaker recognition, and voice AI research. Ready to download, production-tested.

↓ Explore Datasets Request Custom Data
10+Languages
50 TB+Audio Data
1 200+Hours Recorded
4 000+Researchers

Multilingual Speech Corpora

Carefully curated, annotated, and verified speech datasets spanning 10 languages — ready for your next model.

🇺🇸 English Mixed
LibriSpeech Pro — English ASR Corpus

Over 1,000 hours of read audiobook speech derived from LibriVox recordings. Clean and other-conditions splits included, with full text transcriptions aligned at sentence level.

🗂 WAV, FLAC 💾 60 GB 🎙 1,000 hrs
↓ Download Dataset
🇸🇦 Arabic Mixed
ArabicVoice — Modern Standard Arabic

A large-scale broadcast news and conversational corpus in Modern Standard Arabic. Includes dialectal varieties from Gulf, Levantine, and Egyptian regions with rich phonetic annotation.

🗂 MP3, WAV 💾 42 GB 🎙 720 hrs
↓ Download Dataset
🇯🇵 Japanese Female
JVS-Pro — Japanese Voice Synthesis Pack

Professional TTS corpus featuring 100 unique female speakers with phonetically balanced sentences. Recorded in studio conditions at 24 kHz sampling rate with pitch and energy labels.

🗂 WAV 💾 18 GB 🎙 310 hrs
↓ Download Dataset
🇰🇷 Korean Mixed
KorSpeech — Korean Conversational ASR

Spontaneous and scripted Korean speech collected from 500 native speakers across age groups. Annotated with morpheme-level segmentation and emotion labels, ideal for dialog systems.

🗂 WAV, OPUS 💾 28 GB 🎙 480 hrs
↓ Download Dataset
🇪🇸 Spanish Mixed
HispanicVoice — Multilectal Spanish Corpus

Broad coverage of Spanish spoken across Latin America and Iberia. Includes spontaneous dialogue, news narration, and read speech with accent metadata for 12 regional variants.

🗂 FLAC, WAV 💾 55 GB 🎙 950 hrs
↓ Download Dataset
🇨🇳 Mandarin Male
AISHELL-Pro — Mandarin Read Speech

High-fidelity Mandarin corpus recorded with professional microphones in a soundproof studio. Features tonal annotations, character-level alignments, and speaker demographic metadata.

🗂 WAV 💾 74 GB 🎙 1,200 hrs
↓ Download Dataset
🇫🇷 French Mixed
FrenchVox — Common Voice Extended

Extended French speech corpus combining validated Common Voice contributions with broadcast data. Covers metropolitan France, Quebec, and Belgian French accents with age group segmentation.

🗂 MP3, WAV 💾 36 GB 🎙 630 hrs
↓ Download Dataset
🇩🇪 German Male
ThorSpeech — German Telephony & Studio

Dual-channel German corpus covering both clean studio and real telephone conditions. Optimized for ASR robustness training with noise-augmented variants at multiple SNR levels.

🗂 WAV, FLAC 💾 48 GB 🎙 820 hrs
↓ Download Dataset
🇮🇳 Hindi Mixed
IndoVoice — Hindi Broadcast & Casual Speech

Comprehensive Hindi corpus with scripted news, casual conversations, and command-style utterances. Features Devanagari transcriptions and code-switching examples with English vocabulary.

🗂 WAV, OGG 💾 31 GB 🎙 540 hrs
↓ Download Dataset
🇧🇷 Portuguese Female
PortaVoz — Brazilian Portuguese TTS Corpus

Studio-quality Brazilian Portuguese corpus recorded by 80 professional voice artists. Includes expressive speech styles (neutral, happy, sad, surprised) with prosody annotations for TTS training.

🗂 WAV 💾 22 GB 🎙 380 hrs
↓ Download Dataset

Benefits of Our Speech Datasets

Built by researchers, for researchers — every dataset ships with what you actually need to train production-grade models.

Verified & Quality-Checked

Every recording passes a three-stage quality pipeline: automatic SNR filtering, human transcription review, and alignment verification to ensure clean, accurate labels.

🌍

True Multilingual Coverage

Go beyond English with deeply curated datasets in 10+ languages, including tonal, agglutinative, and right-to-left scripts — each with native-speaker validation.

Instant Download Access

Download via direct HTTPS links, S3-compatible buckets, or our CLI tool. Incremental downloads and resumable transfers are supported out of the box.

🔒

Commercial-Friendly Licenses

Every dataset comes with a clearly stated license (CC-BY, CC0, or custom). No hidden restrictions — use in commercial products, cloud APIs, and academic publications.

📊

Rich Metadata & Annotations

Speaker demographics, recording conditions, noise levels, word-level timestamps, and emotion labels are included where relevant — so your models generalise better.

🔄

Regular Dataset Updates

Datasets are versioned and updated quarterly with new speakers, corrected transcripts, and expanded vocabulary. Subscribers receive automatic change notifications.

🛠

Framework-Ready Format

Pre-formatted manifests for ESPnet, Hugging Face Datasets, Kaldi, NeMo, and wav2letter are bundled with every download — zero preprocessing required.

💬

Dedicated Research Support

Direct access to our data-engineering team via Slack and email. We answer questions about corpus design, sampling strategy, and benchmark splits within 24 hours.

How to Use AI Speech Datasets

From download to your first trained model in four straightforward steps.

🔍

Choose Your Dataset

Browse the catalogue by language, size, voice type, or use-case tag (ASR, TTS, speaker ID). Filter by license if you have commercial requirements.

📥

Download & Verify

Download via the web UI, our CLI (voicedata pull <dataset-id>), or an S3 bucket. Each archive ships with an MD5 manifest for integrity checks.

🔧

Preprocess & Split

Use the included Python helper scripts or load directly with datasets.load_dataset(). Standard train / dev / test splits are pre-defined and reproducible.

🧠

Train Your Model

Plug the data into your favourite framework. Configuration examples for Whisper, Wav2Vec 2.0, Tacotron 2, and VITS are available in our documentation.

📈

Evaluate & Benchmark

Measure WER, CER, or MOS using the evaluation scripts bundled with each dataset. Compare against published baselines listed in the dataset card.

🚀

Deploy to Production

Export your model in ONNX, TorchScript, or TFLite. Our deployment guide covers cloud, edge, and on-device inference scenarios with latency benchmarks.

What Researchers Are Saying

Join thousands of AI engineers and academics who rely on our speech corpora.

★★★★★

"The English LibriSpeech Pro dataset cut our ASR model's WER by 12% compared to what we had before. The clean transcripts and pre-built Kaldi manifests saved us at least two weeks of data-prep work."

JM
James Mercer
Senior ML Engineer, SpeakTech Labs
★★★★★

"We built a multilingual virtual assistant covering Arabic and Hindi. Having both corpora from the same provider with consistent metadata schema made cross-lingual alignment trivially easy."

AK
Aisha Khalid
NLP Research Lead, Orion AI
★★★★★

"The JVS-Pro Japanese TTS corpus is outstanding quality — studio-grade recordings with proper prosody labels. Our synthetic voice passed A/B testing against real speakers in under a month of training."

KT
Kenji Tanaka
Voice AI Researcher, TokyoSound Inc.
★★★★☆

"HispanicVoice covers every major Spanish dialect we needed for our Latin American product launch. Support team responded within hours when I had questions about accent metadata. Highly recommend."

LC
Lucía Castillo
AI Product Manager, Voz Digital
★★★★★

"ThorSpeech's noise-augmented German telephony data was exactly what we needed for a call-center ASR engine. The SNR metadata let us build curriculum learning pipelines with zero extra labelling."

FW
Felix Wagner
Principal Data Scientist, VoiceCore GmbH
★★★★★

"As a PhD student on a tight budget, the CC0 license and academic pricing made it possible to use AISHELL-Pro legally in my research. The dataset quality exceeds what I've seen in many commercial offerings."

ZL
Zhao Lin
PhD Candidate, Tsinghua University

Let's Talk Data

Need a custom dataset, volume licensing, or just have a question about our corpora? Send us a message and we'll get back to you within one business day.

📞 +1 (415) 800-VOICE
📍 340 Pine Street, San Francisco, CA 94104
🕐 Mon – Fri, 9:00 AM – 6:00 PM PT

Frequently Asked Questions

Answers to the most common questions about AI speech datasets, licensing, formats, and use cases.

What is an AI speech dataset and how is it used in machine learning?

An AI speech dataset is a curated collection of audio recordings paired with transcriptions, speaker metadata, and linguistic annotations. They are used to train and evaluate automatic speech recognition (ASR) systems, text-to-speech (TTS) synthesisers, speaker identification models, and voice assistants. The quality and diversity of the dataset directly determines how well a model generalises to real-world speech.

Which speech dataset is best for training an English ASR model?

For English ASR, our LibriSpeech Pro corpus (1,000+ hours, WAV/FLAC) is the go-to choice. It provides clean and noisy conditions, word-level alignments, and pre-built manifests for Whisper, Wav2Vec 2.0, and Kaldi. For conversational or call-center use cases, we also offer telephony-optimised variants — reach out via the contact form for details.

Are these datasets licensed for commercial use?

Yes. Every dataset on this platform ships with a clearly stated open or commercial-friendly license (CC0, CC-BY 4.0, or our Custom Commercial License). No hidden clauses. You can use the data to build and sell commercial products, cloud APIs, or embedded device applications. License details are visible on each dataset card and in the accompanying README.

What languages are available, and do you cover low-resource languages?

We currently offer verified datasets for 10 languages: English, Arabic, Japanese, Korean, Spanish, Mandarin Chinese, French, German, Hindi, and Portuguese. We are actively expanding into low-resource languages including Swahili, Urdu, Vietnamese, and Turkish. If you need a specific language not yet listed, use our custom data request form.

What audio formats and sample rates are supported?

Datasets are available in WAV (PCM 16-bit), FLAC (lossless), MP3, OGG, and OPUS formats. Sample rates range from 8 kHz (telephony) to 48 kHz (studio-grade), with 16 kHz being the standard for most ASR and TTS use cases. Format preferences can be selected at download time via our CLI tool.

How do I download a large dataset efficiently?

Use our voicedata pull <dataset-id> CLI for resumable, parallel downloads with automatic MD5 checksum verification. Datasets are also accessible via S3-compatible endpoints (useful for cloud training on AWS, GCP, or Azure) and through the Hugging Face Hub with datasets.load_dataset(). Shard-level downloads let you start training before the full dataset is fetched.

Can I use these datasets to fine-tune OpenAI Whisper or Meta Wav2Vec?

Absolutely. Each dataset ships with ready-made configuration files and training scripts for Whisper, Wav2Vec 2.0, MMS, Conformer, and Tacotron 2 / VITS. The Hugging Face-compatible manifests make it especially easy to plug into the transformers Trainer API with just a few lines of code. Baseline WER / CER numbers are published in the dataset card for reproducibility.

What is the difference between ASR datasets and TTS datasets?

ASR (Automatic Speech Recognition) datasets prioritise speaker diversity, acoustic variety, and accurate transcriptions — the goal is to teach a model to understand speech in many conditions. TTS (Text-to-Speech) datasets prioritise recording quality, phonetic balance, and consistent speaker style — the goal is to teach a model to generate natural-sounding speech. Many of our datasets include metadata that makes them suitable for both tasks.

Do you offer academic or student pricing?

Yes. Verified academic institutions, PhD students, and non-profit research organisations qualify for a 60% discount on all paid tiers. Several datasets are also available completely free under CC0 for academic use. Apply via the contact form with your institutional email address and a brief description of your research project.

How often are datasets updated, and will I get the new version?

Datasets are versioned using semantic versioning (e.g., v2.1.0) and updated quarterly with new speaker recordings, corrected transcripts, and expanded vocabulary coverage. All purchasers of a dataset tier receive update notifications by email and can re-download the latest version at no extra cost for the lifetime of their licence.