Skip to content

Latest commit

 

History

History
183 lines (175 loc) · 5.85 KB

README.md

File metadata and controls

183 lines (175 loc) · 5.85 KB

MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages

🍇 Open-Source Compliant Speech Dataset List

Name License Hours Languages Label
CommonVoice CC 0 6,732 bg, cs, da, nl, en, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv
CoVoST2 CC 0 687 en, fr, it, es, pt, et, nl, sv, lv, sl
CSS10 Public Domain 99 nl, fi, fr, de, el, hu, es
EMU CC BY 3.0 56 pl
EU Parliament CC BY 4.0 32 pl
FLEURS CC BY 4.0 215 bg, cs, da, nl, en, et, fi, fr, de, el, hu, ga, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv
Large Corpus of Czech Parliament Plenary Hearings CC BY 4.0 444 cs
LibriLight Public Domain 57,706 en
LibriTTS CC BY 4.0 585 en
LibriSpeech CC BY 4.0 360 en
LibriVoxDeEn Public Domain 547 de
MC Speech CC 0 22 pl
Multilingual LibriSpeech CC BY 4.0 50,687 nl, en, fr, de, it, pl, pt, es
SIWIS CC BY 4.0 11 fr
Speech Commands CC BY 4.0 18 en
VCTK CC BY 4.0 44 en
VoxPopuli CC 0 383,500 bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv
1,791 hr, cs, nl, en, et, fu, fr, de, hu, it, lt, pl, ro, sk, sl, es
YouTube-Commons CC BY 4.0 3,261 bg, cs, nl, en, et, fr, de, el, hu, it, pl, pt, ro, es
443,396 bg, cs, nl, en, et, fi, fr, de, el, hu, it, lv, lt, pl, pt, ro, es, sv
MOSEL 🍇 CC BY 4.0 441,206 bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv

For the languages, two-letter ISO 639 codes are used.

⚠️ Contribute and Report Issues

If you want to add an open-source compliant dataset to the list, please fill a Pull Request. If you want to report an issue about existing content, please use the issues section.

🏁 Citation

If you use MOSEL dataset, please cite:

@inproceedings{mosel,
  title = {{MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages}},
  author = {Marco Gaido and Sara Papi and Luisa Bentivogli and Alessio Brutti and Mauro Cettolo and Roberto Gretter and Marco Matassoni and Mohamed Nabihand Matteo Negri},
  booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
  month = nov,
  year = "2024",
  address = "Miami, United States",
  publisher = "Association for Computational Linguistics",
}