Mozilla Common Voice Adds New Languages
Written by Kay Ewbank   
Monday, 09 August 2021

The Mozilla Common Voice initiative has released a new, expanded data set featuring 16 new languages including Basaa and Kazakh, along with 4,622 new hours of speech.

Mozilla's Common Voice project aims to provide a free database of recordings of people speaking sample sentences. Contributors donate speech data to an open-source public dataset, which anyone can then use to train voice-enabled technology.

deepspeech

In addition to the Common Voice dataset, Mozilla is also building an open source speech recognition engine called Deep Speech. The company says that both projects are aimed at bridging the digital speech divide. Voice recognition technologies bring a human dimension to our devices, but developers need an enormous amount of voice data to build them.

commonvoice

Currently, most of that data is expensive and proprietary. Mozilla wants to make voice data freely and publicly available, and make sure the data represents the diversity of real people.

When a voice clip is added to the data set, it has to be validated by two separate users to be actually included. If a user rejects a voice clip it returns to the Queue. If rejected a second time, the voice clip is moved to the Clip Graveyard.

This latest release to Common Voice adds 16 new languages to the Common Voice data set. The new languages are Basaa, Slovak, Northern Kurdish, Bulgarian, Kazakh, Bashkir, Galician, Uyghur, Armenian, Belarusian, Urdu, Guarani, Serbian, Uzbek, Azerbaijani, and Hausa.

The top five languages by total hours of available voice recordings are English (2,630 hours), Kinyarwanda (2,260) , German (1,040), Catalan (920), and Esperanto (840). Kinyarwanda is an official language of Rwanda.

In terms of the languages that are increasing the most by percentage,  Thai has seen an almost twenty times growth, from 12 hours to 250 hours, while Luganda has grown 8 hours to 80 hours. Esperanto has increased by over seven times, from 100 hours to 840 hours, and Tamil has grown by more than eight times, from 24 hours to 220 hours.

deepspeech 

More Information

Common Voice Website

DeepSpeech On GitHub

Related Articles

Mozilla Updates Voice Recognition Project

Introducing DeepSpeech

Mozilla Wants Your Voice

Mozilla DeepSpeech Gets Smaller

Mozilla Labs Quietly Relaunched 

Adversarial Attacks On Voice Input

The State Of Voice As UI

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Meta Releases OpenSource Podcast Generating Tool
28/11/2024

Meta has released an open source project that can be used to automatically convert a PDF file into a podcast. Meta says Notebook Llama can be considered an open-source version of Google's NotebookLM.

 [ ... ]



Ai-Da's Portrait of Alan Turing At Auction
01/11/2024

Sotheby's Digital Art Day Action, now underway, features a large-scale portrait of  Alan Turing created by Ai-Da, the humanoid robot artist whose work, including this canvas, was exhibited at the [ ... ]


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info