Help us Collect Sentences

In order to gather voice data in your language(s), we first need thousands of sentences for people to read. So we are asking you to submit as many sentences as possible in your language(s) so future contributors can read them on the Common Voice website.

There are two ways to get sentences:

  1. Writing your own sentences.
  2. Finding existing sentences in the Public Domain.

Writing your own sentences

We ask that you first try to write at least 50 sentences (which you can think of off the top of your head, or take from you blog posts, social media history, or perhaps text messages). Make sure you only share with us sentences that you have permission to share. Once you have sentences, you can submit them on the upload form.

Here are some criteria to help you write your own sentences:

  • Ideally it should take 5 seconds to read each sentence, and no more than 10 seconds. So aim for sentences around 5 to 10 words.
  • 1 or 2 word sentences is also ok, but not for all sentences! Try to have a mix of short and medium-sized sentences, but try to keep them all under 10 seconds.
  • Try to use as many different words as possible. This will help the machine to enrich its vocabulary.
  • Including sentences with punctuation (ex !,?) is also great to have but do not stress yourself if you can’t think of any.
  • If your language is using any special symbols (ex.â, ü, ß, š) that’s great. Using those actually helps the machine to distinguish different sounds.
  • Try to include proper nouns (first names, streets names, places, etc).
  • Numbers are fine, but please spell the number rather than write the digits (ie. “five-hundred twenty-seven” is rather than “527”)

Finding existing sentences in the Public Domain

Another way to find sentences is to search for them on the internet. Remember that we need permission to publish those sentences, so always ensure that the text belongs to the public domain. If there is not an indication, your reach out to the person that the text belongs to and ask if you can use their text. If you have any questions about this, or would like help reaching out to a data holder, please email Michael Henretty.

Once you have found a collection of sentences, you can submit them using our upload form. If you have too many sentences to paste into that form, you can submit a link to where those sentences are located. Or, you can also email the file directly to me at: mhenretty@mozilla.com.

Here are some tips to finding sentences:

  • The best sources you can look for are podcasts, transcripts, movie scripts and anything that potential can contain everyday conversations.
  • Government proceedings, books and articles are also great however since the text tends to be a little more formal they are less of a priority.
  • Unfortunately we can’t have Wikimedia articles yet. So do not copy paste from there.
  • Two great resources to look into are: Common Crawl and Open Subtitles. If you find any similar collection in your local language, that’s great! Do share with us on our slack channel so we can distribute it to the rest of the volunteers.