How Reverie solves the Indian-language voice data challenge uniquely

Janajit Bagchi

Janajit Bagchi

I am Janajit Bagchi. I lead the Data Engineering function for Products and R&D at Reverie Language Technologies.

I and my team are responsible for the supply of high quality data which lays the foundation of the Products and Technologies at Reverie.

I have been working with Reverie for the past three years and have led Data Identification and Data Engineering for technologies like Automatic Speech Recognition and Text-to-Speech.

We have been doing many great things in the field of Indian language technologies, and I thought I would share with you an important solution we have been working on.
So, here goes: 

Data has been a buzzword for a long time now. High-quality data can improve the quality of existing products and technologies immensely if used in the right way. 

Reverie is no exception; our products are also heavily reliant on data. Obtaining high-quality data in Indian languages is challenging, which is why I’m writing this blog to describe how Reverie overcomes this barrier using successful methods that will be discussed later in this blog.

 I fondly remember what Holmes said  in, A Study in Scarlet:,

“It is a capital mistake to theorize before one has data.”

The condition of Indian data repositories is similar; either there is very little data available in Indian languages, or there is a void.

In the next 3 years, it is estimated that 400 million Indians are  expected to get their hands on a smartphone for the very first time. These new smartphone users speak one of India’s 22 official languages or even beyond, most of which are under-represented in the mainstream.  So, voice-based technologies have become imperative to cater to the hundreds of millions of  these new internet and smartphone users in the years ahead.

Given the complexity of existing text-based interfaces, voice is the only way to include these people in the digital transformation. Voice interaction in our mother tongue comes naturally to each one of us. 

It has been proven time and again that the voice-based solution is the most inclusive and democratic way of bringing about the inclusion of the masses in the digital revolution that India will experience in the times ahead. 

Building data-driven speech patterns for voice-based technologies

Reverie is on a mission to work as a catalyst for this transformation and be the platform of choice for both voice and text-based Indian language support.

Now, let’s take a look at understanding what’s coming up between this billion-dollar Indian dream and how we address the challenge at Reverie.

Almost all Indian languages are considered as low resource languages, i.e., there is little or no speech data available in these languages. We have tried adopting and experimenting with different methods for plugging this void. 

For example, we contacted recording studios to record speech; we have also conducted physical drives of data collections but could collect very little data in the process. Unfortunately, collecting  high-quality labelled and annotated speech data in any language is extremely expensive and a resource-intensive operation. This makes speech data collection the biggest bottleneck in developing  local language voice technologies.

Bridging the data challenge and beyond

Reverie’s crowdsourcing app Voice Collector uniquely solves this problem. Our goal is to create a win-win situation by collecting the much-needed data from a diverse set of people across all Indian languages;  at the same time, we were able to generate supplementary income for the participants.

We ran numerous rounds of our data collection campaign with the support of volunteers, and the participants responded with high levels of interest and excitement. 

                            

                           Fig1: Screenshots from our Voice Collector App 

The app interface is intuitive, and anyone can get started in no time. We have been able to generate more than thousands of hours of diverse speech data with rich coverage of accents, dialects and all nuances of our native tongues. With the support of volunteers, the collections were carried out in a controlled environment. Over thousands of volunteers have contributed to this drive.

 

There were a lot of elements of linguistic pride that came through pretty strongly. Participants repeatedly expressed the pride they felt about their native tongue and ability to share their knowledge, as well as for the fact that their native language can earn them income.

 

In closing,

 

It is important to know that as of 2014, 71% of rural Indians and 86% of urban Indians are at least literate in one Indian language. Going forward we will want to include the underserved communities also on the horizon, who are traditionally underrepresented in such datasets. 

 

This will diversify the data by covering all accents, intonations, and dialects across geographies and  cultures while still continuing to generate income at the same time. This is how Reverie is bridging the gap between data and no data, bringing in more Indians together to contribute and build their own future.

 

if you’d like to learn more about how we are solving interesting Indian language localisation challenges. Connect with us at marketing@reverieinc.com

About Reverie Language Technologies

At Reverie Language Technologies, we bring years of experience in localising platforms with superior localisation services. Our platform provides localisation services such as local-language translation, localisation, transliteration, device input, and search, through a set of APIs. 

Our solutions ensure that every step of a user’s journey is localised, both online and offline with an end-to-end Indian localisation products such as:

Prabandhak: A unique cloud-based, AI-powered translation management hub that ensures swift, easy, and accurate localisation.

Anuvadak: A platform that accelerates the process of creating, launching, and optimizing your website in multiple languages.

Voice Suite: Reverie’s voice solutions understand and transcribe speech in 11 Indian languages to communicate effectively with customers.

Swalekh: A multilingual Indic keyboard that helps users type and interact in the language of their choice.

Text Display Suite: A robust font suite comprising two unique solutions to make digital content more local-language friendly.

Business Ready APIs: Reverie’s APIs such as Transliteration API , Neural Machine Translation API, Multilingual Search API, Speech to Text API, Natural Understanding API, and Text to Speech API assist is automation of translation and localisation process.

share: