Pre-Built Products

Anuvadak

Website Localisation and Translation Management Platform

CubeRoot

AI-Powered Chat & Voice Bot Builder

Prabandhak

Translation Project Management Platform

BUILD WITH REVERIE

Translation API

Accurately and fluently translate text

Transliteration API

Retain the phonetic sound of the words from the source language

Text-to-Speech API

Convert any written text into spoken words

Speech-to-Text API

Convert spoken words into text, realtime

NLU

Understand and interpret human language

FEATURED

for Enterprises

Customized language solutions for Web, App, Bot, and IVR enhancing digital CX

for Startups

Integrate Language APIs at a product level or choose customized language solutions for Web, App, Bot, and IVR fostering growth

for Government

Robust Bot, Content, and video localization solutions for citizen engagement, communication & grievance redressal

INDUSTRIES

Last updated on: August 23, 2017

Encoding Standards Challenges for IS 16350

Share this article

This AI generated Text-to-Speech widget generated by Reverie Vachak.

The efforts for the digital presence of Indian languages have been going over for more than two decades now and still the Indic presence is less than 0.01% over the internet. Encoding standards, input mechanism and normalization with lately introduced Zero Width Joiner (ZWJ) and Zero Width Non Joiner (ZWNJ) concepts have further added to how Indian languages look and behave over the internet.

The standard 16350 document titled Enhanced Inscript Keyboard Layouts by Bureau of Indian Standards identifies ISCII which was first proposed in 1991 on identifying minimum set of Brahmi based languages and quoting verbatim, “with an eye on collation, sorting, storage economy and other ancillary measures which ideally should be a part of all the encoding standards.”

For typing, INSCRIPT was standardized by Department of Engineering (DoE) in 1986 which was further modified in the year 1988 to accommodate “Nukta” (़) as a character. Very few characters were added on the ISCII’s proposed set of characters because some extra characters could be created by adding Nukta to it, E.g.

Unicode later came into picture to ensure that every character would have a unique value assigned to them irrespective of the platform.

The picture looks complete and yet there are gaps and inconsistencies in Indian language presence on digital mediums. Let’s look at some of them:

Missing character sets for Brahmi based languages

The 1992 ISCII proposal for characters was for Brahmi based languages. Devanagari which is derived from Brahmi, is in itself the mother script for more than one Indian language; Hindi, Marathi, Nepali, Dogri to name a few. All the languages derived from Devanagari do not use all the characters from Devanagari script. E.g. ळ is specific only to Marathi and hence not used anywhere else. Similarly, ॻ is specific only to Sindhi, a language which can be written or typed using both Devanagari and Urdu letters. As such, although the knowledge for the minimum characters of Brahmi based languages are available, separate character sets for the derived languages of Devanagari are missing.

Inconsistency with character set recommendation and language presence on the internet

Unicode and IS 16350 have outlined characters which are archaic or whose roots are unknown. Hence these characters are extraneous because no one is using them. At Reverie we have scraped data from legacy websites and hardly come across such extraneous characters. E.g.

Nukta as a character can generate junk

As mentioned above, Nukta was added as a character in the year 1988 to avoid adding extra characters in the ISCII proposed character list. However, the nukta based characters also exist independently on the Unicode and IS 16350 document as ड़, ज़, फ़ Since Nukta has the property to behave like a Maatra, it will easily attach itself with other consonants generating junks. E.g.

No concept of explicit Halanth

The ZWNJ exists on the technical space because of two reasons. Firstly, people may have a specific style of denoting conjuncts. They can choose to represent the joined version or may want to represent the constituent consonants. Secondly, it is needed to represent words like “Bathroom” which cannot be represented correctly without ZWNJ.

To conclude, any Indic content generated becomes a part of Indic computing which means it can be collated, searched, sorted and performed with other ancillary measures. As such it is necessary to enable end users generate correct content. This begins by defining language based character sets and using them to create input tools which take care of the language nuances. IS 16350 hence needs revision to tackle the existing challenges for digital presence of Indian languages.

Written by

reverie

Share this article

Subscribe to Reverie's Blogs & News

The latest news, events and stories delivered right to your inbox.

Encoding Standards Challenges for IS 16350

Table of Content

Missing character sets for Brahmi based languages

Inconsistency with character set recommendation and language presence on the internet

Nukta as a character can generate junk

No concept of explicit Halanth

Written by

reverie

Share this article

Subscribe to Reverie's Blogs & News

You may also like

Best Voice AI for Call Centre Automation in 2026

Speech-to-Text Trends 2026: Key Technologies Powering Enterprise Voice

10 Best Software for Voice Recognition in 2026

ABOUT

EXPLORE REVERIE

LATEST

Pre-Built Products

BUILD WITH REVERIE

INDUSTRIES

SOLUTIONS

FREE TOOLS

SUBSCRIBE TO REVERIE

The latest news, events and stories delivered right to your inbox.