Dec 27, 2022
2 min read
The efforts for the digital presence of Indian languages have been going over for more than two decades now and still the Indic presence is less than 0.01% over the internet. Encoding standards, input mechanism and normalization with lately introduced Zero Width Joiner (ZWJ) and Zero Width Non Joiner (ZWNJ) concepts have further added to how Indian languages look and behave over the internet.
The standard 16350 document titled Enhanced Inscript Keyboard Layouts by Bureau of Indian Standards identifies ISCII which was first proposed in 1991 on identifying minimum set of Brahmi based languages and quoting verbatim, “with an eye on collation, sorting, storage economy and other ancillary measures which ideally should be a part of all the encoding standards.”
For typing, INSCRIPT was standardized by Department of Engineering (DoE) in 1986 which was further modified in the year 1988 to accommodate “Nukta” (़) as a character. Very few characters were added on the ISCII’s proposed set of characters because some extra characters could be created by adding Nukta to it, E.g.
Unicode later came into picture to ensure that every character would have a unique value assigned to them irrespective of the platform.
The picture looks complete and yet there are gaps and inconsistencies in Indian language presence on digital mediums. Let’s look at some of them:
Missing character sets for Brahmi based languages
The 1992 ISCII proposal for characters was for Brahmi based languages. Devanagari which is derived from Brahmi, is in itself the mother script for more than one Indian language; Hindi, Marathi, Nepali, Dogri to name a few. All the languages derived from Devanagari do not use all the characters from Devanagari script. E.g. ळ is specific only to Marathi and hence not used anywhere else. Similarly, ॻ is specific only to Sindhi, a language which can be written or typed using both Devanagari and Urdu letters. As such, although the knowledge for the minimum characters of Brahmi based languages are available, separate character sets for the derived languages of Devanagari are missing.
Inconsistency with character set recommendation and language presence on the internet
Unicode and IS 16350 have outlined characters which are archaic or whose roots are unknown. Hence these characters are extraneous because no one is using them. At Reverie we have scraped data from legacy websites and hardly come across such extraneous characters. E.g.
Nukta as a character can generate junk
As mentioned above, Nukta was added as a character in the year 1988 to avoid adding extra characters in the ISCII proposed character list. However, the nukta based characters also exist independently on the Unicode and IS 16350 document as ड़, ज़, फ़ Since Nukta has the property to behave like a Maatra, it will easily attach itself with other consonants generating junks. E.g.
No concept of explicit Halanth
The ZWNJ exists on the technical space because of two reasons. Firstly, people may have a specific style of denoting conjuncts. They can choose to represent the joined version or may want to represent the constituent consonants. Secondly, it is needed to represent words like “Bathroom” which cannot be represented correctly without ZWNJ.
To conclude, any Indic content generated becomes a part of Indic computing which means it can be collated, searched, sorted and performed with other ancillary measures. As such it is necessary to enable end users generate correct content. This begins by defining language based character sets and using them to create input tools which take care of the language nuances. IS 16350 hence needs revision to tackle the existing challenges for digital presence of Indian languages.