Dec 27, 2022
4 min read
In part one of our series on the Government of India’s language support mandate, we discussed what the mandate means for India’s internet users and what greater implications this move has. In this piece, part two, we’ll talk about how languages are assigned characters and how this fits into this larger project of making Indic language display support the new default.
The Government of India’s language support mandate aims to make Indic language available by default on all phones, bringing relief to hundreds of millions of Indians who are forced to use their phones in English, a language they do not understand.
Apart from its firm directive to companies to provide support for these languages, the mandate also covers character standardization, spelling out to companies what character sets to adhere to base language support on.
The purpose of having uniform character sets is to ensure consistency across devices. Imagine not being able to read English text typed on another device, just because the characters used on that other device don’t entirely overlap with the ones your device supports.
That’s essentially the sort of confusion that many Indian language users face on a daily basis, something the mandate’s provisions for uniform character sets are supposed to prevent.
Representing text visually involves a number of process that go on behind the scenes.
When a key is pressed, the character that key is assigned to is fed to the text editor, and you get your desired output.
How are these characters defined and selected?
Back in the early days of personal digital devices, it was quickly realised that there needed to be a uniform standard for encoding and representing characters used in digital text, across numerous languages. This standard evolved over the years, with new languages and characters added to it over time, developing into what we now call Unicode. Unicode is maintained and updated by the Unicode Consortium, an international body.
Unicode assigns certain alphanumeric values, or codes, to each character. It uses a script based approach in encoding these characters, which means that languages that share the same scripts use the same codes for the characters they have in common, while language specific characters that aren’t common across languages are available in these character sets as well.
For example, while n is common to English and Spanish, ñ is only used in Spanish. Similarly, while ल is used by both Hindi and Marathi, ळ is only used in Marathi.
While most Indian scripts are only used by one major language (Kannada, Telugu, Odia, Tamil, Gujarati), some scripts are shared by multiple languages (Devanagari by Nepali, Konkani, Hindi, Bodo, and Marathi; Eastern Nagari by Assamese, Bengali, and Manipuri).
The different character sets assigned by Unicode come with their own flaws and failings. In the process of encoding and standardizing character sets for languages, Unicode overlooked and ignored the way native speakers perceive and understand their own scripts.
Unicode sets for Indic scripts include numerous extraneous characters, including characters that are similar looking forms of characters formed through combination, archaic, obscure characters that native speakers, apart from specialists, would be unaware of, and junk characters that are illegal combinations, combinations that are not permitted by the script’s own inherent rules.
Apart from this, Unicode sets often assign names to these unknown characters that are generally associated with better known, mainstream characters.
These issues aren’t without their implications.
Often, many of these extraneous characters are very similar in appearance to existing mainstream characters, something that can cause confusion.
For example, the character set for Devanagari includes both ओ and ऒ.
While the former is the standard, mainstream character used for o, the second character is not used in Hindi, Marathi, or any of the languages that use Devanagari.
In addition, some characters are formed by combining existing characters with diacritics, diacritics that cannot be added to just any character. Basically, this is when combined characters look identical to independent characters, but are actually illegal combinations.
क़ for example, can be formed by combining क with the ़ diacritic to give क़. Assigning ़ its own code means that it’s free to combine with characters it’s not used with, giving us combinations like छ़ or भ़ that do not exist in languages that use Devanagari. ड plus ़ gives us ड़, but ड़ already exists as a separate character.
This can be fixed by assigning all existing characters that use the ़ diacritic (क़, ज़, ख़, ग़, etc ) their own codes, removing the possibility of illegal combinations.
आ and अा for example, look identical – but they aren’t. The former is an independent a, while the latter is a with an a matra, or vowel diacritic, added. Indic scripts do not permit matras to be added to vowels, yet the effect can be replicated by mashing two characters.
Character confusion can wreak havoc with search. In the above examples, combined characters look identical to individually coded characters. Since they’re coded differently however, both variations of the same character are indexed differently, and throw up completely different results in search. This makes content discovery a whole lot harder, making it even tougher to discover Indic language content.
The language mandate addresses issues of inconsistent and flawed character sets by including defined character sets for each language, containing only the characters that native speakers of these languages actually use in day to day writing. This move seeks to address the various friction points and confusion points that both font developers and users face when using incorrectly defined character sets.
By including these precisely defined character sets in its provisions, this mandate helps build consistent Indic language display across devices, weeding out the inconsistencies that have plagued Indic language users in the past when it came to typing in their script.