What you need to know today about Indic website search
Internet penetration in India rose to 34.8% in the year 2016 compared to 27% in 2015. Businesses with an online presence, who initially catered to English-only audience, responded to this change by providing their content in many Indian languages.
E-commerce, banking, logistics, healthcare and other sectors showed the adoption of Indic languages in terms of localization of mobile applications, multi-lingual SMSs or multi-lingual static and dynamic content on their websites.
The needs of an Indic language speaker on the internet are being met, slowly but surely. So, what is normalization now and why is it needed?
Before we take a look into the need of normalization, let’s dive into some of the hidden errors whose presence are invisible to a reader but visible in Indic computing which includes basic operations like search and sort. This is applicable to anyone who wants to run Indic website search, either on their own Indic website or on Google, Bing etc.
What is normalization of Indic computing?
Unicode encoding assigns a unique value to every character which is processed through computational devices and displayed on the screen. How these characters will look on the screen i.e. the style, behavior when combined (applicable for Indic languages) etc. are defined by font. The process involved in displaying the character of a particular font is called as rendering.
Indic languages do not follow rules of English. The presence of a set of characters called ‘maatra’ and their property to take positions around the consonants calls for careful rendering. If not, Indic computing is hampered.
Hindi transliteration of the word “Home” is “होम”.
Let’s find out the number for ways the sequence of characters in “होम” can be typed without altering its rendering in most of the devices today:
- ह + ो + म
- ह + ा + े + म
So even though word “होम” looks the same, the composition is different in both the examples.
The former composition is correct while the latter is wrong. Let’s see the difference in search results it brings in the same webpage.
In Figure 1, “होम” is a combination of 3 characters while in Figure 2, it is a combination of 4.
Let’s look into a Bengali website and analyze the search results for the transliteration of the word “notice”.
Two different combinations of “notice” generates two different search results.
Let’s have a look at a completely different scenario.
“Hindi” can be written as “हिन्दी” or “हिंदी” both of which are correct but create inconsistency. Similarly, a person named “Nand” can prefer to type his name as “नन्द” or “नंद”.
Hindi alone has lot of words can have two such representations.
Normalization cleans all these through API call.
In the first example, the API looks for the combinations that generate similar looking word and replaces the wrong compositions with the correct ones.
In the second example, a preference is made from the correct possible representations and replaced throughout the content.
Search results of using Reverie API in first example when copied onto a MS-Word document:
Benefits of a normalized website
Websites containing content in Indic languages have the following benefits of normalization:
- Consistency: A consistency is maintained throughout the content in terms of word preference and grammatical correctness.
- Searching and sorting: Non-normalized Indic content can have words that look the same but they may be composed of different character compositions. Since algorithms like searching and sorting are character intensive, difference in the number of characters for same looking word will generate different results. This can be avoided by normalization for proper search results.
Normalization for Indic website search is free!
Normalization APIs will be free and will be made available soon for 11 Indian languages; Tamil, Telugu, Punjabi, Odia, Marathi, Malayalam, Kannada, Hindi, Gujarati, Bengali, & Assamese. Experience a consistent and error free Indic website with us. Stay tuned!