Take a big cup of coffee and settle down comfortably. This is going to take time. I have been part of big projects which include internationalization and localization. But at most places we would have string of parameters which you would translate to the respective language. The app will pickup the required language file and show the UI in respective language. In some cases we would change the way how dates, currencies and floating points were displayed. Myswar was a little different for me. Here we wanted not just the UI but the whole (we still have gaps) content in Hindi. We also wanted search, sort, numeric liberals, dates in Hindi. This blog post will be about that, its meant to be like a reference note to myself for later use and hence in detail.
1. UTF-8 or UTF16? What to use, conversions…
This time before I jumped into developing, I really wanted to understand the meaning of Unicode encodings. Specially what UTF-8 and UTF16 meant, differences etc. The below video explains UTF-8 in a very simple manner
I would suggest you to watch the following detailed videos on ASCII, UTF16/32 and UTF-8 when you get time. It’s very important as a developer to know how characters are represented behind the scene.
UTF8 has won the war in terms of character encoding for the web. Specially because of its backward compatibility with ASCII. So UTF-8 it was for myswar.
2. Translation and Transliteration
We had labels which needed to be translated into Hindi, we used Google Translation API to do that. But as far the album/song titles etc we had to do transliteration. I used my old script to that but the quality of the transliteration wasn’t great. So we kind of had to create our own dictionary words to work along with the script to resolve issues.
3. Application Programming
Application programming is still very difficult when it comes to using local languages. Some issues that we faced while coding of hindi, there could be many more
String length
len(s)
Return the length (the number of items) of an object. The argument may be a sequence (string, tuple or list) or a mapping (dictionary). That’s not actually correct definition. When it comes to strings its the count of bytes by default.
>>> len('म') 3 >>> len('m') 1 >>> len('मt') 4 >>> len('मt'.decode('UTF8')) 2
So you can’t reliably use len to get the count of characters by default. You need to decode it every time. Thats a pain.
Even simple math is impossible
>>> १+९ File "", line 1 १+९ ^ SyntaxError: invalid syntax >>> 1+9 10
So if you ever stored numbers in Unicode then every time you want to do some kind of math on it you need to translate them into ASCII. Same applies for calendar operations.
Equals don’t always work all the time aka what you see is not what you get
>>> print 'क़' == 'क़' False >>> print u'\u0958' == u'\u0915\u093c' False >>> import unicodedata >>> print unicodedata.normalize('NFC' ,u'\u0958') == unicodedata.normalize('NFC', u'\u0915\u093c') True >>> print unicodedata.normalize('NFC' ,u'क़') == unicodedata.normalize('NFC', u'क़') True >>>
Basically what you see is not what it is inside. The same character (which looks the same) might have a different value. I wont go into details you can read this detailed FAQ on Normalization.
Sorting Unicode
A python list containing Unicode strings wont sort properly. If we want to sort it properly then we need to implement Unicode Collation Algorithm (UCA). You can check the attempt made by James Tauber. Logic is simple and straight forward. I hope it will be built into the python language very soon. Sorting is fixed in most of the new databases. They can mostly sort Unicode columns by default. In mongo its still an issue, they don’t support sorting a column by collation. You need to implement it yourself if you want Unicode sorting.
Search, filter
Since the EQUALS doesn’t work the way you want and regular expression is still very basic and messy. If you have search or filter functionality you will have to do many trial and error. This needs a separate post.
Localization of date and time formats
It’s almost impossible. You will have to make your own routines to display date and time in your own language.
4. Localizing strings in JavaScript
As far as I know there are still no standard ways to implement localization in JavaScript. For example if you have a date object, how do you localize to show the date in Hindi and can we sort it in JavaScript? There are some libraries which help you in direct string replacement but I guess that’s not enough. As of now better idea would be not to depend on JS.
5. Input method – JavaScript – JQuery IME
inputting is still an issue. Many indians don’t have any input software installed. As of now the best way is to have it as part of JavaScript. I found Wikipedia’s jquery.ime very simple to use. I am still experimenting with it.
Each of these issues can be a blog post by themselves. I will write about them in detail in coming days.
Comments
One response to “Hindi Version of MySwar – Behind the Scenes”
[…] @taparam: .@thej writes about the tech behind the Hindi version of @myswar. mavrix.in/2014/02/hindi-… Useful if you plan to build indic […]