2007-05-29

Unicode and Sinhala Alphabet

There is a great deal of similarity between Sinhala Unicode (~ SLS 1134) and Sinhala Hodiya (alphabet).

Sidath Sangarawa, one of the oldest texts on Sinhala grammar written over 2000 years ago, lists 10 vowels and 20 consonants (see footnote 1), but the book also uses two unlisted vowels ඇ and ඈ (see footnote 2).

Sanskrit influence increased the number of characters to over 50.

Actual number of shapes, known as "glyphs" in modern typographic terminology, needed to write in Sinhala is in the range of thousands, due to derived and joint forms of basic characters.

Listing all these thousands of glyphs was never a popular practice. Students learn basic characters and modifiers, and common sense takes care of generating the thousands of other shapes. For example, after learning "ispilla", you can add it to basic consonants and generate all the "i" forms such as "ki", "gi", "ji" etc.

Hodiya doesn't have any of these extra characters such as "ki" or "du". Hodiya doesn't have rakaransaya nor yansaya. But nobody complained. Everybody knew, and still know, that the Hodiya is only a basic guide to generate more complex glypls.

However, this didn't work when Sinhala texts started to be printed on printing machines. These machines don't have brains and couldn't learn how to "generate". Therefore every possible glyph had to be given.

Walk into an old press to see a large "matrix" or such glyphs.

Then came the age of computer based typography. Computers can be taught to do things, and that is exactly how standards like Unicode and SLS 1134 generate shapes. We can teach computers to generate thousands of glyphs using less than a hundred of basic shapes. For example, we can generate "du" by adding "da" and "papilla", so a seperate "du" is not necessary.

How about "yansaya" and "rakaransaya"? They are generated by sequences including the zero-width joiner (ZWJ). For example, "pra" is represented as "pa", "hal kireema", ZWJ and "ra". ZWJ also is used to represent joint and touching letters.

Gone are the days of brainless matrix-based printing machines.

We need two things to view Sinhala on a computer. A font containing Sinhala glyphs, and the computer programs should knows how to generate glyphs using sequences of basic characters. Let me explain using an example.

Step 0. Here is how a sample web page looks on a browser when it cannot find a Sinhala Unicode font. The "boxes" indicate unavailable character numbers:

Step 1. After installing a font, the browser will show some Sinhala, but if it hasn't "learned" how to generate glyphs, only basic characters and modifiers are shown independently:

Step 2. Now I have enabled the "shaper" in the browser, which is the part that knows how to generate Sinhala glyphs using basic characters:

All is well!

So where exactly is the similarity?

Students learn less than 100 basic characters in the alphabet and modifiers, and use their brains with some support from teachers to generate the rest of the 1000+ shapes.

Computers can be programmed - and some computers have already been programmed - to generate 1000+ Sinhala glyphs using less than 100 basic shapes in Sinhala Unicode / SLS 1134 standard.

As the standard is platform independent, we use it to communicate with people using diverse platforms in the Sinhala Unicode Group among others.

Footnote 1. පණකුරු පසෙක් එද ලුහු ගුරු බෙයින් දසවේ, ගතකුරුද වේ විස්සෙක්, වහරට යුහු හෙළ බස

Footnote 2. Notice the use of ඇ and ඈ, both independently and in consonants: පසැස් ඈ සරලොප් නැතද සර ගතට පැමිණවූ බැවින් සර සඳ නම්.

6 comments:

Anuradha said...

On GNU/Linux, most software components already come with Sinhala shapers. You have to only find a Sinhala Unicode font, such as the LK-LUG font or the Kaputa Unicode font.

We have enabled Sinhala on most Free and Open Source Software through the Sinhala GNU/Linux project of the Lanka Linux User Group.

For Sinhala input, you'll need the SCIM based input method module.

On Windows XP, you can get the Sinhala enabling pack from www.fonts.lk. Windows Vista is supposed to come with a Sinhala shaper in Uniscribe.

For Mac OS/X, there is a third-party Sinhala Unicode pack here.

Hakim said...

KICK ASS! machang. Great job explaining. But tell you one thing "Ignorance, jealousy and greed don't have cures". May be the duck (would like to replace "d" with and "F") does not realize that he is sick (May that's why he like playing in mud). It's the OS and apps that need to learn the Unicode rules. Why use 1660 (brute force), when you can use 61 (59) and the brain of the computer.

Tyrell Perera said...

Timely post. Nicely done :)

Anuradha Ratnaweera said...

A couple of blog posts I wrote about the same topic are here and here.

Ruki said...

nice work.

Bo said...

Anuradha, thank you so much for this explanation, I was sort of in a confused mood, cause of a harsh comment I received from a "donald" guy about this "sinhala awula". I am new to blogging, and computing,so just felt bad cause if I am doing a harm to my mother tongue through writing with Unicode I thought probably I should stop my writing, but then a friend gave me this link to realize the truth behind these comments, thank you so much !