Is Sinhala Unicode Incomplete?

"The SLSI 1134 is incorrect & incomplete and it should be corrected immediately.", claims Mr Donald Gaminitillake, who is trying to ignite a campaign against Sinhala Unicode standard through www.akuru.org (history of the site), and frequent newspaper articles.

We of the Sinhala GNU/Linux project think otherwise. And we are not alone. Language Technology Research Center of the University of Colombo School of Computing, research groups from the University of Moratuwa and Arthur C Clarke Center for Modern Technology, Microsoft, Microimage, Science Land also think that the standard is correct. A full list would be quite long.

GNU/Linux was the first platform to implement Sinhala Unicode rendering. We dind't find any issues about encoding or displaying those characters Mr Donald claims are impossible - yansaya, rakaaransaya, reepaya, joint letters and all that. Then Microsoft also released a "Sinhala Enabling Kit for Windows". Most vendors today support Sinhala Unicode. None of them, who actually got their hands dirty by writing actual code to implement the standard, see any missing "letters" in the standard.

Implemenation is proof for most poeple. But for some not-so-obvious "reason" Mr Donald continues to say that certain characters are missing!

Our first encounter with Mr Donald hapenned when I wrote an open letter to him which became a lengthy debate (more, more and a seperate archive) on our project mailing list. Harshula, our standards expert, tried to explain to Mr Donald how the basic Unicode code-page and cartesian products of various "sets" create the complete Sinhala character set. However Mr Donald never tried to cooperate with us in "understanding" it, and the discussion led to nowhere.

However, Mr Donald selectively quoted some parts of the discussion on his site... ;-)

Recently, Niranjan Meegammana, creator of kaputa.com, started a Google Group to communicate in Sinhala - using Unicode. This group has now grown to a very interesting community of a unique, intellectual and polite culture. Although the group members use diverse technologies to write and read Sinhala Unicode, we find the standard quite functional and interoperable. And we use yansaya, rakaaransaya and other "special characters" every day.

Most of us on this group have a great passion for language and literature, and therefore the discussions are very interesting and intellectually rich.

This Google group was meant to communicate in Sinhala Unicode to popularize it, and to act as a test bed for implementations. Mr Donald recently joined the group, not to communicate in Sinhala Unicode, but to start another debate. He continues to repeat the same old story and conveniently ignores some of our questions.

Here is a couple of Mr Donald's claims and what I think of them.

Donald G: SLS1134 doesn't contain all the Sinhala characters

Wrong. Here is why:

Most of Western languages contain simple alphabets. Even with the upper and lower case variants, and some "odd" characters with bubbles and hats, the number of character don't exceed 50-100.

However, Asian languages are different. Most characters have different "forms", either phonetically (e.g.: Sinhala, Tamil and Hindi), or by the location of the word (e.g.: Arabic). Therefore, it's impractical to allocate characters for each variant.

Think of atoms and molecules. There is a limited number of atoms, and molecule names can be formed by putting together the names of atoms. I have never heard of a "Chemists' Revolution" demanding a symbol for each molecule....;-)

Unicode is very similar to chemistry in that sense. Each language is assigned a "code page", typically containing 128 "code points". They form the basis to build more complex character variants, i.e., actual characters seen by the eye, sometimes referred to as "glyphs".

In Western languages, "characters" and "glyphs" and "code points" are the same thing: because they don't need variants. For example, english character, or glyph, "A" maps to code point 65 - one to one.

For complex languages, only basic characters are represented by code points. Variants are produced by sequences of code points. For example, character "da" (as in "dambana") is directly mapped to code point 0DAF, whereas "du" (as in "dumriya"), which is a variant of "da", is produced by the sequence of two code points 0DAF ("da") and 0DDF ("papilla"). More complex characters (glyphs) are formed by longer sequences.

Most modern operating systems have rendering engines that can display proper glyphs from these sequences of code points (e.g.: Pango, QT, ICU on GNU/Linux, Uniscribe on Windows). Therefore, each glyph not having an individual code point is not a problem.

In Unicode, some characters are directly mapped to code points, while others are produced by sequences of two or more code points.

Deciding which characters should be basic code points, and which characters should be produced by combining code points is a different question, and is obviously dependant on the language, and likely to be subjective. Input from several Sinhala scholars and experts have been taken into account to decide that repaya, rakaaransaya and yansaya should not be basic code points, but should be produced by using sequences of code points, as they are linguistically alternatives forms. In other words, they are there as sequences of code points, not as single code points. Nevertheless, they are there, so the claim is wrong.

If Mr Donald's claim is "yansaya, rakaaransaya and reepaya should be individual code points", that would be more valid. However, somebody has to eventually decide what's basic and what's not, and it has already been done. Technically, this is not an issue at all.

Donald G: Unicode can't produce a matrix of 1600+ characters needed by OCR etc

Wrong. Here is why:

I am not an expert on OCR, but if Mr Donald claims that OCR requires a matrix of 1600+ characters, that's exactly what Sinhala Unicode is. Only it doesn't list all the 1600+ characters, but defines the basic code points (not characters) and the way to generate all the other characters by using sequences of them.

Even a primary school kid can understand something like this: "ka and paapilla produces ku, and this rule applies to all the consonents." It would rediculous if the document describing the standard includes a 1600+ table listing each variant (ka + papilla = ku, kha + papilla = khu...la+paapilla=lu and so on)... ;-)

Showing the basic code points and claiming "not all the characters are here" for the first time is fine. Second time is still fine, IMHO. But 100+th time is definitely a joke... ;-)

Donald G: SLS 1134 doesn't consider Tamil

There is no need.

Character representation in SLS 1134 almost identical (if not identical) to Sinhala subset in Unicode. As the only country that has a major Sinhala speaking population, it's SLSI's responsibility to contribute to Sinhala in Unicode, and SLSI does this through SLS 1134. Developers eventually use Unicode. To my knowledge, none of the FOSS packages found in a typical GNU/Linux system refer to SLS 1134. In other words, SLS 1134 more of an intermitent standard.

India has a much bigger Tamil speaking population, and the Unicode code page for Tamil has already been worked out. Therefore, there is absolutely no need to create a seperate standard for Sri Lanka. Sinhala Unicode is not a Sri Lankan standard either.

Donald G: Sinhala Unicode doesn't have yansaya on the keyboard

Wrong. here is why:

Unicode is about representing characters. How they are typed using the keyboard is completely upto the keyboard driver. There are different keyboard drivers, some are classic Wijesekara, some modified Wijesekara, and some are transliterated (somewhat "singlish"). Some driver authors include yansaya etc on the keyboard itself whereas others provide ZWJ as an alternative to type them.

Whatever the keyboard is, yansaya, rakaaransaya and repaya can be typed, and eventually represented and displayed in the same code point sequences.

Other claims

There are so many other claims on akuru.org. For example, Mr Donald from time to time challenges that certain words can't be "written" in Sinhala Unicode (latest being the name of the President). When we send him screenshots to show that it's possible (with and without joint characters), he claims that they are fake!!!

Hidden agenda?

There is a saying that it's easy to wake up a sleeping person, but it's very difficult to wake up someone who pretends to sleep.

Mr Donald has applied for a patent for his "system". Although he doesn't seem to have implemented it, he has promissed to deliver results if given an "opportunity" (as far as I know, nobody is holding him). And as Sinhala Unicode is becoming mainstream, his "pending" patent is going to be worthless, unless... oh, well!

Update: 2006-03-21 08:00

There are "valid" articles on akuru.org. Some are about the history of characters, and some are good articles by others authors. For example, articles written by Mr Aelien Silva, one of my favourite writers and linguists who has created so many good Sinhala technical words (e.g.: "manu", "thekala"), brings out very good points about technology localization. In fact, I have often quoted Mr Aelien Silva on the Sinhala Unicode list and elsewhere (need to enable Sinhala Unicode to read it, instructions are here for GNU/Linux and here for Windows, not sure how to do it on Mac... :-( ). However, I belive that hosting such articles is just an attempt to make akuru.org more authentic, which would otherwise be totally useless.



Sinhala Unicode on GNU/Linux

Update: in almost all GNU/Linux distributions released in the last two years, most if not all of the following settings are already done. You have only to install the font and an input method. Please check this page for more details.

Here are the steps to get Sinhala working on GNU/Linux. If you are running Debian or Ubuntu, there is an easier way. Most of the steps will have to be skipped on modern distributions, as Sinhala is mostly `enabled' in them.

Also, this guide assumes reasonable experience in using the GNU/Linux environment. If you think you are a newbie, please get a Guru involved... ;-)

Sinhala/Sri Lanka Locale for Glibc

This is a file `si_LK' in /usr/share/i18n/locales/. If it's not there, download it here.

If there is a /usr/share/i18n/SUPPORTED file in your system, make sure that there is an entry `si_LK UTF-8' in an alphabatically suitable place.

If you are using a recent version of glibc locales (e.g.: locales package on Debian Etch / Sid), si_LK is included and there is no need to download it. Hopefully, other distros will begin to ship it, too.

Aliases for Glibc Locale (Optional)

Add these lines to /etc/locale.alias so that you can refer to si_LK.UTF-8 locale as si, si_LK or sinhala. If this file is not there, skipping this step is harmless.

sinhala  si_LK.UTF-8
si       si_LK.UTF-8
si_LK    si_LK.UTF-8

Generating the Glibc Locale (Debian based systems)

Non-Debian users should skip this step.

Run `dpkg-reconfigure locales'. Select si_LK.UTF-8 locale and other UTF-8 locales (e.g.: en_US.UTF-8, en_GB.UTF-8). Make sure to select a UTF-8 locale (not necessarily si_LK) as the default locale.

Generating the Glibc Locale (non-Debian systems)

Debian users should skip this step.

Generate the si_LK.UTF-8 locale by running:

localedef -i si_LK -f UTF-8 -A /etc/locale.alias si_LK

X-window Locale

Most of the X window programs used on GNU/Linux (GNOME, GTK, QT and KDE apps) are using Glibc locale, and there is no need to add a full fledged locale to X. However, if X Window system doesn't know about si_LK, X programs will complain of it as an unknown locale. A common practice is just to alias such locales to en_US.UTF-8 to avoid this.

If you are running xorg 6.9.0 (or later) or a recent version of XFree86, this is already done, please jump to the next step.

Otherwise, locate the files locale.dir and compose.dir in /usr/X11R6/lib/X11/locale/ and add suitable lines. Notice that you need to add two lines in each file, one without a colon:

en_US.UTF-8/XLC_LOCALE       si_LK.UTF-8
and one with a colon.
en_US.UTF-8/XLC_LOCALE:      si_LK.UTF-8

Lines in compose.dir are similar, except `XLC_LOCALE' is replaced with `Compose'.

Sinhala Unicode Fonts

It's good to see more and more new Unicode Sinhala fonts are being released. Unfortunately, the FreeFont project includes sinhala characters that don't have correct rendering tables, and sometimes this font takes precedance over other correct unicode fonts, making wrong rendering of kombuwa and other specially handled glyphs. A quick workaround would be to remove freefont package (sometimes called ttf-freefont) if it's installed.

Downloading the LK-LUG Unicode font and copying it to .fonts/ directory in your home directory is sufficient for most cases. Copy it to /usr/local/share/fonts/ to make it available globally.

I have written a more detailed description about fonts in X Windows here.

Sinhala Rendering in KDE/QT

If you are using a version of QT later than 3.3.4, Sinhala should be working fine. There was one bug in old version of of QT, which is now fixed, both in QT 3 and 4 series.

Sinhala Rendering in GNOME/GTK

If your Pango version is later than 1.8.1, Sinhala should be working fine. 1.8.0 also supports Sinhala with a bug, and Harshula's fix went into 1.8.1.

Touching letters are also now supported.


Firefox renderes Sinhala properly only if it's compiled with Pango. 1.0.x needs a patch, but Pango comes standard in 1.5.x series. If you are using Firefox in RedHat / Fedora, it comes with the Pango patch, and there is nothing extra to be done.

The easiest is to upgrade Firefox to 1.5 (hoping that it's compiled with Pango support) and set the environment variable MOZ_ENABLE_PANGO to 1.

Sinhala Input

Earlier, we used seperate input method modules for GTK and QT, but now they are obsoleted by SCIM and M17N input methods. Here are the steps to install them.

  • Install SCIM
  • Install SKIM if you use KDE
  • Get SCIM transliterated input method for Sinhala and install it
  • If you like to use Sinhala input method modules from M17N project, install SCIM-M17N bridge, and M17N input method modules.

Running skim in KDE or scim in GNOME should create an icon on system tray that can be used to select the language. After that, you can use ctrl+space to switch between normal ASCII (English) input and SCIM input.

SCIM 1.4.4 doesn't have a Sinhala catagory, so Sinhala input methods are listed under `Other'. It's fixed now and a seperate menu for Sinhala should be available in the next version.



Running Debain on Desktop

Warning: I might change this article depending on feedback. This warning will be removed when I finish with such edits. After hearing some recent reports from users who tried to run Debian GNU/Linux on their desktops, I thought it will be useful to list some guidelines to avoid some common pitfalls. Debian is great to run on the desktop and with the recent udev/sysfs stuff, hardware detection works great, and apt has always been an excellent, if not the the best, package management tool... :-) Which version? Use Debian `testing' (`Etch' at the time of writing this post). Period. Debian `stable' (`Sarge') doesn't get new versions of software, but only bug and security fixes, which is a great for servers; not for desktops. Debian `unstable' (`Sid') get latest versions of software before `testing', but it breaks dependencies from time to time, which is also not desiarable for a general desktop. How many CDs? If you have a good Internet connection, getting the small `netinst' CD is good enough. Otherwise, first 2-4 CDs are all what you need. If you'll ever need more, then you won't be reading this guide... ;-) You can use jigdo to download the official weekly builds, or less preferably download the ISOs (CD/DVD) here. There were problems with some of the `unofficial' CD/DVD builds, so I recommend getting the official builds. Linux kernel 2.6 If you want all the bells and whistles of the latest Linux kernel, instead of just hitting ENTER when booting the installatoin CD, type `linux26'. Partitions For a desktop, it's sufficient to have three partitions. One swap partition (about 512 MB - 1 GB), one root (/) partition for the installation (2-10 GB, depends on the amount of software you plan to install), and one home partition (/home) for personal files (size of this partition depends on your needs, if you are going to have a lot of audio/video files, this will fill up soon). By having a seperate /home partition, it's possible to play around with the installation in the root (/) without loosing the personal files. If you are planning to have databases and other servers, they will consume space in /var, which is in the root (/) in the above settings. Consider increasing the size of the root partition, or creating a seperate /var partition if this is the case. Personally, I use a single root partition (not a seperate /home), because once you install Debian, there is absolutely no need to install again, as you can keep on upgrading... :-) This installation I am using right now has survived three laptop migrations!!! Filesystem Use a journalling filesystem. I personally prefer reiserfs, as it has good indexing capabilities, and excellent when there are lots of small files. If you have a much larger files instead, consider xfs or jfs. Ext3 is also not a bad choice. What to install? With Debian, it is best to finish the basic install as soon as possible. Therefore, don't select anything extra (x windows or gnome etc), and don't add any additional CDs. Just use the first CD and get over with the installation. Now what? Once installation is over, login as root, and add any additional CDROMs with apt-cdrom tool.

# apt-cdrom add
Repeat this for all the CDs available. Consult /etc/apt/sources.list if you want to verify. Now add X windows, and KDE and/or GNOME desktop(s). Again, if you prefer any other desktop (fvwm or windomaker), you won't be reading this guide... ;-)
# apt-get install x-window-system-core
# apt-get install kde
# apt-get install gnome-desktop-environment
You can play around with /etc/X11/xorg.conf to tweak your X window settings (or alternatively, try the `dexconf' tool). Use the `startx' command to start X windows. Once you are happy with the X window settings (resolution, depth and refresh frequency etc), install a graphical login manager (e.g.: kdm).
# apt-get install kdm
# /etc/init.d/kdm start
One last bit of advice: always login as a normal user, not as root. Open a terminal and run `su -' to become root if necessary. This will reduce possibilities of you harming the system, ensuring long life of the installation. I am hoping to write another howto on installing multimedia stuff and Java. Disclaimer: I am sorry if this makes you forget what `installing' is, because it's very unlikely that you will do another installation ever again... ;-) By the way, happy Debianning!