2006-03-20

Is Sinhala Unicode Incomplete?

"The SLSI 1134 is incorrect & incomplete and it should be corrected immediately.", claims Mr Donald Gaminitillake, who is trying to ignite a campaign against Sinhala Unicode standard through www.akuru.org (history of the site), and frequent newspaper articles.

We of the Sinhala GNU/Linux project think otherwise. And we are not alone. Language Technology Research Center of the University of Colombo School of Computing, research groups from the University of Moratuwa and Arthur C Clarke Center for Modern Technology, Microsoft, Microimage, Science Land also think that the standard is correct. A full list would be quite long.

GNU/Linux was the first platform to implement Sinhala Unicode rendering. We dind't find any issues about encoding or displaying those characters Mr Donald claims are impossible - yansaya, rakaaransaya, reepaya, joint letters and all that. Then Microsoft also released a "Sinhala Enabling Kit for Windows". Most vendors today support Sinhala Unicode. None of them, who actually got their hands dirty by writing actual code to implement the standard, see any missing "letters" in the standard.

Implemenation is proof for most poeple. But for some not-so-obvious "reason" Mr Donald continues to say that certain characters are missing!

Our first encounter with Mr Donald hapenned when I wrote an open letter to him which became a lengthy debate (more, more and a seperate archive) on our project mailing list. Harshula, our standards expert, tried to explain to Mr Donald how the basic Unicode code-page and cartesian products of various "sets" create the complete Sinhala character set. However Mr Donald never tried to cooperate with us in "understanding" it, and the discussion led to nowhere.

However, Mr Donald selectively quoted some parts of the discussion on his site... ;-)

Recently, Niranjan Meegammana, creator of kaputa.com, started a Google Group to communicate in Sinhala - using Unicode. This group has now grown to a very interesting community of a unique, intellectual and polite culture. Although the group members use diverse technologies to write and read Sinhala Unicode, we find the standard quite functional and interoperable. And we use yansaya, rakaaransaya and other "special characters" every day.

Most of us on this group have a great passion for language and literature, and therefore the discussions are very interesting and intellectually rich.

This Google group was meant to communicate in Sinhala Unicode to popularize it, and to act as a test bed for implementations. Mr Donald recently joined the group, not to communicate in Sinhala Unicode, but to start another debate. He continues to repeat the same old story and conveniently ignores some of our questions.

Here is a couple of Mr Donald's claims and what I think of them.

Donald G: SLS1134 doesn't contain all the Sinhala characters

Wrong. Here is why:

Most of Western languages contain simple alphabets. Even with the upper and lower case variants, and some "odd" characters with bubbles and hats, the number of character don't exceed 50-100.

However, Asian languages are different. Most characters have different "forms", either phonetically (e.g.: Sinhala, Tamil and Hindi), or by the location of the word (e.g.: Arabic). Therefore, it's impractical to allocate characters for each variant.

Think of atoms and molecules. There is a limited number of atoms, and molecule names can be formed by putting together the names of atoms. I have never heard of a "Chemists' Revolution" demanding a symbol for each molecule....;-)

Unicode is very similar to chemistry in that sense. Each language is assigned a "code page", typically containing 128 "code points". They form the basis to build more complex character variants, i.e., actual characters seen by the eye, sometimes referred to as "glyphs".

In Western languages, "characters" and "glyphs" and "code points" are the same thing: because they don't need variants. For example, english character, or glyph, "A" maps to code point 65 - one to one.

For complex languages, only basic characters are represented by code points. Variants are produced by sequences of code points. For example, character "da" (as in "dambana") is directly mapped to code point 0DAF, whereas "du" (as in "dumriya"), which is a variant of "da", is produced by the sequence of two code points 0DAF ("da") and 0DDF ("papilla"). More complex characters (glyphs) are formed by longer sequences.

Most modern operating systems have rendering engines that can display proper glyphs from these sequences of code points (e.g.: Pango, QT, ICU on GNU/Linux, Uniscribe on Windows). Therefore, each glyph not having an individual code point is not a problem.

In Unicode, some characters are directly mapped to code points, while others are produced by sequences of two or more code points.

Deciding which characters should be basic code points, and which characters should be produced by combining code points is a different question, and is obviously dependant on the language, and likely to be subjective. Input from several Sinhala scholars and experts have been taken into account to decide that repaya, rakaaransaya and yansaya should not be basic code points, but should be produced by using sequences of code points, as they are linguistically alternatives forms. In other words, they are there as sequences of code points, not as single code points. Nevertheless, they are there, so the claim is wrong.

If Mr Donald's claim is "yansaya, rakaaransaya and reepaya should be individual code points", that would be more valid. However, somebody has to eventually decide what's basic and what's not, and it has already been done. Technically, this is not an issue at all.

Donald G: Unicode can't produce a matrix of 1600+ characters needed by OCR etc

Wrong. Here is why:

I am not an expert on OCR, but if Mr Donald claims that OCR requires a matrix of 1600+ characters, that's exactly what Sinhala Unicode is. Only it doesn't list all the 1600+ characters, but defines the basic code points (not characters) and the way to generate all the other characters by using sequences of them.

Even a primary school kid can understand something like this: "ka and paapilla produces ku, and this rule applies to all the consonents." It would rediculous if the document describing the standard includes a 1600+ table listing each variant (ka + papilla = ku, kha + papilla = khu...la+paapilla=lu and so on)... ;-)

Showing the basic code points and claiming "not all the characters are here" for the first time is fine. Second time is still fine, IMHO. But 100+th time is definitely a joke... ;-)

Donald G: SLS 1134 doesn't consider Tamil

There is no need.

Character representation in SLS 1134 almost identical (if not identical) to Sinhala subset in Unicode. As the only country that has a major Sinhala speaking population, it's SLSI's responsibility to contribute to Sinhala in Unicode, and SLSI does this through SLS 1134. Developers eventually use Unicode. To my knowledge, none of the FOSS packages found in a typical GNU/Linux system refer to SLS 1134. In other words, SLS 1134 more of an intermitent standard.

India has a much bigger Tamil speaking population, and the Unicode code page for Tamil has already been worked out. Therefore, there is absolutely no need to create a seperate standard for Sri Lanka. Sinhala Unicode is not a Sri Lankan standard either.

Donald G: Sinhala Unicode doesn't have yansaya on the keyboard

Wrong. here is why:

Unicode is about representing characters. How they are typed using the keyboard is completely upto the keyboard driver. There are different keyboard drivers, some are classic Wijesekara, some modified Wijesekara, and some are transliterated (somewhat "singlish"). Some driver authors include yansaya etc on the keyboard itself whereas others provide ZWJ as an alternative to type them.

Whatever the keyboard is, yansaya, rakaaransaya and repaya can be typed, and eventually represented and displayed in the same code point sequences.

Other claims

There are so many other claims on akuru.org. For example, Mr Donald from time to time challenges that certain words can't be "written" in Sinhala Unicode (latest being the name of the President). When we send him screenshots to show that it's possible (with and without joint characters), he claims that they are fake!!!

Hidden agenda?

There is a saying that it's easy to wake up a sleeping person, but it's very difficult to wake up someone who pretends to sleep.

Mr Donald has applied for a patent for his "system". Although he doesn't seem to have implemented it, he has promissed to deliver results if given an "opportunity" (as far as I know, nobody is holding him). And as Sinhala Unicode is becoming mainstream, his "pending" patent is going to be worthless, unless... oh, well!

Update: 2006-03-21 08:00

There are "valid" articles on akuru.org. Some are about the history of characters, and some are good articles by others authors. For example, articles written by Mr Aelien Silva, one of my favourite writers and linguists who has created so many good Sinhala technical words (e.g.: "manu", "thekala"), brings out very good points about technology localization. In fact, I have often quoted Mr Aelien Silva on the Sinhala Unicode list and elsewhere (need to enable Sinhala Unicode to read it, instructions are here for GNU/Linux and here for Windows, not sure how to do it on Mac... :-( ). However, I belive that hosting such articles is just an attempt to make akuru.org more authentic, which would otherwise be totally useless.

10 comments:

Anuradha said...

Here are two more myths:

Unicode provides a unique number for every character

This is just playing with words by quoting one remote sentense. here is why:

First, this is supposed to be quoted from official Unicode site. However, the word "character" in this sentense has an implicit "basic" prefix, which is understood by anyone who has studied the standard in detail.

To put it more precisely, Unicode provides a unique code point only for basic "characters". Other characters are then be generated by sequences of them.

This is not something special for Sinhala. Most of the other Asian languages are also quite happy with basic code points.

Mr Donald: Answer the question - "DU" "Yansaya" "repaya" in four digits in
unicode


I have asked three times: "why four digits? what's wrong with six, eight or hundred?" And now I am going to ask for the fourth time.

Anuradha said...

This FAQ page on official Unicode site explains why Mr Donald's "each character should have a unique code point" claim is a myth. Notably, the first two entries, and the example of a Devanagari (script used to write Hindi and Sanskrit) "ka" variation:

Quoting from that page:

Q: Does "text element" mean the same as "combining character sequence"?

A: No, this is a common misperception. A text element just means any sequence of characters that are treated as a unit by some process. A combining character sequence is a base character followed by any number of combining characters. It is one type of a text element, but words and sentences are also examples of text elements.

Q: So is a combining character sequence the same as a "character"?

A: That depends. For a programmer, a Unicode code value represents a single character (for exceptions, see below). For an end user, it may not. The better word for what end-users think of as characters is grapheme (as defined in the Unicode glossary): a minimally distinctive unit of writing in the context of a particular writing system.

For example, å (A + COMBINING RING or A-RING) is a grapheme in the Danish writing system, while KA + VIRAMA + TA + VOWEL SIGN U is one in the Devanagari writing system. Graphemes are not necessarily combining character sequences, and combining character sequences are not necessarily graphemes. Moreover, there are a number of other cases where a user would not count "characters" the same way as a programmer would: where there are invisible characters such as the RLM used in BIDI, compatibility composites such as "Dz", "ij", or Roman numerals, and so on.

Anuradha said...

"DU" is registered in Unicode. I have already mentioned in the article that it is the sequence 0DAF ("da") followed by 0DDF ("papilla").

Explaination:

Strictly speaking "DU" is a grapheme rather than a character, and that's why it doesn't need to have a single code point. See this FAQ from the official Unicode site. According to the answer to the question 2 (quoted above, too):

The better word for what end-users think of as characters is grapheme (as defined in the Unicode glossary): a minimally distinctive unit of writing in the context of a particular writing system.

"DU" is a "minimally distinctive unit in writing" to an end user, therefore it's a grapheme.

0DAF ("da") followed by 0DDF ("papilla") is a "combining character sequence", which generates the grapheme "DU". See the answer to the question 1 of the same FAQ:

A combining character sequence is a base character followed by any number of combining characters.

0DAF ("da") is the "base character" here, and 0DDF ("papilla") is a "combining character".

Anuradha said...

Answer to JC's question

I can't give a complete answer for the Sinhala Unicode enabler kit for Windows, as I use only GNU/Linux on my desktop. However, I have seen people installing it and rendering works after that. As far as I understand it, this kit installs a Unicode Sinhala font, adds Sinhala rendering to Uniscribe and also installs a keyboard driver.

To view Sinhala Unicode, you don't need the keyboard driver.

Prasad Gunaratne said...

Well said Anuradha. I really appreciate what you are doing.

I further read about Unicode on Unicode.org site. It clearly indicates there is nothing wrong with specification for Sinhala.

I found that Unicode FAQ pages answer questions raised by Donald.
Firstly 'Where is my character' FAQ explains that not all gyphs are encoded.
http://www.unicode.org/standard/where/
There are various good examples given. But the best example on that page is 'ch' is a considered a character in Slovak and Traditional Spanish. But it is not allocated a code point and instead uses 0063 and 0067 i.e. the code points for 'c' and 'h'. There are other examples for Indian scripts as well.

Secondly the claim by Donald that current spec will break sorting. I also thought that there is some truth to this. But not any more. Because see following page.
http://www.unicode.org/faq/collation.html
I quote:

--start quote

My script does not sort right because the characters were assigned to Unicode code points in the wrong order. What can I do about that?

A: There is a misunderstanding here: Linguistically meaningful sorting is done not by comparing code point values (an approach which would fail even for English), but by assigning multi-level weights to characters or sequences of characters and then comparing those weights on each level. There are many algorithms and implementations for this; the standard Unicode Collation Algorithm (UCA) comes with a default weight table for all assigned characters as well as a tailoring mechanism that describes how this table can be modified to conform to local conventions, where necessary.

--end quote

Donald is clearly on an other agenda. Gald that none of the developers fall in to his trap.

Anuradha said...

I said:

Sinhala Unicode is not a Sri Lankan Standard either.

This is good, and there is absolutely no need to "correct it".

If we have a standard for Sri Lanka only, it is unlikely to be supported by international software. But Sinhala Unicode being an international standard, every software written all over the world that suppport Unicode automatically support Sinhala.

Being an international standard, anyone living anywhere in the world can communicate in Sinhala Unicode. It has already become a reality thanks to the implementations. We have practically proved it through the Sinhala Unicode Group and elsewhere.

That's why I said SLS 1134 is an intermittent local standard, whereas Unicode is going to be the eventual international standard, although both are identical.

Anandawardhana said...

හිතවත් අනුරාධ,
යුන්කෝඩ්බලට ස්තුතිවන්ත වෙන්න මම අද සිංහලෙන් බ්ලොග් ලියනවා.
ඔබේ බ්ලොග් අඩෙවියේ පළ වූ දෑ කියවීමෙනුත් මට මේ සදහා බොහෝ තොරතුරු සොයාගන්න පුළුවන් වුණා. එයට බොහොම පින්!

තුසිත දසනායක said...

අනුරාධ‍ගේ කීම හරි බව තමයි මගේ වැටහීම.සිංහල යුනිකෝඩිවලට පින්සිදුවන්න මම අද බේ‍ලොග් ලියන්නේ සිංහලෙන්.
මෙහි පලවුදෑ කියවීමෙන් ලැබුණු දැනුම බොහෝයි.ජයවේවා

කාලිංග said...

කියවන්නනම් ඕන කෙනෙකුට පුලුවන් කරන්න ඉන්නේ කව්ද ? Unicode නිසා දැන් mobile එකත් සිංහල වෙලා.... Unicode is the way to go ! who says its not cross platform incompatible ?

රට ඉදිරියට යනකොට මේයක්කු ඒකට උදැල්ල දානවා. google එකේ සිංහල අවේ කොහොමද ? thanks to Unicode !

අපි සිංහලෙන් search කරන්නේ කොහොමද ?
thanks to Unicode !

Harshana said...

සටහනෙහි අන්තර්ගතය නම් මරු,
අනේ මන්දා, ඩොනල්ඩ් මහත්මයා තාක්ෂණයේ දියුණුව ගැන වැරදි වැටහීමක් අරන් වගේ. 3rd Normal Form ගැන සහ දත්ත සමුදාය කළමනාකරණය ගැන මොහු අවබෝධකරගත්තේ නම් මෙවැනි කතා කරන්නෙ නෑ කියලයි මම සිතන්නෙ.
ඒ වගේම සිංහල යුතිකේත අසම්පූර්ණයි කියල මොහු දකින්නෙඔහුට වාසියක් අත් කර ගැනීමට බවයි මා නම් සිතන්නෙ

මා සිතන විදිහට නම් සිංහල යුනිකේත දැනට තියෙන විදිහ හොදටම හොදයි යන්නයි.