Comments on Anuradha's Diary: Is Sinhala Unicode Incomplete?

සටහනෙහි අන්තර්ගතය නම් මරු, අනේ මන්දා, ඩොනල්ඩ් මහත්...

2008-05-06T13:28:00.000+05:30

සටහනෙහි අන්තර්ගතය නම් මරු,
අනේ මන්දා, ඩොනල්ඩ් මහත්මයා තාක්ෂණයේ දියුණුව ගැන වැරදි වැටහීමක් අරන් වගේ. 3rd Normal Form ගැන සහ දත්ත සමුදාය කළමනාකරණය ගැන මොහු අවබෝධකරගත්තේ නම් මෙවැනි කතා කරන්නෙ නෑ කියලයි මම සිතන්නෙ.
ඒ වගේම සිංහල යුතිකේත අසම්පූර්ණයි කියල මොහු දකින්නෙඔහුට වාසියක් අත් කර ගැනීමට බවයි මා නම් සිතන්නෙ

මා සිතන විදිහට නම් සිංහල යුනිකේත දැනට තියෙන විදිහ හොදටම හොදයි යන්නයි.

කියවන්නනම් ඕන කෙනෙකුට පුලුවන් කරන්න ඉන්නේ කව්ද ? U...

2008-05-04T15:04:00.000+05:30

කියවන්නනම් ඕන කෙනෙකුට පුලුවන් කරන්න ඉන්නේ කව්ද ? Unicode නිසා දැන් mobile එකත් සිංහල වෙලා.... Unicode is the way to go ! who says its not cross platform incompatible ?

රට ඉදිරියට යනකොට මේයක්කු ඒකට උදැල්ල දානවා. google එකේ සිංහල අවේ කොහොමද ? thanks to Unicode !

අපි සිංහලෙන් search කරන්නේ කොහොමද ?
thanks to Unicode !

අනුරාධ‍ගේ කීම හරි බව තමයි මගේ වැටහීම.සිංහල යුනිකෝඩ...

2007-09-06T14:24:00.000+05:30

අනුරාධ‍ගේ කීම හරි බව තමයි මගේ වැටහීම.සිංහල යුනිකෝඩිවලට පින්සිදුවන්න මම අද බේ‍ලොග් ලියන්නේ සිංහලෙන්.
මෙහි පලවුදෑ කියවීමෙන් ලැබුණු දැනුම බොහෝයි.ජයවේවා

හිතවත් අනුරාධ,යුන්කෝඩ්බලට ස්තුතිවන්ත වෙන්න මම අද ස...

2007-04-10T23:37:00.000+05:30

හිතවත් අනුරාධ,
යුන්කෝඩ්බලට ස්තුතිවන්ත වෙන්න මම අද සිංහලෙන් බ්ලොග් ලියනවා.
ඔබේ බ්ලොග් අඩෙවියේ පළ වූ දෑ කියවීමෙනුත් මට මේ සදහා බොහෝ තොරතුරු සොයාගන්න පුළුවන් වුණා. එයට බොහොම පින්!

I said:Sinhala Unicode is not a Sri Lankan Standar...

2006-03-22T11:15:00.000+06:00

I said:

Sinhala Unicode is not a Sri Lankan Standard either.

This is good, and there is absolutely no need to "correct it".

If we have a standard for Sri Lanka only, it is unlikely to be supported by international software. But Sinhala Unicode being an international standard, every software written all over the world that suppport Unicode automatically support Sinhala.

Being an international standard, anyone living anywhere in the world can communicate in Sinhala Unicode. It has already become a reality thanks to the implementations. We have practically proved it through the Sinhala Unicode Group and elsewhere.

That's why I said SLS 1134 is an intermittent local standard, whereas Unicode is going to be the eventual international standard, although both are identical.

Well said Anuradha. I really appreciate what you a...

2006-03-21T18:21:00.000+06:00

Well said Anuradha. I really appreciate what you are doing.

I further read about Unicode on Unicode.org site. It clearly indicates there is nothing wrong with specification for Sinhala.

I found that Unicode FAQ pages answer questions raised by Donald.
Firstly 'Where is my character' FAQ explains that not all gyphs are encoded.
http://www.unicode.org/standard/where/
There are various good examples given. But the best example on that page is 'ch' is a considered a character in Slovak and Traditional Spanish. But it is not allocated a code point and instead uses 0063 and 0067 i.e. the code points for 'c' and 'h'. There are other examples for Indian scripts as well.

Secondly the claim by Donald that current spec will break sorting. I also thought that there is some truth to this. But not any more. Because see following page.
http://www.unicode.org/faq/collation.html
I quote:

--start quote

My script does not sort right because the characters were assigned to Unicode code points in the wrong order. What can I do about that?

A: There is a misunderstanding here: Linguistically meaningful sorting is done not by comparing code point values (an approach which would fail even for English), but by assigning multi-level weights to characters or sequences of characters and then comparing those weights on each level. There are many algorithms and implementations for this; the standard Unicode Collation Algorithm (UCA) comes with a default weight table for all assigned characters as well as a tailoring mechanism that describes how this table can be modified to conform to local conventions, where necessary.

--end quote

Donald is clearly on an other agenda. Gald that none of the developers fall in to his trap.

Answer to JC's questionI can't give a complete ans...

2006-03-21T17:27:00.000+06:00

Answer to JC's question

I can't give a complete answer for the Sinhala Unicode enabler kit for Windows, as I use only GNU/Linux on my desktop. However, I have seen people installing it and rendering works after that. As far as I understand it, this kit installs a Unicode Sinhala font, adds Sinhala rendering to Uniscribe and also installs a keyboard driver.

To view Sinhala Unicode, you don't need the keyboard driver.

"DU" is registered in Unicode. I have already ment...

2006-03-21T17:07:00.000+06:00

"DU" is registered in Unicode. I have already mentioned in the article that it is the sequence 0DAF ("da") followed by 0DDF ("papilla").

Explaination:

Strictly speaking "DU" is a grapheme rather than a character, and that's why it doesn't need to have a single code point. See this FAQ from the official Unicode site. According to the answer to the question 2 (quoted above, too):

The better word for what end-users think of as characters is grapheme (as defined in the Unicode glossary): a minimally distinctive unit of writing in the context of a particular writing system.

"DU" is a "minimally distinctive unit in writing" to an end user, therefore it's a grapheme.

0DAF ("da") followed by 0DDF ("papilla") is a "combining character sequence", which generates the grapheme "DU". See the answer to the question 1 of the same FAQ:

A combining character sequence is a base character followed by any number of combining characters.

0DAF ("da") is the "base character" here, and 0DDF ("papilla") is a "combining character".

This FAQ page on official Unicode site explains wh...

2006-03-21T10:58:00.000+06:00

This FAQ page on official Unicode site explains why Mr Donald's "each character should have a unique code point" claim is a myth. Notably, the first two entries, and the example of a Devanagari (script used to write Hindi and Sanskrit) "ka" variation:

Quoting from that page:

Q: Does "text element" mean the same as "combining character sequence"?

A: No, this is a common misperception. A text element just means any sequence of characters that are treated as a unit by some process. A combining character sequence is a base character followed by any number of combining characters. It is one type of a text element, but words and sentences are also examples of text elements.

Q: So is a combining character sequence the same as a "character"?

A: That depends. For a programmer, a Unicode code value represents a single character (for exceptions, see below). For an end user, it may not. The better word for what end-users think of as characters is grapheme (as defined in the Unicode glossary): a minimally distinctive unit of writing in the context of a particular writing system.

For example, å (A + COMBINING RING or A-RING) is a grapheme in the Danish writing system, while KA + VIRAMA + TA + VOWEL SIGN U is one in the Devanagari writing system. Graphemes are not necessarily combining character sequences, and combining character sequences are not necessarily graphemes. Moreover, there are a number of other cases where a user would not count "characters" the same way as a programmer would: where there are invisible characters such as the RLM used in BIDI, compatibility composites such as "Dz", "ij", or Roman numerals, and so on.

Here are two more myths:Unicode provides a unique ...

2006-03-21T10:38:00.000+06:00

Here are two more myths:

Unicode provides a unique number for every character

This is just playing with words by quoting one remote sentense. here is why:

First, this is supposed to be quoted from official Unicode site. However, the word "character" in this sentense has an implicit "basic" prefix, which is understood by anyone who has studied the standard in detail.

To put it more precisely, Unicode provides a unique code point only for basic "characters". Other characters are then be generated by sequences of them.

This is not something special for Sinhala. Most of the other Asian languages are also quite happy with basic code points.

Mr Donald: Answer the question - "DU" "Yansaya" "repaya" in four digits in
unicode

I have asked three times: "why four digits? what's wrong with six, eight or hundred?" And now I am going to ask for the fourth time.