2008-04-20

LaTeX and Sinhala Unicode

When we met at Excel World on last 17th, Bud, Srimal and myself started talking about using Sinhala Unicode in TeX / LaTeX.

It didn't occur to me that Chamath, who also created one of the first Sinhala FOSS keyboard drivers, has already created a preprocessor for LaTeX called sintex which reads Sinhala files in Unicode/UTF-8. In fact, not only had I replied to his announcement, but also sent a patch to Debianize it! Life is too complex, and I am too human to keep track of all these.

But that forgetfulness turned out to be a lucky incident, as our pursuit lead to something more useful!

So we started creating a preprocessor for Vasantha Saparamadu's Sinhala TeX package which uses Samanala transliteration scheme.

However, Bud pointed out that the generated PDF files will have ASCII characters instead of Unicode, making it a problem for search engines that index them, and convert them for "HTML view" pages.

After some research, we found XeTeX, a Unicode enabled version of LaTeX.

XeTeX uses ICU for text layout, and ICU versions after 3.6 supports Sinhala out of the box. However, latest stable version 0.996 of XeTeX uses statically linked ICU 3.4. I managed to patch the "tetex-xetex" package that comes with Debian and make it recognize Sinhala. The patches were also submitted to Debian.

XeTeX font changes are always manual, which made the source look ugly. After a bit of research, I found zhspacing package, which among other things automatically sets fonts for Chinese characters. But it is a complicated package, but I managed to get an idea of how it uses character class feature in the latest XeTeX version 0.997.

Downloading the latest version of XeTeX from SVN repository and building for Debian was not difficult, except I had to edit debian/control files to replace tetex-base and tetex-bin dependency to their texlive counterparts. I had to first get xdvipdfmx. Here is a rough sketch of the work.

% mkdir xdvipdfmx
% cd xdvipdfmx
% svn co http://scripts.sil.org/svn-view/xdvipdfmx/TRUNK
% cd TRUNK
% chmod +x debian/rules
# dpkg-buildpackage -b
# cd ..
# dpkg --purge dvipdfmx
# dpkg -i xdvipdfmx...deb
% cd ..

% mkdir xetex
% cd xetex
% svn co http://scripts.sil.org/svn-view/xetex/TRUNK
% cd TRUNK
% vi debian/control
% chmod +x debian/rules
# dpkg-buildpackage -b
# cd ..
# dpkg --purge texlive-xetex
# dpkg -i xetex...deb

As the XeTeX web site had warned, the Debian build files provided by vanilla XeTeX were not up to date. After installing I had to create a /etc/texmf/fmt.d/10local.cnf with the following two lines:

xetex   xetex  -             *xetex.ini
xelatex xetex  language.dat  *xelatex.ini

and then run the following commands:

# update-fmutil
# fmutil-sys --enablefmt xetex
# fmutil-sys --enablefmt xelatex

to make "xelatex" command to work properly.

After getting latest version of XeTeX working, the last remaining step was to create a small style file, which I called "sinhala.sty", to make automatic font switching for Sinhala.

% sinhala.sty version 20080420
% Typesetting mixed Sinhala documents in XeTeX
%
% Copyright (C) 2008 by Anuradha Ratnaweera
%
\ifx\XeTeXrevision\@undefined
  \errmessage{XeTeX is required to use sinhala}
\fi
\ifx\XeTeXinterchartokenstate\@undefined
  \errmessage{XeTeX 0.997 or above required to use sinhala}
\fi
\ProvidesPackage{sinhala}[2008/04/20]
\RequirePackage{fontspec}
\newfontinstance{\sifont}[Script=Sinhala]{LKLUG}
\newcommand\latinfont{\fontfamily{lmr}\selectfont}
\XeTeXinterchartokenstate = 1
\newcount\cnt\cnt="0D80
\loop
  \XeTeXcharclass\cnt=10 \ifnum\cnt<"0DFF \advance\cnt1
\repeat
\XeTeXcharclass "200C = 10
\XeTeXcharclass "200D = 10
\XeTeXinterchartoks 0 10 = {\sifont}
\XeTeXinterchartoks 255 10 = {\sifont}
\XeTeXinterchartoks 10 0 = {\latinfont}
\XeTeXinterchartoks 10 255 = {\latinfont}

So, all you need is XeTeX 0.997 and sinhala.sty to write LaTeX files using Sinhala Unicode.

No comments: