Diff: HowToUnicodeHOWTO - Waikato Linux Users Group

Differences between current version and predecessor to the previous major change of HowToUnicodeHOWTO.

Other diffs: Previous Revision, Previous Author, or view the Annotated Edit History

Newer page:	version 5	Last edited on Monday, October 25, 2004 5:06:27 am	by AristotlePagaltzis
Older page:	version 4	Last edited on Saturday, September 4, 2004 11:03:42 am	by GeorgeLinux	Revert

@@ -1,4109 +1 @@

-~~The Unicode HOWTO~~

-

-~~----~~

-

-~~!!!The Unicode HOWTO~~

-

-~~!!Bruno Haible,~~

-~~<haible@clisp.cons.org>v1., 23 January 2001~~

-

-~~----~~

-~~''This document describes how to change your Linux system so it uses UTF-8~~

-~~as text encoding. -~~

-~~This is work in progress. Any tips, patches, pointers, URLs are very welcome.''~~

-~~----~~

-

-~~!!1. Introduction~~

-

-****1.1 Why Unicode?

-

-****1.2 Unicode encodings

-

-****1.3 Related resources

-

-~~!!2. Display setup~~

-

-****2.1 Linux console

-

-****2.2 X11 Foreign fonts

-

-****2.3 X11 Unicode fonts

-

-****2.4 Unicode xterm

-

-****2.5 !TrueType fonts

-

-****2.6 Miscellaneous

-

-~~!!3. Locale setup~~

-

-****3.1 Files & the kernel

-

-****3.2 Upgrading the C library

-

-****3.3 General data conversion

-

-****3.4 Locale environment variables

-

-****3.5 Creating the locale support files

-

-~~!!4. Specific applications~~

-

-****4.1 Shells

-

-****4.2 Networking

-

-****4.3 Browsers

-

-****4.4 Editors

-

-****4.5 Mailers

-

-****4.6 Text processing

-

-****4.7 Databases

-

-****4.8 Other text-mode applications

-

-****4.9 Other X11 applications

-

-~~!!5. Printing~~

-

-****5.1 Printing using !TrueType fonts

-

-****5.2 Printing using fixed-size fonts

-

-****5.3 The classical approach

-

-****5.4 No luck with...

-

-~~!!6. Making your programs Unicode aware~~

-

-****6.1 C/C++

-

-****6.2 Java

-

-****6.3 Lisp

-

-****6.4 Ada95

-

-****6.5 Python

-

-****6.6 !JavaScript/ECMAscript

-

-****6.7 Tcl

-

-****6.8 Perl

-

-****6.9 Related reading

-

-~~!!7. Other sources of information~~

-

-****7.1 Mailing lists

-

-~~----~~

-

-~~!!1. Introduction~~

-

-~~!!1.1 Why Unicode?~~

-

-~~People in different countries use different characters to represent the~~

-~~words of their native languages. Nowadays most applications, including~~

-~~email systems and web browsers, are 8-bit clean, i.e. they can operate on~~

-~~and display text correctly provided that it is represented in an 8-bit~~

-~~character set, like ISO-8859-1.~~

-

-~~There are far more than 256 characters in the world - think of cyrillic,~~

-~~hebrew, arabic, chinese, japanese, korean and thai -, and new characters~~

-~~are being invented now and then. The problems that come up for users are:~~

-

-****It is impossible to store text with characters from different character

-~~sets in the same document. For example, I can cite russian papers in~~

-~~a German or French publication if I use TeX, xdvi and !PostScript,~~

-~~but I cannot do it in plain text.~~

-~~****~~

-

-****As long as every document has its own character set, and recognition

-~~of the character set is not automatic, manual user intervention is~~

-~~inevitable. For example, in order to view the homepage of the~~

-~~XTeamLinux distribution~~

-~~http://www.xteamlinux.com.cn/~~

-~~I had to tell Netscape that the web page is coded in GB2312.~~

-~~****~~

-

-****New symbols like the Euro are being invented. ISO has issued a new

-~~standard ISO-8859-15, which is mostly like ISO-8859-1 except that it~~

-~~removes some rarely used characters (the old currency sign) and~~

-~~replaced it with the Euro sign. If users adopt this standard, they~~

-~~have documents in different character sets on their disk, and they~~

-~~start having to think about it daily. But computers should make things~~

-~~simpler, not more complicated.~~

-~~****~~

-

-~~The solution of this problem is the adoption of a world-wide usable character~~

-~~set. This character set is Unicode~~

-~~http://www.unicode.org/.~~

-~~For more info about Unicode, do `man 7 unicode' (manpage contained~~

-~~in the man-pages-1.20 package).~~

-

-~~!!1.2 Unicode encodings~~

-

-~~This reduces the user's problem of dealing with character sets to a technical~~

-~~problem: How to transport Unicode characters using the 8-bit bytes?~~

-~~8-bit units are the smallest addressing units of most computers and also the~~

-~~unit used by TCP/IP network connections. The use of 1 byte to represent~~

-~~1 character is, however, an accident of history, caused by the fact that~~

-~~computer development started in Europe and the U.S. where 96 characters were~~

-~~found to be sufficient for a long time.~~

-

-~~There are basically four ways to encode Unicode characters in bytes:~~

-

-~~; __UTF-8__:~~

-

-~~128 characters are encoded using 1 byte (the ASCII characters).~~

-~~1920 characters are encoded using 2 bytes (Roman, Greek, Cyrillic,~~

-~~Coptic, Armenian, Hebrew, Arabic characters).~~

-~~63488 characters are encoded using 3 bytes (Chinese and Japanese among~~

-~~others).~~

-~~The other 2147418112 characters (not assigned yet) can be encoded~~

-~~using 4, 5 or 6 characters.~~

-~~For more info about UTF-8, do `man 7 utf-8' (manpage contained~~

-~~in the man-pages-1.20 package).~~

-~~; __UCS-2__:~~

-

-~~Every character is represented as two bytes.~~

-~~This encoding can only represent the first 65536 Unicode characters.~~

-~~; __UTF-16__:~~

-

-~~This is an extension of UCS-2 which can represent 1112064 Unicode~~

-~~characters. The first 65536 Unicode characters are represented as two~~

-~~bytes, the other ones as four bytes.~~

-~~; __UCS-4__:~~

-

-~~Every character is represented as four bytes.~~

-

-~~The space requirements for encoding a text, compared to encodings currently~~

-~~in use (8 bit per character for European languages, more for~~

-~~Chinese/Japanese/Korean), is as follows. This has an influence on disk~~

-~~storage space and network download speed (when no form of compression is~~

-~~used).~~

-

-~~; __UTF-8__:~~

-

-~~No change for US ASCII, just a few percent more for ISO-8859-1,~~

-~~50% more for Chinese/Japanese/Korean, 100% more for Greek and Cyrillic.~~

-~~; __UCS-2 and UTF-16__:~~

-

-~~No change for Chinese/Japanese/Korean. 100% more for~~

-~~US ASCII and ISO-8859-1, Greek and Cyrillic.~~

-~~; __UCS-4__:~~

-

-~~100% more for Chinese/Japanese/Korean. 300% more for US ASCII and~~

-~~ISO-8859-1, Greek and Cyrillic.~~

-

-~~Given the penalty for US and European documents caused by UCS-2, UTF-16, and~~

-~~UCS-4, it seems unlikely that these encodings have a potential for wide-scale~~

-~~use. The Microsoft Win32 API supports the UCS-2 encoding since 1995 (at~~

-~~least), yet this encoding has not been widely adopted for documents - SJIS~~

-~~remains prevalent in Japan.~~

-

-~~UTF-8 on the other hand has the potential for wide-scale use, since it~~

-~~doesn't penalize US and European users, and since many text processing~~

-~~programs don't need to be changed for UTF-8 support.~~

-

-~~In the following, we will describe how to change your Linux system so~~

-~~it uses UTF-8 as text encoding.~~

-

-~~!Footnotes for C/C++ developers~~

-

-~~The Microsoft Win32 approach makes it easy for developers to produce~~

-~~Unicode versions of their programs: You "#define UNICODE" at the top~~

-~~of your program and then change many occurrences of `char' to~~

-~~`TCHAR', until your program compiles without warnings. The problem~~

-~~with it is that you end up with two versions of your program: one which~~

-~~understands UCS-2 text but no 8-bit encodings, and one which understands~~

-~~only old 8-bit encodings.~~

-

-~~Moreover, there is an endianness issue with UCS-2 and UCS-4. The IANA~~

-~~character set registry http://www.iana.org/assignments/character-sets~~

-~~says about ISO-10646-UCS-2: "this needs to specify network byte order: the~~

-~~standard does not specify". Network byte order is big endian. And RFC 2152~~

-~~is even clearer: "ISO/IEC 10646-1:1993(E) specifies that when characters the~~

-~~UCS-2 form are serialized as octets, that the most significant octet appear~~

-~~first."~~

-~~Whereas Microsoft, in its C/C++ development tools, recommends~~

-~~to use machine-dependent endianness (i.e. little endian on ix86 processors)~~

-~~and either a byte-order mark at the beginning of the document, or some~~

-~~statistical heuristics(!).~~

-

-~~The UTF-8 approach on the other hand keeps `char*' as the standard C~~

-~~string type. As a result, your program will handle US ASCII text,~~

-~~independently of any environment variables, and will handle both~~

-~~ISO-8859-1 and UTF-8 encoded text provided the LANG environment variable~~

-~~is set accordingly.~~

-

-~~!!1.3 Related resources~~

-

-~~Markus Kuhn's very up-to-date resource list:~~

-

-* http://www.cl.cam.ac.uk/~mgk25/unicode.html

-* http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html

-

-~~Roman Czyborra's overview of Unicode, UTF-8 and UTF-8 aware programs:~~

-* http://czyborra.com/utf/#UTF-8

-

-~~Some example UTF-8 files:~~

-

-****In Markus Kuhn's ucs-fonts package:

-~~quickbrown.txt,~~

-~~UTF-8-test.txt,~~

-~~UTF-8-demo.txt.~~

-~~****~~

-

-* http://www.columbia.edu/kermit/utf8.html

-* ftp://ftp.cs.su.oz.au/gary/x-utf8.html

-* ftp://ftp.nid.ru/pub/os/unix/misc/trans111.tar.gz (The file iso10646 in the Kosta Kostis' trans-1.1.1 package )

-* ftp://ftp.dante.de/pub/tex/info/lwc/apc/utf8.html

-* http://www.cogsci.ed.ac.uk/~richard/unicode-sample.html

-

-~~----~~

-

-~~!!2. Display setup~~

-

-~~We assume you have already adapted your Linux console and X11 configuration~~

-~~to your keyboard and locale. This is explained in the Danish/International~~

-~~HOWTO, and in the other national HOWTOs: Finnish, French, German, Italian,~~

-~~Polish, Slovenian, Spanish, Cyrillic, Hebrew, Chinese, Thai, Esperanto. But~~

-~~please do not follow the advice given in the Thai HOWTO, to pretend you~~

-~~were using ISO-8859-1 characters (U0000..U00FF) when what you are typing~~

-~~are actually Thai characters (U0E01..U0E5B). Doing so will only cause~~

-~~problems when you switch to Unicode.~~

-

-~~!!2.1 Linux console~~

-

-~~I'm not talking much about the Linux console here, because on those machines~~

-~~on which I don't have xdm running, I use it only to type my login name,~~

-~~my password, and "xinit".~~

-

-~~Anyway, the kbd-.99 package~~

-~~ftp://sunsite.unc.edu/pub/Linux/system/keyboards/kbd-.99.tar.gz~~

-~~and a heavily extended version, the console-tools-.2.3 package~~

-~~ftp://sunsite.unc.edu/pub/Linux/system/keyboards/console-tools-.2.3.tar.gz~~

-~~contains in the kbd-.99/src/ (or console-tools-.2.3/screenfonttools/)~~

-~~directory two programs: `unicode_start' and `unicode_stop'. When you call~~

-~~`unicode_start', the console's screen output is interpreted as UTF-8. Also,~~

-~~the keyboard is put into Unicode mode (see "man kbd_mode"). In this mode,~~

-~~Unicode characters typed as Alt-x1 ... Alt-xn (where x1,...,xn are digits on~~

-~~the numeric keypad) will be emitted in UTF-8. If your keyboard or, more~~

-~~precisely, your normal keymap has non-ASCII letter keys (like the German~~

-~~Umlaute) which you would like to be !CapsLockable, you need to apply the kernel~~

-~~patch~~

-~~linux-2.2.9-keyboard.diff~~

-or

-~~linux-2.3.12-keyboard.diff.~~

-

-~~You will want to use display characters from different scripts on the same~~

-~~screen. For this, you need a Unicode console font. The~~

-~~ftp://sunsite.unc.edu/pub/Linux/system/keyboards/kbd-.99.tar.gz~~

-~~and~~

-~~ftp://sunsite.unc.edu/pub/Linux/system/keyboards/console-data-1999.08.29.tar.gz~~

-~~packages contain a font (!LatArCyrHeb-{08,14,16,19}.psf) which~~

-~~covers Latin, Cyrillic, Hebrew, Arabic scripts. It covers ISO 8859 parts~~

-~~1,2,3,4,5,6,8,9,10 all at once. To install it, copy it to~~

-~~/usr/lib/kbd/consolefonts/ and execute~~

-~~"/usr/bin/setfont /usr/lib/kbd/consolefonts/!LatArCyrHeb-14.psf".~~

-

-~~A more flexible approach is given by Dmitry Yu. Bolkhovityanov~~

-~~<D.Yu.Bolkhovityanov@inp.nsk.su>~~

-in

-~~http://www.inp.nsk.su/~bolkhov/files/fonts/univga/index.html~~

-~~and~~

-~~http://www.inp.nsk.su/~bolkhov/files/fonts/univga/uni-vga.tgz.~~

-~~To work around the constraint that a VGA font can only cover 512 characters simultaneously,~~

-~~he provides a rich Unicode font (2279 characters, covering Latin, Greek, Cyrillic, Hebrew,~~

-~~Armenian, IPA, math symbols, arrows, and more) in the typical 8x16 size and a script~~

-~~which permits to extract any 512 characters as a console font.~~

-

-~~If you want cut&paste to work with UTF-8 consoles, you need the patch~~

-~~linux-2.3.12-console.diff~~

-~~from Edmund Thomas Grimley Evans and Stanislav Voronyi.~~

-

-~~In April 2000, Edmund Thomas Grimley Evans~~

-~~<edmundo@rano.org>~~

-~~has implemented an UTF-8 console terminal emulator. It uses Unicode fonts~~

-~~and relies on the Linux frame buffer device.~~

-

-~~!!2.2 X11 Foreign fonts~~

-

-~~Don't hesitate to install Cyrillic, Chinese, Japanese etc. fonts. Even~~

-~~if they are not Unicode fonts, they will help in displaying Unicode~~

-~~documents: at least Netscape Communicator 4 and Java will make use of~~

-~~foreign fonts when available.~~

-

-~~The following programs are useful when installing fonts:~~

-

-****"mkfontdir directory"

-~~prepares a font directory for use by the X server, needs to be executed~~

-~~after installing fonts in a directory.~~

-~~****~~

-

-****"xset -q | sed -e '1,/^Font Path:/d' | sed -e '2,$d' -e 's/^ //'"

-~~displays the X server's current font path.~~

-~~****~~

-

-****"xset fp+ directory"

-~~adds a directory to the X server's current font path.~~

-~~To add a directory permanently, add a "!FontPath" line to your~~

-~~/etc/XF86Config file, in section "Files".~~

-~~****~~

-

-****"xset fp rehash"

-~~needs to be executed after calling mkfontdir on a directory that is~~

-~~already contained in the X server's current font path.~~

-~~****~~

-

-****"xfontsel"

-~~allows you to browse the installed fonts by selecting various font~~

-~~properties.~~

-~~****~~

-

-****"xlsfonts -fn fontpattern"

-~~lists all fonts matching a font pattern. Also displays various font~~

-~~properties. In particular, "xlsfonts -ll -fn font" lists the font~~

-~~properties CHARSET_REGISTRY and CHARSET_ENCODING, which together~~

-~~determine the font's encoding.~~

-~~****~~

-

-****"xfd -fn font"

-~~displays a font page by page.~~

-~~****~~

-

-~~The following fonts are freely available (not a complete list):~~

-

-****The ones contained in XFree86, sometimes packaged in separate packages.

-~~For example, SuSE has only normal 75dpi fonts in the base `xf86' package.~~

-~~The other fonts are in the packages `xfnt100', `xfntbig', `xfntcyr',~~

-~~`xfntscl'.~~

-~~****~~

-

-****The Emacs international fonts,

-~~ftp://ftp.gnu.org/pub/gnu/intlfonts/intlfonts-1.2.tar.gz~~

-~~As already mentioned, they are useful even if you prefer XEmacs to~~

-~~GNU Emacs or don't use any Emacs at all.~~

-~~****~~

-

-~~!!2.3 X11 Unicode fonts~~

-

-~~Applications wishing to display text belonging to different scripts (like~~

-~~Cyrillic and Greek) at the same time, can do so by using different X fonts~~

-~~for the various pieces of text. This is what Netscape Communicator and Java~~

-~~do. However, this approach is more complicated, because instead of working~~

-~~with `Font' and `XFontStruct', the programmer has to deal with `XFontSet',~~

-~~and also because not all fonts in the font set need to have the same~~

-~~dimensions.~~

-

-****Markus Kuhn has assembled fixed-width 75dpi fonts with Unicode encoding

-~~covering Latin, Greek, Cyrillic, Armenian, Georgian, Hebrew scripts and~~

-~~many symbols.~~

-~~They cover ISO 8859 parts 1,2,3,4,5,7,8,9,10,13,14,15,16 all at once.~~

-~~These fonts are required for running xterm in utf-8 mode. They are now~~

-~~contained in XFree86 4..1, therefore you need to install them manually~~

-~~only if you have an older XFree86 3.x version.~~

-~~http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz.~~

-~~****~~

-

-****Markus Kuhn has also assembled double-width fixed 75dpi fonts with Unicode

-~~encoding covering Chinese, Japanese and Korean. These fonts are contained~~

-~~in XFree86 4..1 as well.~~

-~~http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts-asian.tar.gz~~

-~~****~~

-

-****Roman Czyborra has assembled an 8x16 / 16x16 75dpi font with Unicode encoding

-~~covering a huge part of Unicode. Download unifont.hex.gz and hex2bdf from~~

-~~http://czyborra.com/unifont/.~~

-~~It is not fixed-width: 8 pixels wide for European characters, 16 pixels wide~~

-~~for Chinese characters. Installation instructions:~~

-

-~~$ gunzip unifont.hex.gz~~

-~~$ hex2bdf < unifont.hex > unifont.bdf~~

-~~$ bdftopcf -o unifont.pcf unifont.bdf~~

-~~$ gzip -9 unifont.pcf~~

-~~# cp unifont.pcf.gz /usr/X11R6/lib/X11/fonts/misc~~

-~~# cd /usr/X11R6/lib/X11/fonts/misc~~

-~~# mkfontdir~~

-~~# xset fp rehash~~

-

-~~****~~

-

-****Primoz Peterlin has assembled an ETL family fonts covering Latin, Greek,

-~~Cyrillic, Armenian, Georgian, Hebrew scripts.~~

-~~ftp://ftp.x.org/contrib/fonts/etl-unicode.tar.gz~~

-~~Use the "bdftopcf" program in order to install it.~~

-~~****~~

-

-****Mark Leisher has assembled a proportional, 17 pixel high (12 point), font,

-~~called ClearlyU, covering Latin, Greek, Cyrillic, Armenian, Georgian, Hebrew,~~

-~~Thai, Laotian scripts.~~

-~~http://crl.nmsu.edu/~mleisher/cu.html.~~

-~~Installation instructions:~~

-

-~~$ bdftopcf -o cu12.pcf cu12.bdf~~

-~~$ gzip -9 cu12.pcf~~

-~~# cp cu12.pcf.gz /usr/X11R6/lib/X11/fonts/misc~~

-~~# cd /usr/X11R6/lib/X11/fonts/misc~~

-~~# mkfontdir~~

-~~# xset fp rehash~~

-

-~~****~~

-

-~~!!2.4 Unicode xterm~~

-

-~~xterm is part of X11R6 and XFree86, but is maintained separately by Tom~~

-~~Dickey.~~

-~~http://www.clark.net/pub/dickey/xterm/xterm.html~~

-~~Newer versions (patch level 146 and above) contain support for converting~~

-~~keystrokes to UTF-8 before sending them to the application running in the~~

-~~xterm, and for displaying Unicode characters that the application outputs~~

-~~as UTF-8 byte sequence. It also contains support for double-wide characters~~

-~~(mostly CJK ideographs) and combining characters, contributed by Robert Brady~~

-~~<robert@suse.co.uk>.~~

-

-~~To get an UTF-8 xterm running, you need to:~~

-

-****Fetch

-~~http://www.clark.net/pub/dickey/xterm/xterm.tar.gz,~~

-~~****~~

-

-****Configure it by calling "./configure --enable-wide-chars ...", then

-~~compile and install it.~~

-~~****~~

-

-****Have a Unicode fixed-width font installed. Markus Kuhn's ucs-fonts.tar.gz

-~~(see above) is made for this.~~

-~~****~~

-

-****Start "xterm -u8 -fn '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1'".

-~~The option "-u8" turns on Unicode and UTF-8 handling. The font designated~~

-~~by the long "-fn" option is Markus Kuhn's Unicode font. Without this option,~~

-~~the default font called "fixed" would be used, an ISO-8859-1 6x13 font.~~

-~~****~~

-

-****Take a look at the sample files contained in Markus Kuhn's ucs-fonts

-~~package:~~

-

-~~$ cd .../ucs-fonts~~

-~~$ cat quickbrown.txt~~

-~~$ cat utf-8-demo.txt~~

-

-~~You should be seeing (among others) greek and russian characters.~~

-~~****~~

-

-****To make xterm come up with UTF-8 handling each time it is started,

-~~add the lines~~

-

-~~xterm*utf8: 1~~

-~~xterm*VT100*font: -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1~~

-~~xterm*VT100*wideFont: -misc-fixed-medium-r-normal-ja-13-125-75-75-c-120-iso10646-1~~

-~~xterm*VT100*boldFont: -misc-fixed-bold-r-semicondensed--13-120-75-75-c-60-iso10646-1~~

-

-~~to your $HOME/.Xdefaults (for yourself only).~~

-~~For CJK text processing with double-width characters, the following~~

-~~settings are probably better:~~

-

-~~xterm*VT100*font: -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1~~

-~~xterm*VT100*wideFont: -Misc-Fixed-Medium-R-Normal-ja-18-120-100-100-C-180-ISO10646-1~~

-

-~~I don't recommend changing~~

-~~the system-wide /usr/X11R6/lib/X11/app-defaults/XTerm, because then your~~

-~~changes will be erased next time you upgrade to a new XFree86 version.~~

-~~****~~

-

-~~!!2.5 !TrueType fonts~~

-

-~~The fonts mentioned above are fixed size and not scalable. For some~~

-~~applications, especially printing, high resolution fonts are necessary,~~

-~~though. The most important type of scalable, high resolution fonts are~~

-~~!TrueType fonts.~~

-~~They are currently supported by~~

-

-****XFree86 4..1; you need to add the line

-

-~~Load "freetype"~~

-

-or

-

-~~Load "xtt"~~

-

-~~to the "Module" section of your XF86Config file.~~

-~~****~~

-

-****The display engines of other operating systems.

-~~****~~

-

-****The yudit editor, see below, and its printing engine.

-~~****~~

-

-~~Some no-cost !TrueType fonts with large Unicode coverage are~~

-

-~~; __Bitstream Cyberbit__:~~

-

-~~Covers Roman, Cyrillic, Greek, Hebrew, Arabic, combining diacritical marks,~~

-~~Chinese, Korean, Japanese, and more.~~

-

-~~Downloadable from~~

-~~ftp://ftp.netscape.com/pub/communicator/extras/fonts/windows/Cyberbit.ZIP.~~

-~~It is free for non-commercial purposes.~~

-

-~~; __Microsoft Arial__:~~

-

-~~Covers Roman, Cyrillic, Greek, Hebrew, Arabic, some combining diacritical~~

-~~marks, Vietnamese.~~

-

-~~Downloadable; look on a search engine for ftp-able files called~~

-~~arial.ttf, ariali.ttf, arialbd.ttf,~~

-~~arialbi.ttf.~~

-

-~~; __Lucida Sans Unicode__:~~

-

-~~Covers Roman, Cyrillic, Greek, Hebrew, combining diacritical marks.~~

-

-~~Download: contained in IBM's JDK 1.3.0 for Linux, at~~

-~~http://www.ibm.com/java/jdk/linux130/,~~

-~~or directly downloadable as !LucidaSansRegular.ttf and~~

-~~!LucidaSansOblique.ttf from~~

-~~ftp://ftp.maths.tcd.ie/Linux/opt/IBMJava2-13/jre/lib/fonts/.~~

-

-~~; __Arphic__:~~

-

-~~Cover Chinese (both traditional and simplified).~~

-

-~~Download: at~~

-~~ftp://ftp.gnu.org/non-gnu/chinese-fonts-truetype/.~~

-~~These fonts are truly free.~~

-

-~~Download locations for these and other !TrueType fonts can be found at~~

-~~Christoph Singer's list of freely downloadable Unicode !TrueType fonts~~

-~~http://www.ccss.de/slovo/unifonts.htm.~~

-

-~~Truetype fonts are installed similarly to fixed size fonts, except that~~

-~~they go in a separate directory, and that ttmkfdir must be~~

-~~called before mkfontdir:~~

-

-~~# mkdir -p /usr/X11R6/lib/X11/fonts/truetype~~

-~~# cp /somewhere/Cyberbit.ttf ... /usr/X11R6/lib/X11/fonts/truetype~~

-~~# cd /usr/X11R6/lib/X11/fonts/truetype~~

-~~# ttmkfdir > fonts.scale~~

-~~# mkfontdir~~

-~~# xset fp rehash~~

-

-~~!TrueType fonts can be converted to low resolution, non-scalable X11 fonts by~~

-~~use of Mark Leisher's ttf2bdf utility~~

-~~ftp://crl.nmsu.edu/CLR/multiling/General/ttf2bdf-2.8-LINUX.tar.gz.~~

-~~For example, to generate a proportional Unicode font for use with cooledit:~~

-

-~~# cd /usr/X11R6/lib/X11/fonts/local~~

-~~# ttf2bdf ../truetrype/Cyberbit.ttf > cyberbit.bdf~~

-~~# bdftopcf -o cyberbit.pcf cyberbit.bdf~~

-~~# gzip -9 cyberbit.pcf~~

-~~# mkfontdir~~

-~~# xset fp rehash~~

-

-~~More information about !TrueType fonts can be found in the Linux !TrueType HOWTO~~

-~~http://www.moisty.org/~brion/linux/!TrueType-HOWTO.html.~~

-

-~~!!2.6 Miscellaneous~~

-

-~~A small program which tests whether a Linux console or xterm is in UTF-8 mode~~

-~~can be found in the~~

-~~ftp://sunsite.unc.edu/pub/Linux/system/keyboards/x-lt-1.24.tar.gz~~

-~~package by Ricardas Cepas, files testUTF-8.c and testUTF8.c. Most applications~~

-~~should not use this, however: they should look at the environment variables,~~

-~~see section "Locale environment variables".~~

-

-~~----~~

-

-~~!!3. Locale setup~~

-

-~~!!3.1 Files & the kernel~~

-

-~~You can now already use any Unicode characters in file names. No kernel~~

-~~or file utilities need modifications. This is because file names in the~~

-~~kernel can be anything not containing a null byte, and '/' is used to~~

-~~delimit subdirectories. When encoded using UTF-8, non-ASCII characters~~

-~~will never be encoded using null bytes or slashes. All that happens is~~

-~~that file and directory names occupy more bytes than they contain characters.~~

-~~For example, a filename consisting of five greek characters will appear~~

-~~to the kernel as a 10-byte filename. The kernel does not know (and does~~

-~~not need to know) that these bytes are displayed as greek.~~

-

-~~This is the general theory, as long as your files stay inside Linux. On~~

-~~filesystems which are used from other operating systems, you have mount~~

-~~options to control conversion of filenames to/from UTF-8:~~

-

-****The "vfat" filesystems has a mount option "utf8".

-~~See~~

-~~file:/usr/src/linux/Documentation/filesystems/vfat.txt.~~

-~~When you give an "iocharset" mount option different from the default~~

-~~(which is "iso8859-1"), the results with and without "utf8" are not~~

-~~consistent. Therefore I don't recommend the "iocharset" mount option.~~

-~~****~~

-

-****The "msdos", "umsdos" filesystems have the same mount option, but it

-~~appears to have no effect.~~

-~~****~~

-

-****The "iso9660" filesystem has a mount option "utf8".

-~~See~~

-~~file:/usr/src/linux/Documentation/filesystems/isofs.txt.~~

-~~****~~

-

-****Since Linux 2.2.x kernels, the "ntfs" filesystem has a mount option

-~~"utf8". See~~

-~~file:/usr/src/linux/Documentation/filesystems/ntfs.txt.~~

-~~****~~

-

-~~The other filesystems (nfs, smbfs, ncpfs, hpfs, etc.) don't convert~~

-~~filenames; therefore they support Unicode file names in UTF-8 encoding only~~

-~~if the other operating system supports them.~~

-~~Recall that to enable a mount option for all future remounts, you add it to~~

-~~the fourth column of the corresponding /etc/fstab line.~~

-

-~~!!3.2 Upgrading the C library~~

-

-~~glibc-2.2 supports multibyte locales, in particular UTF-8 locales. But~~

-~~glibc-2.1.x and earlier C libraries do not support it. Therefore you need~~

-~~to upgrade to glibc-2.2. Upgrading from glibc-2.1.x is riskless, because~~

-~~glibc-2.2 is binary compatible with glibc-2.1.x (at least on i386 platforms,~~

-~~and except for IPv6). Nevertheless, I recommend to have a bootable rescue~~

-~~disk handy in case something goes wrong.~~

-

-~~Prepare the kernel sources. You must have them unpacked and configured.~~

-~~/usr/src/linux/include/linux/autoconf.h must exist. Building the kernel~~

-~~is not needed.~~

-

-~~Retrieve the glibc sources~~

-~~ftp://ftp.gnu.org/pub/gnu/glibc/,~~

-~~su to root, then unpack, build and install it:~~

-

-~~# unset LD_PRELOAD~~

-~~# unset LD_LIBRARY_PATH~~

-~~# tar xvfz glibc-2.2.tar.gz~~

-~~# tar xvfz glibc-linuxthreads-2.2.tar.gz -C glibc-2.2~~

-~~# mkdir glibc-2.2-build~~

-~~# cd glibc-2.2-build~~

-~~# ../glibc-2.2/configure --prefix=/usr --with-headers=/usr/src/linux/include --enable-add-ons~~

-~~# make~~

-~~# make check~~

-~~# make info~~

-~~# LC_ALL=C make install~~

-~~# make localedata/install-locales~~

-

-~~Upgrading from glibc versions earlier than 2.1.x cannot be done this way;~~

-~~consider first installing a Linux distribution based on glibc-2.1.x, and~~

-~~then upgrading to glibc-2.2 as described above.~~

-

-~~Note that if -- for any reason -- you want to rebuild GCC after having~~

-~~installed glibc-2.2, you need to first apply this patch~~

-~~gcc-glibc-2.2-compat.diff~~

-~~to the GCC sources.~~

-

-~~!!3.3 General data conversion~~

-

-~~You will need a program to convert your locally (probably ISO-8859-1) encoded~~

-~~texts to UTF-8. (The alternative would be to keep using texts in different~~

-~~encodings on the same machine; this is not fun in the long run.)~~

-~~One such program is `iconv', which comes with glibc-2.2. Simply use~~

-

-~~$ iconv --from-code=ISO-8859-1 --to-code=UTF-8 < old_file > new_file~~

-

-~~Here are two handy shell scripts, called "i2u"~~

-~~i2u.sh~~

-~~(for ISO to UTF conversion) and "u2i"~~

-~~u2i.sh~~

-~~(for UTF to ISO conversion).~~

-~~Adapt according to your current 8-bit character set.~~

-

-~~If you don't have glibc-2.2 and iconv installed, you can use GNU recode 3.6~~

-~~instead.~~

-~~"i2u"~~

-~~i2u_recode.sh is~~

-~~"recode ISO-8859-1..UTF-8", and~~

-~~"u2i"~~

-~~u2i_recode.sh is~~

-~~"recode UTF-8..ISO-8859-1".~~

-~~ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz~~

-

-~~Or you can also use CLISP instead. Here are~~

-~~"i2u"~~

-~~i2u.lisp and~~

-~~"u2i"~~

-~~u2i.lisp~~

-~~written in Lisp. Note: You need a CLISP version from July 1999 or newer.~~

-~~ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz.~~

-

-~~Other data conversion programs, less powerful than GNU recode, are~~

-~~`trans'~~

-~~ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/trans113.tar.gz,~~

-~~`tcs' from the Plan9 operating system~~

-~~ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/tcs.tar.gz,~~

-~~and~~

-~~`utrans'/`uhtrans'/`hutrans'~~

-~~ftp://ftp.cdrom.com/pub/FreeBSD/distfiles/i18ntools-1..tar.gz~~

-~~by G. Adam Stanislav~~

-~~<adam@whizkidtech.net>.~~

-

-~~For the repeated conversion of files to UTF-8 from different character sets,~~

-~~a semi-automatic tool can be used:~~

-~~to-utf8~~

-~~presents the non-ASCII parts of a file to the user, lets him decide about the~~

-~~file's original character set, and then converts the file to UTF-8.~~

-

-~~!!3.4 Locale environment variables~~

-

-~~You may have the following environment variables set, containing locale~~

-~~names:~~

-

-~~; __LANGUAGE__:~~

-

-~~override for LC_MESSAGES, used by GNU gettext only~~

-~~; __LC_ALL__:~~

-

-~~override for all other LC_* variables~~

-~~; __LC_CTYPE, LC_MESSAGES, LC_COLLATE, LC_NUMERIC, LC_MONETARY, LC_TIME__:~~

-

-~~individual variables for:~~

-~~character types and encoding,~~

-~~natural language messages,~~

-~~sorting rules,~~

-~~number formatting,~~

-~~money amount formatting,~~

-~~date and time display~~

-~~; __LANG__:~~

-

-~~default value for all LC_* variables~~

-

-~~(See `man 7 locale' for a detailed description.)~~

-

-~~Each of the LC_* and LANG variables can contain a locale name of the~~

-~~following form:~~

-

-~~language~~ [~~[_territory[[.codeset~~ ]~~][[@modifier]~~

-

-~~where language is an~~

-~~ISO 639~~

-~~language code (lower case), territory is an~~

-~~ISO 3166~~

-~~country code (upper case), codeset denotes a character set, and~~

-~~modifier stands for other particular attributes (for example indicating~~

-~~a particular language dialect, or a nonstandard orthography).~~

-

-~~LANGUAGE can contain several locale names, separated by colons.~~

-

-~~In order to tell your system and all applications that you are using UTF-8,~~

-~~you need to add a codeset suffix of UTF-8 to your locale names. For example,~~

-~~if you were using~~

-

-~~LC_CTYPE=de_DE~~

-

-~~you would change this to~~

-

-~~LC_CTYPE=de_DE.UTF-8~~

-

-~~You do ''not'' need to change your LANGUAGE environment variable.~~

-~~GNU gettext in glibc-2.2 has the ability to convert translations to the right~~

-~~encoding.~~

-

-~~!!3.5 Creating the locale support files~~

-

-~~You create using localedef the support files for each UTF-8 locale~~

-~~you intend to use, for example:~~

-

-~~$ localedef -v -c -i de_DE -f UTF-8 de_DE.UTF-8~~

-

-~~You typically don't need to create locales named "de" or "fr" without~~

-~~country suffix, because these locales are normally only used by the~~

-~~LANGUAGE variable and not by the LC_* variables, and LANGUAGE is only~~

-~~used as an override for LC_MESSAGES.~~

-

-~~----~~

-

-~~!!4. Specific applications~~

-

-~~!!4.1 Shells~~

-

-~~!bash~~

-

-~~By default, GNU bash assumes that every character is one byte long and one~~

-~~column wide. A patch for bash 2.04, by Marcin 'Qrczak' Kowalczyk and~~

-~~Ricardas Cepas, teaches bash about multibyte characters in UTF-8 encoding.~~

-~~bash-2.04-diff~~

-

-~~Double-width characters, combining characters and bidi are not supported by~~

-~~this patch. It seems a complete redesign of the readline redisplay engine is~~

-~~needed.~~

-

-~~!!4.2 Networking~~

-

-~~!telnet~~

-

-~~In some installations, telnet is not 8-bit clean by default.~~

-~~In order to be able to send Unicode keystrokes to the remote host, you need to~~

-~~set telnet into "outbinary" mode.~~

-~~There are two ways to do this:~~

-

-~~$ telnet -L <host>~~

-

-~~and~~

-

-~~$ telnet~~

-~~telnet> set outbinary~~

-~~telnet> open <host>~~

-

-~~!kermit~~

-

-~~The communications program C-Kermit~~

-~~http://www.columbia.edu/kermit/ckermit.html,~~

-~~(an interactive tool for connection setup, telnet, file transfer,~~

-~~with support for TCP/IP and serial lines),~~

-~~in versions 7.0 or newer, understands the file and transfer encodings~~

-~~UTF-8 and UCS-2, and understands the terminal encoding UTF-8, and converts~~

-~~between these encodings and many others. Documentation of these features~~

-~~can be found in~~

-~~http://www.columbia.edu/kermit/ckermit2.html#x6.6.~~

-

-~~!!4.3 Browsers~~

-

-~~!Netscape~~

-

-~~Netscape 4.05 or newer can display HTML documents in UTF-8 encoding. All a~~

-~~document needs is the following line between the~~

-~~<head> and </head> tags:~~

-

-~~<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">~~

-

-~~Netscape 4.05 or newer can also display HTML and text files in UCS-2~~

-~~encoding with byte-order mark.~~

-

-~~http://www.netscape.com/computing/download/~~

-

-~~!Mozilla~~

-

-~~Mozilla milestone M16 has much better internationalization than Netscape 4.~~

-~~It can display HTML documents in UTF-8 encoding with support for more~~

-~~languages. Alas, there is a cosmetic problem with CJK fonts: some glyphs~~

-~~can be bigger than the line's height, thus overlapping the previous or next~~

-~~line.~~

-

-~~http://www.mozilla.org/~~

-

-~~!Amaya~~

-

-~~Amaya 4.2.1~~

-(

-~~http://www.w3.org/Amaya/,~~

-~~http://www.w3.org/Amaya/User/!SourceDist)~~

-~~has now limited handling of UTF-8 encoded HTML pages. It~~

-~~recognizes the encoding, but it displays only ISO-8859-1 and symbol~~

-~~characters; it only ever accesses the fonts~~

-

-~~-adobe-times-*-iso8859-1~~

-~~-adobe-helvetica-*-iso8859-1~~

-~~-adobe-new century schoolbook-*-iso8859-1~~

-~~-adobe-courier-*-iso8859-1~~

-~~-adobe-symbol-*-adobe-fontspecific~~

-

-~~Amaya is in fact a HTML editor, not only a browser. Amaya's strengths among~~

-~~the browsers are its speed, given enough memory, and its rendering~~

-~~of mathematical formulas (MathML support).~~

-

-~~!lynx~~

-

-~~lynx-2.8 has an options screen (key 'O') which permits to set the display~~

-~~character set. When running in an xterm or Linux console in UTF-8 mode,~~

-~~set this to "UNICODE UTF-8". Note that for this setting to take effect~~

-~~in the current browser session, you have to confirm on the "Accept Changes"~~

-~~field, and for this setting to take effect in future browser sessions, you~~

-~~have to enable the "Save options to disk" field and then confirm it on~~

-~~the "Accept Changes" field.~~

-

-~~Now, again, all a document needs is the following line between the~~

-~~<head> and </head> tags:~~

-

-~~<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">~~

-

-~~When you are viewing text files in UTF-8 encoding, you also need to~~

-~~pass the command-line option "-assume_local_charset=UTF-8" (affects only~~

-~~file:/... URLs) or "-assume_charset=UTF-8" (affects all URLs).~~

-~~In lynx-2.8.2 you can alternatively, in the options screen (key 'O'),~~

-~~change the assumed document character set to "utf-8".~~

-

-~~There is also an option in the options screen, to set the "preferred document~~

-~~character set". But it has no effect, at least with file:/... URLs~~

-~~and with http://... URLs served by apache-1.3..~~

-

-~~There is a spacing and line-breaking problem, however. (Look at the~~

-~~russian section of x-utf8.html, or at utf-8-demo.txt.)~~

-

-~~Also, in lynx-2.8.2, configured with --enable-prettysrc, the nice colour~~

-~~scheme does not work correctly any more when the display character set~~

-~~has been set to "UNICODE UTF-8". This is fixed by a simple patch~~

-~~lynx282.diff.~~

-

-~~The Lynx developers say: "For any serious use of UTF-8 screen output with~~

-~~lynx, compiling with slang lib and -DSLANG_MBCS_HACK is still recommended."~~

-

-~~Latest stable release:~~

-~~ftp://ftp.gnu.org/pub/gnu/lynx/lynx-2.8.2.tar.gz~~

-

-~~http://lynx.isc.org/~~

-

-~~General home page:~~

-~~http://lynx.browser.org/~~

-

-~~http://www.slcc.edu/lynx/~~

-

-~~Newer development shapshots:~~

-~~http://lynx.isc.org/current/,~~

-~~ftp://lynx.isc.org/current/~~

-

-~~!w3m~~

-

-~~w3m by Akinori Ito~~

-~~http://ei5nazha.yz.yamagata-u.ac.jp/~aito/w3m/eng/~~

-~~is a text mode browser for HTML pages and plain-text files.~~

-~~Its layout of HTML tables, enumerations etc. is much prettier than lynx' one.~~

-~~w3m can also be used as a high quality HTML to plain text converter.~~

-

-~~w3m .1.10 has command line options for the three major Japanese encodings, but~~

-~~can also be used for UTF-8 encoded files. Without command line options,~~

-~~you often have to press Ctrl-L to refresh the display, and line breaking~~

-~~in Cyrillic and CJK paragraphs is not good.~~

-

-~~To fix this, by Hironori Sakamoto has a patch~~

-~~http://www2u.biglobe.ne.jp/~hsaka/w3m/~~

-~~which adds UTF-8 as display encoding.~~

-

-~~!Test pages~~

-

-~~Some test pages for browsers can be found at the pages of Alan Wood~~

-~~http://www.hclrss.demon.co.uk/unicode/#links~~

-~~and James Kass~~

-~~http://home.att.net/~jameskass/.~~

-

-~~!!4.4 Editors~~

-

-~~!yudit~~

-

-~~yudit by Gaacutespaacuter Sinai~~

-~~http://www.yudit.org/~~

-~~is a first-class unicode text editor for the X Window System.~~

-~~It supports simultaneous processing of many languages, input methods,~~

-~~conversions for local character standards.~~

-~~It has facilities for entering text in all languages with only~~

-~~an English keyboard, using keyboard configuration maps.~~

-

-~~!yudit-1.5~~

-

-~~It can be compiled in three versions: Xlib GUI, KDE GUI, or Motif GUI.~~

-

-~~Customization is very easy. Typically you will first customize your font.~~

-~~From the font menu I chose "Unicode". Then, since the command~~

-~~"xlsfonts '*-*-iso10646-1'" still showed some ambiguity, I chose a font~~

-~~size of 13 (to match Markus Kuhn's 13-pixel fixed font).~~

-

-~~Next, you will customize your input method. The input methods "Straight",~~

-~~"Unicode" and "SGML" are most remarkable. For details about the other~~

-~~built-in input methods, look in /usr/local/share/yudit/data/.~~

-

-~~To change the default for the next session, edit your $HOME/.yuditrc~~

-~~file.~~

-

-~~The general editor functionality is limited to editing, cut&paste~~

-~~and search&replace. No undo.~~

-

-~~!yudit-2.1~~

-

-~~This version is less easy to learn, because it comes with a homebrewn~~

-~~GUI and no easily accessible help. But it has an undo functionality and~~

-~~should therefore be more usable than version 1.5.~~

-

-~~!Fonts for yudit~~

-

-~~yudit can display text using a !TrueType font; see section "!TrueType fonts"~~

-~~above. The Bitstream Cyberbit gives good results. For yudit to find the~~

-~~font, symlink it to /usr/local/share/yudit/data/cyberbit.ttf.~~

-

-~~!vim~~

-

-~~vim (as of version 6.0r) has good support for UTF-8: when started in an~~

-~~UTF-8 locale, it assumes UTF-8 encoding for the console and the text files~~

-~~being edited. It supports double-wide (CJK) characters as well and~~

-~~combining characters and therefore fits perfectly into UTF-8 enabled~~

-~~xterm.~~

-

-~~Installation: Download from~~

-~~http://www.vim.org/.~~

-~~After unpacking the four parts, call ./configure with~~

-~~--with-features=big --enable-multibyte arguments~~

-~~(or edit src/Makefile to include the --with-features=big and~~

-~~--enable-multibyte options). This will turn on the feature~~

-~~FEAT_MBYTE. Then do "make" and "make install".~~

-

-~~vim can be used to edit files in other encodings. For example, to edit~~

-~~a BIG5 encoded file: :e ++cc=BIG5 filename. All encoding names~~

-~~supported by iconv are accepted. Plus: vim automatically distinguishes~~

-~~UTF-8 and ISO-8859-1 files without needing any command line option.~~

-

-~~!cooledit~~

-

-~~cooledit by Paul Sheer~~

-~~http://www.cooledit.org/~~

-~~is a good text editor for the X Window System. Since version 3.15, it has~~

-~~support for Unicode, including Bidi for Hebrew (but not Arabic).~~

-

-~~A build error message message about a missing "vga_setpage" function is~~

-~~worked around by adding "-DDO_NOT_USE_VGALIB" to the CFLAGS.~~

-

-~~To view UTF-8 files in an UTF-8 locale you have to modify a setting in~~

-~~the "Options -> Switches" panel: Enable the checkbox "Display characters~~

-~~outside locale". I also found it necessary to disable "Spellcheck as you~~

-~~type".~~

-

-~~For viewing texts with both European and CJK characters, cooledit needs a~~

-~~font which contains both, for example the GNU unifont (see section~~

-~~"X11 Unicode fonts"): Start once~~

-

-~~$ cooledit -fn -gnu-unifont-medium-r-normal--16-160-75-75-c-80-iso10646-1~~

-

-~~cooledit will then use this font in all future invocations.~~

-

-~~Unfortunately, the only characters that can be entered through the keyboard~~

-~~are ISO-8859-1 characters and, through a cooledit specific compose mechanism,~~

-~~ISO-8859-2 characters. Inputing arbitrary Unicode characters in cooledit is~~

-~~possible, but a bit tedious.~~

-

-~~!emacs~~

-

-~~First of all, you should read the section "International Character Set Support"~~

-~~(node "International") in the Emacs manual. In particular, note that you need~~

-~~to start Emacs using the command~~

-

-~~$ emacs -fn fontset-standard~~

-

-~~so that it will use a font set comprising a lot of international characters.~~

-

-~~In the short term, there are two packages for using UTF-8 in Emacs. None~~

-~~of them needs recompiling Emacs.~~

-

-****The emacs-utf package

-~~http://www.cs.ust.hk/faculty/otfried/Mule/~~

-~~by Otfried Cheong provides a "unicode-utf8" encoding to Emacs.~~

-~~****~~

-

-****The oc-unicode package

-~~http://www.cs.ust.hk/faculty/otfried/Mule/,~~

-~~by Otfried Cheong, an extension of the Mule-UCS package~~

-~~ftp://etlport.etl.go.jp/pub/mule/Mule-UCS/Mule-UCS-.70.tar.gz~~

-~~(mirrored at~~

-~~http://riksun.riken.go.jp/archives/misc/mule/Mule-UCS/Mule-UCS-.70.tar.gz~~

-~~and~~

-~~ftp://ftp.m17n.org/pub/mule/Mule-UCS/Mule-UCS-.70.tar.gz)~~

-~~by Hisashi Miyashita, provides a "utf-8" encoding to Emacs.~~

-~~****~~

-

-~~You can use either of these packages, or both together. The advantages~~

-~~of the emacs-utf "unicode-utf8" encoding are: it loads faster, and it deals~~

-~~better with combining characters (important for Thai).~~

-~~The advantage of the Mule-UCS / oc-unicode "utf-8" encoding is: it can apply~~

-~~to a process buffer (such as M-x shell), not only to loading and saving of~~

-~~files; and it respects the widths of characters better (important for~~

-~~Ethiopian). However, it is less reliable: After heavy editing of a file, I~~

-~~have seen some Unicode characters replaced with U+FFFD after the file was~~

-~~saved. (But maybe that were bugs in Emacs 20.5 and 20.6 which are fixed in~~

-~~Emacs 20.7.)~~

-

-~~To install the emacs-utf package, compile the program "utf2mule" and install~~

-~~it somewhere in your $PATH, also install unicode.el, muleuni-1.el,~~

-~~unicode-char.el somewhere. Then add the lines~~

-

-~~(setq load-path (cons "/home/user/somewhere/emacs" load-path))~~

-~~(if (not (string-match "XEmacs" emacs-version))~~

-~~(progn~~

-~~(require 'unicode)~~

-~~;(setq unicode-data-path "..../!UnicodeData-3...txt")~~

-~~(if (eq window-system 'x)~~

-~~(progn~~

-~~(setq fontset12~~

-~~(create-fontset-from-fontset-spec~~

-~~"-misc-fixed-medium-r-normal-*-12-*-*-*-*-*-fontset-standard"))~~

-~~(setq fontset13~~

-~~(create-fontset-from-fontset-spec~~

-~~"-misc-fixed-medium-r-normal-*-13-*-*-*-*-*-fontset-standard"))~~

-~~(setq fontset14~~

-~~(create-fontset-from-fontset-spec~~

-~~"-misc-fixed-medium-r-normal-*-14-*-*-*-*-*-fontset-standard"))~~

-~~(setq fontset15~~

-~~(create-fontset-from-fontset-spec~~

-~~"-misc-fixed-medium-r-normal-*-15-*-*-*-*-*-fontset-standard"))~~

-~~(setq fontset16~~

-~~(create-fontset-from-fontset-spec~~

-~~"-misc-fixed-medium-r-normal-*-16-*-*-*-*-*-fontset-standard"))~~

-~~(setq fontset18~~

-~~(create-fontset-from-fontset-spec~~

-~~"-misc-fixed-medium-r-normal-*-18-*-*-*-*-*-fontset-standard"))~~

-~~; (set-default-font fontset15)~~

-~~))))~~

-

-~~to your $HOME/.emacs file. To activate any of the font sets, use the Mule~~

-~~menu item "Set Font/!FontSet" or Shift-down-mouse-1. The Unicode coverage~~

-~~may of the font sets at different sizes may depend on the installed fonts;~~

- here ~~are screen shots at various sizes of UTF-8-demo.txt (~~

-~~12,~~

-~~13,~~

-~~14,~~

-~~15,~~

-~~16,~~

-~~18)~~

-~~and of the Mule script examples (~~

-~~12,~~

-~~13,~~

-~~14,~~

-~~15,~~

-~~16,~~

-~~18).~~

-~~To designate a font set as the initial font set for the first frame at startup,~~

-~~uncomment the set-default-font line in the code snippet above.~~

-

-~~To install the oc-unicode package, execute the command~~

-

-~~$ emacs -batch -l oc-comp.el~~

-

-~~and install the resulting file un-define.elc, as well as~~

-~~oc-unicode.el, oc-charsets.el, oc-tools.el,~~

-~~somewhere. Then add the lines~~

-

-~~(setq load-path (cons "/home/user/somewhere/emacs" load-path))~~

-~~(if (not (string-match "XEmacs" emacs-version))~~

-~~(progn~~

-~~(require 'oc-unicode)~~

-~~;(setq unicode-data-path "..../!UnicodeData-3...txt")~~

-~~(if (eq window-system 'x)~~

-~~(progn~~

-~~(setq fontset12~~

-~~(oc-create-fontset~~

-~~"-misc-fixed-medium-r-normal-*-12-*-*-*-*-*-fontset-standard"~~

-~~"-misc-fixed-medium-r-normal-ja-12-*-iso10646-*"))~~

-~~(setq fontset13~~

-~~(oc-create-fontset~~

-~~"-misc-fixed-medium-r-normal-*-13-*-*-*-*-*-fontset-standard"~~

-~~"-misc-fixed-medium-r-normal-ja-13-*-iso10646-*"))~~

-~~(setq fontset14~~

-~~(oc-create-fontset~~

-~~"-misc-fixed-medium-r-normal-*-14-*-*-*-*-*-fontset-standard"~~

-~~"-misc-fixed-medium-r-normal-ja-14-*-iso10646-*"))~~

-~~(setq fontset15~~

-~~(oc-create-fontset~~

-~~"-misc-fixed-medium-r-normal-*-15-*-*-*-*-*-fontset-standard"~~

-~~"-misc-fixed-medium-r-normal-ja-15-*-iso10646-*"))~~

-~~(setq fontset16~~

-~~(oc-create-fontset~~

-~~"-misc-fixed-medium-r-normal-*-16-*-*-*-*-*-fontset-standard"~~

-~~"-misc-fixed-medium-r-normal-ja-16-*-iso10646-*"))~~

-~~(setq fontset18~~

-~~(oc-create-fontset~~

-~~"-misc-fixed-medium-r-normal-*-18-*-*-*-*-*-fontset-standard"~~

-~~"-misc-fixed-medium-r-normal-ja-18-*-iso10646-*"))~~

-~~; (set-default-font fontset15)~~

-~~))))~~

-

-~~to your $HOME/.emacs file. You can choose your appropriate font set as with~~

-~~the emacs-utf package.~~

-

-~~In order to open an UTF-8 encoded file, you will type~~

-

-~~M-x universal-coding-system-argument unicode-utf8 RET~~

-~~M-x find-file filename RET~~

-

-or

-

-~~C-x RET c unicode-utf8 RET~~

-~~C-x C-f filename RET~~

-

-~~(or utf-8 instead of unicode-utf8, if you prefer oc-unicode/Mule-UCS).~~

-

-~~In order to start a shell buffer with UTF-8 I/O, you will type~~

-

-~~M-x universal-coding-system-argument utf-8 RET~~

-~~M-x shell RET~~

-

-~~(This works with oc-unicode/Mule-UCS only.)~~

-

-~~There is a newer version Mule-UCS-.81. Unfortunately you need to rebuild emacs~~

-~~from source in order to use it.~~

-

-~~Note that all this works with Emacs 20 in windowing mode only, not in terminal~~

-~~mode. None of the mentioned packages works in Emacs 21, as of this writing.~~

-

-~~Richard Stallman plans to add integrated UTF-8 support to Emacs in the long~~

-~~term, and so does the XEmacs developers group.~~

-

-~~!xemacs~~

-

-~~(This section is written by Gilbert Baumann.)~~

-

-~~Here is how to teach XEmacs (20.4 configured with MULE) the UTF-8 encoding.~~

-~~Unfortunately you need its sources to be able to patch it.~~

-

-~~First you need these files provided by Tomohiko Morioka:~~

-

-~~http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-21.-b55-emc-b55-ucs.diff~~

-~~and~~

-~~http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-ucs-conv-.1.tar.gz~~

-

-~~The .diff is a diff against the C sources. The tar ball is elisp code,~~

-~~which provides lots of code tables to map to and from Unicode. As the~~

-~~name of the diff file suggests it is against XEmacs-21; I needed to~~

-~~help `patch' a bit. The most notable difference to my XEmacs-20.4~~

-~~sources is that file-coding.[[ch] was called mule-coding.[[ch].~~

-

-~~For those unfamilar with the XEmacs-MULE stuff (as I am) a quick~~

-~~guide:~~

-

-~~What we call an encoding is called by MULE a `coding-system'. The most~~

-~~important commands are:~~

-

-~~M-x set-file-coding-system~~

-~~M-x set-buffer-process-coding-system [[comint buffers]~~

-

-~~and the variable `file-coding-system-alist', which guides `find-file'~~

-~~to guess the encoding used. After stuff was running, the very first~~

-~~thing I did was~~

-~~this.~~

-

-~~This code looks into the special mode line introduced by -*- somewhere~~

-~~in the first 600 bytes of the file about to opened; if now there is a~~

-~~field "Encoding: xyz;" and the xyz encoding ("coding system" in Emacs speak)~~

-~~exists, choose that. So now you could do e.g.~~

-

-~~;;; -*- Mode: Lisp; Syntax: Common-Lisp; Package: CLEX; Encoding: utf-8; -*-~~

-

-~~and XEmacs goes into utf-8 mode here.~~

-

-~~Atfer everything was running I defined \u03BB (greek lambda) as a~~

-~~macro like:~~

-

-~~(defmacro \u03BB (x) `(lambda .,x))~~

-

-~~!nedit~~

-

-~~!xedit~~

-

-~~With XFree86-4..1, xedit is able to edit UTF-8 files if you set the locale~~

-~~accordingly (see above), and add the line "Xedit*international: true" to~~

-~~your $HOME/.Xdefaults file.~~

-

-~~!axe~~

-

-~~As of version 6.1.2, aXe supports only 8-bit locales. If you add the line~~

-~~"Axe*international: true" to your $HOME/.Xdefaults file, it will simply dump~~

-~~core.~~

-

-~~!pico~~

-

-~~As of version 4.30, pine cannot be reasonably used to view or edit UTF-8~~

-~~files. In UTF-8 enabled xterm, it has severe redraw problems.~~

-

-~~!mined98~~

-

-~~mined98 is a small text editor by Michiel Huisjes, Achim Muumlller and~~

-~~Thomas Wolff.~~

-~~http://www.inf.fu-berlin.de/~wolff/mined98.tar.gz~~

-~~It lets you edit UTF-8 or 8-bit encoded files, in an UTF-8 or 8-bit xterm.~~

-~~It also has powerful capabilities for entering Unicode characters.~~

-

-~~mined lets you edit both 8-bit encoded and UTF-8 encoded files. By default~~

-~~it uses an autodetection heuristic. If you don't want to rely on heuristics,~~

-~~pass the command-line option -u when editing an UTF-8 file, or~~

-~~+u when editing an 8-bit encoded file. You can change the~~

-~~interpretation at any time from within the editor: It displays the encoding~~

-~~("L:h" for 8-bit, "U:h" for UTF-8) in the menu line. Click on the first~~

-~~of these characters to change it.~~

-

-~~mined knows about double-width and combining characters and displays them~~

-~~correctly. It also has a special display mode for combining characters.~~

-

-~~mined also has a scrollbar and very nice pull-down menus. Alas, the "Home",~~

-~~"End", "Delete" keys do not work.~~

-

-~~!qemacs~~

-

-~~qemacs .2 is a small text editor by Fabrice Bellard.~~

-~~http://www-stud.enst.fr/~bellard/qemacs/~~

-~~with Emacs keybindings. It runs in an UTF-8 console or xterm, and can edit~~

-~~both 8-bit encoded and UTF-8 encoded files. It still has a few rough edges,~~

-~~but further development is underway.~~

-

-~~!!4.5 Mailers~~

-

-~~MIME: RFC 2279 defines UTF-8 as a MIME charset, which can be transported~~

-~~under the 8bit, quoted-printable and base64 encodings. The older MIME~~

-~~UTF-7 proposal (RFC 2152) is considered to be deprecated and should not~~

-~~be used any further.~~

-

-~~Mail clients released after January 1, 1999, should be capable of sending and~~

-~~displaying UTF-8 encoded mails, otherwise they are considered deficient.~~

-~~But these mails have to carry the MIME labels~~

-

-~~Content-Type: text/plain; charset=UTF-8~~

-~~Content-Transfer-Encoding: 8bit~~

-

-~~Simply piping an UTF-8 file into "mail" without caring about the MIME labels~~

-~~will not work.~~

-

-~~Mail client implementors should take a look at~~

-~~http://www.imc.org/imc-intl/~~

-~~and~~

-~~http://www.imc.org/mail-i18n.html.~~

-

-~~Now about the individual mail clients (or "mail user agents"):~~

-

-~~!pine~~

-

-~~The situation for an unpatched pine version 4.30 is as follows.~~

-

-~~Pine does not do character set conversions. But it allows you to view~~

-~~UTF-8 mails in an UTF-8 text window (Linux console or xterm).~~

-

-~~Normally, Pine will warn about different character sets each time you view~~

-~~an UTF-8 encoded mail. To get rid of this warning, choose S (setup), then~~

-~~C (config), then change the value of "character-set" to UTF-8. This option~~

-~~will not do anything, except to reduce the warnings, as Pine has no built-in~~

-~~knowledge of UTF-8.~~

-

-~~Also note that Pine's notion of Unicode characters is pretty limited: It~~

-~~will display Latin and Greek characters, but not other kinds of Unicode~~

-~~characters.~~

-

-~~A patch by Robert Brady~~

-~~<robert@suse.co.uk>~~

-~~http://www.ents.susu.soton.ac.uk/~robert/pine-utf8-.1.diff~~

-~~adds UTF-8 support to Pine. With this patch, it decodes and prints headers~~

-~~and bodies properly. The patch depends on the GNOME libunicode~~

-~~http://cvs.gnome.org/lxr/source/libunicode/.~~

-

-~~However, alignment remains broken in many places; replying to a mail does~~

-~~not cause the character set to be converted as appropriate; and the editor,~~

-~~pico, cannot deal with multibyte characters.~~

-

-~~!kmail~~

-

-~~kmail (as of KDE 1.) does not support UTF-8 mails at all.~~

-

-~~!Netscape Communicator~~

-

-~~Netscape Communicator's Messenger can send and display mails in UTF-8~~

-~~encoding, but it needs a little bit of manual user intervention.~~

-

-~~To send an UTF-8 encoded mail: After opening the "Compose" window, but before~~

-~~starting to compose the message, select from the menu~~

-~~"View -> Character Set -> Unicode (UTF-8)". Then compose the message and~~

-~~send it.~~

-

-~~When you receive an UTF-8 encoded mail, Netscape unfortunately does not~~

-~~display it in UTF-8 right away, and does not even give a visual clue that~~

-~~the mail was encoded in UTF-8. You have to manually select from the menu~~

-~~"View -> Character Set -> Unicode (UTF-8)".~~

-

-~~For displaying UTF-8 mails, Netscape uses different fonts. You can adjust~~

-~~your font settings in the "Edit -> Preferences -> Fonts" dialog; choose~~

-~~the "Unicode" font category.~~

-

-~~!emacs (rmail, vm)~~

-

-~~!mutt~~

-

-~~mutt-1.2.x, as available from~~

-~~http://www.mutt.org/,~~

-~~has only rudimentary support for UTF-8: it can convert~~

-~~from UTF-8 into an 8-bit display charset. The mutt-1.3.x~~

-~~development branch also supports UTF-8 as the display charset,~~

-~~so you can run Mutt in an UTF-8 xterm, and has thorough support~~

-~~for MIME and charset conversion (relying on iconv).~~

-

-~~!exmh~~

-

-~~exmh 2.1.2 with Tk 8.4a1 can recognize and correctly display UTF-8 mails~~

-~~(without CJK characters) if you add the following lines to your~~

-~~$HOME/.Xdefaults file.~~

-

-!

-~~! Exmh~~

-!

-~~exmh.mimeUCharsets: utf-8~~

-~~exmh.mime_utf-8_registry: iso10646~~

-~~exmh.mime_utf-8_encoding: 1~~

-~~exmh.mime_utf-8_plain_families: fixed~~

-~~exmh.mime_utf-8_fixed_families: fixed~~

-~~exmh.mime_utf-8_proportional_families: fixed~~

-~~exmh.mime_utf-8_title_families: fixed~~

-

-~~!!4.6 Text processing~~

-

-~~!groff~~

-

-~~groff 1.16.1, the GNU implementation of the traditional Unix text processing~~

-~~system troff/nroff, can output UTF-8 formatted text. Simply use~~

-~~`groff -Tutf8' instead of `groff -Tlatin1' or~~

-~~`groff -Tascii'.~~

-

-~~!TeX~~

-

-~~The teTeX .9 (and newer) distribution contains an Unicode adaptation of TeX,~~

-~~called Omega~~

-(

-~~http://www.gutenberg.eu.org/omega/,~~

-~~ftp://ftp.ens.fr/pub/tex/yannis/omega).~~

-~~Together with the unicode.tex file contained in~~

-~~utf8-tex-.1.tar.gz~~

-~~it enables you to use UTF-8 encoded sources as input for TeX. A thousand of~~

-~~Unicode characters are currently supported.~~

-

-~~All that changes is that you run `omega' (instead of `tex') or `lambda'~~

-~~(instead of `latex'), and insert the following lines at the head of~~

-~~your source input.~~

-

-~~\ocp\TexUTF=inutf8~~

-~~\!InputTranslation currentfile \TexUTF~~

-

-~~\input unicode~~

-

-~~Other maybe related links:~~

-~~http://www.dante.de/projekte/nts/NTS-FAQ.html,~~

-~~ftp://ftp.dante.de/pub/tex/language/chinese/CJK/.~~

-

-~~!!4.7 Databases~~

-

-~~!PostgreSQL~~

-

-~~PostgreSQL 6.4 or newer can be built with the configuration option~~

-~~--with-mb=UNICODE.~~

-

-~~!Interbase~~

-

-~~Borland/Inprise's Interbase 6.0 can store string fields in UTF-8 format~~

-~~if the option "CHARACTER SET UNICODE_FSS" is given.~~

-

-~~!!4.8 Other text-mode applications~~

-

-~~!less~~

-

-~~With~~

-~~http://www.flash.net/~marknu/less/less-358.tar.gz~~

-~~you can browse UTF-8 encoded text files in an UTF-8 xterm or console.~~

-~~Make sure that the environment variable LESSCHARSET is not set (or is set~~

-~~to utf-8). If you also have a LESSKEY environment variable set, also make~~

-~~sure that the file it points to does not define LESSCHARSET. If necessary,~~

-~~regenerate this file using the `lesskey' command, or unset the LESSKEY~~

-~~environment variable.~~

-

-~~!lv~~

-

-~~lv-4.49.3 by Tomio Narita~~

-~~http://www.ff.iij4u.or.jp/~nrt/lv/~~

-~~is a file viewer with builtin character set converters. To view UTF-8 files~~

-~~in an UTF-8 console, use "lv -Au8". But it can also be used to view~~

-~~files in other CJK encodings in an UTF-8 console.~~

-

-~~There is a small glitch: lv turns off xterm's cursor and doesn't turn it on~~

-~~again.~~

-

-~~!expand~~

-

-~~Get the GNU textutils-2.0 and apply the patch~~

-~~textutils-2..diff,~~

-~~then configure, add "#define HAVE_FGETWC 1", "#define HAVE_FPUTWC 1" to~~

-~~config.h. Then rebuild.~~

-

-~~!col, colcrt, colrm, column, rev, ul~~

-

-~~Get the util-linux-2.9y package, configure it, then define ENABLE_WIDECHAR in~~

-~~defines.h, change the "#if " to "#if 1" in lib/widechar.h. In~~

-~~text-utils/Makefile, modify CFLAGS and LDFLAGS so that they include the~~

-~~directories where libutf8 is installed. Then rebuild.~~

-

-~~!figlet~~

-

-~~figlet 2.2 has an option for UTF-8 input: "figlet -C utf8"~~

-

-~~!Base utilities~~

-

-~~The Li18nux list of commands and utilities that ought to be made interoperable~~

-~~with UTF-8 is as follows. Useful information needs to get added here; I just~~

-~~didn't get around it yet :-)~~

-

-~~As of glibc-2.2, regular expressions only work for 8-bit characters.~~

-~~In an UTF-8 locale, regular expressions that contain non-ASCII characters~~

-~~or that expect to match a single multibyte character with "." do not work.~~

-~~This affects all commands and utilities listed below.~~

-

-~~; __alias__:~~

-

-~~No info available yet.~~

-~~; __ar__:~~

-

-~~No info available yet.~~

-~~; __arch__:~~

-

-~~No info available yet.~~

-~~; __arp__:~~

-

-~~No info available yet.~~

-~~; __at__:~~

-

-~~As of at-3.1.8: The two uses of isalnum in at.c are invalid and should be~~

-~~replaced with a use of quotearg.c or an exclude list of the (fixed) list~~

-~~of shell metacharacters. The two uses of %8s in at.c and atd.c are invalid~~

-~~and should become arbitrary length.~~

-~~; __awk__:~~

-

-~~No info available yet.~~

-~~; __basename__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __batch__:~~

-

-~~No info available yet.~~

-~~; __bc__:~~

-

-~~No info available yet.~~

-~~; __bg__:~~

-

-~~No info available yet.~~

-~~; __bunzip2__:~~

-

-~~No info available yet.~~

-~~; __bzip2__:~~

-

-~~No info available yet.~~

-~~; __bzip2recover__:~~

-

-~~No info available yet.~~

-~~; __cal__:~~

-

-~~No info available yet.~~

-~~; __cat__:~~

-

-~~No info available yet.~~

-~~; __cd__:~~

-

-~~No info available yet.~~

-~~; __cflow__:~~

-

-~~No info available yet.~~

-~~; __chgrp__:~~

-

-~~As of fileutils-4.0u: OK.~~

-~~; __chmod__:~~

-

-~~As of fileutils-4.0u: OK.~~

-~~; __chown__:~~

-

-~~As of fileutils-4.0u: OK.~~

-~~; __chroot__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __cksum__:~~

-

-~~As of textutils-2.0e: OK.~~

-~~; __clear__:~~

-

-~~No info available yet.~~

-~~; __cmp__:~~

-

-~~No info available yet.~~

-~~; __col__:~~

-

-~~No info available yet.~~

-~~; __comm__:~~

-

-~~No info available yet.~~

-~~; __command__:~~

-

-~~No info available yet.~~

-~~; __compress__:~~

-

-~~No info available yet.~~

-~~; __cp__:~~

-

-~~As of fileutils-4.0u: OK.~~

-~~; __cpio__:~~

-

-~~No info available yet.~~

-~~; __crontab__:~~

-

-~~No info available yet.~~

-~~; __csplit__:~~

-

-~~No info available yet.~~

-~~; __ctags__:~~

-

-~~No info available yet.~~

-~~; __cut__:~~

-

-~~No info available yet.~~

-~~; __date__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __dd__:~~

-

-~~As of fileutils-4.0u: The conv=lcase, conv=ucase options don't work correctly.~~

-~~; __df__:~~

-

-~~As of fileutils-4.0u: OK.~~

-~~; __diff__:~~

-

-~~As of diffutils-2.7.2: the --side-by-side mode therefore doesn't compute~~

-~~column width correctly.~~

-~~; __diff3__:~~

-

-~~No info available yet.~~

-~~; __dirname__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __domainname__:~~

-

-~~No info available yet.~~

-~~; __du__:~~

-

-~~As of fileutils-4.0u: OK.~~

-~~; __echo__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __ed__:~~

-

-~~No info available yet.~~

-~~; __egrep__:~~

-

-~~No info available yet.~~

-~~; __env__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __ex__:~~

-

-~~No info available yet.~~

-~~; __expand__:~~

-

-~~No info available yet.~~

-~~; __expr__:~~

-

-~~As of sh-utils-2.0i: The operators "match", "substr", "index", "length"~~

-~~don't work correctly.~~

-~~; __false__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __fc__:~~

-

-~~No info available yet.~~

-~~; __fg__:~~

-

-~~No info available yet.~~

-~~; __fgrep__:~~

-

-~~No info available yet.~~

-~~; __file__:~~

-

-~~No info available yet.~~

-~~; __find__:~~

-

-~~As of findutils-4.1.6: The "-iregex" does not work correctly; this needs a~~

-~~fix in function find/parser.c:insert_regex.~~

-~~; __fold__:~~

-

-~~No info available yet.~~

-~~; __ftp[[BSD]__:~~

-

-~~No info available yet.~~

-~~; __fuser__:~~

-

-~~No info available yet.~~

-~~; __gencat__:~~

-

-~~No info available yet.~~

-~~; __getconf__:~~

-

-~~No info available yet.~~

-~~; __getopts__:~~

-

-~~No info available yet.~~

-~~; __gettext__:~~

-

-~~No info available yet.~~

-~~; __grep__:~~

-

-~~No info available yet.~~

-~~; __gunzip__:~~

-

-~~No info available yet.~~

-~~; __gzip__:~~

-

-~~gzip-1.3 is UTF-8 capable, but it uses only English messages in ASCII~~

-~~charset. Proper internationalization would require: Use gettext. Call~~

-~~setlocale. In function check_ofname (file gzip.c), use the function rpmatch~~

-~~from GNU text/sh/fileutils instead of asking for "y" or "n". The use~~

-~~of strlen in gzip.c:852 is wrong, needs to use the function mbswidth.~~

-~~; __hash__:~~

-

-~~No info available yet.~~

-~~; __head__:~~

-

-~~No info available yet.~~

-~~; __hostname__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __iconv__:~~

-

-~~No info available yet.~~

-~~; __id__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __ifconfig__:~~

-

-~~No info available yet.~~

-~~; __imake__:~~

-

-~~No info available yet.~~

-~~; __ipcrm__:~~

-

-~~No info available yet.~~

-~~; __ipcs__:~~

-

-~~No info available yet.~~

-~~; __jobs__:~~

-

-~~No info available yet.~~

-~~; __join__:~~

-

-~~No info available yet.~~

-~~; __kill__:~~

-

-~~No info available yet.~~

-~~; __killall__:~~

-

-~~No info available yet.~~

-~~; __ldd__:~~

-

-~~No info available yet.~~

-~~; __less__:~~

-

-~~No complete info available yet.~~

-~~; __lex__:~~

-

-~~No info available yet.~~

-~~; __ln__:~~

-

-~~As of fileutils-4.0u: OK.~~

-~~; __locale__:~~

-

-~~As of glibc-2.2: OK.~~

-~~; __localedef__:~~

-

-~~As of glibc-2.2: OK.~~

-~~; __logger__:~~

-

-~~No info available yet.~~

-~~; __logname__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __lp__:~~

-

-~~No info available yet.~~

-~~; __lpc[[BSD]__:~~

-

-~~No info available yet.~~

-~~; __lpq[[BSD]__:~~

-

-~~No info available yet.~~

-~~; __lpr[[BSD]__:~~

-

-~~No info available yet.~~

-~~; __lprm[[BSD]__:~~

-

-~~No info available yet.~~

-~~; __lpstat(LEGACY)__:~~

-

-~~No info available yet.~~

-~~; __ls__:~~

-

-~~As of fileutils-4.0y: OK.~~

-~~; __m4__:~~

-

-~~No info available yet.~~

-~~; __mailx__:~~

-

-~~No info available yet.~~

-~~; __make__:~~

-

-~~No info available yet.~~

-~~; __man__:~~

-

-~~No info available yet.~~

-~~; __mesg__:~~

-

-~~No info available yet.~~

-~~; __mkdir__:~~

-

-~~As of fileutils-4.0u: OK.~~

-~~; __mkfifo__:~~

-

-~~As of fileutils-4.0u: OK.~~

-~~; __mkfs__:~~

-

-~~No info available yet.~~

-~~; __mkswap__:~~

-

-~~No info available yet.~~

-~~; __more__:~~

-

-~~No info available yet.~~

-~~; __mount__:~~

-

-~~No info available yet.~~

-~~; __msgfmt__:~~

-

-~~No info available yet.~~

-~~; __msgmerge__:~~

-

-~~No info available yet.~~

-~~; __mv__:~~

-

-~~As of fileutils-4.0u: OK.~~

-~~; __netstat__:~~

-

-~~No info available yet.~~

-~~; __newgrp__:~~

-

-~~No info available yet.~~

-~~; __nice__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __nl__:~~

-

-~~No info available yet.~~

-~~; __nohup__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __nslookup__:~~

-

-~~No info available yet.~~

-~~; __nm__:~~

-

-~~No info available yet.~~

-~~; __od__:~~

-

-~~No info available yet.~~

-~~; __passwd[[BSD]__:~~

-

-~~No info available yet.~~

-~~; __paste__:~~

-

-~~No info available yet.~~

-~~; __patch__:~~

-

-~~No info available yet.~~

-~~; __pathchk__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __ping__:~~

-

-~~No info available yet.~~

-~~; __pr__:~~

-

-~~No info available yet.~~

-~~; __printf__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __ps__:~~

-

-~~No info available yet.~~

-~~; __pwd__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __read__:~~

-

-~~No info available yet.~~

-~~; __reboot__:~~

-

-~~No info available yet.~~

-~~; __renice__:~~

-

-~~No info available yet.~~

-~~; __rm__:~~

-

-~~As of fileutils-4.0u: OK.~~

-~~; __rmdir__:~~

-

-~~As of fileutils-4.0u: OK.~~

-~~; __sed__:~~

-

-~~No info available yet.~~

-~~; __shar[[BSD]__:~~

-

-~~No info available yet.~~

-~~; __shutdown__:~~

-

-~~No info available yet.~~

-~~; __sleep__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __sort__:~~

-

-~~No info available yet.~~

-~~; __split__:~~

-

-~~No info available yet.~~

-~~; __strings__:~~

-

-~~No info available yet.~~

-~~; __strip__:~~

-

-~~No info available yet.~~

-~~; __stty__:~~

-

-~~As of sh-utils-2..11: OK.~~

-~~; __su[[BSD]__:~~

-

-~~No info available yet.~~

-~~; __sum__:~~

-

-~~As of textutils-2.0e: OK.~~

-~~; __tail__:~~

-

-~~No info available yet.~~

-~~; __talk__:~~

-

-~~No info available yet.~~

-~~; __tar__:~~

-

-~~As of tar-1.13.17: OK, if user and group names are always ASCII.~~

-~~; __tclsh__:~~

-

-~~No info available yet.~~

-~~; __tee__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __telnet__:~~

-

-~~No info available yet.~~

-~~; __test__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __time__:~~

-

-~~No info available yet.~~

-~~; __touch__:~~

-

-~~As of fileutils-4.0u: OK.~~

-~~; __tput__:~~

-

-~~No info available yet.~~

-~~; __tr__:~~

-

-~~No info available yet.~~

-~~; __true__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __tsort__:~~

-

-~~No info available yet.~~

-~~; __tty__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __type__:~~

-

-~~No info available yet.~~

-~~; __ulimit__:~~

-

-~~No info available yet.~~

-~~; __umask__:~~

-

-~~No info available yet.~~

-~~; __umount__:~~

-

-~~No info available yet.~~

-~~; __unalias__:~~

-

-~~No info available yet.~~

-~~; __uname__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __uncompress__:~~

-

-~~No info available yet.~~

-~~; __unexpand__:~~

-

-~~No info available yet.~~

-~~; __uniq__:~~

-

-~~No info available yet.~~

-~~; __uudecode__:~~

-

-~~No info available yet.~~

-~~; __uuencode__:~~

-

-~~No info available yet.~~

-~~; __vi__:~~

-

-~~No info available yet.~~

-~~; __wait__:~~

-

-~~No info available yet.~~

-~~; __wc__:~~

-

-~~As of textutils-2..8: OK.~~

-~~; __who__:~~

-

-~~As of sh-utils-2.0i: OK.~~

-~~; __wish__:~~

-

-~~No info available yet.~~

-~~; __write__:~~

-

-~~No info available yet.~~

-~~; __xargs__:~~

-

-~~As of findutils-4.1.5: The program uses strstr; a patch has been submitted~~

-~~to the maintainer.~~

-~~; __xgettext__:~~

-

-~~No info available yet.~~

-~~; __yacc__:~~

-

-~~No info available yet.~~

-~~; __zcat__:~~

-

-~~No info available yet.~~

-

-~~!!4.9 Other X11 applications~~

-

-~~Owen Taylor is currently developing a library for rendering multilingual~~

-~~text, called pango.~~

-~~http://www.labs.redhat.com/~otaylor/pango/,~~

-~~http://www.pango.org/.~~

-

-~~----~~

-

-~~!!5. Printing~~

-

-~~Since Postscript itself does not support Unicode fonts, the burden of~~

-~~Unicode support in printing is on the program creating the Postscript~~

-~~document, not on the Postscript renderer.~~

-

-~~The existing Postscript fonts I've seen - .pfa/.pfb/.afm/.pfm/.gsf -~~

-~~support only a small range of glyphs and are not Unicode fonts.~~

-

-~~!!5.1 Printing using !TrueType fonts~~

-

-~~Both the uniprint and wprint programs produce good printed output~~

-~~for Unicode plain text. They require a !TrueType font; see section~~

-~~"!TrueType fonts" above. The Bitstream Cyberbit gives good results.~~

-

-~~!uniprint~~

-

-~~The "uniprint" program contained in the yudit package can convert a text~~

-~~file to Postscript. For uniprint to find the Cyberbit font, symlink it to~~

-~~/usr/local/share/yudit/data/cyberbit.ttf.~~

-

-~~!wprint~~

-

-~~The "wprint" (!WorldPrint) program by Eduardo Trapani~~

-~~http://ttt.esperanto.org.uy/programoj/angle/wprint.html~~

-~~postprocesses Postscript output produced by Netscape Communicator or Mozilla~~

-~~from HTML pages or plain text files.~~

-

-~~The output is nearly perfect; only in Cyrillic paragraphs the line breaking~~

-~~is incorrect: the lines are only about half as wide as they should be.~~

-

-~~!Comparison~~

-

-~~For plain text, uniprint has a better overall layout. On the other hand,~~

-~~only wprint gets Thai output correct.~~

-

-~~!!5.2 Printing using fixed-size fonts~~

-

-~~Generally, printing using fixed-size fonts does not give an as professional~~

-~~output as using !TrueType fonts.~~

-

-~~!txtbdf2ps~~

-

-~~The txtbdf2ps .7 program by Serge Winitzki~~

-~~http://members.linuxstart.com/~winitzki/txtbdf2ps.html~~

-~~converts a plain text file to Postscript, by use of a BDF font.~~

-~~Installation:~~

-

-~~# install -m 777 txtbdf2ps-dev.txt /usr/local/bin/txtbdf2ps~~

-

-~~Example with a proportional font:~~

-

-~~$ txtbdf2ps -BDF=cyberbit.bdf -UTF-8 -nowrap < input.txt > output.ps~~

-

-~~Example with a fixed-width font:~~

-

-~~$ txtbdf2ps -BDF=unifont.bdf -UTF-8 -nowrap < input.txt > output.ps~~

-

-~~Note: txtbdf2ps does not support combining characters and bidi.~~

-

-~~!!5.3 The classical approach~~

-

-~~Another way to print with !TrueType fonts is to convert the !TrueType font to~~

-~~a Postscript font using the ttf2pt1 utility~~

-(

-~~http://www.netspace.net.au/~mheath/ttf2pt1/,~~

-~~http://quadrant.netspace.net.au/ttf2pt1/,~~

-~~http://ttf2pt1.sourceforge.net/). Details can be~~

-~~found in Julius Chroboczek's "Printing with !TrueType fonts in Unix" writeup,~~

-~~http://www.dcs.ed.ac.uk/home/jec/programs/xfsft/printing.html.~~

-

-~~!TeX, Omega~~

-

-~~TODO: CJK, metafont, omega, dvips, odvips, utf8-tex-.1~~

-

-~~!!DocBook~~

-

-~~TODO: db2ps, jadetex~~

-

-~~!groff -Tps~~

-

-~~"groff -Tps" produces Postscript output. Its Postscript output driver~~

-~~supports only a very limited number of Unicode characters (only what~~

-~~Postscript supports by itself).~~

-

-~~!!5.4 No luck with...~~

-

-~~!Netscape's "Print..."~~

-

-~~As of version 4.72, Netscape Communicator cannot correctly print HTML~~

-~~pages in UTF-8 encoding. You really have to use wprint.~~

-

-~~!Mozilla's "Print..."~~

-

-~~As of version M16, printing of HTML pages is apparently not implemented.~~

-

-~~!html2ps~~

-

-~~As of version 1.0b1, the html2ps HTML to Postscript converter does not support~~

-~~UTF-8 encoded HTML pages and has no special treatment of fonts: the generated~~

-~~Postscript uses the standard Postscript fonts.~~

-

-~~!a2ps~~

-

-~~As of version 4.12, a2ps doesn't support printing UTF-8 encoded text.~~

-

-~~!enscript~~

-

-~~As of version 1.6.1, enscript doesn't support printing UTF-8 encoded text.~~

-~~By default, it uses only the standard Postscript fonts, but it can also~~

-~~include a custom Postscript font in the output.~~

-

-~~----~~

-

-~~!!6. Making your programs Unicode aware~~

-

-~~!!6.1 C/C++~~

-

-~~The C `char' type is 8-bit and will stay 8-bit because it denotes~~

-~~the smallest addressable data unit. Various facilities are available:~~

-

-~~!For normal text handling~~

-

-~~The ISO/ANSI C standard contains, in an amendment which was added in 1995,~~

-~~a "wide character" type `wchar_t', a set of functions like those~~

-~~found in <string.h> and <ctype.h> (declared in~~

-~~<wchar.h> and <wctype.h>, respectively), and~~

-~~a set of conversion functions between `char *' and~~

-~~`wchar_t *' (declared in <stdlib.h>).~~

-

-~~Good references for this API are~~

-

-****the GNU libc-2.1 manual, chapters 4 "Character Handling" and

-~~6 "Character Set Handling",~~

-~~****~~

-

-****the manual pages

-~~man-mbswcs.tar.gz, now contained in~~

-~~ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz,~~

-~~****~~

-

-****the !OpenGroup's introduction

-~~http://www.unix-systems.org/version2/whatsnew/login_mse.html,~~

-~~****~~

-

-****the !OpenGroup's Single Unix specification

-~~http://www.UNIX-systems.org/online.html,~~

-~~****~~

-

-****the ISO/IEC 9899:1999 (ISO C 99) standard. The latest draft before it was

-~~adopted is called n2794. You find it at~~

-~~ftp://ftp.csn.net/DMK/sc22wg14/review/~~

-or

-~~http://java-tutor.com/docs/c/.~~

-~~****~~

-

-****Clive Feather's introduction

-~~http://www.lysator.liu.se/c/na1.html,~~

-~~****~~

-

-****the Dinkumware C library reference

-~~http://www.dinkumware.com/htm_cl/.~~

-~~****~~

-

-~~Advantages of using this API:~~

-

-****It's a vendor independent standard.

-~~****~~

-

-****The functions do the right thing, depending on the user's locale.

-~~All a program needs to call is setlocale(LC_ALL,"");.~~

-~~****~~

-

-~~Drawbacks of this API:~~

-

-****Some of the functions are not multithread-safe, because they keep a hidden

-~~internal state between function calls.~~

-~~****~~

-

-****There is no first-class locale datatype. Therefore this API cannot reasonably

-~~be used for anything that needs more than one locale or character set at the~~

-~~same time.~~

-~~****~~

-

-****The OS support for this API is not good on most OSes.

-~~****~~

-

-~~!Portability notes~~

-

-~~A `wchar_t' may or may not be encoded in Unicode; this is~~

-~~platform and sometimes also locale dependent. A multibyte sequence~~

-~~`char *' may or may not be encoded in UTF-8; this is platform~~

-~~and sometimes also locale dependent.~~

-

-~~In detail, here is what the~~

-~~Single Unix specification~~

-~~says about the `wchar_t' type:~~

-~~''All wide-character codes in a given process consist of an equal number~~

-~~of bits. This is in contrast to characters, which can consist of a~~

-~~variable number of bytes. The byte or byte sequence that represents a~~

-~~character can also be represented as a wide-character code.~~

-~~Wide-character codes thus provide a uniform size for manipulating text~~

-~~data. A wide-character code having all bits zero is the null~~

-~~wide-character code, and terminates wide-character strings. The~~

-~~wide-character value for each member of the Portable Character Set'' (i.e. ASCII) ''will equal its value when used as the lone character in an integer~~

-~~character constant. Wide-character codes for other characters are~~

-~~locale- and implementation-dependent. State shift bytes do not have a~~

-~~wide-character code representation.''~~

-

-~~One particular consequence is that in portable programs you shouldn't use~~

-~~non-ASCII characters in string literals. That means, even though you~~

-~~know the Unicode double quotation marks have the codes U+201C and U+201D,~~

-~~you shouldn't write a string literal L"\u201cHello\u201d, he said"~~

-~~or "\xe2\x80\x9cHello\xe2\x80\x9d, he said" in C programs. Instead,~~

-~~use GNU gettext, write it as gettext("'Hello', he said"), and create~~

-~~a message database en.po which translates "'Hello', he said" to~~

-~~"\u201cHello\u201d, he said".~~

-

-~~Here is a survey of the portability of the ISO/ANSI C facilities on various~~

-~~Unix flavours.~~

-

-~~; __GNU glibc-2.2.x__:~~

-

-****<wchar.h> and <wctype.h> exist.

-~~****~~

-

-****Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.

-~~****~~

-

-****Has five UTF-8 locales.

-~~****~~

-

-****mbrtowc works.

-~~****~~

-

-~~; __GNU glibc-2..x, glibc-2.1.x__:~~

-

-****<wchar.h> and <wctype.h> exist.

-~~****~~

-

-****Has wcs/mbs functions, but no fgetwc/fputwc/wprintf.

-~~****~~

-

-****No UTF-8 locale.

-~~****~~

-

-****mbrtowc returns EILSEQ for bytes >= 0x80.

-~~****~~

-

-~~; __AIX 4.3__:~~

-

-****<wchar.h> and <wctype.h> exist.

-~~****~~

-

-****Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.

-~~****~~

-

-****Has many UTF-8 locales, one for every country.

-~~****~~

-

-****Needs -D_XOPEN_SOURCE=500 in order to define mbstate_t.

-~~****~~

-

-****mbrtowc works.

-~~****~~

-

-~~; __Solaris 2.7__:~~

-

-****<wchar.h> and <wctype.h> exist.

-~~****~~

-

-****Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.

-~~****~~

-

-****Has the following UTF-8 locales:

-~~en_US.UTF-8, de.UTF-8, es.UTF-8, fr.UTF-8, it.UTF-8, sv.UTF-8.~~

-~~****~~

-

-****mbrtowc returns -1/EILSEQ (instead of -2) for bytes >= 0x80.

-~~****~~

-

-~~; __OSF/1 4.0d__:~~

-

-****<wchar.h> and <wctype.h> exist.

-~~****~~

-

-****Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.

-~~****~~

-

-****Has an add-on universal.utf8@ucs4 locale, see "man 5 unicode".

-~~****~~

-

-****mbrtowc does not know about UTF-8.

-~~****~~

-

-~~; __Irix 6.5__:~~

-

-****<wchar.h> and <wctype.h> exist.

-~~****~~

-

-****Has wcs/mbs functions and fgetwc/fputwc, but not wprintf.

-~~****~~

-

-****Has no multibyte locales.

-~~****~~

-

-****Has only a dummy definition for mbstate_t.

-~~****~~

-

-****Doesn't have mbrtowc.

-~~****~~

-

-~~; __HP-UX 11.00__:~~

-

-****<wchar.h> exists, <wctype.h> does not.

-~~****~~

-

-****Has wcs/mbs functions and fgetwc/fputwc, but not wprintf.

-~~****~~

-

-****Has a C.utf8 locale.

-~~****~~

-

-****Doesn't have mbstate_t.

-~~****~~

-

-****Doesn't have mbrtowc.

-~~****~~

-

-~~As a consequence, I recommend to use the restartable and multithread-safe~~

-~~wcsr/mbsr functions, forget about those systems which don't have them (Irix,~~

-~~HP-UX, AIX), and use the UTF-8 locale plug-in libutf8_plug.so (see below)~~

-~~on those systems which permit you to compile programs which use these~~

-~~wcsr/mbsr functions (Linux, Solaris, OSF/1).~~

-

-~~A similar advice, given by Sun in~~

-~~http://www.sun.com/software/white-papers/wp-unicode/,~~

-~~section "Internationalized Applications with Unicode", is:~~

-

-~~''To properly internationalize an application, use the following~~

-~~guidelines:''~~

-

-***#''Avoid direct access with Unicode. This is a task of the platform's

-~~internationalization framework.''~~

-***#

-

-***#''Use the POSIX model for multibyte and wide-character interfaces.''

-***#

-

-***#''Only call the APIs that the internationalization framework

-~~provides for language and cultural-specific operations.''~~

-***#

-

-***#''Remain code-set independent.''

-***#

-

-~~If, for some reason, in some piece of code, you really have to assume that~~

-~~`wchar_t' is Unicode (for example, if you want to do special treatment of~~

-~~some Unicode characters), you should make that piece of code conditional~~

-~~upon the result of is_locale_utf8(). Otherwise you will mess up~~

-~~your program's behaviour in different locales or other platforms. The~~

-~~function is_locale_utf8 is declared in~~

-~~utf8locale.h~~

-~~and defined in~~

-~~utf8locale.c.~~

-

-~~!The libutf8 library~~

-

-~~A portable implementation of the ISO/ANSI C API, which supports 8-bit locales~~

-~~and UTF-8 locales, can be found in~~

-~~libutf8-.7.3.tar.gz.~~

-

-~~Advantages:~~

-

-****Unicode UTF-8 support now, portably, even on OSes whose multibyte character

-~~support does not work or which don't have multibyte/wide character support~~

-~~at all.~~

-~~****~~

-

-****The same binary works in all OS supported 8-bit locales and in UTF-8 locales.

-~~****~~

-

-****When an OS vendor adds proper multibyte character support, you can take

-~~advantage of it by simply recompiling without -DHAVE_LIBUTF8 compiler option.~~

-~~****~~

-

-~~!The Plan9 way~~

-

-~~The Plan9 operating system, a variant of Unix, uses UTF-8 as character~~

-~~encoding in all applications. Its wide character type is called~~

-~~`Rune', not `wchar_t'. Parts of its libraries, written by~~

-~~Rob Pike and Howard Trickey, are available at~~

-~~ftp://ftp.cdrom.com/pub/netlib/research/9libs/9libs-1..tar.gz.~~

-~~Another similar library, written by Alistair G. Crooks, is~~

-~~ftp://ftp.cdrom.com/pub/NetBSD/packages/distfiles/libutf-2.10.tar.gz.~~

-~~In particular, each of these libraries contains an UTF-8 aware regular~~

-~~expression matcher.~~

-

-~~Drawback of this API:~~

-

-****UTF-8 is compiled in, not optional. Programs compiled in this universe lose

-~~support for the 8-bit encodings which are still frequently used in Europe.~~

-~~****~~

-

-~~!For graphical user interface~~

-

-~~The Qt-2.0 library~~

-~~http://www.troll.no/~~

-~~contains a fully-Unicode QString class. You can use the member functions~~

-~~QString::utf8 and QString::fromUtf8 to convert to/from UTF-8 encoded text.~~

-~~The QString::ascii and QString::latin1 member functions should not be used~~

-~~any more.~~

-

-~~!For advanced text handling~~

-

-~~The previously mentioned libraries implement Unicode aware versions of~~

-~~the ASCII concepts. Here are libraries which deal with Unicode concepts,~~

-~~such as titlecase (a third letter case, different from uppercase and~~

-~~lowercase), distinction between punctuation and symbols, canonical~~

-~~decomposition, combining classes, canonical ordering and the like.~~

-

-~~; __ucdata-2.4__:~~

-

-~~The ucdata library by Mark Leisher~~

-~~http://crl.nmsu.edu/~mleisher/ucdata.html~~

-~~deals with character properties, case conversion, decomposition, combining~~

-~~classes. The companion package ure-.5~~

-~~http://crl.nmsu.edu/~mleisher/ure-.5.tar.gz~~

-~~is a Unicode regular expression matcher.~~

-

-~~; __ustring__:~~

-

-~~The ustring C++ library by Rodrigo Reyes~~

-~~http://ustring.charabia.net/~~

-~~deals with character properties, case conversion, decomposition, combining~~

-~~classes, and includes a Unicode regular expression matcher.~~

-

-~~; __ICU__:~~

-

-~~International Components for Unicode~~

-~~http://oss.software.ibm.com/icu/.~~

-~~IBM's very comprehensive internationalization library featuring Unicode strings,~~

-~~resource bundles, number formatters, date/time formatters, message formatters,~~

-~~collation and more. Lots of supported locales. Portable to Unix and Win32,~~

-~~but compiles out of the box only on Linux libc6, not libc5.~~

-

-~~; __libunicode__:~~

-

-~~The GNOME libunicode library~~

-~~http://cvs.gnome.org/lxr/source/libunicode/~~

-~~by Tom Tromey and others. It covers character set conversion, character~~

-~~properties, decomposition.~~

-

-~~!For conversion~~

-

-~~Two kinds of conversion libraries, which support UTF-8 and a large number~~

-~~of 8-bit character sets, are available:~~

-

-~~!iconv~~

-

-~~The iconv implementation by Ulrich Drepper, contained in the GNU glibc-2.2.~~

-~~ftp://ftp.gnu.org/pub/gnu/glibc/glibc-2.2.tar.gz.~~

-~~The iconv manpages are now contained in~~

-~~ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz.~~

-

-~~The portable iconv implementation by Bruno Haible.~~

-~~ftp://ftp.ilog.fr/pub/Users/haible/gnu/libiconv-1.5.1.tar.gz~~

-

-~~The portable iconv implementation by Konstantin Chuguev.~~

-~~<joy@urc.ac.ru>~~

-~~ftp://ftp.urc.ac.ru/pub/local/OS/Unix/converters/iconv-.4.tar.gz~~

-

-~~Advantages:~~

-

-****iconv is POSIX standardized, programs using iconv to convert from/to UTF-8

-~~will also run under Solaris. However, the names for the character sets differ~~

-~~between platforms. For example, "EUC-JP" under glibc is "eucJP" under HP-UX.~~

-~~(The official IANA name for this character set is "EUC-JP", so it's clearly~~

-~~a HP-UX deficiency.)~~

-~~****~~

-

-****On glibc-2.1 systems, no additional library is needed. On other systems, one of

-~~the two other iconv implementations can be used.~~

-~~****~~

-

-~~!librecode~~

-

-~~librecode by Franccedilois Pinard~~

-~~ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz.~~

-

-~~Advantages:~~

-

-****Support for transliteration, i.e. conversion of non-ASCII characters

-~~to sequences of ASCII characters in order to preserve readability by~~

-~~humans, even when a lossless transformation is impossible.~~

-~~****~~

-

-~~Drawbacks:~~

-

-****Non-standard API.

-~~****~~

-

-****Slow initialization.

-~~****~~

-

-~~!ICU~~

-

-~~International Components for Unicode 1.7~~

-~~http://oss.software.ibm.com/icu/.~~

-~~IBM's internationalization library also has conversion facilities, declared~~

-~~in `ucnv.h'.~~

-

-~~Advantages:~~

-

-****Comprehensive set of supported encodings.

-~~****~~

-

-~~Drawbacks:~~

-

-****Non-standard API.

-~~****~~

-

-~~!Other approaches~~

-

-~~; __libutf-8__:~~

-

-~~libutf-8 by G. Adam Stanislav~~

-~~<adam@whizkidtech.net>~~

-~~contains a few functions for on-the-fly conversion from/to UTF-8 encoded~~

-~~`FILE*' streams.~~

-~~http://www.whizkidtech.net/i18n/libutf-8-1..tar.gz~~

-

-~~Advantages:~~

-

-****Very small.

-~~****~~

-

-~~Drawbacks:~~

-

-****Non-standard API.

-~~****~~

-

-****UTF-8 is compiled in, not optional. Programs compiled with this library

-~~lose support for the 8-bit encodings which are still frequently used in Europe.~~

-~~****~~

-

-****Installation is nontrivial: Makefile needs tweaking, not autoconfiguring.

-~~****~~

-

-~~!!6.2 Java~~

-

-~~Java has Unicode support built into the language. The type `char' denotes~~

-~~a Unicode character, and the `java.lang.String' class denotes a string~~

-~~built up from Unicode characters.~~

-

-~~Java can display any Unicode characters through its windowing system AWT,~~

-~~provided that~~

-~~1. you set the Java system property "user.language" appropriately,~~

-~~2. the /usr/lib/java/lib/font.properties.''language'' font set~~

-~~definitions are appropriate, and~~

-~~3. the fonts specified in that file are installed.~~

-~~For example, in order to display text containing japanese characters,~~

-~~you would install japanese fonts and run "java -Duser.language=ja ...".~~

-~~You can combine font sets: In order to display western european, greek~~

-~~and japanese characters simultaneously, you would create a combination~~

-~~of the files "font.properties" (covers ISO-8859-1), "font.properties.el"~~

-~~(covers ISO-8859-7) and "font.properties.ja" into a single file.~~

-~~??This is untested??~~

-

-~~The interfaces java.io.!DataInput and java.io.!DataOutput have methods called~~

-~~`readUTF' and `writeUTF' respectively. But note that they don't use UTF-8;~~

-~~they use a modified UTF-8 encoding: the NUL character is encoded as the~~

-~~two-byte sequence 0xC0 0x80 instead of 0x00, and a 0x00 byte is added at~~

-~~the end. Encoded this way, strings can contain NUL characters and nevertheless~~

-~~need not be prefixed with a length field - the C <string.h> functions~~

-~~like strlen() and strcpy() can be used to manipulate them.~~

-

-~~!!6.3 Lisp~~

-

-~~The Common Lisp standard specifies two character types: `base-char' and~~

-~~`character'. It's up to the implementation to support Unicode or not.~~

-~~The language also specifies a keyword argument `:external-format' to `open',~~

-~~as the natural place to specify a character set or encoding.~~

-

-~~Among the free Common Lisp implementations, only CLISP~~

-~~http://clisp.cons.org/~~

-~~supports Unicode. You need a CLISP version from March 2000 or newer.~~

-~~ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz.~~

-~~The types `base-char' and `character' are both equivalent to 16-bit Unicode.~~

-~~The functions char-width and string-width provide an~~

-~~API comparable to wcwidth() and wcswidth().~~

-~~The encoding used for file or socket/pipe I/O can be specified through the~~

-~~`:external-format' argument. The encodings used for tty I/O and the default~~

-~~encoding for file/socket/pipe I/O are locale dependent.~~

-

-~~Among the commercial Common Lisp implementations:~~

-

-~~!LispWorks~~

-~~http://www.xanalys.com/software_tools/products/~~

-~~supports Unicode.~~

-~~The type `base-char' is equivalent to ISO-8859-1, and the type `simple-char'~~

-~~(subtype of `character') contains all Unicode characters.~~

-~~The encoding used for file I/O can be specified through the~~

-~~`:external-format' argument, for example '(:UTF-8).~~

-~~Limitations: Encodings cannot be used for socket I/O. The editor cannot edit~~

-~~UTF-8 encoded files.~~

-

-~~Eclipse~~

-~~http://www.elwood.com/eclipse/eclipse.htm~~

-~~supports Unicode. See~~

-~~http://www.elwood.com/eclipse/char.htm.~~

-~~The type `base-char' is equivalent~~

-~~to ISO-8859-1, and the type `character' contains all Unicode characters.~~

-~~The encoding used for file I/O can be specified through a combination of~~

-~~the `:element-type' and `:external-format' arguments to `open'.~~

-~~Limitations: Character attribute functions are locale dependent. Source and~~

-~~compiled source files cannot contain Unicode string literals.~~

-

-~~The commercial Common Lisp implementation Allegro CL, in version 6., has~~

-~~Unicode support. The types `base-char' and `character' are both equivalent~~

-~~to 16-bit Unicode. The encoding used for file I/O can be specified through the~~

-~~`:external-format' argument, for example :external-format :utf8.~~

-~~The default encoding is locale dependent. More details are at~~

-~~http://www.franz.com/support/documentation/6./doc/iacl.htm.~~

-

-~~!!6.4 Ada95~~

-

-~~Ada95 was designed for Unicode support and the Ada95 standard library~~

-~~features special ISO 10646-1 data types Wide_Character and Wide_String,~~

-~~as well as numerous associated procedures and functions. The GNU Ada95~~

-~~compiler (gnat-3.11 or newer) supports UTF-8 as the external encoding of~~

-~~wide characters. This allows you to use UTF-8 in both source code and~~

-~~application I/O. To activate it in the application, use "WCEM=8" in the~~

-~~FORM string when opening a file, and use compiler option "-gnatW8" if~~

-~~the source code is in UTF-8. See the GNAT~~

-(

-~~ftp://cs.nyu.edu/pub/gnat/)~~

-~~and Ada95~~

-(

-~~ftp://ftp.cnam.fr/pub/Ada/PAL/userdocs/docadalt/rm95/index.htm)~~

-~~reference manuals for details.~~

-

-~~!!6.5 Python~~

-

-~~Python 2.~~

-(

-~~http://www.python.org/2./,~~

-~~http://www.python.org/pipermail/python-announce-list/2000-October/000889.html,~~

-~~http://starship.python.net/crew/amk/python/writing/new-python/new-python.html)~~

-~~contains Unicode support. It has a new fundamental data type~~

-~~`unicode', representing a Unicode string, a module `unicodedata' for the~~

-~~character properties, and a set of converters for the most important encodings.~~

-~~See~~

-~~http://starship.python.net/crew/lemburg/unicode-proposal.txt,~~

-~~or the file Misc/unicode.txt in the distribution, for details.~~

-

-~~!!6.6 !JavaScript/ECMAscript~~

-

-~~Since !JavaScript version 1.3, strings are always Unicode. There is no~~

-~~character type, but you can use the \uXXXX notation for Unicode characters~~

-~~inside strings. No normalization is done internally, so it expects to receive~~

-~~Unicode Normalization Form C, which the W3C recommends. See~~

-~~http://developer.netscape.com/docs/manuals/communicator/jsref/js13.html#Unicode~~

-~~for details and~~

-~~http://developer.netscape.com/docs/javascript/e262-pdf.pdf~~

-~~for the complete ECMAscript specification.~~

-

-~~!!6.7 Tcl~~

-

-~~Tcl/Tk started using Unicode as its base character set with version 8.1.~~

-~~Its internal representation for strings is UTF-8. It supports the \uXXXX~~

-~~notation for Unicode characters. See~~

-~~http://dev.scriptics.com/doc/howto/i18n.html.~~

-

-~~!!6.8 Perl~~

-

-~~Perl 5.6 stores strings internally in UTF-8 format, if you write~~

-

-~~use utf8;~~

-

-~~at the beginning of your script. length() returns the number of~~

-~~characters of a string. For details, see the Perl-i18n FAQ at~~

-~~http://rf.net/~james/perli18n.html.~~

-

-~~Support for other (non-8-bit) encodings is available through the iconv~~

-~~interface module~~

-~~http://cpan.perl.org/modules/by-module/Text/Text-Iconv-1.1.tar.gz.~~

-

-~~!!6.9 Related reading~~

-

-~~Tomohiro Kubota has written an introduction to internationalization~~

-~~http://www.debian.org/doc/manuals/intro-i18n/.~~

-~~The emphasis of his document is on writing software that runs in any locale,~~

-~~using the locale's encoding.~~

-

-~~----~~

-

-~~!!7. Other sources of information~~

-

-~~!!7.1 Mailing lists~~

-

-~~Broader audiences can be reached at the following mailing lists.~~

-

-~~Note that where I write `at', you should write `@'. (Anti-spam device.)~~

-

-~~!linux-utf8~~

-

-~~Address: linux-utf8 at nl.linux.org~~

-

-~~This mailing list is about internationalization with Unicode, and covers~~

-~~a broad range of topics from the keyboard driver to the X11 fonts.~~

-

-~~Archives are at~~

-~~http://mail.nl.linux.org/linux-utf8/.~~

-

-~~To subscribe, send a message to majordomo at nl.linux.org~~

-~~with the line "subscribe linux-utf8" in the body.~~

-

-~~!li18nux~~

-

-~~Address: linux-i18n at sun.com~~

-

-~~This mailing list is focused on organizing internationalization work on~~

-~~Linux, and arranging meetings between people.~~

-

-~~To subscribe, fill in the form at http://www.li18nux.org/~~

-~~and send it to linux-i18n-request at sun.com.~~

-

-~~!unicode~~

-

-~~Address: unicode at unicode.org~~

-

-~~This mailing list is focused on the standardization and continuing development~~

-~~of the Unicode standard, and related technologies, such as Bidi and sorting~~

-~~algorithms.~~

-

-~~Archives are at~~

-~~ftp://ftp.unicode.org/Public/!MailArchive/,~~

-~~but they are not regularly updated.~~

-

-~~For subscription information, see~~

-~~http://www.unicode.org/unicode/consortium/distlist.html.~~

-

-~~!X11 internationalization~~

-

-~~Address: i18n at xfree86.org~~

-

-~~This mailing list addresses the people who work on better internationalization~~

-~~of the X11/XFree86 system.~~

-

-~~Archives are at~~

-~~http://devel.xfree86.org/archives/i18n/.~~

-

-~~To subscribe, send mail to the friendly person at i18n-request at~~

-~~xfree86.org explaining your motivation.~~

-

-~~!X11 fonts~~

-

-~~Address: fonts at xfree86.org~~

-

-~~This mailing list addresses the people who work on Unicode fonts and the~~

-~~font subsystem for the X11/XFree86 system.~~

-

-~~Archives are at~~

-~~http://devel.xfree86.org/archives/fonts/.~~

-

-~~To subscribe, send mail to the overworked person at fonts-request at~~

-~~xfree86.org explaining your motivation~~ .

-

-~~----~~

+Describe [HowToUnicodeHOWTO ] here.