Penguin
Note: You are viewing an old revision of this page. View the current version.

Setting up a UTF-8 Environment in linux

Introduction

Traditionally, computers have used ASCII, a set of 127 characters, as a result of English and American heritage. Every character can be represented with a single byte. Eventually, different countries came up with their own encodings, using the same bytes to represent different characters. For example, in a common "Western" encoding, the byte 0xFD means a Y with an acute accent, while in a common Turkish encoding, the byte 0xFD means a dotless i. And this is just for Latin-style encodings, without the thousands of characters needed by Asian languages.

With computers taking over the world, something called "unicode" was developed that (attempts to) assign a unique number to every character in every language. So Latin Y with acute has the code 0x00FD, while a Latin dotless i has the code 0x0131. UTF-8 (see the utf-8(7) man page) is a method of encoding the unicode numbers in a backwards compatible way with Legacy systems that use ascii or Latin characters. The first 256 unicode characters are identical to Western Latin, of which the first 127 are identical to ASCII. All ASCII characters are represented exactly the same way in UTF-8. See the UTF-8 FAQ, which has the definitive version online at http://www.cl.cam.ac.uk/mgk25/unicode.html.

Terminals

To turn on UTF-8 support in xterm (must have been compiled with utf-8 support, xterm version 145 or later), you must invoke xterm with a certain option:

$ xterm -u8

To turn on UTF-8 support in gnome-terminal, you print a certain escape sequence to the terminal:

$ /bin/echo -ne '\033%G'

You will also need an X11 font that has the unicode characters you want to display. However, if your distribution comes with utf-8 enabled terminals, then it will almost certainly come with a decent default font. Try

$ xlsfonts | grep iso10646

to see unicode fonts you have access to. You should see some listed for "misc-fixed", which is the default font used by terminals.

If you don't specify a font when you start xterm, it will default to "fixed". This font is an "alias" - for the specific font that it maps to, look in /usr/X11R6/lib/fonts/misc/fonts.alias
... fixed -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso8859-1 ...

You should change that to end with "-iso10646-1" instead, if you have a unicode version of the font installed.

gnome-terminal from GNOME 2 seems to be different - it appears to restrict your choices of font to name only (ie you can't specify which encoding to use). I'll update this when I figure it out...

locale

It's a good idea to set some environment variables to tell applications what language and encoding you prefer. In NewZealand, you should do something like

$ LC_ALL=en_NZ.UTF-8 ; export LC_ALL

(This requires your system to have the correct support for this locale; if it doesn't then the administrator can add "en_NZ.UTF-8 UTF-8" to /etc/locale.gen and run locale-gen(8), which is in the "locales" package.)

The system administrator can make this the default by putting

LC_ALL=en_NZ.UTF-8

into /etc/environment (create it if it doesn't already exist).

The program uxterm is a shell script wrapper that sets up the locale properly then runs xterm with the right parameters.

The "less" program

"less" looks for an environment variable to determine what is a printable character. The following tells less to display characters for utf-8
$ LESSCHARSET=utf-8 $ export LESSCHARSET

This is not absolutely necessary -- you can give less the "-r" option to display raw characters, instead of octal codes. Or once you a viewing a file in less, you can type "-" then "r" to toggle this display on and off. If you have the environment variable set, then you can't toggle it. (Sometimes it is useful to see the raw utf-8 codes, for development purposes).

Mail clients

Mozilla has great charsets support, being so new. Netscape >= 4.05 has some support, but does have troubles. Mutt can do utf-8, but I haven't been able to get it to show the headers summary correctly. I don't know about kmail, balsa, or evolution, but my guess is that they are new enough to have good support.

X Fonts

The easiest thing I've found to do is to get some of the excellent Microsoft true type fonts working under linux (see HowToTTDebian? or HowToTTXFree86?) as they have put quite a bit of work into internationalisation and fonts. At the very least, "Courier new" and "Times New Roman" are good TTF fonts to use. I personally also like "Verdana" as a sans-serif font.

Text

To convert between unicode (eg utf-8 or utf-16), use the iconv command. The -t argument is the "to" encoding and -f is the "from" encoding. For example

$ iconv -t utf-8 -f iso-8859-1 < somefile.txt > somefile-utf8.txt

This is a front end to the iconv(3) library (libiconv) that many recent programs use for handling character encoding and conversion.

Perl

perl 5.8 has significantly improved unicode/utf-8 handling over earlier versions. See the perllocale(1) and perlunicode(1) man pages. Once set to use unicode, commands like lc/uc (lower/upper case) and RegularExpression character classes (space/printable/upper/lower etc) will work as you'd expect.

Perhaps the most important tip is that by default, filehandles (including stdin(3)) are assumed to be Latin1/iso-8859-1 (most likely for backwards-compatibility?). Add

use encoding 'utf8';

to your script to change the default string encoding.


See also the HowToUnicodeHOWTO? for a decent introduction and list of support in various applications.