Diff: UnicodeNotes - Waikato Linux Users Group

Differences between version 26 and predecessor to the previous major change of UnicodeNotes.

Other diffs: Previous Revision, Previous Author, or view the Annotated Edit History

Newer page:	version 26	Last edited on Wednesday, January 17, 2007 2:26:01 pm	by AristotlePagaltzis	Revert
Older page:	version 25	Last edited on Monday, January 15, 2007 11:46:33 pm	by JohnMcPherson	Revert

@@ -1,28 +1,35 @@

-~~!Setting up a UTF-8 Environment in linux~~

-

-~~!!Introduction~~

Traditionally, computers have used [ASCII], a set of 127 characters, as a result of English and American heritage. Every character can be represented with a single byte. Eventually, different countries came up with their own encodings, using the same bytes to represent different characters. For example, in a common "Western" encoding, the byte 0xFD means a Y with an acute accent, while in a common Turkish encoding, the byte 0xFD means a dotless i. And this is just for Latin-style encodings, without the thousands of characters needed by Asian languages.

-With computers taking over the world, something called "~~unicode~~ " was developed that (attempts to) assign a unique number to ~~every character~~ in ~~every language~~ .

- So Latin Y with acute has the ~~code 0x00FD~~ , while a Latin dotless i has the ~~code 0x0131~~ . UTF-8 (see the utf-8(7) ~~man page~~ ) is a method of encoding ~~the unicode numbers~~ in a backwards compatible way with [Legacy] systems that use ~~ascii~~ or Latin characters.

- The first 256 ~~unicode~~ characters are identical to Western Latin, of which the first 127 are identical to [ASCII]. All [ASCII] characters are represented exactly the same way in UTF-8. See the UTF-8 FAQ~~, which has the definitive version online at [~~ http://www.cl.cam.ac.uk/~mgk25/unicode.html|http://www.cl.cam.ac.uk/~mgk25/unicode.html].

-~~----~~

-!!!Creating accented characters

+With computers taking over the world, something called "Unicode " was developed that (attempts to) assign a unique number to most separate characters in most languages. By convention the number is it written in hexadecimal with “U+” prepended; this is called a codepoint . So Latin y with acute “ý” has the codepoint U+00FD , while a Latin dotless i “ı” has the codepoint U+0131 . [ UTF-8] (see the [ utf-8(7)] ManPage ) is a method of encoding codepoints in a backwards- compatible way with [Legacy] systems that use [ASCII] or Latin characters. The first 256 Unicode characters are identical to Western Latin, of which the first 127 are identical to [ASCII]. All [ASCII] characters are represented exactly the same way in [ UTF-8] . See the [ UTF-8 FAQ | http://www.cl.cam.ac.uk/~mgk25/unicode.html|http://www.cl.cam.ac.uk/~mgk25/unicode.html].

+

+!!! Creating accented characters

+

QWERTY keyboards for English speakers obviously don't have separate keys for accented characters like other languages do. However, there are still relatively easy ways to get characters into your applications:

# Use a ‘character-picker’ applet or similar in your desktop environment. For example, in GNOME you can add a panel applet called "character palette" that offers a customisable variety of common non-ascii characters that you can click on to insert into your clipboard.

# Use a "compose" key. For example, in GNOME's keyboard preferences settings you can assign a key to be the Compose key. If the Right Alt key is the compose key, then pressing right alt + ' will make the next character have an ' accent above it, if that is a valid combination. Eg "Compose+`, e" results in è, "Compose+~~, n" results in ñ (you have to press compose + shift + ` to get the ~~), and so on.

-----

-!!!Terminals

+

+!!! Converting Text

+

+To convert between unicode (eg utf -8 or utf -16), use the iconv command. The -t argument is the "to" encoding and -f is the "from" encoding. For example

+ $ iconv -t utf-8 -f iso-8859-1 < somefile.txt > somefile-utf8.txt

+This is a front end to the iconv(3) library (libiconv) that many recent programs use for handling character encoding and conversion.

+

+!!! Setting up a [UTF-8] environment in Linux

+

+ !! Terminals

! Testing your terminal

+

To test if your terminal already supports UTF-8, try running the following command:

-<pre>

+

+ <pre>

$ perl -e 'print chr(195) . chr(137) . "\n"'

-</pre>

+ </pre>

Copy the following text from this page and paste it into your terminal:

+

echo Árvíztűrő tükörfúrógép

</verbatim>

@@ -30,52 +37,68 @@

Some shells (notably zsh(1)) can't cope with it (and gets confused if you start moving the cursor over the text), although xterm will still print the output fine.

Bash copes with it just fine too.

+! Setting up xterm for UTF-8

-~~!Setting up xterm for UTF-8~~

- To turn on UTF-8 support in xterm (it must have been compiled with utf-8 support, xterm version 145 or later), you must invoke xterm with ~~a certain option:~~

- <~~pre~~ >

- ~~$ xterm~~ -u8

- </~~pre~~ >

+To turn on UTF-8 support in xterm (it must have been compiled with utf-8 support, xterm version 145 or later), you must invoke xterm with the “ <tt >-u8</tt >” option.

-To turn on UTF-8 support in gnome-terminal, you print a certain escape sequence to the terminal:

- <~~pre~~ >

- $ /bin/echo -ne '\033%G'

- </~~pre~~ >

-You will also need an X11 font that has the unicode characters you want to display. However, if your distribution comes with utf-8 enabled terminals, then it will almost certainly come with a decent default font. Try

-~~<pre>~~

- ~~$ xlsfonts | grep iso10646~~

-~~</pre>~~

-~~to see unicode fonts you have access to. You should see some listed for "misc-fixed", which is the default font used by terminals.~~

+To turn on UTF-8 support in gnome-terminal, you print a certain escape sequence to the terminal: “ <tt >/bin/echo -ne '\033%G'</tt >”

-If you don't specify a font when you start xterm, it will default to "fixed". This font is an "alias" - for the specific font that it maps to, look in /usr/share/fonts/misc/fonts.alias (or /usr/X11R6/lib/fonts/misc/fonts.alias for XFree86 users):~~%%%~~

- ~~...~~

+You will also need an X11 font that has the unicode characters you want to display. However, if your distribution comes with utf-8 enabled terminals, then it will almost certainly come with a decent default font. Try “<tt>xlsfonts | grep iso10646</tt>” to see unicode fonts you have access to. You should see some listed for "misc-fixed", which is the default font used by terminals.

+

+ If you don't specify a font when you start xterm, it will default to "fixed". This font is an "alias" – for the specific font that it maps to, look in <tt> /usr/share/fonts/misc/fonts.alias</tt> (or <tt> /usr/X11R6/lib/fonts/misc/fonts.alias</tt> for [ XFree86] users):

+

+ <verbatim>

fixed -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso8859-1

- ~~...~~

+ </verbatim>

You should change that to end with "-iso10646-1" instead, if you have a unicode version of the font installed. If you don't have administrator rights, you can always make your own alias file, eg put

+

+ <verbatim>

fixed -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1

-into a file such as ~~$HOME~~ /.fonts/fonts.alias and then put this directory as the first directory on your font path:

+ </verbatim>

+

+ into a file such as <tt>~~ /.fonts/fonts.alias</tt> and then put this directory as the first directory on your font path:

+

+ <verbatim>

xset +fp $HOME/.fonts/fonts.alias

-~~and now any new xterms should be able to display more non-ascii characters.~~

+ </verbatim>

+Now any new xterms should be able to display more non-[ASCII] characters.

Recent versions of xterm (eg v187) create accented Latin characters if you press a letter while the Alt key is pressed (eg alt+x gives an "ø" character). This will screw up any text-mode apps that expect alt to do something different (like emacs in text-mode). If you want the old-style behaviour, add

XTerm.vt100.metaSendsEscape: true

to your $HOME/.Xdefaults file, or run

echo 'XTerm.vt100.metaSendsEscape: true' | xrdb -merge

from a command line. (It will take effect for new xterms).

-Also, you need to re-map the Alt key to be Meta.

- Add

- ~~keysym Alt_L = Meta_L~~

-~~to your ~~/.Xmodmap file (which should be sourced on login), or run~~

- ~~xmodmap -e 'keysym Alt_L = Meta_L'~~

+Also, you need to re-map the Alt key to be Meta. Add “

+ <verbatim>

+ keysym Alt_L = Meta_L

+ </verbatim>

+to your <tt>~~/.Xmodmap</tt> file (which should be sourced on login), or run

+

+ <verbatim>

+ xmodmap -e 'keysym Alt_L = Meta_L'

+ </verbatim>

+

+! UXterm

+

+The program uxterm is a shell script wrapper that sets up the locale properly then runs xterm with the right parameters.

+

+! rxvt

+

+There is a unicode enabled version of rxvt, [uxrvt | http://software.schmorp.de/pkg/rxvt-unicode.html].

+

+<verbatim>

+urxvt -fn "xft:Bitstream Vera Sans Mono:pixelsize=16"

+</verbatim>

+

+!! locale

-~~!locale~~

It's a good idea to set some environment variables to tell applications

what language and encoding you prefer. In NewZealand, you should do

something like

$ LC_ALL=en_NZ.UTF-8 ; export LC_ALL

@@ -105,22 +128,11 @@

echo 'export LOCPATH=~/pkg/locale' >> ~/.bashrc

export 'export LC_ALL=en_NZ.UTF-8' >> ~/.bashrc

</verbatim>

-! ~~UXterm~~

- The program ~~uxterm is a shell script wrapper that sets up the locale properly then runs xterm with the right parameters.~~

+!! The [less(1)] program

-~~! rxvt~~

-

-~~There is a unicode enabled version of rxvt, uxrvt.~~

-

-~~<verbatim>~~

-~~urxvt -fn "xft:Bitstream Vera Sans Mono:pixelsize=16"~~

-~~</verbatim>~~

-

-~~!!The "~~ less~~" program~~

-

-~~"less"~~ looks for an environment variable to determine what is a printable character. The following tells less to display characters for utf-8:

+[ less(1)] looks for an environment variable to determine what is a printable character. The following tells less to display characters for utf-8:

$ LESSCHARSET=utf-8

$ export LESSCHARSET

This is not absolutely necessary -- you can give less the "-r" option to display raw characters, instead of octal codes. Or once you a viewing a file in less, you can type "-" then "r" to toggle this display on and off. If you have the environment variable set, then you can't toggle it. (Sometimes it is useful to see the raw utf-8 codes, for development purposes).

@@ -142,26 +154,22 @@

(set-terminal-coding-system 'utf-8)

to your $HOME/.emacs file.

!! vim

+

See the VimNotes page.

+!! Mail clients

-~~!! Converting Text~~

-~~To convert between unicode (eg utf-8 or utf-16), use the iconv command. The -t argument is the "to" encoding and -f is the "from" encoding. For example~~

- ~~$ iconv -t utf-8 -f iso-8859-1 < somefile.txt > somefile-utf8.txt~~

-~~This is a front end to the iconv(3) library (libiconv) that many recent programs use for handling character encoding and conversion.~~

-

-~~----~~

-~~!!!Mail clients~~

Mozilla has great charsets support, being so new. Netscape >= 4.05 has some support, but does have troubles. Mutt can do utf-8, but I haven't been able to get it to show the headers summary correctly. I don't know about kmail, balsa, or evolution, but my guess is that they are new enough to have good support.

-! !!X Fonts

+!! X Fonts

+

The easiest thing I've found to do is to get some of the excellent Microsoft true type fonts working under linux as they have put quite a bit of work into internationalisation and fonts.

If they aren't installed system wide, you can install them into $HOME/.fonts and programs using fontconfig (most modern graphical programs) will automatically find them.

At the very least, "Courier new" and "Times New Roman" are good TTF fonts to use. I personally also like "Verdana" as a sans-serif font.

-! !!File systems (Samba)

+!! File systems (Samba)

I copied a bunch of files with "non-printable" UTF-8 characters (Árvíztűrő tükörfúrógép etc) from a Samba share, using cygwin's rsync under Windows, to a vfat drive. Somewhere along the way, the encoding got changed from UTF-8 and the end result was that my ő's changed to bad squiggles or question marks, depending on what program you lookd at them with.

The [convmv|http://j3e.de/linux/convmv/] utility lets you do a bulk conversion of character sets in file names:

@@ -171,9 +179,12 @@

</verbatim>

fixed my problem. Thanks to [the Unicode/charsets section of the Samba HOWTO|http://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/unicode.html].

-For the "opposite" problem --- that is, you have a windows machine with a share that is samba-mounted onto a linux client, and the non-ascii characters are getting munged --- you need to give samba some mount options: <tt>iocharset=utf8</tt> tells samba to use a utf-8 encoding when presenting filenames to linux applications, and <tt>codepage=<i>foo</i></tt> tells samba which encoding the windows machine is using. If your accents are getting screwed up, try <tt>codepage=850</tt>.

- eg:

- smbmount //servername/sharename /mnt/point -o codepage=cp850,iocharset=utf8,password=$p

+For the "opposite" problem --- that is, you have a windows machine with a share that is samba-mounted onto a linux client, and the non-ascii characters are getting munged --- you need to give samba some mount options: <tt>iocharset=utf8</tt> tells samba to use a utf-8 encoding when presenting filenames to linux applications, and <tt>codepage=<i>foo</i></tt> tells samba which encoding the windows machine is using. If your accents are getting screwed up, try <tt>codepage=850</tt>, eg. :

+

+ <verbatim>

+ smbmount //servername/sharename /mnt/point -o codepage=cp850,iocharset=utf8,password=$pass

+ </verbatim>

+

----

CategoryNotes