Penguin

Differences between version 19 and predecessor to the previous major change of UnicodeNotes.

Other diffs: Previous Revision, Previous Author, or view the Annotated Edit History

Newer page: version 19 Last edited on Wednesday, April 13, 2005 2:30:03 pm by JohnMcPherson Revert
Older page: version 10 Last edited on Tuesday, August 3, 2004 8:50:03 pm by JohnMcPherson Revert
@@ -6,27 +6,70 @@
 With computers taking over the world, something called "unicode" was developed that (attempts to) assign a unique number to every character in every language. 
 So Latin Y with acute has the code 0x00FD, while a Latin dotless i has the code 0x0131. UTF-8 (see the utf-8(7) man page) is a method of encoding the unicode numbers in a backwards compatible way with [Legacy] systems that use ascii or Latin characters. 
 The first 256 unicode characters are identical to Western Latin, of which the first 127 are identical to [ASCII]. All [ASCII] characters are represented exactly the same way in UTF-8. See the UTF-8 FAQ, which has the definitive version online at [http://www.cl.cam.ac.uk/~mgk25/unicode.html|http://www.cl.cam.ac.uk/~mgk25/unicode.html]. 
  
+----  
 !!Terminals 
  
-To turn on UTF-8 support in xterm (must have been compiled with utf-8 support, xterm version 145 or later), you must invoke xterm with a certain option:%%%  
+! Testing your terminal  
+To test if your terminal already supports UTF-8, try running the following command:  
+<pre>  
+ $ perl -e 'print chr(195) . chr(137) . "\n"'  
+ É  
+</pre>  
+  
+Copy the following text from this page and paste it into your terminal:  
+<verbatim>  
+echo Árvíztűrő tükörfúrógép  
+</verbatim>  
+  
+If everything is working, you should see it both on the shell's input line and in the xterm's output. If it doesn't work, then the problem might be with the terminal, with the locale, or the lack of a fixed font that has those characters.  
+  
+Some shells (notably zsh(1)) can't cope with it (and gets confused if you start moving the cursor over the text), although xterm will still print the output fine.  
+Bash copes with it just fine too.  
+  
+  
+!Setting up xterm for UTF-8  
+ To turn on UTF-8 support in xterm (it must have been compiled with utf-8 support, xterm version 145 or later), you must invoke xterm with a certain option:  
+<pre>  
  $ xterm -u8 
+</pre>  
  
-To turn on UTF-8 support in gnome-terminal, you print a certain escape sequence to the terminal:%%%  
+To turn on UTF-8 support in gnome-terminal, you print a certain escape sequence to the terminal:  
+<pre>  
  $ /bin/echo -ne '\033%G' 
-  
-You will also need an X11 font that has the unicode characters you want to display. However, if your distribution comes with utf-8 enabled terminals, then it will almost certainly come with a decent default font. Try%%%  
- $ xlsfonts | grep iso10646%%%  
+</pre>  
+You will also need an X11 font that has the unicode characters you want to display. However, if your distribution comes with utf-8 enabled terminals, then it will almost certainly come with a decent default font. Try  
+<pre>  
+ $ xlsfonts | grep iso10646  
+</pre>  
 to see unicode fonts you have access to. You should see some listed for "misc-fixed", which is the default font used by terminals. 
  
-If you don't specify a font when you start xterm, it will default to "fixed". This font is an "alias" - for the specific font that it maps to, look in /usr/X11R6/lib/fonts/misc/fonts.alias: 
+If you don't specify a font when you start xterm, it will default to "fixed". This font is an "alias" - for the specific font that it maps to, look in /usr/X11R6/lib/fonts/misc/fonts.alias:%%%  
  ... 
  fixed -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso8859-1 
  ... 
-You should change that to end with "-iso10646-1" instead, if you have a unicode version of the font installed.  
  
-gnome-terminal from GNOME 2 seems to be different - it appears to restrict your choices of font to name only (ie you can 't specify which encoding to use ). I 'll update this when I figure it out ... 
+You should change that to end with " -iso10646-1" instead, if you have a unicode version of the font installed. If you don 't have administrator rights, you can always make your own alias file, eg put  
+ fixed -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1  
+into a file such as $HOME/.fonts/fonts.alias and then put this directory as the first directory on your font path:  
+ xset +fp $HOME/.fonts/fonts.alias  
+and now any new xterms should be able to display more non-ascii characters.  
+  
+  
+Recent versions of xterm (eg v187) create accented Latin characters if you press a letter while the Alt key is pressed (eg alt+x gives an "ø" character ). This will screw up any text-mode apps that expect alt to do something different (like emacs in text-mode). If you want the old-style behaviour, add  
+ XTerm.vt100.metaSendsEscape: true  
+to your $HOME/.Xdefaults file, or run  
+ echo 'XTerm .vt100 .metaSendsEscape: true' | xrdb -merge  
+from a command line . (It will take effect for new xterms).  
+  
+Also, you need to re-map the Alt key to be Meta.  
+Add  
+ keysym Alt_L = Meta_L  
+to your ~~/.Xmodmap file (which should be sourced on login), or run  
+ xmodmap -e 'keysym Alt_L = Meta_L'  
+  
+  
  
 !locale 
 It's a good idea to set some environment variables to tell applications 
 what language and encoding you prefer. In NewZealand, you should do 
@@ -39,29 +82,39 @@
 package.) 
  
 The system administrator can make this the default by putting 
  LC_ALL=en_NZ.UTF-8 
-into /etc/environment (create it if it doesn't already exist). 
+into /etc/environment (create it if it doesn't already exist - Note that this file might possibly be Debian-specific ). 
  
+As well as getting utf-8 support, this has the added advantage that locale-aware applications  
+will use the correct currency symbol, unit separator, date formatting etc for your locale.  
+(Eg, MozillaMail will show dates as dd/mm/yyyy instead of the default US mm/dd/yyyy)  
+  
+If you don't have a friendly administrator or can't otherwise get root permissions, you should still be  
+able to generate a locale yourself if it isn't already installed:  
+  
+1. generate a locale giving an encoding, a locale, and an output directory:  
+<verbatim>  
+ mkdir -p ~/pkg/locale/ && localedef -f UTF-8 -i en_NZ ~/pkg/locale/en_NZ.UTF-8  
+</verbatim>  
+2. Set your LOCPATH environment variable to point to the correct directory  
+<verbatim>  
+ echo 'export LOCPATH=~/pkg/locale' >> ~/.bashrc  
+ export 'export LC_ALL=en_NZ.UTF-8' >> ~/.bashrc  
+</verbatim>  
+  
+! UXterm  
 The program uxterm is a shell script wrapper that sets up the locale properly then runs xterm with the right parameters. 
  
+  
+----  
+!!!Terminal programs  
 !!The "less" program 
  
 "less" looks for an environment variable to determine what is a printable character. The following tells less to display characters for utf-8: 
  $ LESSCHARSET=utf-8 
  $ export LESSCHARSET 
 This is not absolutely necessary -- you can give less the "-r" option to display raw characters, instead of octal codes. Or once you a viewing a file in less, you can type "-" then "r" to toggle this display on and off. If you have the environment variable set, then you can't toggle it. (Sometimes it is useful to see the raw utf-8 codes, for development purposes). 
-  
-!!Mail clients  
-Mozilla has great charsets support, being so new. Netscape >= 4.05 has some support, but does have troubles. Mutt can do utf-8, but I haven't been able to get it to show the headers summary correctly. I don't know about kmail, balsa, or evolution, but my guess is that they are new enough to have good support.  
-  
-!!X Fonts  
-The easiest thing I've found to do is to get some of the excellent Microsoft true type fonts working under linux (see [HowToTTDebian] or [HowToTTXFree86]) as they have put quite a bit of work into internationalisation and fonts. At the very least, "Courier new" and "Times New Roman" are good TTF fonts to use. I personally also like "Verdana" as a sans-serif font.  
-  
-!! Text  
-To convert between unicode (eg utf-8 or utf-16), use the iconv command. The -t argument is the "to" encoding and -f is the "from" encoding. For example  
- $ iconv -t utf-8 -f iso-8859-1 < somefile.txt > somefile-utf8.txt  
-This is a front end to the iconv(3) library (libiconv) that many recent programs use for handling character encoding and conversion.  
  
 !! Perl 
 perl 5.8 has significantly improved unicode/utf-8 handling over earlier versions. See the perllocale(1) and perlunicode(1) man pages. 
 Once set to use unicode, commands like lc/uc (lower/upper case) and 
@@ -71,7 +124,34 @@
 Perhaps the most important tip is that by default, filehandles (including stdin(3)) are assumed to be Latin1/iso-8859-1 (most 
 likely for backwards-compatibility?). Add 
  use encoding 'utf8'; 
 to your script to change the default string encoding. 
+  
+!! emacs  
+  
+If you do want alt+letters to create accented characters, __don't__ use xmodmap to remap Alt to Meta (as described above), and add:  
+ (set-keyboard-coding-system 'utf-8)  
+ (set-terminal-coding-system 'utf-8)  
+to your $HOME/.emacs file.  
+  
+!! vim  
+See the VimNotes page.  
+  
+  
+!! Converting Text  
+To convert between unicode (eg utf-8 or utf-16), use the iconv command. The -t argument is the "to" encoding and -f is the "from" encoding. For example  
+ $ iconv -t utf-8 -f iso-8859-1 < somefile.txt > somefile-utf8.txt  
+This is a front end to the iconv(3) library (libiconv) that many recent programs use for handling character encoding and conversion.  
+  
+  
+!!!Mail clients  
+Mozilla has great charsets support, being so new. Netscape >= 4.05 has some support, but does have troubles. Mutt can do utf-8, but I haven't been able to get it to show the headers summary correctly. I don't know about kmail, balsa, or evolution, but my guess is that they are new enough to have good support.  
+  
+!!!X Fonts  
+The easiest thing I've found to do is to get some of the excellent Microsoft true type fonts working under linux as they have put quite a bit of work into internationalisation and fonts.  
+If they aren't installed system wide, you can install them into $HOME/.fonts and programs using fontconfig (most modern graphical programs) will automatically find them.  
+At the very least, "Courier new" and "Times New Roman" are good TTF fonts to use. I personally also like "Verdana" as a sans-serif font.  
+  
+  
  
 ---- 
-See also the [HowToUnicodeHOWTO] for a decent introduction and list of support in various applications.  
+CategoryNotes