Penguin
Blame: UnicodeNotes
EditPageHistoryDiffInfoLikePages
Annotated edit history of UnicodeNotes version 28, including all changes. View license author blame.
Rev Author # Line
18 JohnMcPherson 1 Traditionally, computers have used [ASCII], a set of 127 characters, as a result of English and American heritage. Every character can be represented with a single byte. Eventually, different countries came up with their own encodings, using the same bytes to represent different characters. For example, in a common "Western" encoding, the byte 0xFD means a Y with an acute accent, while in a common Turkish encoding, the byte 0xFD means a dotless i. And this is just for Latin-style encodings, without the thousands of characters needed by Asian languages.
2
26 AristotlePagaltzis 3 With computers taking over the world, something called "Unicode" was developed that (attempts to) assign a unique number to most separate characters in most languages. By convention the number is it written in hexadecimal with “U+” prepended; this is called a codepoint. So Latin y with acute “ý” has the codepoint U+00FD, while a Latin dotless i “ı” has the codepoint U+0131. [UTF-8] (see the [utf-8(7)] ManPage) is a method of encoding codepoints in a backwards-compatible way with [Legacy] systems that use [ASCII] or Latin characters. The first 256 Unicode characters are identical to Western Latin, of which the first 127 are identical to [ASCII]. All [ASCII] characters are represented exactly the same way in [UTF-8]. See the [UTF-8 FAQ | http://www.cl.cam.ac.uk/~mgk25/unicode.html|http://www.cl.cam.ac.uk/~mgk25/unicode.html].
4
5 !!! Creating accented characters
6
23 JohnMcPherson 7 QWERTY keyboards for English speakers obviously don't have separate keys for accented characters like other languages do. However, there are still relatively easy ways to get characters into your applications:
8 # Use a ‘character-picker’ applet or similar in your desktop environment. For example, in GNOME you can add a panel applet called "character palette" that offers a customisable variety of common non-ascii characters that you can click on to insert into your clipboard.
28 LawrenceDoliveiro 9 # Use a [“compose” key|ComposeKey]. For example, in GNOME's keyboard preferences settings you can assign a key to be the Compose key. If the Right Alt key is the compose key, then pressing right alt + ' will make the next character have an ' accent above it, if that is a valid combination. Eg "Compose+`, e" results in è, "Compose+~~, n" results in ñ (you have to press compose + shift + ` to get the ~~), and so on.
26 AristotlePagaltzis 10
11 !!! Converting Text
12
13 To convert between unicode (eg utf-8 or utf-16), use the iconv command. The -t argument is the "to" encoding and -f is the "from" encoding. For example
14 $ iconv -t utf-8 -f iso-8859-1 < somefile.txt > somefile-utf8.txt
15 This is a front end to the iconv(3) library (libiconv) that many recent programs use for handling character encoding and conversion.
16
17 !!! Setting up a [UTF-8] environment in Linux
18
19 !! Terminals
18 JohnMcPherson 20
19 JohnMcPherson 21 ! Testing your terminal
26 AristotlePagaltzis 22
19 JohnMcPherson 23 To test if your terminal already supports UTF-8, try running the following command:
26 AristotlePagaltzis 24
25 <pre>
19 JohnMcPherson 26 $ perl -e 'print chr(195) . chr(137) . "\n"'
27 É
26 AristotlePagaltzis 28 </pre>
19 JohnMcPherson 29
30 Copy the following text from this page and paste it into your terminal:
26 AristotlePagaltzis 31
19 JohnMcPherson 32 <verbatim>
33 echo Árvíztűrő tükörfúrógép
34 </verbatim>
35
36 If everything is working, you should see it both on the shell's input line and in the xterm's output. If it doesn't work, then the problem might be with the terminal, with the locale, or the lack of a fixed font that has those characters.
37
38 Some shells (notably zsh(1)) can't cope with it (and gets confused if you start moving the cursor over the text), although xterm will still print the output fine.
39 Bash copes with it just fine too.
40
26 AristotlePagaltzis 41 ! Setting up xterm for UTF-8
19 JohnMcPherson 42
26 AristotlePagaltzis 43 To turn on UTF-8 support in xterm (it must have been compiled with utf-8 support, xterm version 145 or later), you must invoke xterm with the “<tt>-u8</tt>” option.
18 JohnMcPherson 44
26 AristotlePagaltzis 45 To turn on UTF-8 support in gnome-terminal, you print a certain escape sequence to the terminal: “<tt>/bin/echo -ne '\033%G'</tt>”
18 JohnMcPherson 46
26 AristotlePagaltzis 47 You will also need an X11 font that has the unicode characters you want to display. However, if your distribution comes with utf-8 enabled terminals, then it will almost certainly come with a decent default font. Try “<tt>xlsfonts | grep iso10646</tt>” to see unicode fonts you have access to. You should see some listed for "misc-fixed", which is the default font used by terminals.
48
49 If you don't specify a font when you start xterm, it will default to "fixed". This font is an "alias" – for the specific font that it maps to, look in <tt>/usr/share/fonts/misc/fonts.alias</tt> (or <tt>/usr/X11R6/lib/fonts/misc/fonts.alias</tt> for [XFree86] users):
50
51 <verbatim>
18 JohnMcPherson 52 fixed -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso8859-1
26 AristotlePagaltzis 53 </verbatim>
19 JohnMcPherson 54
18 JohnMcPherson 55 You should change that to end with "-iso10646-1" instead, if you have a unicode version of the font installed. If you don't have administrator rights, you can always make your own alias file, eg put
26 AristotlePagaltzis 56
57 <verbatim>
18 JohnMcPherson 58 fixed -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1
26 AristotlePagaltzis 59 </verbatim>
60
61 into a file such as <tt>~~/.fonts/fonts.alias</tt> and then put this directory as the first directory on your font path:
62
63 <verbatim>
18 JohnMcPherson 64 xset +fp $HOME/.fonts/fonts.alias
26 AristotlePagaltzis 65 </verbatim>
18 JohnMcPherson 66
26 AristotlePagaltzis 67 Now any new xterms should be able to display more non-[ASCII] characters.
18 JohnMcPherson 68
69 Recent versions of xterm (eg v187) create accented Latin characters if you press a letter while the Alt key is pressed (eg alt+x gives an "ø" character). This will screw up any text-mode apps that expect alt to do something different (like emacs in text-mode). If you want the old-style behaviour, add
70 XTerm.vt100.metaSendsEscape: true
71 to your $HOME/.Xdefaults file, or run
72 echo 'XTerm.vt100.metaSendsEscape: true' | xrdb -merge
73 from a command line. (It will take effect for new xterms).
74
26 AristotlePagaltzis 75 Also, you need to re-map the Alt key to be Meta. Add “
18 JohnMcPherson 76
26 AristotlePagaltzis 77 <verbatim>
78 keysym Alt_L = Meta_L
79 </verbatim>
18 JohnMcPherson 80
26 AristotlePagaltzis 81 to your <tt>~~/.Xmodmap</tt> file (which should be sourced on login), or run
82
83 <verbatim>
84 xmodmap -e 'keysym Alt_L = Meta_L'
85 </verbatim>
86
87 ! UXterm
88
89 The program uxterm is a shell script wrapper that sets up the locale properly then runs xterm with the right parameters.
90
91 ! rxvt
92
93 There is a unicode enabled version of rxvt, [uxrvt | http://software.schmorp.de/pkg/rxvt-unicode.html].
94
95 <verbatim>
96 urxvt -fn "xft:Bitstream Vera Sans Mono:pixelsize=16"
97 </verbatim>
98
99 !! locale
18 JohnMcPherson 100
101 It's a good idea to set some environment variables to tell applications
27 LawrenceDoliveiro 102 what [language and encoding|LocaleName] you prefer. In NewZealand, you should do
18 JohnMcPherson 103 something like
104 $ LC_ALL=en_NZ.UTF-8 ; export LC_ALL
105
106 (This requires your system to have the correct support for this locale;
107 if it doesn't then the administrator can add "en_NZ.UTF-8 UTF-8" to
108 /etc/locale.gen and run locale-gen(8), which is in the "locales"
109 package.)
110
111 The system administrator can make this the default by putting
112 LC_ALL=en_NZ.UTF-8
19 JohnMcPherson 113 into /etc/environment (create it if it doesn't already exist - Note that this file might possibly be Debian-specific).
18 JohnMcPherson 114
115 As well as getting utf-8 support, this has the added advantage that locale-aware applications
116 will use the correct currency symbol, unit separator, date formatting etc for your locale.
117 (Eg, MozillaMail will show dates as dd/mm/yyyy instead of the default US mm/dd/yyyy)
118
119 If you don't have a friendly administrator or can't otherwise get root permissions, you should still be
19 JohnMcPherson 120 able to generate a locale yourself if it isn't already installed:
18 JohnMcPherson 121
122 1. generate a locale giving an encoding, a locale, and an output directory:
19 JohnMcPherson 123 <verbatim>
18 JohnMcPherson 124 mkdir -p ~/pkg/locale/ && localedef -f UTF-8 -i en_NZ ~/pkg/locale/en_NZ.UTF-8
19 JohnMcPherson 125 </verbatim>
18 JohnMcPherson 126 2. Set your LOCPATH environment variable to point to the correct directory
19 JohnMcPherson 127 <verbatim>
18 JohnMcPherson 128 echo 'export LOCPATH=~/pkg/locale' >> ~/.bashrc
129 export 'export LC_ALL=en_NZ.UTF-8' >> ~/.bashrc
19 JohnMcPherson 130 </verbatim>
18 JohnMcPherson 131
26 AristotlePagaltzis 132 !!The [less(1)] program
18 JohnMcPherson 133
26 AristotlePagaltzis 134 [less(1)] looks for an environment variable to determine what is a printable character. The following tells less to display characters for utf-8:
18 JohnMcPherson 135 $ LESSCHARSET=utf-8
136 $ export LESSCHARSET
137 This is not absolutely necessary -- you can give less the "-r" option to display raw characters, instead of octal codes. Or once you a viewing a file in less, you can type "-" then "r" to toggle this display on and off. If you have the environment variable set, then you can't toggle it. (Sometimes it is useful to see the raw utf-8 codes, for development purposes).
138
139 !! Perl
140 perl 5.8 has significantly improved unicode/utf-8 handling over earlier versions. See the perllocale(1) and perlunicode(1) man pages.
141 Once set to use unicode, commands like lc/uc (lower/upper case) and
142 RegularExpression character classes (space/printable/upper/lower etc)
143 will work as you'd expect.
144
145 Perhaps the most important tip is that by default, filehandles (including stdin(3)) are assumed to be Latin1/iso-8859-1 (most
146 likely for backwards-compatibility?). Add
147 use encoding 'utf8';
148 to your script to change the default string encoding.
149
150 !! emacs
151
152 If you do want alt+letters to create accented characters, __don't__ use xmodmap to remap Alt to Meta (as described above), and add:
153 (set-keyboard-coding-system 'utf-8)
154 (set-terminal-coding-system 'utf-8)
155 to your $HOME/.emacs file.
156
157 !! vim
26 AristotlePagaltzis 158
18 JohnMcPherson 159 See the VimNotes page.
160
26 AristotlePagaltzis 161 !! Mail clients
18 JohnMcPherson 162
163 Mozilla has great charsets support, being so new. Netscape >= 4.05 has some support, but does have troubles. Mutt can do utf-8, but I haven't been able to get it to show the headers summary correctly. I don't know about kmail, balsa, or evolution, but my guess is that they are new enough to have good support.
164
26 AristotlePagaltzis 165 !! X Fonts
166
18 JohnMcPherson 167 The easiest thing I've found to do is to get some of the excellent Microsoft true type fonts working under linux as they have put quite a bit of work into internationalisation and fonts.
168 If they aren't installed system wide, you can install them into $HOME/.fonts and programs using fontconfig (most modern graphical programs) will automatically find them.
169 At the very least, "Courier new" and "Times New Roman" are good TTF fonts to use. I personally also like "Verdana" as a sans-serif font.
170
26 AristotlePagaltzis 171 !! File systems (Samba)
18 JohnMcPherson 172
22 CraigBox 173 I copied a bunch of files with "non-printable" UTF-8 characters (Árvíztűrő tükörfúrógép etc) from a Samba share, using cygwin's rsync under Windows, to a vfat drive. Somewhere along the way, the encoding got changed from UTF-8 and the end result was that my ő's changed to bad squiggles or question marks, depending on what program you lookd at them with.
174
175 The [convmv|http://j3e.de/linux/convmv/] utility lets you do a bulk conversion of character sets in file names:
176
177 <verbatim>
23 JohnMcPherson 178 ./convmv -r -f latin1 -t utf8 --notest /array/images/mp3/albums/*
22 CraigBox 179 </verbatim>
180
181 fixed my problem. Thanks to [the Unicode/charsets section of the Samba HOWTO|http://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/unicode.html].
18 JohnMcPherson 182
26 AristotlePagaltzis 183 For the "opposite" problem --- that is, you have a windows machine with a share that is samba-mounted onto a linux client, and the non-ascii characters are getting munged --- you need to give samba some mount options: <tt>iocharset=utf8</tt> tells samba to use a utf-8 encoding when presenting filenames to linux applications, and <tt>codepage=<i>foo</i></tt> tells samba which encoding the windows machine is using. If your accents are getting screwed up, try <tt>codepage=850</tt>, eg.:
184
185 <verbatim>
186 smbmount //servername/sharename /mnt/point -o codepage=cp850,iocharset=utf8,password=$pass
187 </verbatim>
188
18 JohnMcPherson 189 ----
190 CategoryNotes

PHP Warning

lib/blame.php:177: Warning: Invalid argument supplied for foreach() (...repeated 6 times)