Blame: charsets(7) - Waikato Linux Users Group

Annotated edit history of charsets(7) version 4, including all changes. View license author blame.

Rev	Author	#	Line
1	perry	1	`CHARSETS`
		2	`!!!CHARSETS`
		3	`NAME`
		4	`DESCRIPTION`
		5	`ASCII`
		6	`ISO 8859`
		7	`KOI8-R`
		8	`JIS X 0208`
		9	`KS X 1001`
		10	`GB 2312`
		11	`Big5`
		12	`TIS 620`
		13	`UNICODE`
		14	`ISO 2022 AND ISO 4873`
		15	`SEE ALSO`
		16	`----`
		17	`!!NAME`
		18
		19
		20	`charsets - programmer's view of character sets and internationalization`
		21	`!!DESCRIPTION`
		22
		23
		24	`Linux is an international operating system. Various of its`
		25	`utilities and device drivers (including the console driver)`
		26	`support multilingual character sets including Latin-alphabet`
		27	`letters with diacritical marks, accents, ligatures, and`
		28	`entire non-Latin alphabets including Greek, Cyrillic,`
		29	`Arabic, and Hebrew.`
		30
		31
		32	`This manual page presents a programmer's-eye view of`
		33	`different character-set standards and how they fit together`
		34	`on Linux. Standards discussed include ASCII, ISO 8859,`
		35	`KOI8-R, Unicode, ISO 2022 and ISO 4873. The primary emphasis`
		36	`is on character sets actually used as locale character sets,`
		37	`not the myriad others that can be found in data from other`
		38	`systems.`
		39
		40
		41	`A complete list of charsets used in a officially supported`
		42	`locale in glibc 2.2.3 is: ISO-8859-{1,2,3,5,6,7,8,9,13,15},`
		43	`CP1251, UTF-8, EUC-{KR,JP,TW}, KOI8-{R,U}, GB2312, GB18030,`
		44	`GBK, BIG5, BIG5-HKSCS and TIS-620 (in no particular order.)`
		45	`(Romanian may be switching to ISO-8859-16.)`
		46	`!!ASCII`
		47
		48
		49	`ASCII (American Standard Code For Information Interchange)`
		50	`is the original 7-bit character set, originally designed for`
		51	`American English. It is currently described by the ECMA-6`
		52	`standard.`
		53
		54
		55	`Various ASCII variants replacing the dollar sign with other`
		56	`currency symbols and replacing punctuation with non-English`
		57	`alphabetic characters to cover German, French, Spanish and`
		58	`others in 7 bits exist. All are deprecated; GNU libc doesn't`
		59	`support locales whose character sets aren't true supersets`
		60	`of ASCII. (These sets are also known as ISO-646, a close`
		61	`relative of ASCII that permitted replacing these`
		62	`characters.)`
		63
		64
		65	`As Linux was written for hardware designed in the US, it`
		66	`natively supports ASCII.`
		67	`!!ISO 8859`
		68
		69
		70	`ISO 8859 is a series of 15 8-bit character sets all of which`
		71	`have US ASCII in their low (7-bit) half, invisible control`
		72	`characters in positions 128 to 159, and 96 fixed-width`
		73	`graphics in positions 160-255.`
		74
		75
		76	`Of these, the most important is ISO 8859-1 (Latin-1). It is`
		77	`natively supported in the Linux console driver, fairly well`
		78	`supported in X11R6, and is the base character set of`
		79	`HTML.`
		80
		81
		82	`Console support for the other 8859 character sets is`
		83	`available under Linux through user-mode utilities (such as`
		84	`consolechars(8)) that modify keyboard bindings and`
		85	`the EGA graphics table and employ the`
		86	`__`
		87
		88
		89	`Here are brief descriptions of each set:`
		90
		91
		92	`8859-1 (Latin-1)`
		93
		94
		95	`Latin-1 covers most Western European languages such as`
		96	`Albanian, Catalan, Danish, Dutch, English, Faroese, Finnish,`
		97	`French, German, Galician, Irish, Icelandic, Italian,`
		98	`Norwegian, Portuguese, Spanish, and Swedish. The lack of the`
		99	ligatures Dutch ij, French oe and old-style ,,German``
		100	`quotation marks is considered tolerable.`
		101
		102
		103	`8859-2 (Latin-2)`
		104
		105
		106	`Latin-2 supports most Latin-written Slavic and Central`
		107	`European languages: Croatian, Czech, German, Hungarian,`
		108	`Polish, Rumanian, Slovak, and Slovene.`
		109
		110
		111	`8859-3 (Latin-3)`
		112
		113
		114	`Latin-3 is popular with authors of Esperanto, Galician, and`
		115	`Maltese. (Turkish is now written with 8859-9`
		116	`instead.)`
		117
		118
		119	`8859-4 (Latin-4)`
		120
		121
		122	`Latin-4 introduced letters for Estonian, Latvian, and`
		123	`Lithuanian. It is essentially obsolete; see 8859-13`
		124	`(Latin-7).`
		125
		126
		127	`8859-5`
		128
		129
		130	`Cyrillic letters supporting Bulgarian, Byelorussian,`
		131	`Macedonian, Russian, Serbian and Ukrainian. Ukrainians read`
		132	the letter `ghe' with downstroke as `heh' and would need a
		133	`ghe with upstroke to write a correct ghe. See the discussion`
		134	`of KOI8-R below.`
		135
		136
		137	`8859-6`
		138
		139
		140	`Supports Arabic. The 8859-6 glyph table is a fixed font of`
		141	`separate letter forms, but a proper display engine should`
		142	`combine these using the proper initial, medial, and final`
		143	`forms.`
		144
		145
		146	`8859-7`
		147
		148
		149	`Supports Modern Greek.`
		150
		151
		152	`8859-8`
		153
		154
		155	`Supports modern Hebrew without niqud (punctuation signs).`
		156	`Niqud and full-fledged Biblical Hebrew are outside the scope`
		157	`of this character set; under Linux, UTF-8 is the preferred`
		158	`encoding for these.`
		159
		160
		161	`8859-9 (Latin-5)`
		162
		163
		164	`This is a variant of Latin-1 that replaces Icelandic letters`
		165	`with Turkish ones.`
		166
		167
		168	`8859-10 (Latin-6)`
		169
		170
		171	`Latin 6 adds the last Inuit (Greenlandic) and Sami (Lappish)`
		172	`letters that were missing in Latin 4 to cover the entire`
		173	`Nordic area. RFC 1345 listed a preliminary and different`
		174	`latin6'. Skolt Sami still needs a few more accents than
		175	`these.`
		176
		177
		178	`8859-11`
		179
		180
		181	`This only exists as a rejected draft standard. The draft`
		182	`standard was identical to TIS-620, which is used under Linux`
		183	`for Thai.`
		184
		185
		186	`8859-12`
		187
		188
		189	`This set does not exist. While Vietnamese has been suggested`
		190	`for this space, it does not fit within the 96`
		191	`(non-combining) characters ISO 8859 offers. UTF-8 is the`
		192	`preferred character set for Vietnamese use under`
		193	`Linux.`
		194
		195
		196	`8859-13 (Latin-7)`
		197
		198
		199	`Supports the Baltic Rim languages; in particular, it`
		200	`includes Latvian characters not found in`
		201	`Latin-4.`
		202
		203
		204	`8859-14 (Latin-8)`
		205
		206
		207	`This is the Celtic character set, covering Gaelic and`
		208	`Welsh.`
		209
		210
		211	`8859-15 (Latin-9)`
		212
		213
		214	`This adds the Euro sign and French and Finnish letters that`
		215	`were missing in Latin-1.`
		216
		217
		218	`8859-16 (Latin-10)`
		219
		220
		221	`This set covers many of the languages covered by 8859-2, and`
		222	`supports Romanian more completely then that set`
		223	`does.`
		224	`!!KOI8-R`
		225
		226
		227	`KOI8-R is a non-ISO character set popular in Russia. The`
		228	`lower half is US ASCII; the upper is a Cyrillic character`
		229	`set somewhat better designed than ISO 8859-5. KOI8-U is a`
		230	`common character set, based off KOI8-R, that has better`
		231	`support for Ukrainian. Neither of these sets are ISO-2022`
		232	`compatible, unlike the ISO-8859 series.`
		233
		234
		235	`Console support for KOI8-R is available under Linux through`
		236	`user-mode utilities that modify keyboard bindings and the`
		237	`EGA graphics table, and employ the`
		238	`!!JIS X 0208`
		239
		240
		241	`JIS X 0208 is a Japanese national standard character set.`
		242	`Though there are some more Japanese national standard`
		243	`character sets (like JIS X 0201, JIS X 0212, and JIS X`
		244	`0213), this is the most important one. Characters are mapped`
		245	`into a 94x94 two-byte matrix, whose each byte is in the`
		246	`range 0x21-0x7e. Note that JIS X 0208 is a character set,`
		247	`not an encoding. This means that JIS X 0208 itself is not`
		248	`used for expressing text data. JIS X 0208 is used as a`
		249	`component to construct encodings such as EUC-JP, Shift_JIS,`
		250	`and ISO-2022-JP. EUC-JP is the most important encoding for`
		251	`Linux and includes US ASCII and JIS X 0208. In EUC-JP, JIS X`
		252	`0208 characters are expressed in two bytes, each of which is`
		253	`the JIS X 0208 code plus 0x80.`
		254	`!!KS X 1001`
		255
		256
		257	`KS X 1001 is a Korean national standard character set. Just`
		258	`as JIS X 0208, characters are mapped into a 94x94 two-byte`
		259	`matrix. KS X 1001 is used like JIS X 0208, as a component to`
		260	`construct encodings such as EUC-KR, Johab, and ISO-2022-KR.`
		261	`EUC-KR is the most important encoding for Linux and includes`
		262	`US ASCII and KS X 1001. KS C 5601 is an older name for KS X`
		263	`1001.`
		264	`!!GB 2312`
		265
		266
		267	`GB 2312 is a mainland Chinese national standard character`
		268	`set used to express simplified Chinese. Just like JIS X`
		269	`0208, characters are mapped into a 94x94 two-byte matrix`
		270	`used to construct EUC-CN. EUC-CN is the most important`
		271	`encoding for Linux and includes US ASCII and GB 2312. Note`
		272	`that EUC-CN is often called as GB, GB 2312, or`
		273	`CN-GB.`
		274	`!!Big5`
		275
		276
		277	`Big5 is a popular character set in Taiwan to express`
		278	`traditional Chinese. (Big5 is both a character set and an`
		279	`encoding.) It is a superset of US ASCII. Non-ASCII`
		280	`characters are expressed in two bytes. Bytes 0xa1-0xfe are`
		281	`used as leading bytes for two-byte characters. Big5 and its`
		282	`extension is widely used in Taiwan and Hong Kong. It is not`
		283	`ISO 2022-compliant.`
		284	`!!TIS 620`
		285
		286
		287	`TIS 620 is a Thai national standard character set and a`
		288	`superset of US ASCII. Like ISO 8859 series, Thai characters`
		289	`are mapped into 0xa1-0xfe. TIS 620 is the only commonly used`
		290	`character set under Linux besides UTF-8 to have combining`
		291	`characters.`
		292	`!!UNICODE`
		293
		294
		295	`Unicode (ISO 10646) is a standard which aims to`
		296	`unambiguously represent every character in every human`
		297	`language. Unicode's structure permits 20.1 bits to encode`
		298	`every character. Since most computers don't include 20.1-bit`
		299	`integers, Unicode is usually encoded as 32-bit integers`
		300	`internally and either a series of 16-bit integers (UTF-16)`
		301	`(needing two 16-bit integers only when encoding certain rare`
		302	`characters) or a series of 8-bit bytes (UTF-8). Information`
		303	`on Unicode is available at`
		304
		305
		306	`Linux represents Unicode using the 8-bit Unicode`
		307	`Transformation Format (UTF-8). UTF-8 is a variable length`
		308	`encoding of Unicode. It uses 1 byte to code 7 bits, 2 bytes`
		309	`for 11 bits, 3 bytes for 16 bits, and 4 bytes for the`
		310	`remainder.`
		311
		312
		313	`Let 0,1,x stand for a zero, one, or arbitrary bit. A byte`
		314	`0xxxxxxx stands for the Unicode 00000000 0xxxxxxx which`
		315	`codes the same symbol as the ASCII 0xxxxxxx. Thus, ASCII`
		316	`goes unchanged into UTF-8, and people using only ASCII do`
		317	`not notice any change: not in code, and not in file`
		318	`size.`
		319
		320
		321	`A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx`
		322	`10yyyyyy is assembled into 00000000 0000000 00000xxx`
		323	`xxyyyyyy. A byte 1110xxxx is the start of a 3-byte code, and`
		324	`1110xxxx 10yyyyyy 10zzzzzz is assembled into 00000000`
		325	`00000000 xxxxyyyy yyzzzzzz. Lastly, 110110xxx starts a`
		326	`4-byte code, and 110110xxx 10xxyyyy 10zzzzzz 10aaaaaa`
		327	`becomes 0000000 000xxxxx yyyyzzzz zzaaaaaa.`
		328
		329
		330	`For most people who use ISO-8859 character sets, this means`
		331	`that the characters outside of ASCII are now coded with two`
		332	`bytes. This tends to expand ordinary text files by only one`
		333	`or two percent. For Russian or Greek users, this expands`
		334	`ordinary text files by 100%, since text in those languages`
		335	`is mostly outside of ASCII. For Japanese users this means`
		336	`that the 16-bit codes now in common use will take three`
		337	`bytes. While there are algorithmic conversions from some`
		338	`character sets (esp. ISO-8859-1) to Unicode, general`
		339	`conversion requires carrying around conversion tables, which`
		340	`can be quite large for 16-bit codes.`
		341
		342
		343	`Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail,`
		344	`any other byte is the head of a code. Note that the only way`
		345	`ASCII bytes occur in a UTF-8 stream, is as themselves. In`
		346	`particular, there are no embedded NULs or '/'s that form`
		347	`part of some larger code.`
		348
		349
		350	`Since ASCII, and, in particular, NUL and '/', are unchanged,`
		351	`the kernel does not notice that UTF-8 is being used. It does`
		352	`not care at all what the bytes it is handling stand`
		353	`for.`
		354
		355
		356	`Rendering of Unicode data streams is typically handled`
		357	through `subfont' tables which map a subset of Unicode to
		358	`glyphs. Internally the kernel uses Unicode to describe the`
		359	`subfont loaded in video RAM. This means that in UTF-8 mode`
		360	`one can use a character set with 512 different symbols. This`
		361	`is not enough for Japanese, Chinese and Korean, but it is`
		362	`enough for most other purposes.`
		363
		364
		365	`At the current time, the console driver does not handle`
		366	`combining characters. So Thai, Sioux and any other script`
		367	`needing combining characters can't be handled on the`
		368	`console.`
		369	`!!ISO 2022 AND ISO 4873`
		370
		371
		372	`The ISO 2022 and 4873 standards describe a font-control`
		373	`model based on VT100 practice. This model is (partially)`
		374	`supported by the Linux kernel and by xterm(1). It is`
		375	`popular in Japan and Korea.`
		376
		377
		378	`There are 4 graphic character sets, called G0, G1, G2 and`
		379	`G3, and one of them is the current character set for codes`
		380	`with high bit zero (initially G0), and one of them is the`
		381	`current character set for codes with high bit one (initially`
		382	`G1). Each graphic character set has 94 or 96 characters, and`
		383	`is essentially a 7-bit character set. It uses codes either`
		384	`040-0177 (041-0176) or 0240-0377 (0241-0376). G0 always has`
		385	`size 94 and uses codes 041-0176.`
		386
		387
		388	`Switching between character sets is done using the shift`
		389	`functions ^N (SO or LS1), ^O (SI or LS0), ESC n (LS2), ESC o`
		390	`(LS3), ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R),`
		391	`ESC \| (LS3R). The function LS''n'' makes character set`
		392	`G''n'' the current one for codes with high bit zero. The`
		393	`function LS''n''R makes character set G''n'' the`
		394	`current one for codes with high bit one. The function`
		395	`SS''n'' makes character set G''n'' (''n''=2 or 3)`
		396	`the current one for the next character only (regardless of`
		397	`the value of its high order bit).`
		398
		399
		400	`A 94-character set is designated as G''n'' character set`
		401	`by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1),`
		402	`ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol`
		403	`or a pair of symbols found in the ISO 2375 International`
		404	`Register of Coded Character Sets. For example, ESC ( @`
		405	`selects the ISO 646 character set as G0, ESC ( A selects the`
		406	`UK standard character set (with pound instead of number`
		407	`sign), ESC ( B selects ASCII (with dollar instead of`
		408	`currency sign), ESC ( M selects a character set for African`
		409	`languages, ESC ( ! A selects the Cuban character set, etc.`
		410	`etc.`
		411
		412
		413	`A 96-character set is designated as G''n'' character set`
		414	`by an escape sequence ESC - xx (for G1), ESC . xx (for G2)`
		415	`or ESC / xx (for G3). For example, ESC - G selects the`
		416	`Hebrew alphabet as G1.`
		417
		418
		419	`A multibyte character set is designated as G''n''`
		420	`character set by an escape sequence ESC $ xx or ESC $ ( xx`
		421	`(for G0), ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ +`
		422	`xx (for G3). For example, ESC $ ( C selects the Korean`
		423	`character set for G0. The Japanese character set selected by`
		424	`ESC $ B has a more recent version selected by ESC`
		425	`''`
		426
		427
		428	`ISO 4873 stipulates a narrower use of character sets, where`
		429	`G0 is fixed (always ASCII), so that G1, G2 and G3 can only`
		430	`be invoked for codes with the high order bit set. In`
		431	`particular, ^N and ^O are not used anymore, ESC ( xx can be`
		432	`used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx are`
		433	`equivalent to ESC - xx, ESC . xx, ESC / xx,`
		434	`respectively.`
		435	`!!SEE ALSO`
		436
		437
4	perry	438	`console(4), console_ioctl(4),`
		439	`console_codes(4), ascii(7),`
		440	`iso_8859_1(7), unicode(7),`
		441	`utf-8(7)`
1	perry	442	`----`

This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.

Last edited on Tuesday, June 4, 2002 12:30:56 am by "perry"

Edit PageHistory Diff Info LikePages