version 2 showing authors affecting page license.
.
Rev |
Author |
# |
Line |
1 |
perry |
1 |
ISPELL |
|
|
2 |
!!!ISPELL |
|
|
3 |
NAME |
|
|
4 |
DESCRIPTION |
|
|
5 |
EXAMPLES |
|
|
6 |
SEE ALSO |
|
|
7 |
---- |
|
|
8 |
!!NAME |
|
|
9 |
|
|
|
10 |
|
|
|
11 |
ispell - format of ispell dictionaries and affix files |
|
|
12 |
!!DESCRIPTION |
|
|
13 |
|
|
|
14 |
|
|
|
15 |
''Ispell''(1) requires two files to define the language |
|
|
16 |
that it is spell-checking. The first file is a dictionary |
|
|
17 |
containing words for the language, and the second is an |
|
|
18 |
''buildhash'' (see ispell(1)) and written to a |
|
|
19 |
hash file which is not described here. |
|
|
20 |
|
|
|
21 |
|
|
|
22 |
A raw ''ispell'' dictionary (either the main dictionary |
|
|
23 |
or your own personal dictionary) contains a list of words, |
|
|
24 |
one per line. Each word may optionally be followed by a |
|
|
25 |
slash ( |
|
|
26 |
''ispell'' was built, case may or may not be |
|
|
27 |
significant in either the root word or the flags, |
|
|
28 |
independently. Specifically, if the compile-time option |
|
|
29 |
CAPITALIZATION is defined, case is significant in the root |
|
|
30 |
word; if not, case is ignored in the root word. If the |
|
|
31 |
compile-time option MASKBITS is set to a value of 32, case |
|
|
32 |
is ignored in the flags; otherwise case is significant in |
|
|
33 |
the flags. Contact your system administrator or |
|
|
34 |
''ispell'' maintainer for more information (or use the |
|
|
35 |
__-vv__ flag to find out). The dictionary should be |
|
|
36 |
sorted with the __-f__ flag of sort(1) before the |
|
|
37 |
hash file is built; this is done automatically by |
|
|
38 |
munchlist(1), which is the normal way of building |
|
|
39 |
dictionaries. |
|
|
40 |
|
|
|
41 |
|
|
|
42 |
If the dictionary contains words that have string characters |
|
|
43 |
(see the affix-file documentation below), they must be |
|
|
44 |
written in the format given by the __defstringtype__ |
|
|
45 |
statement in the affix file. This will be the case for most |
|
|
46 |
non-English languages. Be careful to use this format, rather |
|
|
47 |
than that of your favorite formatter, when adding words to a |
|
|
48 |
dictionary. (If you add words to your personal dictionary |
|
|
49 |
during an ''ispell'' session, they will automatically be |
|
|
50 |
converted to the correct format. This feature can be used to |
|
|
51 |
convert an entire dictionary if necessary:) |
|
|
52 |
|
|
|
53 |
|
|
|
54 |
echo qqqqq |
|
|
55 |
affix-file'' dummy.hash |
|
|
56 |
awk '{print ''old-dict-file'' \ |
|
|
57 |
| ispell -a -T ''old-dict-string-type'' \ |
|
|
58 |
-d ./dummy.hash -p ./''new-dict-file'' \ |
|
|
59 |
'' |
|
|
60 |
|
|
|
61 |
|
|
|
62 |
The case of the root word controls the case of words |
|
|
63 |
accepted by ''ispell'', as follows: |
|
|
64 |
|
|
|
65 |
|
|
|
66 |
(1) |
|
|
67 |
|
|
|
68 |
|
|
|
69 |
If the root word appears only in lower case (e.g., |
|
|
70 |
''bob''), it will be accepted in lower case, capitalized, |
|
|
71 |
or all capitals. |
|
|
72 |
|
|
|
73 |
|
|
|
74 |
(2) |
|
|
75 |
|
|
|
76 |
|
|
|
77 |
If the root word appears capitalized (e.g., ''Robert''), |
|
|
78 |
it will not be accepted in all-lower case, but will be |
|
|
79 |
accepted capitalized or all in capitals. |
|
|
80 |
|
|
|
81 |
|
|
|
82 |
(3) |
|
|
83 |
|
|
|
84 |
|
|
|
85 |
If the root word appears all in capitals (e.g., |
|
|
86 |
''UNIX''), it will only be accepted all in |
|
|
87 |
capitals. |
|
|
88 |
|
|
|
89 |
|
|
|
90 |
(4) |
|
|
91 |
|
|
|
92 |
|
|
|
93 |
If the root word appears with a |
|
|
94 |
ITCorp''), a word will be |
|
|
95 |
accepted only if it follows that capitalization, or if it |
|
|
96 |
appears all in capitals. |
|
|
97 |
|
|
|
98 |
|
|
|
99 |
(5) |
|
|
100 |
|
|
|
101 |
|
|
|
102 |
More than one capitalization of a root word may appear in |
|
|
103 |
the dictionary. Flags from different capitalizations are |
|
|
104 |
combined by OR-ing them together. |
|
|
105 |
|
|
|
106 |
|
|
|
107 |
Redundant capitalizations (e.g., ''bob'' and ''Bob'') |
|
|
108 |
will be combined by ''buildhash'' and by ''ispell'' |
|
|
109 |
(for personal dictionaries), and can be removed from a raw |
|
|
110 |
dictionary by ''munchlist''. |
|
|
111 |
|
|
|
112 |
|
|
|
113 |
For example, the dictionary: |
|
|
114 |
|
|
|
115 |
|
|
|
116 |
bob |
|
|
117 |
Robert |
|
|
118 |
UNIX |
|
|
119 |
ITcorp |
|
|
120 |
ITCorp |
|
|
121 |
|
|
|
122 |
|
|
|
123 |
will accept ''bob'', ''Bob'', ''BOB'', |
|
|
124 |
''Robert'', ''ROBERT'', ''UNIX'', ''ITcorp'', |
|
|
125 |
''ITCorp'', and ''ITCORP'', and will reject all |
|
|
126 |
others. Some of the unacceptable forms are ''bOb'', |
|
|
127 |
''robert'', ''Unix'', and ''!ItCorp''. |
|
|
128 |
|
|
|
129 |
|
|
|
130 |
As mentioned above, root words in any dictionary may be |
|
|
131 |
extended by flags. Each flag is a single alphabetic |
|
|
132 |
character, which represents a prefix or suffix that may be |
|
|
133 |
added to the root to form a new word. For example, in an |
|
|
134 |
English dictionary the __D__ flag can be added to |
|
|
135 |
''bathe'' to make ''bathed''. Since flags are |
|
|
136 |
represented as a single bit in the hashed dictionary, this |
|
|
137 |
results in significant space savings. The ''munchlist'' |
|
|
138 |
script will reduce an existing raw dictionary by adding |
|
|
139 |
flags when possible. |
|
|
140 |
|
|
|
141 |
|
|
|
142 |
When a word is extended with an affix, the affix will be |
|
|
143 |
accepted only if it appears in the same case as the initial |
|
|
144 |
(prefix) or final (suffix) letter of the word. Thus, for |
|
|
145 |
example, the entry ''UNIX/M'' in the main dictionary |
|
|
146 |
(__M__ means add an apostrophe and an |
|
|
147 |
__UNIX'S'' but would |
|
|
148 |
reject ''UNIX's''. If ''UNIX's'' is legal, it must |
|
|
149 |
appear as a separate dictionary entry, and it will not be |
|
|
150 |
combined by ''munchlist''. (In general, you don't need to |
|
|
151 |
worry about these things; ''munchlist'' guarantees that |
|
|
152 |
its output dictionary will accept the same set of words as |
|
|
153 |
its input, so all you have to do is add words to the |
|
|
154 |
dictionary and occasionally run munchlist to reduce its |
|
|
155 |
size). |
|
|
156 |
|
|
|
157 |
|
|
|
158 |
As mentioned, the affix definition file describes the |
|
|
159 |
affixes associated with particular flags. It also describes |
|
|
160 |
the character set used by the language. |
|
|
161 |
|
|
|
162 |
|
|
|
163 |
Although the affix-definition grammar is designed for a |
|
|
164 |
line-oriented layout, it is actually a free-format yacc |
|
|
165 |
grammar and can be laid out weirdly if you want. Comments |
|
|
166 |
are started by a pound (sharp) sign (#), and continue to the |
|
|
167 |
end of the line. Backslashes are supported in the usual |
|
|
168 |
fashion (__\__''nnn'', plus specials __n__, |
|
|
169 |
__r__, __t__, __v__, __f__, __b__, and the |
|
|
170 |
new hex format __x__''nn''). Any character with |
|
|
171 |
special meaning to the parser can be changed to an |
|
|
172 |
uninterpreted token by backslashing it; for example, you can |
|
|
173 |
declare a flag named 'asterisk' or 'colon' with ''flag |
|
|
174 |
*:'' or ''flag ::''. |
|
|
175 |
|
|
|
176 |
|
|
|
177 |
The grammar will be presented in a top-down fashion, with |
|
|
178 |
discussion of each element. An affix-definition file must |
|
|
179 |
contain exactly one table: |
|
|
180 |
|
|
|
181 |
|
|
|
182 |
''table'' : [[''headers''] [[''prefixes''] [[''suffixes''] |
|
|
183 |
|
|
|
184 |
|
|
|
185 |
At least one of ''prefixes'' and ''suffixes'' is |
|
|
186 |
required. They can appear in either order. |
|
|
187 |
|
|
|
188 |
|
|
|
189 |
''headers'' : [[ ''options'' ] ''char-sets |
|
|
190 |
'' |
|
|
191 |
|
|
|
192 |
|
|
|
193 |
The headers describe options global to this dictionary and |
|
|
194 |
language. These include the character sets to be used and |
|
|
195 |
the formatter, and the defaults for certain ''ispell'' |
|
|
196 |
flags. |
|
|
197 |
|
|
|
198 |
|
|
|
199 |
''options'' : { ''fmtr-stmt'' | ''opt-stmt'' | ''flag-stmt'' | ''num-stmt'' } |
|
|
200 |
|
|
|
201 |
|
|
|
202 |
The options statements define the defaults for certain |
|
|
203 |
ispell flags and for the character sets used by the |
|
|
204 |
formatters. |
|
|
205 |
|
|
|
206 |
|
|
|
207 |
''fmtr-stmt'' : { ''nroff-stmt'' | ''tex-stmt'' } |
|
|
208 |
|
|
|
209 |
|
|
|
210 |
A ''fmtr-stmt'' describes characters that have special |
|
|
211 |
meaning to a formatter. Normally, this statement is not |
|
|
212 |
necessary, but some languages may have preempted the usual |
|
|
213 |
defaults for use as language-specific characters. In this |
|
|
214 |
case, these statements may be used to redefine the special |
|
|
215 |
characters expected by the formatter. |
|
|
216 |
|
|
|
217 |
|
|
|
218 |
''nroff-stmt'' : { __nroffchars__ | __troffchars__ } ''string |
|
|
219 |
'' |
|
|
220 |
|
|
|
221 |
|
|
|
222 |
The __nroffchars__ statement allows redefinition of |
|
|
223 |
certain ''nroff'' control characters. The string given |
|
|
224 |
must be exactly five characters long, and must list |
|
|
225 |
substitutions for the left and right parentheses |
|
|
226 |
( |
|
|
227 |
'' |
|
|
228 |
|
|
|
229 |
|
|
|
230 |
__nroffchars__ {}.\* |
|
|
231 |
|
|
|
232 |
|
|
|
233 |
would replace the left and right parentheses with left and |
|
|
234 |
right curly braces for purposes of parsing |
|
|
235 |
''nroff''/''troff'' strings, with no effect on the |
|
|
236 |
others (admittedly a contrived example). Note that the |
|
|
237 |
backslash is escaped with a backslash. |
|
|
238 |
|
|
|
239 |
|
|
|
240 |
''tex-stmt'' : { __!TeXchars__ | __texchars__ } ''string |
|
|
241 |
'' |
|
|
242 |
|
|
|
243 |
|
|
|
244 |
The __!TeXchars__ statement allows redefinition of certain |
|
|
245 |
TeX/LaTeX control characters. The string given must be |
|
|
246 |
exactly thirteen characters long, and must list |
|
|
247 |
substitutions for the left and right parentheses |
|
|
248 |
( |
|
|
249 |
__ |
|
|
250 |
|
|
|
251 |
|
|
|
252 |
__texchars__ ()[[] |
|
|
253 |
__ |
|
|
254 |
|
|
|
255 |
|
|
|
256 |
would replace the functions of the left and right curly |
|
|
257 |
braces with the left and right angle brackets for purposes |
|
|
258 |
of parsing TeX/LaTeX constructs, while retaining their |
|
|
259 |
functions for the ''tib'' bibliographic preprocessor. |
|
|
260 |
Note that the backslash, the left square bracket, and the |
|
|
261 |
right angle bracket must be escaped with a |
|
|
262 |
backslash. |
|
|
263 |
|
|
|
264 |
|
|
|
265 |
''opt-stmt'' : { ''cmpnd-stmt'' | ''aff-stmt'' } |
|
|
266 |
''cmpnd-stmt'' : __ compoundwords__ ''compound-opt |
|
|
267 |
aff-stmt'' : __ allaffixes__ ''on-or-off |
|
|
268 |
on-or-off'' : { __on__ | __off__ } |
|
|
269 |
''compound-opt'' : { ''on-or-off'' | __controlled__ ''character'' } |
|
|
270 |
|
|
|
271 |
|
|
|
272 |
An ''opt-stmt'' controls certain ispell defaults that are |
|
|
273 |
best made language-specific. The __allaffixes__ statement |
|
|
274 |
controls the default for the __-P__ and __-m__ options |
|
|
275 |
to ''ispell.'' If __allaffixes__ is turned __off__ |
|
|
276 |
(the default), ''ispell'' will default to the behavior of |
|
|
277 |
the ''-P'' flag: root/affix suggestions will only be made |
|
|
278 |
if there are no |
|
|
279 |
''allaffixes__ is turned __on__, ''ispell'' will |
|
|
280 |
default to the behavior of the ''-m'' flag: root/affix |
|
|
281 |
suggestions will always be made. The __compoundwords__ |
|
|
282 |
statement controls the default for the __-B__ and |
|
|
283 |
__-C__ options to ''ispell.'' If __compoundwords__ |
|
|
284 |
is turned __off__ (the default), ''ispell'' will |
|
|
285 |
default to the behavior of the ''-B'' flag: run-together |
|
|
286 |
words will be reported as errors. If __compoundwords__ is |
|
|
287 |
turned __on__, ''ispell'' will default to the behavior |
|
|
288 |
of the ''-C'' flag: run-together words will be considered |
|
|
289 |
as compounds if both are in the dictionary. This is useful |
|
|
290 |
for languages such as German and Norwegian, which form large |
|
|
291 |
numbers of compound words. Finally, if __compoundwords__ |
|
|
292 |
is set to ''controlled'', only words marked with the flag |
|
|
293 |
indicated by ''character'' (which should not be otherwise |
|
|
294 |
used) will be allowed to participate in compound formation. |
|
|
295 |
Because this option requires the flags to be specified in |
|
|
296 |
the dictionary, it is not available from the command |
|
|
297 |
line. |
|
|
298 |
|
|
|
299 |
|
|
|
300 |
''flag-stmt'' : __ flagmarker__ ''character |
|
|
301 |
'' |
|
|
302 |
|
|
|
303 |
|
|
|
304 |
The __flagmarker__ statement describes the character |
|
|
305 |
which is used to separate affix flags from the root word in |
|
|
306 |
a raw dictionary file. This must be a character which is not |
|
|
307 |
found in any word (including in string characters; see |
|
|
308 |
below). The default is |
|
|
309 |
__ |
|
|
310 |
|
|
|
311 |
|
|
|
312 |
''num-stmt'' : __ compoundmin__ ''digit |
|
|
313 |
'' |
|
|
314 |
|
|
|
315 |
|
|
|
316 |
The __compoundmin__ statement controls the length of the |
|
|
317 |
two components of a compound word. This only has an effect |
|
|
318 |
if __compoundwords__ is turned __on__ or if the |
|
|
319 |
__-C__ flag is given to ''ispell''. In that case, only |
|
|
320 |
words at least as long as the given minimum will be accepted |
|
|
321 |
as components of a compound. The default is 3 |
|
|
322 |
characters. |
|
|
323 |
|
|
|
324 |
|
|
|
325 |
''char-sets'' : '' norm-sets'' [[ ''alt-sets'' ] |
|
|
326 |
|
|
|
327 |
|
|
|
328 |
The character-set section describes the characters that can |
|
|
329 |
be part of a word, and defines their collating order. There |
|
|
330 |
must always be a definition of |
|
|
331 |
|
|
|
332 |
|
|
|
333 |
''norm-sets'' : [[ ''deftype'' ] charset-group |
|
|
334 |
|
|
|
335 |
|
|
|
336 |
A |
|
|
337 |
|
|
|
338 |
|
|
|
339 |
''deftype'' : __defstringtype__ ''name deformatter suffix''* |
|
|
340 |
|
|
|
341 |
|
|
|
342 |
The __defstringtype__ declaration gives a list of file |
|
|
343 |
suffixes which should make use of the default string |
|
|
344 |
characters defined as part of the base character set; it is |
|
|
345 |
only necessary if string characters are being defined. The |
|
|
346 |
''name'' parameter is a string giving the unique name |
|
|
347 |
associated with these suffixes; often it is a formatter |
|
|
348 |
name. If the formatter is a member of the troff family, |
|
|
349 |
''ispell 's'' __-T__ switch to specify a formatter |
|
|
350 |
type. The ''deformatter'' parameter specifies the |
|
|
351 |
deformatting style to use when processing files with the |
|
|
352 |
given suffixes. Currently, this must be either __tex__ or |
|
|
353 |
__nroff__. The ''suffix'' parameters are a |
|
|
354 |
whitespace-separated list of strings which, if present at |
|
|
355 |
the end of a filename, indicate that the associated set of |
|
|
356 |
string characters should be used by default for this file. |
|
|
357 |
For example, the suffix list for the troff family typically |
|
|
358 |
includes suffixes such as |
|
|
359 |
'' |
|
|
360 |
|
|
|
361 |
|
|
|
362 |
''charset-group'' : { ''char-stmt'' | ''string-stmt'' | ''dup-stmt''}* |
|
|
363 |
|
|
|
364 |
|
|
|
365 |
A ''char-stmt'' describes single characters; a |
|
|
366 |
''string-stmt'' describes characters that must appear |
|
|
367 |
together as a string, and which usually represent a single |
|
|
368 |
character in the target language. Either may also describe |
|
|
369 |
conversion between upper and lower case. A ''dup-stmt'' |
|
|
370 |
is used to describe alternate forms of string characters, so |
|
|
371 |
that a single dictionary may be used with several formatting |
|
|
372 |
programs that use different conventions for representing |
|
|
373 |
non-ASCII characters. |
|
|
374 |
|
|
|
375 |
|
|
|
376 |
''char-stmt'' : __ wordchars__ ''character-range |
|
|
377 |
'' | __ wordchars__ ''lowercase-range uppercase-range |
|
|
378 |
'' | __ boundarychars__ ''character-range |
|
|
379 |
'' | __ boundarychars__ ''lowercase-range uppercase-range |
|
|
380 |
string-stmt'' : __ stringchar__ ''string |
|
|
381 |
'' | __ stringchar__ ''lowercase-string uppercase-string |
|
|
382 |
'' |
|
|
383 |
|
|
|
384 |
|
|
|
385 |
Characters described with the __boundarychars__ statement |
|
|
386 |
are considered part of a word only if they appear singly, |
|
|
387 |
embedded between characters declared with the |
|
|
388 |
__wordchars__ or __stringchar__ statements. For |
|
|
389 |
example, if the hyphen is a boundary character (useful in |
|
|
390 |
French), the string |
|
|
391 |
__ |
|
|
392 |
|
|
|
393 |
|
|
|
394 |
If two ranges or strings are given in a ''char-stmt'' or |
|
|
395 |
''string-stmt'', the first describes characters that are |
|
|
396 |
interpreted as lowercase and the second describes uppercase. |
|
|
397 |
In the case of a __stringchar__ statement, the two |
|
|
398 |
strings must be of the same length. Also, in a |
|
|
399 |
__stringchar__ statement, the actual strings may contain |
|
|
400 |
both uppercase and characters themselves without difficulty; |
|
|
401 |
for instance, the statement |
|
|
402 |
|
|
|
403 |
|
|
|
404 |
stringchar |
|
|
405 |
|
|
|
406 |
|
|
|
407 |
is legal and will not interfere with (or be interfered with |
|
|
408 |
by) other declarations of of |
|
|
409 |
|
|
|
410 |
|
|
|
411 |
A final note on string characters: some languages collate |
|
|
412 |
certain special characters as if they were strings. For |
|
|
413 |
example, the German |
|
|
414 |
|
|
|
415 |
|
|
|
416 |
''alt-sets'' : '' alttype'' [[ ''alt-stmt''* ] |
|
|
417 |
|
|
|
418 |
|
|
|
419 |
Because different formatters use different notations to |
|
|
420 |
represent non-ASCII characters, ''ispell'' must be aware |
|
|
421 |
of the representations used by these formatters. These are |
|
|
422 |
declared as alternate sets of string |
|
|
423 |
characters. |
|
|
424 |
|
|
|
425 |
|
|
|
426 |
''alttype'' : __ altstringtype__ ''name suffix''* |
|
|
427 |
|
|
|
428 |
|
|
|
429 |
The __altstringtype__ statement introduces each set by |
|
|
430 |
declaring the associated formatter name and filename suffix |
|
|
431 |
list. This name and list are interpreted exactly as in the |
|
|
432 |
__defstringtype__ statement above. Following this header |
|
|
433 |
are one or more ''alt-stmt''s which declare the alternate |
|
|
434 |
string characters used by this formatter. |
|
|
435 |
|
|
|
436 |
|
|
|
437 |
''alt-stmt'' : __ altstringchar__ ''alt-string std-string |
|
|
438 |
'' |
|
|
439 |
|
|
|
440 |
|
|
|
441 |
The ''altstringchar'' statement describes alternate |
|
|
442 |
representations for string characters. For example, the -mm |
|
|
443 |
macro package of ''troff'' represents the German |
|
|
444 |
''a*:'', while ''TeX'' uses |
|
|
445 |
the sequence ''''. If the ''troff'' versions |
|
|
446 |
are declared as the standard versions using |
|
|
447 |
__stringchar__, the ''TeX'' versions may be declared |
|
|
448 |
as alternates by using the statement |
|
|
449 |
|
|
|
450 |
|
|
|
451 |
altstringchar \ |
|
|
452 |
|
|
|
453 |
|
|
|
454 |
When the __altstringchar__ statement is used to specify |
|
|
455 |
alternate forms, all forms for a particular formatter must |
|
|
456 |
be declared together as a group. Also, each formatter or |
|
|
457 |
macro package must provide a complete set of characters, |
|
|
458 |
both upper- and lower-case, and the character sequences used |
|
|
459 |
for each formatter must be completely distinct. Character |
|
|
460 |
sequences which describe upper- and lower-case versions of |
|
|
461 |
the same printable character must also be the same length. |
|
|
462 |
It may be necessary to define some new macros for a given |
|
|
463 |
formatter to satisfy these restrictions. (The current |
|
|
464 |
version of ''buildhash'' does not enforce these |
|
|
465 |
restrictions, but failure to obey them may result in errors |
|
|
466 |
being introduced into files that are processed with |
|
|
467 |
''ispell''.) |
|
|
468 |
|
|
|
469 |
|
|
|
470 |
An important minor point is that ''ispell'' assumes that |
|
|
471 |
all characters declared as __wordchars__ or |
|
|
472 |
__boundarychars__ will occupy exactly one position on the |
|
|
473 |
terminal screen. |
|
|
474 |
|
|
|
475 |
|
|
|
476 |
A single character-set statement can declare either a single |
|
|
477 |
character or a contiguous range of characters. A range is |
|
|
478 |
given as in egrep and the shell: [[a-z] means lowercase |
|
|
479 |
alphabetics; [[^a-z] means all but lowercase, etc. All |
|
|
480 |
character-set statements are combined (unioned) to produce |
|
|
481 |
the final list of characters that may be part of a word. The |
|
|
482 |
collating order of the characters is defined by the order of |
|
|
483 |
their declaration; if a range is used, the characters are |
|
|
484 |
considered to have been declared in ASCII order. Characters |
|
|
485 |
that have case are collated next to each other, with the |
|
|
486 |
uppercase character first. |
|
|
487 |
|
|
|
488 |
|
|
|
489 |
The character-declaration statements have a rather strange |
|
|
490 |
behavior caused by its need to match each lowercase |
|
|
491 |
character with its uppercase equivalent. In any given |
|
|
492 |
__wordchars__ or __boundarychars__ statement, the |
|
|
493 |
characters in each range are first sorted into ASCII |
|
|
494 |
collating sequence, then matched one-for-one with the other |
|
|
495 |
range. (The two ranges must have the same number of |
|
|
496 |
characters). Thus, for example, the two |
|
|
497 |
statements: |
|
|
498 |
|
|
|
499 |
|
|
|
500 |
__wordchars__ [[aeiou] [[AEIOU] |
|
|
501 |
__wordchars__ [[aeiou] [[UOIEA] |
|
|
502 |
|
|
|
503 |
|
|
|
504 |
would produce exactly the same effect. To get the vowels to |
|
|
505 |
match up |
|
|
506 |
|
|
|
507 |
|
|
|
508 |
__wordchars__ a U |
|
|
509 |
__wordchars__ e O |
|
|
510 |
__wordchars__ i I |
|
|
511 |
__wordchars__ o E |
|
|
512 |
__wordchars__ u A |
|
|
513 |
|
|
|
514 |
|
|
|
515 |
which would cause uppercase 'e' to be 'O', and lowercase 'O' |
|
|
516 |
to be 'e'. This should normally be a problem only with |
|
|
517 |
languages which have been forced to use a strange ASCII |
|
|
518 |
collating sequence. If your uppercase and lowercase letters |
|
|
519 |
both collate in the same order, you shouldn't have to worry |
|
|
520 |
about this |
|
|
521 |
|
|
|
522 |
|
|
|
523 |
The prefixes and suffixes sections have exactly the same |
|
|
524 |
syntax, except for the introductory keyword. |
|
|
525 |
|
|
|
526 |
|
|
|
527 |
''prefixes'' : __ prefixes__ ''flagdef''* |
|
|
528 |
''suffixes'' : __ suffixes__ ''flagdef''* |
|
|
529 |
''flagdef'' : __ flag__ [[__*__|__~__] ''char'' __:__ ''repl''* |
|
|
530 |
|
|
|
531 |
|
|
|
532 |
A prefix or suffix table consists of an introductory keyword |
|
|
533 |
and a list of flag definitions. Flags can be defined more |
|
|
534 |
than once, in which case the definitions are combined. Each |
|
|
535 |
flag controls one or more ''repl''s (replacements) which |
|
|
536 |
are conditionally applied to the beginnings or endings of |
|
|
537 |
various words. |
|
|
538 |
|
|
|
539 |
|
|
|
540 |
Flags are named by a single character ''char''. Depending |
|
|
541 |
on a configuration option, this character can be either any |
|
|
542 |
uppercase letter (the default configuration) or any 7-bit |
|
|
543 |
ASCII character. Most languages should be able to get along |
|
|
544 |
with just 26 flags. |
|
|
545 |
|
|
|
546 |
|
|
|
547 |
A flag character may be prefixed with one or more option |
|
|
548 |
characters. (If you wish to use one of the option characters |
|
|
549 |
as a flag character, simply enclose it in double |
|
|
550 |
quotes.) |
|
|
551 |
|
|
|
552 |
|
|
|
553 |
The asterisk (__*__) option means that this flag |
|
|
554 |
participates in ''cross-product'' formation. This only |
|
|
555 |
matters if the file contains both prefix and suffix tables. |
|
|
556 |
If so, all prefixes and suffixes marked with an asterisk |
|
|
557 |
will be applied in all cross-combinations to the root word. |
|
|
558 |
For example, consider the root ''fix'' with prefixes |
|
|
559 |
''pre'' and ''in'', and suffixes ''es'' and |
|
|
560 |
''ed''. If all flags controlling these prefixes and |
|
|
561 |
suffixes are marked with an asterisk, then the single root |
|
|
562 |
''fix'' would also generate ''prefix'', |
|
|
563 |
''prefixes'', ''prefixed'', ''infix'', |
|
|
564 |
''infixes'', ''infixed'', ''fix'', ''fixes'', |
|
|
565 |
and ''fixed''. Cross-product formation can produce a |
|
|
566 |
large number of words quickly, some of which may be illegal, |
|
|
567 |
so watch out. If cross-products produce illegal words, |
|
|
568 |
''munchlist'' will not produce those flag combinations, |
|
|
569 |
and the flag will not be useful. |
|
|
570 |
|
|
|
571 |
|
|
|
572 |
''repl'' : '' condition''* ____ [[ __-__ ''strip-string'' __,__ ] ''append-string |
|
|
573 |
'' |
|
|
574 |
|
|
|
575 |
|
|
|
576 |
The __~__ option specifies that the associated flag is |
|
|
577 |
only active when a compound word is being formed. This is |
|
|
578 |
useful in a language like German, where the form of a word |
|
|
579 |
sometimes changes inside a compound. |
|
|
580 |
|
|
|
581 |
|
|
|
582 |
A ''repl'' is a conditional rule for modifying a root |
|
|
583 |
word. Up to 8 ''conditions'' may be specified. If the |
|
|
584 |
''conditions'' are satisfied, the rules on the right-hand |
|
|
585 |
side of the ''repl'' are applied, as |
|
|
586 |
follows: |
|
|
587 |
|
|
|
588 |
|
|
|
589 |
(1) |
|
|
590 |
|
|
|
591 |
|
|
|
592 |
If a strip-string is given, it is first stripped from the |
|
|
593 |
beginning or ending (as appropriate) of the root |
|
|
594 |
word. |
|
|
595 |
|
|
|
596 |
|
|
|
597 |
(2) |
|
|
598 |
|
|
|
599 |
|
|
|
600 |
Then the append-string is added at that point. |
|
|
601 |
|
|
|
602 |
|
|
|
603 |
For example, the ''condition'' __.__ means |
|
|
604 |
__condition'' __Y__ means |
|
|
605 |
__ |
|
|
606 |
|
|
|
607 |
|
|
|
608 |
. |
|
|
609 |
|
|
|
610 |
|
|
|
611 |
would change ''induce'' to ''inducement'' and |
|
|
612 |
''fly'' to ''flies''. (If they were controlled by the |
|
|
613 |
same flag, they would also change ''fly'' to |
|
|
614 |
''flyment'', which might not be what was wanted. |
|
|
615 |
''Munchlist'' can be used to protect against this sort of |
|
|
616 |
problem; see the command sequence given below.) |
|
|
617 |
|
|
|
618 |
|
|
|
619 |
No matter how much you might wish it, the strings on the |
|
|
620 |
right must be strings of specific characters, not ranges. |
|
|
621 |
The reasons are rooted deeply in the way ''ispell'' |
|
|
622 |
works, and it would be difficult or impossible to provide |
|
|
623 |
for more flexibility. For example, you might wish to |
|
|
624 |
write: |
|
|
625 |
|
|
|
626 |
|
|
|
627 |
[[EY] |
|
|
628 |
|
|
|
629 |
|
|
|
630 |
This will not work. Instead, you must use two separate |
|
|
631 |
rules: |
|
|
632 |
|
|
|
633 |
|
|
|
634 |
E |
|
|
635 |
|
|
|
636 |
|
|
|
637 |
The application of ''repl''s can be restricted to certain |
|
|
638 |
words with ''conditions'': |
|
|
639 |
|
|
|
640 |
|
|
|
641 |
''condition'' : { __.__ | ''character'' | ''range'' } |
|
|
642 |
|
|
|
643 |
|
|
|
644 |
A ''condition'' is a restriction on the characters that |
|
|
645 |
adjoin, and/or are replaced by, the right-hand side of the |
|
|
646 |
''repl''. Up to 8 ''conditions'' may be given, which |
|
|
647 |
should be enough context for anyone. The right-hand side |
|
|
648 |
will be applied only if the ''conditions'' in the |
|
|
649 |
''repl'' are satisfied. The ''conditions'' also |
|
|
650 |
implicitly define a length; roots shorter than the number of |
|
|
651 |
''conditions'' will not pass the test. (As a special |
|
|
652 |
case, a ''condition'' of a single dot |
|
|
653 |
'' |
|
|
654 |
|
|
|
655 |
|
|
|
656 |
''Conditions'' that are single characters should be |
|
|
657 |
separated by white space. For example, to specify words |
|
|
658 |
ending in '' |
|
|
659 |
|
|
|
660 |
|
|
|
661 |
E D |
|
|
662 |
|
|
|
663 |
|
|
|
664 |
If you write: |
|
|
665 |
|
|
|
666 |
|
|
|
667 |
ED |
|
|
668 |
|
|
|
669 |
|
|
|
670 |
the effect will be the same as: |
|
|
671 |
|
|
|
672 |
|
|
|
673 |
[[ED] |
|
|
674 |
|
|
|
675 |
|
|
|
676 |
As a final minor, but important point, it is sometimes |
|
|
677 |
useful to rebuild a dictionary file using an incompatible |
|
|
678 |
suffix file. For example, suppose you expanded the |
|
|
679 |
newdict'' that, using |
|
|
680 |
''newaffixes'', will accept exactly the same list of |
|
|
681 |
words as the old list ''olddict'' did using |
|
|
682 |
''oldaffixes'', the __-c__ switch of ''munchlist'' |
|
|
683 |
is useful, as in the following example: |
|
|
684 |
|
|
|
685 |
|
|
|
686 |
$ munchlist -c oldaffixes -l newaffixes olddict |
|
|
687 |
|
|
|
688 |
|
|
|
689 |
If you use this procedure, your new dictionary will always |
|
|
690 |
accept the same list the original did, even if you badly |
|
|
691 |
screwed up the affix file. This is because ''munchlist'' |
|
|
692 |
compares the words generated by a flag with the original |
|
|
693 |
word list, and refuses to use any flags that generate |
|
|
694 |
illegal words. (But don't forget that the ''munchlist'' |
|
|
695 |
step takes a long time and eats up temporary file |
|
|
696 |
space). |
|
|
697 |
!!EXAMPLES |
|
|
698 |
|
|
|
699 |
|
|
|
700 |
As an example of conditional suffixes, here is the |
|
|
701 |
specification of the __S__ flag from the English affix |
|
|
702 |
file: |
|
|
703 |
|
|
|
704 |
|
|
|
705 |
flag *S: |
|
|
706 |
[[^AEIOU]Y |
|
|
707 |
|
|
|
708 |
|
|
|
709 |
The first line applies to words ending in Y, but not in |
|
|
710 |
vowel-Y. The second takes care of the vowel-Y words. The |
|
|
711 |
third then handles those words that end in a sibilant or |
|
|
712 |
near-sibilant, and the last picks up everything |
|
|
713 |
else. |
|
|
714 |
|
|
|
715 |
|
|
|
716 |
Note that the ''conditions'' are written very carefully |
|
|
717 |
so that they apply to disjoint sets of words. In particular, |
|
|
718 |
note that the fourth line excludes words ending in Y as well |
|
|
719 |
as the obvious SXZH. Otherwise, it would convert |
|
|
720 |
'' |
|
|
721 |
|
|
|
722 |
|
|
|
723 |
Although the English affix file does not do so, you can also |
|
|
724 |
have a flag generate more than one variation on a root word. |
|
|
725 |
For example, we could extend the English |
|
|
726 |
|
|
|
727 |
|
|
|
728 |
flag *R: |
|
|
729 |
E |
|
|
730 |
|
|
|
731 |
|
|
|
732 |
This flag would generate both |
|
|
733 |
!!SEE ALSO |
|
|
734 |
|
|
|
735 |
|
|
|
736 |
ispell(1) |
|
|
737 |
---- |