View Source: ispell(5) - Waikato Linux Users Group

Edit PageHistory Diff Info LikePages
ISPELL
!!!ISPELL
NAME
DESCRIPTION
EXAMPLES
SEE ALSO
----
!!NAME


ispell - format of ispell dictionaries and affix files
!!DESCRIPTION


''Ispell''(1) requires two files to define the language
that it is spell-checking. The first file is a dictionary
containing words for the language, and the second is an
''buildhash'' (see ispell(1)) and written to a
hash file which is not described here.


A raw ''ispell'' dictionary (either the main dictionary
or your own personal dictionary) contains a list of words,
one per line. Each word may optionally be followed by a
slash (
''ispell'' was built, case may or may not be
significant in either the root word or the flags,
independently. Specifically, if the compile-time option
CAPITALIZATION is defined, case is significant in the root
word; if not, case is ignored in the root word. If the
compile-time option MASKBITS is set to a value of 32, case
is ignored in the flags; otherwise case is significant in
the flags. Contact your system administrator or
''ispell'' maintainer for more information (or use the
__-vv__ flag to find out). The dictionary should be
sorted with the __-f__ flag of sort(1) before the
hash file is built; this is done automatically by
munchlist(1), which is the normal way of building
dictionaries.


If the dictionary contains words that have string characters
(see the affix-file documentation below), they must be
written in the format given by the __defstringtype__
statement in the affix file. This will be the case for most
non-English languages. Be careful to use this format, rather
than that of your favorite formatter, when adding words to a
dictionary. (If you add words to your personal dictionary
during an ''ispell'' session, they will automatically be
converted to the correct format. This feature can be used to
convert an entire dictionary if necessary:)


     echo qqqqq
affix-file'' dummy.hash
awk '{print ''old-dict-file'' \
| ispell -a -T ''old-dict-string-type'' \
-d ./dummy.hash -p ./''new-dict-file'' \
''


The case of the root word controls the case of words
accepted by ''ispell'', as follows:


(1)


If the root word appears only in lower case (e.g.,
''bob''), it will be accepted in lower case, capitalized,
or all capitals.


(2)


If the root word appears capitalized (e.g., ''Robert''),
it will not be accepted in all-lower case, but will be
accepted capitalized or all in capitals.


(3)


If the root word appears all in capitals (e.g.,
''UNIX''), it will only be accepted all in
capitals.


(4)


If the root word appears with a
ITCorp''), a word will be
accepted only if it follows that capitalization, or if it
appears all in capitals.


(5)


More than one capitalization of a root word may appear in
the dictionary. Flags from different capitalizations are
combined by OR-ing them together.


Redundant capitalizations (e.g., ''bob'' and ''Bob'')
will be combined by ''buildhash'' and by ''ispell''
(for personal dictionaries), and can be removed from a raw
dictionary by ''munchlist''.


For example, the dictionary:


bob
Robert
UNIX
ITcorp
ITCorp


will accept ''bob'', ''Bob'', ''BOB'',
''Robert'', ''ROBERT'', ''UNIX'', ''ITcorp'',
''ITCorp'', and ''ITCORP'', and will reject all
others. Some of the unacceptable forms are ''bOb'',
''robert'', ''Unix'', and ''!ItCorp''.


As mentioned above, root words in any dictionary may be
extended by flags. Each flag is a single alphabetic
character, which represents a prefix or suffix that may be
added to the root to form a new word. For example, in an
English dictionary the __D__ flag can be added to
''bathe'' to make ''bathed''. Since flags are
represented as a single bit in the hashed dictionary, this
results in significant space savings. The ''munchlist''
script will reduce an existing raw dictionary by adding
flags when possible.


When a word is extended with an affix, the affix will be
accepted only if it appears in the same case as the initial
(prefix) or final (suffix) letter of the word. Thus, for
example, the entry ''UNIX/M'' in the main dictionary
(__M__ means add an apostrophe and an
__UNIX'S'' but would
reject ''UNIX's''. If ''UNIX's'' is legal, it must
appear as a separate dictionary entry, and it will not be
combined by ''munchlist''. (In general, you don't need to
worry about these things; ''munchlist'' guarantees that
its output dictionary will accept the same set of words as
its input, so all you have to do is add words to the
dictionary and occasionally run munchlist to reduce its
size).


As mentioned, the affix definition file describes the
affixes associated with particular flags. It also describes
the character set used by the language.


Although the affix-definition grammar is designed for a
line-oriented layout, it is actually a free-format yacc
grammar and can be laid out weirdly if you want. Comments
are started by a pound (sharp) sign (#), and continue to the
end of the line. Backslashes are supported in the usual
fashion (__\__''nnn'', plus specials __n__,
__r__, __t__, __v__, __f__, __b__, and the
new hex format __x__''nn''). Any character with
special meaning to the parser can be changed to an
uninterpreted token by backslashing it; for example, you can
declare a flag named 'asterisk' or 'colon' with ''flag
*:'' or ''flag ::''.


The grammar will be presented in a top-down fashion, with
discussion of each element. An affix-definition file must
contain exactly one table:


''table''     :    [[''headers''] [[''prefixes''] [[''suffixes'']


At least one of ''prefixes'' and ''suffixes'' is
required. They can appear in either order.


''headers''   :    [[ ''options'' ] ''char-sets
''


The headers describe options global to this dictionary and
language. These include the character sets to be used and
the formatter, and the defaults for certain ''ispell''
flags.


''options'' : { ''fmtr-stmt'' | ''opt-stmt'' | ''flag-stmt'' | ''num-stmt'' }


The options statements define the defaults for certain
ispell flags and for the character sets used by the
formatters.


''fmtr-stmt'' :    { ''nroff-stmt'' | ''tex-stmt'' }


A ''fmtr-stmt'' describes characters that have special
meaning to a formatter. Normally, this statement is not
necessary, but some languages may have preempted the usual
defaults for use as language-specific characters. In this
case, these statements may be used to redefine the special
characters expected by the formatter.


''nroff-stmt''     :    { __nroffchars__ | __troffchars__ } ''string
''


The __nroffchars__ statement allows redefinition of
certain ''nroff'' control characters. The string given
must be exactly five characters long, and must list
substitutions for the left and right parentheses
(
''


__nroffchars__ {}.\*


would replace the left and right parentheses with left and
right curly braces for purposes of parsing
''nroff''/''troff'' strings, with no effect on the
others (admittedly a contrived example). Note that the
backslash is escaped with a backslash.


''tex-stmt''  :    { __!TeXchars__ | __texchars__ } ''string
''


The __!TeXchars__ statement allows redefinition of certain
TeX/LaTeX control characters. The string given must be
exactly thirteen characters long, and must list
substitutions for the left and right parentheses
(
__


__texchars__ ()[[]
__


would replace the functions of the left and right curly
braces with the left and right angle brackets for purposes
of parsing TeX/LaTeX constructs, while retaining their
functions for the ''tib'' bibliographic preprocessor.
Note that the backslash, the left square bracket, and the
right angle bracket must be escaped with a
backslash.


''opt-stmt''  :    { ''cmpnd-stmt'' | ''aff-stmt'' }
''cmpnd-stmt''     : __   compoundwords__ ''compound-opt
aff-stmt''       : __   allaffixes__ ''on-or-off
on-or-off'' :    { __on__ | __off__ }
''compound-opt'' : { ''on-or-off'' | __controlled__ ''character'' }


An ''opt-stmt'' controls certain ispell defaults that are
best made language-specific. The __allaffixes__ statement
controls the default for the __-P__ and __-m__ options
to ''ispell.'' If __allaffixes__ is turned __off__
(the default), ''ispell'' will default to the behavior of
the ''-P'' flag: root/affix suggestions will only be made
if there are no
''allaffixes__ is turned __on__, ''ispell'' will
default to the behavior of the ''-m'' flag: root/affix
suggestions will always be made. The __compoundwords__
statement controls the default for the __-B__ and
__-C__ options to ''ispell.'' If __compoundwords__
is turned __off__ (the default), ''ispell'' will
default to the behavior of the ''-B'' flag: run-together
words will be reported as errors. If __compoundwords__ is
turned __on__, ''ispell'' will default to the behavior
of the ''-C'' flag: run-together words will be considered
as compounds if both are in the dictionary. This is useful
for languages such as German and Norwegian, which form large
numbers of compound words. Finally, if __compoundwords__
is set to ''controlled'', only words marked with the flag
indicated by ''character'' (which should not be otherwise
used) will be allowed to participate in compound formation.
Because this option requires the flags to be specified in
the dictionary, it is not available from the command
line.


''flag-stmt'' : __   flagmarker__ ''character
''


The __flagmarker__ statement describes the character
which is used to separate affix flags from the root word in
a raw dictionary file. This must be a character which is not
found in any word (including in string characters; see
below). The default is
__


''num-stmt''  : __   compoundmin__ ''digit
''


The __compoundmin__ statement controls the length of the
two components of a compound word. This only has an effect
if __compoundwords__ is turned __on__ or if the
__-C__ flag is given to ''ispell''. In that case, only
words at least as long as the given minimum will be accepted
as components of a compound. The default is 3
characters.


''char-sets'' : ''   norm-sets'' [[ ''alt-sets'' ]


The character-set section describes the characters that can
be part of a word, and defines their collating order. There
must always be a definition of


''norm-sets'' :    [[ ''deftype'' ] charset-group


A


''deftype'' : __defstringtype__ ''name deformatter suffix''*


The __defstringtype__ declaration gives a list of file
suffixes which should make use of the default string
characters defined as part of the base character set; it is
only necessary if string characters are being defined. The
''name'' parameter is a string giving the unique name
associated with these suffixes; often it is a formatter
name. If the formatter is a member of the troff family,
''ispell 's'' __-T__ switch to specify a formatter
type. The ''deformatter'' parameter specifies the
deformatting style to use when processing files with the
given suffixes. Currently, this must be either __tex__ or
__nroff__. The ''suffix'' parameters are a
whitespace-separated list of strings which, if present at
the end of a filename, indicate that the associated set of
string characters should be used by default for this file.
For example, the suffix list for the troff family typically
includes suffixes such as
''


''charset-group'' :     { ''char-stmt'' | ''string-stmt'' | ''dup-stmt''}*


A ''char-stmt'' describes single characters; a
''string-stmt'' describes characters that must appear
together as a string, and which usually represent a single
character in the target language. Either may also describe
conversion between upper and lower case. A ''dup-stmt''
is used to describe alternate forms of string characters, so
that a single dictionary may be used with several formatting
programs that use different conventions for representing
non-ASCII characters.


''char-stmt'' : __   wordchars__ ''character-range
''          | __   wordchars__ ''lowercase-range uppercase-range
''          | __   boundarychars__ ''character-range
''          | __   boundarychars__ ''lowercase-range uppercase-range
string-stmt''    : __   stringchar__ ''string
''          | __   stringchar__ ''lowercase-string uppercase-string
''


Characters described with the __boundarychars__ statement
are considered part of a word only if they appear singly,
embedded between characters declared with the
__wordchars__ or __stringchar__ statements. For
example, if the hyphen is a boundary character (useful in
French), the string
__


If two ranges or strings are given in a ''char-stmt'' or
''string-stmt'', the first describes characters that are
interpreted as lowercase and the second describes uppercase.
In the case of a __stringchar__ statement, the two
strings must be of the same length. Also, in a
__stringchar__ statement, the actual strings may contain
both uppercase and characters themselves without difficulty;
for instance, the statement


stringchar


is legal and will not interfere with (or be interfered with
by) other declarations of of


A final note on string characters: some languages collate
certain special characters as if they were strings. For
example, the German


''alt-sets''  : ''   alttype'' [[ ''alt-stmt''* ]


Because different formatters use different notations to
represent non-ASCII characters, ''ispell'' must be aware
of the representations used by these formatters. These are
declared as alternate sets of string
characters.


''alttype''   : __   altstringtype__ ''name suffix''*


The __altstringtype__ statement introduces each set by
declaring the associated formatter name and filename suffix
list. This name and list are interpreted exactly as in the
__defstringtype__ statement above. Following this header
are one or more ''alt-stmt''s which declare the alternate
string characters used by this formatter.


''alt-stmt''       : __   altstringchar__ ''alt-string std-string
''


The ''altstringchar'' statement describes alternate
representations for string characters. For example, the -mm
macro package of ''troff'' represents the German
''a*:'', while ''TeX'' uses
the sequence ''''. If the ''troff'' versions
are declared as the standard versions using
__stringchar__, the ''TeX'' versions may be declared
as alternates by using the statement


altstringchar  \


When the __altstringchar__ statement is used to specify
alternate forms, all forms for a particular formatter must
be declared together as a group. Also, each formatter or
macro package must provide a complete set of characters,
both upper- and lower-case, and the character sequences used
for each formatter must be completely distinct. Character
sequences which describe upper- and lower-case versions of
the same printable character must also be the same length.
It may be necessary to define some new macros for a given
formatter to satisfy these restrictions. (The current
version of ''buildhash'' does not enforce these
restrictions, but failure to obey them may result in errors
being introduced into files that are processed with
''ispell''.)


An important minor point is that ''ispell'' assumes that
all characters declared as __wordchars__ or
__boundarychars__ will occupy exactly one position on the
terminal screen.


A single character-set statement can declare either a single
character or a contiguous range of characters. A range is
given as in egrep and the shell: [[a-z] means lowercase
alphabetics; [[^a-z] means all but lowercase, etc. All
character-set statements are combined (unioned) to produce
the final list of characters that may be part of a word. The
collating order of the characters is defined by the order of
their declaration; if a range is used, the characters are
considered to have been declared in ASCII order. Characters
that have case are collated next to each other, with the
uppercase character first.


The character-declaration statements have a rather strange
behavior caused by its need to match each lowercase
character with its uppercase equivalent. In any given
__wordchars__ or __boundarychars__ statement, the
characters in each range are first sorted into ASCII
collating sequence, then matched one-for-one with the other
range. (The two ranges must have the same number of
characters). Thus, for example, the two
statements:


__wordchars__ [[aeiou] [[AEIOU]
__wordchars__ [[aeiou] [[UOIEA]


would produce exactly the same effect. To get the vowels to
match up


__wordchars__ a U
__wordchars__ e O
__wordchars__ i I
__wordchars__ o E
__wordchars__ u A


which would cause uppercase 'e' to be 'O', and lowercase 'O'
to be 'e'. This should normally be a problem only with
languages which have been forced to use a strange ASCII
collating sequence. If your uppercase and lowercase letters
both collate in the same order, you shouldn't have to worry
about this


The prefixes and suffixes sections have exactly the same
syntax, except for the introductory keyword.


''prefixes''  : __   prefixes__ ''flagdef''*
''suffixes''  : __   suffixes__ ''flagdef''*
''flagdef''   : __   flag__ [[__*__|__~__] ''char'' __:__ ''repl''*


A prefix or suffix table consists of an introductory keyword
and a list of flag definitions. Flags can be defined more
than once, in which case the definitions are combined. Each
flag controls one or more ''repl''s (replacements) which
are conditionally applied to the beginnings or endings of
various words.


Flags are named by a single character ''char''. Depending
on a configuration option, this character can be either any
uppercase letter (the default configuration) or any 7-bit
ASCII character. Most languages should be able to get along
with just 26 flags.


A flag character may be prefixed with one or more option
characters. (If you wish to use one of the option characters
as a flag character, simply enclose it in double
quotes.)


The asterisk (__*__) option means that this flag
participates in ''cross-product'' formation. This only
matters if the file contains both prefix and suffix tables.
If so, all prefixes and suffixes marked with an asterisk
will be applied in all cross-combinations to the root word.
For example, consider the root ''fix'' with prefixes
''pre'' and ''in'', and suffixes ''es'' and
''ed''. If all flags controlling these prefixes and
suffixes are marked with an asterisk, then the single root
''fix'' would also generate ''prefix'',
''prefixes'', ''prefixed'', ''infix'',
''infixes'', ''infixed'', ''fix'', ''fixes'',
and ''fixed''. Cross-product formation can produce a
large number of words quickly, some of which may be illegal,
so watch out. If cross-products produce illegal words,
''munchlist'' will not produce those flag combinations,
and the flag will not be useful.


''repl'' : ''   condition''* ____ [[ __-__ ''strip-string'' __,__ ] ''append-string
''


The __~__ option specifies that the associated flag is
only active when a compound word is being formed. This is
useful in a language like German, where the form of a word
sometimes changes inside a compound.


A ''repl'' is a conditional rule for modifying a root
word. Up to 8 ''conditions'' may be specified. If the
''conditions'' are satisfied, the rules on the right-hand
side of the ''repl'' are applied, as
follows:


(1)


If a strip-string is given, it is first stripped from the
beginning or ending (as appropriate) of the root
word.


(2)


Then the append-string is added at that point.


For example, the ''condition'' __.__ means
__condition'' __Y__ means
__


.


would change ''induce'' to ''inducement'' and
''fly'' to ''flies''. (If they were controlled by the
same flag, they would also change ''fly'' to
''flyment'', which might not be what was wanted.
''Munchlist'' can be used to protect against this sort of
problem; see the command sequence given below.)


No matter how much you might wish it, the strings on the
right must be strings of specific characters, not ranges.
The reasons are rooted deeply in the way ''ispell''
works, and it would be difficult or impossible to provide
for more flexibility. For example, you might wish to
write:


[[EY]


This will not work. Instead, you must use two separate
rules:


E


The application of ''repl''s can be restricted to certain
words with ''conditions'':


''condition'' :    { __.__ | ''character'' | ''range'' }


A ''condition'' is a restriction on the characters that
adjoin, and/or are replaced by, the right-hand side of the
''repl''. Up to 8 ''conditions'' may be given, which
should be enough context for anyone. The right-hand side
will be applied only if the ''conditions'' in the
''repl'' are satisfied. The ''conditions'' also
implicitly define a length; roots shorter than the number of
''conditions'' will not pass the test. (As a special
case, a ''condition'' of a single dot
''


''Conditions'' that are single characters should be
separated by white space. For example, to specify words
ending in ''


E D


If you write:


ED


the effect will be the same as:


[[ED]


As a final minor, but important point, it is sometimes
useful to rebuild a dictionary file using an incompatible
suffix file. For example, suppose you expanded the
newdict'' that, using
''newaffixes'', will accept exactly the same list of
words as the old list ''olddict'' did using
''oldaffixes'', the __-c__ switch of ''munchlist''
is useful, as in the following example:


$ munchlist -c oldaffixes -l newaffixes olddict


If you use this procedure, your new dictionary will always
accept the same list the original did, even if you badly
screwed up the affix file. This is because ''munchlist''
compares the words generated by a flag with the original
word list, and refuses to use any flags that generate
illegal words. (But don't forget that the ''munchlist''
step takes a long time and eats up temporary file
space).
!!EXAMPLES


As an example of conditional suffixes, here is the
specification of the __S__ flag from the English affix
file:


flag *S:
[[^AEIOU]Y


The first line applies to words ending in Y, but not in
vowel-Y. The second takes care of the vowel-Y words. The
third then handles those words that end in a sibilant or
near-sibilant, and the last picks up everything
else.


Note that the ''conditions'' are written very carefully
so that they apply to disjoint sets of words. In particular,
note that the fourth line excludes words ending in Y as well
as the obvious SXZH. Otherwise, it would convert
''


Although the English affix file does not do so, you can also
have a flag generate more than one variation on a root word.
For example, we could extend the English


flag *R:
E


This flag would generate both
!!SEE ALSO


ispell(1)
----
10 pages link to ispell(5):
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.
Last edited on Tuesday, June 4, 2002 12:30:37 am by "perry"
Edit PageHistory Diff Info LikePages