Penguin
Annotated edit history of dictd(8) version 2, including all changes. View license author blame.
Rev Author # Line
1 perry 1 DICTD
2 !!!DICTD
3 NAME
4 SYNOPSIS
5 DESCRIPTION
6 BACKGROUND
7 OPTIONS
8 CONFIGURATION FILE
9 DETERMINATION OF ACCESS LEVEL
10 SEARCH ALGORITHMS
11 DATABASE FORMAT
12 ACKNOWLEDGEMENTS
13 COPYING
14 BUGS
15 FILES
16 SEE ALSO
17 ----
18 !!NAME
19
20
21 dictd - a dictionary database server
22 !!SYNOPSIS
23
24
25 __dictd__ ''[[options]
26 ''
27 !!DESCRIPTION
28
29
30 __dictd__ is a server for the Dictionary Server Protocol
31 (DICT), a TCP transaction based query/response protocol that
32 allows a client to access dictionary definitions from a set
33 of natural language dictionary databases.
34
35
36 For security reasons, dictd drops root permissions after
37 startup. If user __dictd__ exists on the system, the
38 daemon will run as that user, group __dictd__, otherwise
39 it will run as user __nobody__, group
40 __nogroup__.
41
42
43 Since startup time is significant, the server is designed to
44 run continuously, and should ''not'' be run from
45 inetd(8). (However, with a fast processor, it is
46 feasible to do so.)
47
48
49 Databases are distributed separately from the
50 server.
51 !!BACKGROUND
52
53
54 For many years, the Internet community has relied on the
55
56
57 Fortunately, several freely-distributable dictionaries and
58 lexicons have recently become available on the Internet.
59 However, these freely-distributable databases are not
60 accessible via a uniform interface, and are not accessible
61 from a single site. They are often small and incomplete
62 individually, but would collectively provide an interesting
63 and useful database of English words. Examples include the
2 perry 64 Jargon file, the !WordNet database, MICRA's version of the
1 perry 65 1913 Webster's Revised Unabridged Dictionary, and the Free
66 Online Dictionary of Computing. (See the DICT protocol
67 specification (RFC) for references.) Translating and
68 non-English dictionaries are also becoming available (for
69 example, the FOLDOC dictionary is being translated into
70 Spanish).
71
72
73 The webster protocol is not suitable for providing access to
74 a large number of separate dictionary databases, and
75 extensions to the current webster protocol were not felt to
76 be a clean solution to the dictionary database
77 problem.
78
79
80 The DICT protocol is designed to provide access to multiple
81 databases. Word definitions can be requested, the word index
82 can be searched (using an easily extended set of
83 algorithms), information about the server can be provided
84 (e.g., which index search strategies are supported, or which
85 databases are available), and information about a database
86 can be provided (e.g., copyright, citation, or distribution
87 information). Further, the DICT protocol has hooks that can
88 be used to restrict access to some or all of the
89 databases.
90
91
92 dictd(8) is a server that implements the DICT
93 protocol. Bret Martin implemented another server, and
94 several people (including Bret and myself) have implemented
95 clients in a variety of languages.
96 !!OPTIONS
97
98
99 __-V__ or __--version__
100
101
102 Display version information.
103
104
105 __--license__
106
107
108 Display copyright and license information.
109
110
111 __-h__ or __--help__
112
113
114 Display help information.
115
116
117 __-v__ or __--verbose__ or __-d
118 verbose__
119
120
121 Be verbose.
122
123
124 __-c__ ''file'' or __--config__
125 ''file''
126
127
128 Specify configuration file. The default is
129 ''/etc/dictd.conf'', but may be changed in the
130 ''dictd.h'' file at compile time
131 (DICT_CONFIG_FILE).
132
133
134 __-p__ ''service'' or __--port__
135 ''service''
136
137
138 Specifies the port (e.g., 2628) or service (e.g., dict) for
139 connections. The default is 2628, as specified in the DICT
140 Protocol RFC, but may be changed in the ''dictd.h'' file
141 at compile time (DICT_DEFAULT_SERVICE).
142
143
144 __-i__ or __--inetd__
145
146
147 Communicate on standard input/output, suitable for use from
148 inetd. Although, due to its rather large startup time, this
149 daemon was not intended to run from inetd, with a fast
150 processor it is feasible to do so.
151
152
153 __--depth__ ''length''
154
155
156 Specify the queue length for listen(2). Specifies the
157 number of pending socket connections which are queued by the
158 operating system. Some operating systems may silently limit
159 this value to 5 (older BSD systems) or 128 (Linux). The
160 default is 10 but may be changed in the ''dictd.h'' file
161 at compile time (DICT_QUEUE_DEPTH).
162
163
164 __--delay__ ''seconds''
165
166
167 Specifies the number of seconds a client may be idle before
168 the server will close the connection. Idle time is defined
169 to be the time the server is waiting for input and does not
170 include the time the server spends searching the database.
171 Connections are closed without warning since no provision
172 for premature connection termination is specified in the
173 DICT protocol RFC. The default is 600 seconds (10 minutes),
174 but may be changed in the ''dictd.h'' file at compile
175 time (DICT_DEFAULT_DELAY).
176
177
178 __--facility__ ''facility''
179
180
181 Specifies the syslog facility to use. The use of this option
182 sets the -s option. The available facilities are those
183 listed in ''syslog.conf(5)''. (Note that keywords such as
184 __local1__ are used, not the variables such as
185 __LOG_LOCAL1__ described in ''syslog(3)''.) The
186 default facility is __user__.
187
188
189 The default syslog configuration adds all logs to
190 /var/log/syslog. Refer to ''syslog.conf(5)'' if you wish
191 to assign a log file name for a previously unused facility,
192 or if you desire to avoid cluttering ''/var/log/syslog''
193 with dictd logging messages.
194
195
196 __-f__ or __--force__
197
198
199 Force the daemon to start even if an instance of the daemon
200 is already running. (This is of little value unless a
201 non-default port is specified with -p, since, if one
202 instance is bound to a port, the second one fails when it
203 can not bind to the port.)
204
205
206 __--limit__ ''children''
207
208
209 Specifies the number of daemons that may be running
210 simultaneously. Each daemon services a single connection. If
211 the limit is exceeded, a (serialized) connection will be
212 made by the server process, and a response code 420 (server
213 temporarily unavailable) will be sent to the client. This
214 parameter should be adjusted to prevent the server machine
215 from being overloaded by dict clients, but should not be set
216 so low that many clients are denied useful connections. The
217 default is 100, but may be changed in the ''dictd.h''
218 file at compile time (DICT_DAEMON_LIMIT).
219
220
221 __-l__ ''option'' or __--log__
222 ''option''
223
224
225 Specify a logging option. (This is effective only if logging
226 has been enabled with the -s or -L option.) Only one option
227 may be set with each invocation of this option; however,
228 multiple invocations of this option may be made in one dictd
229 command line. For instance:
230
231
232 __dictd -s --log__ ''stats'' __--log__ ''found''
233 __--log__ ''notfound''
234
235
236 is a valid command line, and sets three logging
237 options.
238
239
240 Some of the more verbose options are used primarily for
241 debugging the server code, and are not practical for normal
242 use.
243
244
245 __server__ Log server diagnostics. This is extremely
246 verbose.
247
248
249 __connect__
250
251
252 Log all connections.
253
254
255 __stats__
256
257
258 Log all children terminations.
259
260
261 __command__
262
263
264 Log all commands. This is extremely verbose.
265
266
267 __client__
268
269
270 Log results of CLIENT command.
271
272
273 __found__
274
275
276 Log all words found in the databases.
277
278
279 __notfound__
280
281
282 Log all words not found in the databases.
283
284
285 __timestamp__
286
287
288 When logging to a file, use a full timestamp like that which
289 syslog would produce. Otherwise, no timestamp is made,
290 making the files shorter.
291
292
293 __host__
294
295
296 Log name of foreign host.
297
298
299 __min__
300
301
302 Set the following options: found, notfound, stats, and
303 client. If logging is activated (to a file, or via syslog),
304 and no options are set, then this minimal set of options
305 will be used.
306
307
308 __all__
309
310
311 Set all of the options.
312
313
314 __none__
315
316
317 Clear all of the options.
318
319
320 To facilitate location of interesting information in the log
321 file, entries are marked with initial letters indicating the
322 class of the line being logged:
323
324
325 __I__
326
327
328 Information about the server, connections, or termination
329 statistics. These lines are generally not designed to be
330 parsed automatically.
331
332
333 __E__
334
335
336 Error messages.
337
338
339 __C__
340
341
342 CLIENT command information.
343
344
345 __D__
346
347
348 Definitions found in the databases searched.
349
350
351 __M__
352
353
354 Matches found in the database searched.
355
356
357 __N__
358
359
360 Matches which were not found in the databases
361 searched.
362
363
364 __T__
365
366
367 Trace of exact line sent by client.
368
369
370 To preserve anonymity of the client, do ''not'' use the
371 __connect__ or __host__ options. Clients may or may
372 not send host information using the CLIENT command, but this
373 should be an option that is selectable on the client
374 side.
375
376
377 __-s__
378
379
380 Log using the syslog(3) facility.
381
382
383 __-L__ ''file'' or __--logfile__
384 ''file''
385
386
387 Specify the file for logging.
388
389
390 __NOTE:__ If dictd does not have write permission for
391 this file, it will silently fail.
392
393
394 __-m__ ''minutes'' or __--mark__
395 ''minutes''
396
397
398 How often a timestamp should be logged. (This is effective
399 only if logging has been enabled with the -s or -L
400 option.)
401
402
403 __-d__ ''option''
404
405
406 Activate a debugging option. There are several, all of which
407 are only useful to developers. They are documented here for
408 completeness. A list can be obtained interactively by using
409 __-d__ with an illegal option.
410
411
412 __verbose__
413
414
415 The same as __-v__ or __--verbose__. Adds verbosity to
416 other options.
417
418
419 __scan__
420
421
422 Debug the scanner for the configuration file.
423
424
425 __parse__
426
427
428 Debug the parser for the configuration file.
429
430
431 __search__
432
433
434 Debug the character folding and binary search
435 routines.
436
437
438 __init__
439
440
441 Report database initialization.
442
443
444 __port__
445
446
447 Log client-side port number to the log file.
448
449
450 __lev__
451
452
453 Debug Levenshtein search algorithm.
454
455
456 __auth__
457
458
459 Debug the authorization routines.
460
461
462 __nodetach__
463
464
465 Do not detach as a background process. Implies that a copy
466 of the log file will appear on the standard
467 output.
468
469
470 __nofork__
471
472
473 Do not fork daemons to service requests. Be a
474 single-threaded server. This option implies __nodetach__,
475 and is most useful for using a debugger to find the point at
476 which daemon processes are dumping core.
477
478
479 __alt__
480
481
482 Debugs __altcompare__ in ''index.c''.
483 !!CONFIGURATION FILE
484
485
486 The configuration file defaults to ''/etc/dictd.conf'',
487 but can be specified on the command line with the __-c__
488 option (see above). The configuration file has four distinct
489 sections. At this time, each section must appear in the
490 specified order, although only the Database section is
491 required.
492
493
494 __Syntax__
495
496
497 The following keywords are valid in a configuration file:
498 access, allow, deny, group, database, data, index, filter,
499 prefilter, postfilter, name, include, user, authonly, site.
500 Keywords are case sensitive. String arguments that contain
501 spaces should be surrounded by double quotes. Without
502 quoting, strings may contain alphanumeric characters and _,
503 -, ., and *, but not spaces. Strings must be on a single
504 line and cannot be continued between lines. Comments start
505 with # and extend to the end of the line.
506
507
508 __Access Specification__
509
510
511 Access specifications may occur in the Access Section or in
512 the Database Section. The access specification will be
513 described here.
514
515
516 For allow, deny, and authonly, a star (*) may be used as a
517 wild card that matches any number of characters. A question
518 mark (?) may be used as a wildcard that matches a single
519 character. For example, 10.0.0.* and *.edu are valid
520 strings.
521
522
523 The syntax is as follows:
524
525
526 __allow__ ''string''
527
528
529 The string specifies a domain name or IP address which is
530 allowed access to the server (in the Access Section) or to a
531 database (in the Database Section).
532
533
534 __deny__ ''string''
535
536
537 The string specifies a domain name or IP address which is
538 denied access to the server (in the Access Section) or to a
539 database (in the Database Section). Note that if reverse DNS
540 is not working, then only the IP number will be checked.
541 Therefore, it is essential to deny networks based on IP
542 number, since a denial based on domain name may not always
543 be checked.
544
545
546 __authonly__ ''string''
547
548
549 This form is only useful in the Access Section. The string
550 specifies a domain name or IP address which is allowed
551 access to the server but not to any of the databases. All
552 commands are valid except DEFINE, MATCH, and SHOW DB. More
553 specifically AUTH is a valid command, and commands which
554 access the databases are not allowed.
555
556
557 __user__''string''
558
559
560 This form is only useful in the Database Section. The string
561 specifies a username that is allowed to access this database
562 after a successful AUTH command is executed.
563
564
565 __site__ ''string''
566
567
568 Used to specify the filename for the site information file,
569 a flat text file which will be displayed in response to the
570 SHOW SERVER command. This section, if present, must be
571 first.
572
573
574 __access {__ ''access specification''
575 __}__
576
577
578 This section, the second if the Site Section is present,
579 contains access restrictions for the server and all of the
580 databases collectively. Per-database control is specified in
581 the Database Section
582
583
584 __database__ ''string'' __{__ ''database
585 specification'' __}__
586
587
588 This section is required. The string specifies the name of
589 the database (e.g., wn or web1913). The database
590 specification describes the database:
591
592
593 __NOTE__: If the files specified in database
594 specification do not exist on the system, dictd will
595 silently fail.
596
597
598 __data__ ''string''
599
600
601 Specifies the filename for the flat text
602 database.
603
604
605 __index__ ''string''
606
607
608 Specifies the filename for the index file.
609
610
611 __prefilter__ ''string''
612
613
614 Specifies the prefilter command. When a chunk of the
615 compressed database is read, it will be filtered with this
616 filter before being decompressed. This may be used to
617 provide some additional compression that knows about the
618 data and can provide better compression than the LZ77
619 algorithm used by zlib.
620
621
622 __postfilter__ ''string''
623
624
625 Specifies the postfilter command. When a chunk of the
626 compressed database is read, it will be filtered with this
627 filter before the offset and length for the entry are used
628 to access data. This is provided for symmetry with the
629 prefilter command, and may also be useful for providing
630 additional database compression.
631
632
633 __filter__ ''string''
634
635
636 Specifies the filter command. After the entry is extracted
637 from the database, it will be filtered with this filter.
638 This may be used to provide formatting for the entry (e.g.,
639 for html). __Warning:__ This is not currently
640 implemented.
641
642
643 __name__ ''string''
644
645
646 Specifies the short name of the database (e.g.,
647 dictd.h'' file at compile time
648 (DICT_SHORT_ENTRY_NAME).
649
650
651 __access {__ ''access specification''
652 __}__
653
654
655 Used to restrict access to this particular
656 database.
657
658
659 __include__ ''filename''
660
661
662 The text of the file ''filename'' (usually a database
663 specification) will be read as if it appeared at this
664 location in the configuration file.
665
666
667 __Note for Debian Systems:__
668 On Debian Systems, a configuration script that creates a
669 database specification in /var/lib/dictd/db.list is run
670 whenever any dictionary database is installed or removed.
671 This makes it unnecessary for the user to edit the Database
672 section of the configuration file.
673
674
675 __user__ ''string'' __string__
676
677
678 The first string specifies the username, and the second
679 string specifies the shared secret for this username. When
680 the AUTH command is used, the client will provide the
681 username and a hashed version of the shared secret. If the
682 shared secret matches, the user is said to have
683 authenticated, and will have access to databases whose
684 access specifications allow that user (by name, or by
685 wildcard). If present, this section must appear last in the
686 configuration file. There may be many user entries. The
687 shared secret should be kept secret, as anyone who has
688 access to it can access the shared databases (assuming
689 access is not denied by domain name).
690 !!DETERMINATION OF ACCESS LEVEL
691
692
693 When a client connects, the global access specification is
694 scanned, in order, until a specification matches. If no
695 access specification exists, all access is allowed (e.g.,
696 the action is the same as if
697
698
699 allow 10.42.* authonly *.edu deny *
700
701
702 With this specification, all clients in the 10.42 network
703 will be allowed access to unrestricted databases; all
704 clients from *.edu sites will be allowed to authenticate,
705 but will be denied access to all databases, even those which
706 are otherwise unrestricted; and all other clients will have
707 their connection terminated immediately. The 10.42 network
708 clients can send an AUTH command and gain access to
709 restricted databases. The *.edu clients must send an AUTH
710 command to gain access to any databases, restricted or
711 unrestricted.
712
713
714 When the AUTH command is sent, the access list for each
715 database is scanned, in order, just as the global access
716 list is scanned. However, after authentication, the client
717 has an associated username. For example, consider the
718 following access specification:
719
720
721 user u1 deny *.com user u2 allow *
722
723
724 If the client authenticated as u1, then the client will have
725 access to this database, even if the client comes from a
726 *.com site. In contrast, if the client authenticated as u2,
727 the client will only have access if it does not come from a
728 *.com site. In this case, the
729
730
731 __Warning:__ Checks are performed for domain names and
732 for IP addresses. However, if reverse DNS for a specific
733 site is not working, it is possible that a domain name may
734 not be available for checking. Make sure that all denials
735 use IP addresses. (And consider a future enhancement: if a
736 domain name is not available, should denials that depend on
737 a domain name match anything? This is the more conservative
738 viewpoint, but it is not currently
739 implemented.)
740 !!SEARCH ALGORITHMS
741
742
743 The DICT standard specifies a few search algorithms that
744 must be implemented, and permits others to be supported on a
745 server-dependent basis. The following search strategies are
746 supported by this server. Note that ''all'' strategies
747 are case insensitive. Most ignore non-alphanumeric,
748 non-whitespace characters.
749
750
751 __exact__
752
753
754 An exact match. This algorithm uses a binary search and is
755 one of the fastest search algorithms available.
756
757
758 __prefix__
759
760
761 Prefix match. This algorithm also uses a binary search and
762 is very fast.
763
764
765 __substring__
766
767
768 Match a substring anywhere in the headword. This search
769 strategy uses a modified Boyer-Moore-Horspool algorithm.
770 Since it must search the whole index file, it is not as fast
771 as the exact and prefix matches.
772
773
774 __suffix__
775
776
777 Suffix match. This search strategy also uses a modified
778 Boyer-Moore-Horspool algorithm, and is as fast as the
779 substring search.
780
781
782 __re__
783
784
785 POSIX 1003.2 (modern) regular expression search. Modern
786 regular expressions are the ones used by egrep(1).
787 These regular expressions allow predefined character classes
788 (e.g., [[[[:alnum:]], [[[[:alpha:]], [[[[:digit:]], and
789 [[[[:xdigit:]] are useful for this application); uses * to
790 match a sequence 0 or more matches of the previous atom;
791 uses + to match a sequence of 1 or more matches of the
792 previous atom; uses ? to match a sequence of 0 or 1 matches
793 of the previous atom; uses ^ to match the beginning of a
794 word, uses $ to match the end of a word, and allows nested
795 subexpression and alternation with () and |. For example,
796 __Warning:__
797 Regular expression matches can take 10 to 300 times longer
798 than substring matches. On a busy server, with many
799 databases, this can required more than 5 minutes of waiting
800 time, depending on the complexity of the regular
801 expression.
802
803
804 __regexp__
805
806
807 Old (basic) regular expressions. These regular expressions
808 don't support |, +, or ?. Groups use escaped parentheses.
809 While modern regular expressions are generally easier to
810 use, basic regular expressions have a back reference
811 feature. This can be used to match a second occurrence of
812 something that was already matched. For example, the
813 following expression finds all words that begin and end with
814 the same three letters:
815
816
817 ^\(...\).*\1$
818 Note the use of the double backslashes to escape the special characters. This is required by the DICT protocol string specification (a single backslash quotes the next character -- we use two to get a single backslash through to the regular expression engine). __Warning:__ Note that the use of backtracking is even slower than the use of general regular expressions.
819
820
821 __soundex__
822
823
824 The Soundex algorithm, a classic algorithm for finding words
825 that sound similar to each other. The algorithm encodes each
826 word using the first letter of the word and up to three
827 digits. Since the first letter is known, this search is
828 relatively fast, and it sometimes good for correcting
829 spelling errors when the Levenshtein algorithm doesn't
830 help.
831
832
833 __lev__
834
835
836 The Levenshtein algorithm (string edit distance of one).
837 This algorithm searches for all words which are within an
838 edit distance of one from the target word. An
839 !!DATABASE FORMAT
840
841
842 Databases for __dictd__ are distributed separately. A
843 database consists of two files. One is a flat text file, the
844 other in the index.
845
846
847 The flat text file contains dictionary entries (or any other
848 suitable data), and the index contains tab-delimited tuples
849 consisting of the headword, the byte offset at which this
850 entry begins in the flat text file, and the length of the
851 entry in bytes. The offset and length are encoded using base
852 64 encoding using the 64-character subset of International
853 Alphabet IA5 discussed in RFC 1421 (printable encoding) and
854 RFC 1522 (base64 MIME). Encoding the offsets in base 64
855 saves considerable space when compared with the usual base
856 10 encoding, while still permitting tab characters (ASCII 9)
857 to be used for delimiting fields in a record. Each record
858 ends with a newline (ASCII 10), so the index file is human
859 readable.
860
861
862 The flat text file may be compressed using gzip(1)
863 (not recommended) or dictzip(1) (highly recommended).
864 Optimal speed will be obtained using an uncompressed file.
865 However, the __gzip__ compression algorithm works very
866 well on plain text, and can result in space savings
867 typically between 60 and 80%. Using a file compressed with
868 gzip(1) is not recommended, however, because random
869 access on the file can only be accomplished by serially
870 decompressing the whole file, a process which is
871 prohibitively slow. dictzip(1) uses the same
872 compression algorithm and file format as does
873 gzip(1), but provides a table that can be used to
874 randomly access compressed blocks in the file. The use of
875 50-64kB blocks for compression typically degrades
876 compression by less than 10%, while maintaining acceptable
877 random access capabilities for all data in the file. As an
878 added benefit, files compressed with dictzip(1) can
879 be decompressed with gzip(1) or zcat(1).
880 (Note: recompressing a __dictzip__'d file using, for
881 example, znew(1) will destroy the random access
882 characteristics of the file. Always compress data files
883 using dictzip(1).)
884 !!ACKNOWLEDGEMENTS
885
886
887 Special thanks to Jean-loup Gailly and Mark Adler for
888 writing the zlib general purpose data compression library.
889 The version contained with __dictd__ is not necessarily
890 an original version and __may have been modified__,
891 although any modifications are probably trivial. The key
892 features of the __dictzip__ random-access compression
893 algorithm utilize a documented extension of the gzip format,
894 and do not require any modifications to zlib. For more
895 information on zlib, please see the zlib home page at''
896 http://quest.jpl.nasa.gov/zlib/''
897
898
899 Special thanks to Henry Spencer for his regex package. The
900 package contained with __dictd__ is not necessarily an
901 original version and __may have been modified.__ For more
902 information on regex, please see''
903 ftp://zoo.toronto.edu/pub/regex.shar''
904 !!COPYING
905
906
907 The main source files for the __dictd__ server and the
908 __dictzip__ compression program were written by Rik Faith
909 (faith@dict.org) and are distributed under the terms of the
910 GNU General Public License. If you need to distribute under
911 other terms, write to the author.
912
913
914 The main libraries used by these programs (zlib, regex,
915 libmaa) are distributed under different terms, so you may be
916 able to use the libraries for applications which are
917 incompatible with the GPL -- please see the copyright
918 notices and license information that come with the libraries
919 for more information, and consult with your attorney to
920 resolve these issues.
921 !!BUGS
922
923
924 The regular expression searches do not ignore
925 non-whitespace, non-alphanumeric characters as do the other
926 searches. In practice, this isn't much of a
927 problem.
928
929
930 The databases are memory mapped and cannot be updated while
931 the server is running.
932
933
934 There is no way to get a running server to re-read the
935 configuration file, so databases cannot be added or deleted
936 on the fly.
937 !!FILES
938
939
940 ''/etc/dictd.conf
941 /usr/sbin/dictd''
942 !!SEE ALSO
943
944
945 dict(1), dictzip(1), gunzip(1),
946 zcat(1), webster(1), __RFC
947 2229__
948 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.