version 1, including all changes.
.
Rev |
Author |
# |
Line |
1 |
perry |
1 |
LOCATEDB |
|
|
2 |
!!!LOCATEDB |
|
|
3 |
NAME |
|
|
4 |
DESCRIPTION |
|
|
5 |
EXAMPLE |
|
|
6 |
SEE ALSO |
|
|
7 |
---- |
|
|
8 |
!!NAME |
|
|
9 |
|
|
|
10 |
|
|
|
11 |
locatedb - front-compressed file name database |
|
|
12 |
!!DESCRIPTION |
|
|
13 |
|
|
|
14 |
|
|
|
15 |
This manual page documents the format of file name databases |
|
|
16 |
for the GNU version of __locate__. The file name |
|
|
17 |
databases contain lists of files that were in particular |
|
|
18 |
directory trees when the databases were last |
|
|
19 |
updated. |
|
|
20 |
|
|
|
21 |
|
|
|
22 |
There can be multiple databases. Users can select which |
|
|
23 |
databases __locate__ searches using an environment |
|
|
24 |
variable or command line option; see __locate__(1L). The |
|
|
25 |
system administrator can choose the file name of the default |
|
|
26 |
database, the frequency with which the databases are |
|
|
27 |
updated, and the directories for which they contain entries. |
|
|
28 |
Normally, file name databases are updated by running the |
|
|
29 |
__updatedb__ program periodically, typically nightly; see |
|
|
30 |
__updatedb__(1L). |
|
|
31 |
|
|
|
32 |
|
|
|
33 |
__updatedb__ runs a program called __frcode__ to |
|
|
34 |
compress the list of file names using front-compression, |
|
|
35 |
which reduces the database size by a factor of 4 to 5. |
|
|
36 |
Front-compression (also known as incremental encoding) works |
|
|
37 |
as follows. |
|
|
38 |
|
|
|
39 |
|
|
|
40 |
The database entries are a sorted list (case-insensitively, |
|
|
41 |
for users' convenience). Since the list is sorted, each |
|
|
42 |
entry is likely to share a prefix (initial string) with the |
|
|
43 |
previous entry. Each database entry begins with an |
|
|
44 |
offset-differential count byte, which is the additional |
|
|
45 |
number of characters of prefix of the preceding entry to use |
|
|
46 |
beyond the number that the preceding entry is using of its |
|
|
47 |
predecessor. (The counts can be negative.) Following the |
|
|
48 |
count is a null-terminated ASCII remainder -- the part of |
|
|
49 |
the name that follows the shared prefix. |
|
|
50 |
|
|
|
51 |
|
|
|
52 |
If the offset-differential count is larger than can be |
|
|
53 |
stored in a byte (+/-127), the byte has the value 0x80 and |
|
|
54 |
the count follows in a 2-byte word, with the high byte first |
|
|
55 |
(network byte order). |
|
|
56 |
|
|
|
57 |
|
|
|
58 |
Every database begins with a dummy entry for a file called |
|
|
59 |
`LOCATE02', which __locate__ checks for to ensure that |
|
|
60 |
the database file has the correct format; it ignores the |
|
|
61 |
entry in doing the search. |
|
|
62 |
|
|
|
63 |
|
|
|
64 |
Databases can not be concatenated together, even if the |
|
|
65 |
first (dummy) entry is trimmed from all but the first |
|
|
66 |
database. This is because the offset-differential count in |
|
|
67 |
the first entry of the second and following databases will |
|
|
68 |
be wrong. |
|
|
69 |
|
|
|
70 |
|
|
|
71 |
There is also an old database format, used by Unix |
|
|
72 |
__locate__ and __find__ programs and earlier releases |
|
|
73 |
of the GNU ones. __updatedb__ runs programs called |
|
|
74 |
__bigram__ and __code__ to produce old-format |
|
|
75 |
databases. The old format differs from the above description |
|
|
76 |
in the following ways. Instead of each entry starting with |
|
|
77 |
an offset-differential count byte and ending with a null, |
|
|
78 |
byte values from 0 through 28 indicate offset-differential |
|
|
79 |
counts from -14 through 14. The byte value indicating that a |
|
|
80 |
long offset-differential count follows is 0x1e (30), not |
|
|
81 |
0x80. The long counts are stored in host byte order, which |
|
|
82 |
is not necessarily network byte order, and host integer word |
|
|
83 |
size, which is usually 4 bytes. They also represent a count |
|
|
84 |
14 less than their value. The database lines have no |
|
|
85 |
termination byte; the start of the next line is indicated by |
|
|
86 |
its first byte having a value __ |
|
|
87 |
|
|
|
88 |
|
|
|
89 |
In addition, instead of starting with a dummy entry, the old |
|
|
90 |
database format starts with a 256 byte table containing the |
|
|
91 |
128 most common bigrams in the file list. A bigram is a pair |
|
|
92 |
of adjacent bytes. Bytes in the database that have the high |
|
|
93 |
bit set are indexes (with the high bit cleared) into the |
|
|
94 |
bigram table. The bigram and offset-differential count |
|
|
95 |
coding makes these databases 20-25% smaller than the new |
|
|
96 |
format, but makes them not 8-bit clean. Any byte in a file |
|
|
97 |
name that is in the ranges used for the special codes is |
|
|
98 |
replaced in the database by a question mark, which not |
|
|
99 |
coincidentally is the shell wildcard to match a single |
|
|
100 |
character. |
|
|
101 |
!!EXAMPLE |
|
|
102 |
|
|
|
103 |
|
|
|
104 |
Input to __frcode__: |
|
|
105 |
/usr/src |
|
|
106 |
/usr/src/cmd/aardvark.c |
|
|
107 |
/usr/src/cmd/armadillo.c |
|
|
108 |
/usr/tmp/zoo |
|
|
109 |
Length of the longest prefix of the preceding entry to share: |
|
|
110 |
0 /usr/src |
|
|
111 |
8 /cmd/aardvark.c |
|
|
112 |
14 rmadillo.c |
|
|
113 |
5 tmp/zoo |
|
|
114 |
Output from __frcode__, with trailing nulls changed to newlines and count bytes made printable: |
|
|
115 |
|
|
|
116 |
|
|
|
117 |
0 LOCATE02 |
|
|
118 |
0 /usr/src |
|
|
119 |
8 /cmd/aardvark.c |
|
|
120 |
6 rmadillo.c |
|
|
121 |
-9 tmp/zoo |
|
|
122 |
(6 = 14 - 8, and -9 = 5 - 14) |
|
|
123 |
!!SEE ALSO |
|
|
124 |
|
|
|
125 |
|
|
|
126 |
__find__(1L), __locate__(1L), __locatedb__(5L), |
|
|
127 |
__xargs__(1L) __Finding Files__ (on-line in Info, or |
|
|
128 |
printed) |
|
|
129 |
---- |