version 1, including all changes.
.
| Rev |
Author |
# |
Line |
| 1 |
perry |
1 |
bzip2 |
| |
|
2 |
!!!bzip2 |
| |
|
3 |
NAME |
| |
|
4 |
SYNOPSIS |
| |
|
5 |
DESCRIPTION |
| |
|
6 |
OPTIONS |
| |
|
7 |
MEMORY MANAGEMENT |
| |
|
8 |
RECOVERING DATA FROM DAMAGED FILES |
| |
|
9 |
PERFORMANCE NOTES |
| |
|
10 |
CAVEATS |
| |
|
11 |
AUTHOR |
| |
|
12 |
---- |
| |
|
13 |
!!NAME |
| |
|
14 |
|
| |
|
15 |
|
| |
|
16 |
bzip2, bunzip2 - a block-sorting file compressor, v1.0.2 |
| |
|
17 |
bzcat - decompresses files to stdout |
| |
|
18 |
bzip2recover - recovers data from damaged bzip2 files |
| |
|
19 |
!!SYNOPSIS |
| |
|
20 |
|
| |
|
21 |
|
| |
|
22 |
__bzip2__ [[ __-cdfkqstvzVL123456789__ ] [[ ''filenames |
| |
|
23 |
...'' ]__ |
| |
|
24 |
bunzip2__ [[ __-fkvsVL__ ] [[ ''filenames ...'' |
| |
|
25 |
]__ |
| |
|
26 |
bzcat__ [[ __-s__ ] [[ ''filenames ...'' ]__ |
| |
|
27 |
bzip2recover__ ''filename'' |
| |
|
28 |
!!DESCRIPTION |
| |
|
29 |
|
| |
|
30 |
|
| |
|
31 |
''bzip2'' compresses files using the Burrows-Wheeler |
| |
|
32 |
block sorting text compression algorithm, and Huffman |
| |
|
33 |
coding. Compression is generally considerably better than |
| |
|
34 |
that achieved by more conventional LZ77/LZ78-based |
| |
|
35 |
compressors, and approaches the performance of the PPM |
| |
|
36 |
family of statistical compressors. |
| |
|
37 |
|
| |
|
38 |
|
| |
|
39 |
The command-line options are deliberately very similar to |
| |
|
40 |
those of ''GNU gzip,'' but they are not |
| |
|
41 |
identical. |
| |
|
42 |
|
| |
|
43 |
|
| |
|
44 |
''bzip2'' expects a list of file names to accompany the |
| |
|
45 |
command-line flags. Each file is replaced by a compressed |
| |
|
46 |
version of itself, with the name |
| |
|
47 |
'' |
| |
|
48 |
|
| |
|
49 |
|
| |
|
50 |
''bzip2'' and ''bunzip2'' will by default not |
| |
|
51 |
overwrite existing files. If you want this to happen, |
| |
|
52 |
specify the -f flag. |
| |
|
53 |
|
| |
|
54 |
|
| |
|
55 |
If no file names are specified, ''bzip2'' compresses from |
| |
|
56 |
standard input to standard output. In this case, |
| |
|
57 |
''bzip2'' will decline to write compressed output to a |
| |
|
58 |
terminal, as this would be entirely incomprehensible and |
| |
|
59 |
therefore pointless. |
| |
|
60 |
|
| |
|
61 |
|
| |
|
62 |
''bunzip2'' (or ''bzip2 -d)'' decompresses all |
| |
|
63 |
specified files. Files which were not created by |
| |
|
64 |
''bzip2'' will be detected and ignored, and a warning |
| |
|
65 |
issued. ''bzip2'' attempts to guess the filename for the |
| |
|
66 |
decompressed file from that of the compressed file as |
| |
|
67 |
follows: |
| |
|
68 |
|
| |
|
69 |
|
| |
|
70 |
filename.bz2 becomes filename filename.bz becomes filename |
| |
|
71 |
filename.tbz2 becomes filename.tar filename.tbz becomes |
| |
|
72 |
filename.tar anyothername becomes |
| |
|
73 |
anyothername.out |
| |
|
74 |
|
| |
|
75 |
|
| |
|
76 |
If the file does not end in one of the recognised endings, |
| |
|
77 |
''.bz2, .bz, .tbz2'' or ''.tbz, bzip2'' complains that |
| |
|
78 |
it cannot guess the name of the original file, and uses the |
| |
|
79 |
original name with ''.out'' appended. |
| |
|
80 |
|
| |
|
81 |
|
| |
|
82 |
As with compression, supplying no filenames causes |
| |
|
83 |
decompression from standard input to standard |
| |
|
84 |
output. |
| |
|
85 |
|
| |
|
86 |
|
| |
|
87 |
''bunzip2'' will correctly decompress a file which is the |
| |
|
88 |
concatenation of two or more compressed files. The result is |
| |
|
89 |
the concatenation of the corresponding uncompressed files. |
| |
|
90 |
Integrity testing (-t) of concatenated compressed files is |
| |
|
91 |
also supported. |
| |
|
92 |
|
| |
|
93 |
|
| |
|
94 |
You can also compress or decompress files to the standard |
| |
|
95 |
output by giving the -c flag. Multiple files may be |
| |
|
96 |
compressed and decompressed like this. The resulting outputs |
| |
|
97 |
are fed sequentially to stdout. Compression of multiple |
| |
|
98 |
files in this manner generates a stream containing multiple |
| |
|
99 |
compressed file representations. Such a stream can be |
| |
|
100 |
decompressed correctly only by ''bzip2'' version 0.9.0 or |
| |
|
101 |
later. Earlier versions of ''bzip2'' will stop after |
| |
|
102 |
decompressing the first file in the stream. |
| |
|
103 |
|
| |
|
104 |
|
| |
|
105 |
''bzcat'' (or ''bzip2 -dc)'' decompresses all |
| |
|
106 |
specified files to the standard output. |
| |
|
107 |
|
| |
|
108 |
|
| |
|
109 |
''bzip2'' will read arguments from the environment |
| |
|
110 |
variables ''BZIP2'' and ''BZIP,'' in that order, and |
| |
|
111 |
will process them before any arguments read from the command |
| |
|
112 |
line. This gives a convenient way to supply default |
| |
|
113 |
arguments. |
| |
|
114 |
|
| |
|
115 |
|
| |
|
116 |
Compression is always performed, even if the compressed file |
| |
|
117 |
is slightly larger than the original. Files of less than |
| |
|
118 |
about one hundred bytes tend to get larger, since the |
| |
|
119 |
compression mechanism has a constant overhead in the region |
| |
|
120 |
of 50 bytes. Random data (including the output of most file |
| |
|
121 |
compressors) is coded at about 8.05 bits per byte, giving an |
| |
|
122 |
expansion of around 0.5%. |
| |
|
123 |
|
| |
|
124 |
|
| |
|
125 |
As a self-check for your protection, ''bzip2'' uses |
| |
|
126 |
32-bit CRCs to make sure that the decompressed version of a |
| |
|
127 |
file is identical to the original. This guards against |
| |
|
128 |
corruption of the compressed data, and against undetected |
| |
|
129 |
bugs in ''bzip2'' (hopefully very unlikely). The chances |
| |
|
130 |
of data corruption going undetected is microscopic, about |
| |
|
131 |
one chance in four billion for each file processed. Be |
| |
|
132 |
aware, though, that the check occurs upon decompression, so |
| |
|
133 |
it can only tell you that something is wrong. It can't help |
| |
|
134 |
you recover the original uncompressed data. You can use |
| |
|
135 |
''bzip2recover'' to try to recover data from damaged |
| |
|
136 |
files. |
| |
|
137 |
|
| |
|
138 |
|
| |
|
139 |
Return values: 0 for a normal exit, 1 for environmental |
| |
|
140 |
problems (file not found, invalid flags, I/O errors, |
| |
|
141 |
bzip2'' to panic. |
| |
|
142 |
!!OPTIONS |
| |
|
143 |
|
| |
|
144 |
|
| |
|
145 |
__-c --stdout__ |
| |
|
146 |
|
| |
|
147 |
|
| |
|
148 |
Compress or decompress to standard output. |
| |
|
149 |
|
| |
|
150 |
|
| |
|
151 |
__-d --decompress__ |
| |
|
152 |
|
| |
|
153 |
|
| |
|
154 |
Force decompression. ''bzip2, bunzip2'' and ''bzcat'' |
| |
|
155 |
are really the same program, and the decision about what |
| |
|
156 |
actions to take is done on the basis of which name is used. |
| |
|
157 |
This flag overrides that mechanism, and forces ''bzip2'' |
| |
|
158 |
to decompress. |
| |
|
159 |
|
| |
|
160 |
|
| |
|
161 |
__-z --compress__ |
| |
|
162 |
|
| |
|
163 |
|
| |
|
164 |
The complement to -d: forces compression, regardless of the |
| |
|
165 |
invocation name. |
| |
|
166 |
|
| |
|
167 |
|
| |
|
168 |
__-t --test__ |
| |
|
169 |
|
| |
|
170 |
|
| |
|
171 |
Check integrity of the specified file(s), but don't |
| |
|
172 |
decompress them. This really performs a trial decompression |
| |
|
173 |
and throws away the result. |
| |
|
174 |
|
| |
|
175 |
|
| |
|
176 |
__-f --force__ |
| |
|
177 |
|
| |
|
178 |
|
| |
|
179 |
Force overwrite of output files. Normally, ''bzip2'' will |
| |
|
180 |
not overwrite existing output files. Also forces |
| |
|
181 |
''bzip2'' to break hard links to files, which it |
| |
|
182 |
otherwise wouldn't do. |
| |
|
183 |
|
| |
|
184 |
|
| |
|
185 |
bzip2 normally declines to decompress files which don't have |
| |
|
186 |
the correct magic header bytes. If forced (-f), however, it |
| |
|
187 |
will pass such files through unmodified. This is how GNU |
| |
|
188 |
gzip behaves. |
| |
|
189 |
|
| |
|
190 |
|
| |
|
191 |
__-k --keep__ |
| |
|
192 |
|
| |
|
193 |
|
| |
|
194 |
Keep (don't delete) input files during compression or |
| |
|
195 |
decompression. |
| |
|
196 |
|
| |
|
197 |
|
| |
|
198 |
__-s --small__ |
| |
|
199 |
|
| |
|
200 |
|
| |
|
201 |
Reduce memory usage, for compression, decompression and |
| |
|
202 |
testing. Files are decompressed and tested using a modified |
| |
|
203 |
algorithm which only requires 2.5 bytes per block byte. This |
| |
|
204 |
means any file can be decompressed in 2300k of memory, |
| |
|
205 |
albeit at about half the normal speed. |
| |
|
206 |
|
| |
|
207 |
|
| |
|
208 |
During compression, -s selects a block size of 200k, which |
| |
|
209 |
limits memory use to around the same figure, at the expense |
| |
|
210 |
of your compression ratio. In short, if your machine is low |
| |
|
211 |
on memory (8 megabytes or less), use -s for everything. See |
| |
|
212 |
MEMORY MANAGEMENT below. |
| |
|
213 |
|
| |
|
214 |
|
| |
|
215 |
__-q --quiet__ |
| |
|
216 |
|
| |
|
217 |
|
| |
|
218 |
Suppress non-essential warning messages. Messages pertaining |
| |
|
219 |
to I/O errors and other critical events will not be |
| |
|
220 |
suppressed. |
| |
|
221 |
|
| |
|
222 |
|
| |
|
223 |
__-v --verbose__ |
| |
|
224 |
|
| |
|
225 |
|
| |
|
226 |
Verbose mode -- show the compression ratio for each file |
| |
|
227 |
processed. Further -v's increase the verbosity level, |
| |
|
228 |
spewing out lots of information which is primarily of |
| |
|
229 |
interest for diagnostic purposes. |
| |
|
230 |
|
| |
|
231 |
|
| |
|
232 |
__-L --license -V --version__ |
| |
|
233 |
|
| |
|
234 |
|
| |
|
235 |
Display the software version, license terms and |
| |
|
236 |
conditions. |
| |
|
237 |
|
| |
|
238 |
|
| |
|
239 |
__-1 (or --fast) to -9 (or --best)__ |
| |
|
240 |
|
| |
|
241 |
|
| |
|
242 |
Set the block size to 100 k, 200 k .. 900 k when |
| |
|
243 |
compressing. Has no effect when decompressing. See MEMORY |
| |
|
244 |
MANAGEMENT below. The --fast and --best aliases are |
| |
|
245 |
primarily for GNU gzip compatibility. In particular, --fast |
| |
|
246 |
doesn't make things significantly faster. And --best merely |
| |
|
247 |
selects the default behaviour. |
| |
|
248 |
|
| |
|
249 |
|
| |
|
250 |
__--__ |
| |
|
251 |
|
| |
|
252 |
|
| |
|
253 |
Treats all subsequent arguments as file names, even if they |
| |
|
254 |
start with a dash. This is so you can handle files with |
| |
|
255 |
names beginning with a dash, for example: bzip2 -- |
| |
|
256 |
-myfilename. |
| |
|
257 |
|
| |
|
258 |
|
| |
|
259 |
__--repetitive-fast --repetitive-best__ |
| |
|
260 |
|
| |
|
261 |
|
| |
|
262 |
These flags are redundant in versions 0.9.5 and above. They |
| |
|
263 |
provided some coarse control over the behaviour of the |
| |
|
264 |
sorting algorithm in earlier versions, which was sometimes |
| |
|
265 |
useful. 0.9.5 and above have an improved algorithm which |
| |
|
266 |
renders these flags irrelevant. |
| |
|
267 |
!!MEMORY MANAGEMENT |
| |
|
268 |
|
| |
|
269 |
|
| |
|
270 |
''bzip2'' compresses large files in blocks. The block |
| |
|
271 |
size affects both the compression ratio achieved, and the |
| |
|
272 |
amount of memory needed for compression and decompression. |
| |
|
273 |
The flags -1 through -9 specify the block size to be 100,000 |
| |
|
274 |
bytes through 900,000 bytes (the default) respectively. At |
| |
|
275 |
decompression time, the block size used for compression is |
| |
|
276 |
read from the header of the compressed file, and |
| |
|
277 |
''bunzip2'' then allocates itself just enough memory to |
| |
|
278 |
decompress the file. Since block sizes are stored in |
| |
|
279 |
compressed files, it follows that the flags -1 to -9 are |
| |
|
280 |
irrelevant to and so ignored during |
| |
|
281 |
decompression. |
| |
|
282 |
|
| |
|
283 |
|
| |
|
284 |
Compression and decompression requirements, in bytes, can be |
| |
|
285 |
estimated as: |
| |
|
286 |
|
| |
|
287 |
|
| |
|
288 |
Compression: 400k + ( 8 x block size ) |
| |
|
289 |
|
| |
|
290 |
|
| |
|
291 |
Decompression: 100k + ( 4 x block size ), or 100k + ( 2.5 x |
| |
|
292 |
block size ) |
| |
|
293 |
|
| |
|
294 |
|
| |
|
295 |
Larger block sizes give rapidly diminishing marginal |
| |
|
296 |
returns. Most of the compression comes from the first two or |
| |
|
297 |
three hundred k of block size, a fact worth bearing in mind |
| |
|
298 |
when using ''bzip2'' on small machines. It is also |
| |
|
299 |
important to appreciate that the decompression memory |
| |
|
300 |
requirement is set at compression time by the choice of |
| |
|
301 |
block size. |
| |
|
302 |
|
| |
|
303 |
|
| |
|
304 |
For files compressed with the default 900k block size, |
| |
|
305 |
''bunzip2'' will require about 3700 kbytes to decompress. |
| |
|
306 |
To support decompression of any file on a 4 megabyte |
| |
|
307 |
machine, ''bunzip2'' has an option to decompress using |
| |
|
308 |
approximately half this amount of memory, about 2300 kbytes. |
| |
|
309 |
Decompression speed is also halved, so you should use this |
| |
|
310 |
option only where necessary. The relevant flag is |
| |
|
311 |
-s. |
| |
|
312 |
|
| |
|
313 |
|
| |
|
314 |
In general, try and use the largest block size memory |
| |
|
315 |
constraints allow, since that maximises the compression |
| |
|
316 |
achieved. Compression and decompression speed are virtually |
| |
|
317 |
unaffected by block size. |
| |
|
318 |
|
| |
|
319 |
|
| |
|
320 |
Another significant point applies to files which fit in a |
| |
|
321 |
single block -- that means most files you'd encounter using |
| |
|
322 |
a large block size. The amount of real memory touched is |
| |
|
323 |
proportional to the size of the file, since the file is |
| |
|
324 |
smaller than a block. For example, compressing a file 20,000 |
| |
|
325 |
bytes long with the flag -9 will cause the compressor to |
| |
|
326 |
allocate around 7600k of memory, but only touch 400k + 20000 |
| |
|
327 |
* 8 = 560 kbytes of it. Similarly, the decompressor will |
| |
|
328 |
allocate 3700k but only touch 100k + 20000 * 4 = 180 |
| |
|
329 |
kbytes. |
| |
|
330 |
|
| |
|
331 |
|
| |
|
332 |
Here is a table which summarises the maximum memory usage |
| |
|
333 |
for different block sizes. Also recorded is the total |
| |
|
334 |
compressed size for 14 files of the Calgary Text Compression |
| |
|
335 |
Corpus totalling 3,141,622 bytes. This column gives some |
| |
|
336 |
feel for how compression varies with block size. These |
| |
|
337 |
figures tend to understate the advantage of larger block |
| |
|
338 |
sizes for larger files, since the Corpus is dominated by |
| |
|
339 |
smaller files. |
| |
|
340 |
|
| |
|
341 |
|
| |
|
342 |
Compress Decompress Decompress Corpus Flag usage usage -s |
| |
|
343 |
usage Size |
| |
|
344 |
|
| |
|
345 |
|
| |
|
346 |
-1 1200k 500k 350k 914704 -2 2000k 900k 600k 877703 -3 2800k |
| |
|
347 |
1300k 850k 860338 -4 3600k 1700k 1100k 846899 -5 4400k 2100k |
| |
|
348 |
1350k 845160 -6 5200k 2500k 1600k 838626 -7 6100k 2900k |
| |
|
349 |
1850k 834096 -8 6800k 3300k 2100k 828642 -9 7600k 3700k |
| |
|
350 |
2350k 828642 |
| |
|
351 |
!!RECOVERING DATA FROM DAMAGED FILES |
| |
|
352 |
|
| |
|
353 |
|
| |
|
354 |
''bzip2'' compresses files in blocks, usually 900kbytes |
| |
|
355 |
long. Each block is handled independently. If a media or |
| |
|
356 |
transmission error causes a multi-block .bz2 file to become |
| |
|
357 |
damaged, it may be possible to recover data from the |
| |
|
358 |
undamaged blocks in the file. |
| |
|
359 |
|
| |
|
360 |
|
| |
|
361 |
The compressed representation of each block is delimited by |
| |
|
362 |
a 48-bit pattern, which makes it possible to find the block |
| |
|
363 |
boundaries with reasonable certainty. Each block also |
| |
|
364 |
carries its own 32-bit CRC, so damaged blocks can be |
| |
|
365 |
distinguished from undamaged ones. |
| |
|
366 |
|
| |
|
367 |
|
| |
|
368 |
''bzip2recover'' is a simple program whose purpose is to |
| |
|
369 |
search for blocks in .bz2 files, and write each block out |
| |
|
370 |
into its own .bz2 file. You can then use ''bzip2'' -t to |
| |
|
371 |
test the integrity of the resulting files, and decompress |
| |
|
372 |
those which are undamaged. |
| |
|
373 |
|
| |
|
374 |
|
| |
|
375 |
''bzip2recover'' takes a single argument, the name of the |
| |
|
376 |
damaged file, and writes a number of files |
| |
|
377 |
'' |
| |
|
378 |
|
| |
|
379 |
|
| |
|
380 |
''bzip2recover'' should be of most use dealing with large |
| |
|
381 |
.bz2 files, as these will contain many blocks. It is clearly |
| |
|
382 |
futile to use it on damaged single-block files, since a |
| |
|
383 |
damaged block cannot be recovered. If you wish to minimise |
| |
|
384 |
any potential data loss through media or transmission |
| |
|
385 |
errors, you might consider compressing with a smaller block |
| |
|
386 |
size. |
| |
|
387 |
!!PERFORMANCE NOTES |
| |
|
388 |
|
| |
|
389 |
|
| |
|
390 |
The sorting phase of compression gathers together similar |
| |
|
391 |
strings in the file. Because of this, files containing very |
| |
|
392 |
long runs of repeated symbols, like |
| |
|
393 |
|
| |
|
394 |
|
| |
|
395 |
Decompression speed is unaffected by these |
| |
|
396 |
phenomena. |
| |
|
397 |
|
| |
|
398 |
|
| |
|
399 |
''bzip2'' usually allocates several megabytes of memory |
| |
|
400 |
to operate in, and then charges all over it in a fairly |
| |
|
401 |
random fashion. This means that performance, both for |
| |
|
402 |
compressing and decompressing, is largely determined by the |
| |
|
403 |
speed at which your machine can service cache misses. |
| |
|
404 |
Because of this, small changes to the code to reduce the |
| |
|
405 |
miss rate have been observed to give disproportionately |
| |
|
406 |
large performance improvements. I imagine ''bzip2'' will |
| |
|
407 |
perform best on machines with very large |
| |
|
408 |
caches. |
| |
|
409 |
!!CAVEATS |
| |
|
410 |
|
| |
|
411 |
|
| |
|
412 |
I/O error messages are not as helpful as they could be. |
| |
|
413 |
''bzip2'' tries hard to detect I/O errors and exit |
| |
|
414 |
cleanly, but the details of what the problem is sometimes |
| |
|
415 |
seem rather misleading. |
| |
|
416 |
|
| |
|
417 |
|
| |
|
418 |
This manual page pertains to version 1.0.2 of ''bzip2.'' |
| |
|
419 |
Compressed data created by this version is entirely forwards |
| |
|
420 |
and backwards compatible with the previous public releases, |
| |
|
421 |
versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0 and 1.0.1, but with the |
| |
|
422 |
following exception: 0.9.0 and above can correctly |
| |
|
423 |
decompress multiple concatenated compressed files. 0.1pl2 |
| |
|
424 |
cannot do this; it will stop after decompressing just the |
| |
|
425 |
first file in the stream. |
| |
|
426 |
|
| |
|
427 |
|
| |
|
428 |
''bzip2recover'' versions prior to this one, 1.0.2, used |
| |
|
429 |
32-bit integers to represent bit positions in compressed |
| |
|
430 |
files, so it could not handle compressed files more than 512 |
| |
|
431 |
megabytes long. Version 1.0.2 and above uses 64-bit ints on |
| |
|
432 |
some platforms which support them (GNU supported targets, |
| |
|
433 |
and Windows). To establish whether or not bzip2recover was |
| |
|
434 |
built with such a limitation, run it without arguments. In |
| |
|
435 |
any event you can build yourself an unlimited version if you |
| |
|
436 |
can recompile it with MaybeUInt64 set to be an unsigned |
| |
|
437 |
64-bit integer. |
| |
|
438 |
!!AUTHOR |
| |
|
439 |
|
| |
|
440 |
|
| |
|
441 |
Julian Seward, jseward@acm.org. |
| |
|
442 |
|
| |
|
443 |
|
| |
|
444 |
http://sources.redhat.com/bzip2 |
| |
|
445 |
|
| |
|
446 |
|
| |
|
447 |
The ideas embodied in ''bzip2'' are due to (at least) the |
| |
|
448 |
following people: Michael Burrows and David Wheeler (for the |
| |
|
449 |
block sorting transformation), David Wheeler (again, for the |
| |
|
450 |
Huffman coder), Peter Fenwick (for the structured coding |
| |
|
451 |
model in the original ''bzip,'' and many refinements), |
| |
|
452 |
and Alistair Moffat, Radford Neal and Ian Witten (for the |
| |
|
453 |
arithmetic coder in the original ''bzip).'' I am much |
| |
|
454 |
indebted for their help, support and advice. See the manual |
| |
|
455 |
in the source distribution for pointers to sources of |
| |
|
456 |
documentation. Christian von Roques encouraged me to look |
| |
|
457 |
for faster sorting algorithms, so as to speed up |
| |
|
458 |
compression. Bela Lubkin encouraged me to improve the |
| |
|
459 |
worst-case compression performance. The bz* scripts are |
| |
|
460 |
derived from those of GNU gzip. Many people sent patches, |
| |
|
461 |
helped with portability problems, lent machines, gave advice |
| |
|
462 |
and were generally helpful. |
| |
|
463 |
---- |