version 1 showing authors affecting page license.
.
Rev |
Author |
# |
Line |
1 |
perry |
1 |
bzip2 |
|
|
2 |
!!!bzip2 |
|
|
3 |
NAME |
|
|
4 |
SYNOPSIS |
|
|
5 |
DESCRIPTION |
|
|
6 |
OPTIONS |
|
|
7 |
MEMORY MANAGEMENT |
|
|
8 |
RECOVERING DATA FROM DAMAGED FILES |
|
|
9 |
PERFORMANCE NOTES |
|
|
10 |
CAVEATS |
|
|
11 |
AUTHOR |
|
|
12 |
---- |
|
|
13 |
!!NAME |
|
|
14 |
|
|
|
15 |
|
|
|
16 |
bzip2, bunzip2 - a block-sorting file compressor, v1.0.2 |
|
|
17 |
bzcat - decompresses files to stdout |
|
|
18 |
bzip2recover - recovers data from damaged bzip2 files |
|
|
19 |
!!SYNOPSIS |
|
|
20 |
|
|
|
21 |
|
|
|
22 |
__bzip2__ [[ __-cdfkqstvzVL123456789__ ] [[ ''filenames |
|
|
23 |
...'' ]__ |
|
|
24 |
bunzip2__ [[ __-fkvsVL__ ] [[ ''filenames ...'' |
|
|
25 |
]__ |
|
|
26 |
bzcat__ [[ __-s__ ] [[ ''filenames ...'' ]__ |
|
|
27 |
bzip2recover__ ''filename'' |
|
|
28 |
!!DESCRIPTION |
|
|
29 |
|
|
|
30 |
|
|
|
31 |
''bzip2'' compresses files using the Burrows-Wheeler |
|
|
32 |
block sorting text compression algorithm, and Huffman |
|
|
33 |
coding. Compression is generally considerably better than |
|
|
34 |
that achieved by more conventional LZ77/LZ78-based |
|
|
35 |
compressors, and approaches the performance of the PPM |
|
|
36 |
family of statistical compressors. |
|
|
37 |
|
|
|
38 |
|
|
|
39 |
The command-line options are deliberately very similar to |
|
|
40 |
those of ''GNU gzip,'' but they are not |
|
|
41 |
identical. |
|
|
42 |
|
|
|
43 |
|
|
|
44 |
''bzip2'' expects a list of file names to accompany the |
|
|
45 |
command-line flags. Each file is replaced by a compressed |
|
|
46 |
version of itself, with the name |
|
|
47 |
'' |
|
|
48 |
|
|
|
49 |
|
|
|
50 |
''bzip2'' and ''bunzip2'' will by default not |
|
|
51 |
overwrite existing files. If you want this to happen, |
|
|
52 |
specify the -f flag. |
|
|
53 |
|
|
|
54 |
|
|
|
55 |
If no file names are specified, ''bzip2'' compresses from |
|
|
56 |
standard input to standard output. In this case, |
|
|
57 |
''bzip2'' will decline to write compressed output to a |
|
|
58 |
terminal, as this would be entirely incomprehensible and |
|
|
59 |
therefore pointless. |
|
|
60 |
|
|
|
61 |
|
|
|
62 |
''bunzip2'' (or ''bzip2 -d)'' decompresses all |
|
|
63 |
specified files. Files which were not created by |
|
|
64 |
''bzip2'' will be detected and ignored, and a warning |
|
|
65 |
issued. ''bzip2'' attempts to guess the filename for the |
|
|
66 |
decompressed file from that of the compressed file as |
|
|
67 |
follows: |
|
|
68 |
|
|
|
69 |
|
|
|
70 |
filename.bz2 becomes filename filename.bz becomes filename |
|
|
71 |
filename.tbz2 becomes filename.tar filename.tbz becomes |
|
|
72 |
filename.tar anyothername becomes |
|
|
73 |
anyothername.out |
|
|
74 |
|
|
|
75 |
|
|
|
76 |
If the file does not end in one of the recognised endings, |
|
|
77 |
''.bz2, .bz, .tbz2'' or ''.tbz, bzip2'' complains that |
|
|
78 |
it cannot guess the name of the original file, and uses the |
|
|
79 |
original name with ''.out'' appended. |
|
|
80 |
|
|
|
81 |
|
|
|
82 |
As with compression, supplying no filenames causes |
|
|
83 |
decompression from standard input to standard |
|
|
84 |
output. |
|
|
85 |
|
|
|
86 |
|
|
|
87 |
''bunzip2'' will correctly decompress a file which is the |
|
|
88 |
concatenation of two or more compressed files. The result is |
|
|
89 |
the concatenation of the corresponding uncompressed files. |
|
|
90 |
Integrity testing (-t) of concatenated compressed files is |
|
|
91 |
also supported. |
|
|
92 |
|
|
|
93 |
|
|
|
94 |
You can also compress or decompress files to the standard |
|
|
95 |
output by giving the -c flag. Multiple files may be |
|
|
96 |
compressed and decompressed like this. The resulting outputs |
|
|
97 |
are fed sequentially to stdout. Compression of multiple |
|
|
98 |
files in this manner generates a stream containing multiple |
|
|
99 |
compressed file representations. Such a stream can be |
|
|
100 |
decompressed correctly only by ''bzip2'' version 0.9.0 or |
|
|
101 |
later. Earlier versions of ''bzip2'' will stop after |
|
|
102 |
decompressing the first file in the stream. |
|
|
103 |
|
|
|
104 |
|
|
|
105 |
''bzcat'' (or ''bzip2 -dc)'' decompresses all |
|
|
106 |
specified files to the standard output. |
|
|
107 |
|
|
|
108 |
|
|
|
109 |
''bzip2'' will read arguments from the environment |
|
|
110 |
variables ''BZIP2'' and ''BZIP,'' in that order, and |
|
|
111 |
will process them before any arguments read from the command |
|
|
112 |
line. This gives a convenient way to supply default |
|
|
113 |
arguments. |
|
|
114 |
|
|
|
115 |
|
|
|
116 |
Compression is always performed, even if the compressed file |
|
|
117 |
is slightly larger than the original. Files of less than |
|
|
118 |
about one hundred bytes tend to get larger, since the |
|
|
119 |
compression mechanism has a constant overhead in the region |
|
|
120 |
of 50 bytes. Random data (including the output of most file |
|
|
121 |
compressors) is coded at about 8.05 bits per byte, giving an |
|
|
122 |
expansion of around 0.5%. |
|
|
123 |
|
|
|
124 |
|
|
|
125 |
As a self-check for your protection, ''bzip2'' uses |
|
|
126 |
32-bit CRCs to make sure that the decompressed version of a |
|
|
127 |
file is identical to the original. This guards against |
|
|
128 |
corruption of the compressed data, and against undetected |
|
|
129 |
bugs in ''bzip2'' (hopefully very unlikely). The chances |
|
|
130 |
of data corruption going undetected is microscopic, about |
|
|
131 |
one chance in four billion for each file processed. Be |
|
|
132 |
aware, though, that the check occurs upon decompression, so |
|
|
133 |
it can only tell you that something is wrong. It can't help |
|
|
134 |
you recover the original uncompressed data. You can use |
|
|
135 |
''bzip2recover'' to try to recover data from damaged |
|
|
136 |
files. |
|
|
137 |
|
|
|
138 |
|
|
|
139 |
Return values: 0 for a normal exit, 1 for environmental |
|
|
140 |
problems (file not found, invalid flags, I/O errors, |
|
|
141 |
bzip2'' to panic. |
|
|
142 |
!!OPTIONS |
|
|
143 |
|
|
|
144 |
|
|
|
145 |
__-c --stdout__ |
|
|
146 |
|
|
|
147 |
|
|
|
148 |
Compress or decompress to standard output. |
|
|
149 |
|
|
|
150 |
|
|
|
151 |
__-d --decompress__ |
|
|
152 |
|
|
|
153 |
|
|
|
154 |
Force decompression. ''bzip2, bunzip2'' and ''bzcat'' |
|
|
155 |
are really the same program, and the decision about what |
|
|
156 |
actions to take is done on the basis of which name is used. |
|
|
157 |
This flag overrides that mechanism, and forces ''bzip2'' |
|
|
158 |
to decompress. |
|
|
159 |
|
|
|
160 |
|
|
|
161 |
__-z --compress__ |
|
|
162 |
|
|
|
163 |
|
|
|
164 |
The complement to -d: forces compression, regardless of the |
|
|
165 |
invocation name. |
|
|
166 |
|
|
|
167 |
|
|
|
168 |
__-t --test__ |
|
|
169 |
|
|
|
170 |
|
|
|
171 |
Check integrity of the specified file(s), but don't |
|
|
172 |
decompress them. This really performs a trial decompression |
|
|
173 |
and throws away the result. |
|
|
174 |
|
|
|
175 |
|
|
|
176 |
__-f --force__ |
|
|
177 |
|
|
|
178 |
|
|
|
179 |
Force overwrite of output files. Normally, ''bzip2'' will |
|
|
180 |
not overwrite existing output files. Also forces |
|
|
181 |
''bzip2'' to break hard links to files, which it |
|
|
182 |
otherwise wouldn't do. |
|
|
183 |
|
|
|
184 |
|
|
|
185 |
bzip2 normally declines to decompress files which don't have |
|
|
186 |
the correct magic header bytes. If forced (-f), however, it |
|
|
187 |
will pass such files through unmodified. This is how GNU |
|
|
188 |
gzip behaves. |
|
|
189 |
|
|
|
190 |
|
|
|
191 |
__-k --keep__ |
|
|
192 |
|
|
|
193 |
|
|
|
194 |
Keep (don't delete) input files during compression or |
|
|
195 |
decompression. |
|
|
196 |
|
|
|
197 |
|
|
|
198 |
__-s --small__ |
|
|
199 |
|
|
|
200 |
|
|
|
201 |
Reduce memory usage, for compression, decompression and |
|
|
202 |
testing. Files are decompressed and tested using a modified |
|
|
203 |
algorithm which only requires 2.5 bytes per block byte. This |
|
|
204 |
means any file can be decompressed in 2300k of memory, |
|
|
205 |
albeit at about half the normal speed. |
|
|
206 |
|
|
|
207 |
|
|
|
208 |
During compression, -s selects a block size of 200k, which |
|
|
209 |
limits memory use to around the same figure, at the expense |
|
|
210 |
of your compression ratio. In short, if your machine is low |
|
|
211 |
on memory (8 megabytes or less), use -s for everything. See |
|
|
212 |
MEMORY MANAGEMENT below. |
|
|
213 |
|
|
|
214 |
|
|
|
215 |
__-q --quiet__ |
|
|
216 |
|
|
|
217 |
|
|
|
218 |
Suppress non-essential warning messages. Messages pertaining |
|
|
219 |
to I/O errors and other critical events will not be |
|
|
220 |
suppressed. |
|
|
221 |
|
|
|
222 |
|
|
|
223 |
__-v --verbose__ |
|
|
224 |
|
|
|
225 |
|
|
|
226 |
Verbose mode -- show the compression ratio for each file |
|
|
227 |
processed. Further -v's increase the verbosity level, |
|
|
228 |
spewing out lots of information which is primarily of |
|
|
229 |
interest for diagnostic purposes. |
|
|
230 |
|
|
|
231 |
|
|
|
232 |
__-L --license -V --version__ |
|
|
233 |
|
|
|
234 |
|
|
|
235 |
Display the software version, license terms and |
|
|
236 |
conditions. |
|
|
237 |
|
|
|
238 |
|
|
|
239 |
__-1 (or --fast) to -9 (or --best)__ |
|
|
240 |
|
|
|
241 |
|
|
|
242 |
Set the block size to 100 k, 200 k .. 900 k when |
|
|
243 |
compressing. Has no effect when decompressing. See MEMORY |
|
|
244 |
MANAGEMENT below. The --fast and --best aliases are |
|
|
245 |
primarily for GNU gzip compatibility. In particular, --fast |
|
|
246 |
doesn't make things significantly faster. And --best merely |
|
|
247 |
selects the default behaviour. |
|
|
248 |
|
|
|
249 |
|
|
|
250 |
__--__ |
|
|
251 |
|
|
|
252 |
|
|
|
253 |
Treats all subsequent arguments as file names, even if they |
|
|
254 |
start with a dash. This is so you can handle files with |
|
|
255 |
names beginning with a dash, for example: bzip2 -- |
|
|
256 |
-myfilename. |
|
|
257 |
|
|
|
258 |
|
|
|
259 |
__--repetitive-fast --repetitive-best__ |
|
|
260 |
|
|
|
261 |
|
|
|
262 |
These flags are redundant in versions 0.9.5 and above. They |
|
|
263 |
provided some coarse control over the behaviour of the |
|
|
264 |
sorting algorithm in earlier versions, which was sometimes |
|
|
265 |
useful. 0.9.5 and above have an improved algorithm which |
|
|
266 |
renders these flags irrelevant. |
|
|
267 |
!!MEMORY MANAGEMENT |
|
|
268 |
|
|
|
269 |
|
|
|
270 |
''bzip2'' compresses large files in blocks. The block |
|
|
271 |
size affects both the compression ratio achieved, and the |
|
|
272 |
amount of memory needed for compression and decompression. |
|
|
273 |
The flags -1 through -9 specify the block size to be 100,000 |
|
|
274 |
bytes through 900,000 bytes (the default) respectively. At |
|
|
275 |
decompression time, the block size used for compression is |
|
|
276 |
read from the header of the compressed file, and |
|
|
277 |
''bunzip2'' then allocates itself just enough memory to |
|
|
278 |
decompress the file. Since block sizes are stored in |
|
|
279 |
compressed files, it follows that the flags -1 to -9 are |
|
|
280 |
irrelevant to and so ignored during |
|
|
281 |
decompression. |
|
|
282 |
|
|
|
283 |
|
|
|
284 |
Compression and decompression requirements, in bytes, can be |
|
|
285 |
estimated as: |
|
|
286 |
|
|
|
287 |
|
|
|
288 |
Compression: 400k + ( 8 x block size ) |
|
|
289 |
|
|
|
290 |
|
|
|
291 |
Decompression: 100k + ( 4 x block size ), or 100k + ( 2.5 x |
|
|
292 |
block size ) |
|
|
293 |
|
|
|
294 |
|
|
|
295 |
Larger block sizes give rapidly diminishing marginal |
|
|
296 |
returns. Most of the compression comes from the first two or |
|
|
297 |
three hundred k of block size, a fact worth bearing in mind |
|
|
298 |
when using ''bzip2'' on small machines. It is also |
|
|
299 |
important to appreciate that the decompression memory |
|
|
300 |
requirement is set at compression time by the choice of |
|
|
301 |
block size. |
|
|
302 |
|
|
|
303 |
|
|
|
304 |
For files compressed with the default 900k block size, |
|
|
305 |
''bunzip2'' will require about 3700 kbytes to decompress. |
|
|
306 |
To support decompression of any file on a 4 megabyte |
|
|
307 |
machine, ''bunzip2'' has an option to decompress using |
|
|
308 |
approximately half this amount of memory, about 2300 kbytes. |
|
|
309 |
Decompression speed is also halved, so you should use this |
|
|
310 |
option only where necessary. The relevant flag is |
|
|
311 |
-s. |
|
|
312 |
|
|
|
313 |
|
|
|
314 |
In general, try and use the largest block size memory |
|
|
315 |
constraints allow, since that maximises the compression |
|
|
316 |
achieved. Compression and decompression speed are virtually |
|
|
317 |
unaffected by block size. |
|
|
318 |
|
|
|
319 |
|
|
|
320 |
Another significant point applies to files which fit in a |
|
|
321 |
single block -- that means most files you'd encounter using |
|
|
322 |
a large block size. The amount of real memory touched is |
|
|
323 |
proportional to the size of the file, since the file is |
|
|
324 |
smaller than a block. For example, compressing a file 20,000 |
|
|
325 |
bytes long with the flag -9 will cause the compressor to |
|
|
326 |
allocate around 7600k of memory, but only touch 400k + 20000 |
|
|
327 |
* 8 = 560 kbytes of it. Similarly, the decompressor will |
|
|
328 |
allocate 3700k but only touch 100k + 20000 * 4 = 180 |
|
|
329 |
kbytes. |
|
|
330 |
|
|
|
331 |
|
|
|
332 |
Here is a table which summarises the maximum memory usage |
|
|
333 |
for different block sizes. Also recorded is the total |
|
|
334 |
compressed size for 14 files of the Calgary Text Compression |
|
|
335 |
Corpus totalling 3,141,622 bytes. This column gives some |
|
|
336 |
feel for how compression varies with block size. These |
|
|
337 |
figures tend to understate the advantage of larger block |
|
|
338 |
sizes for larger files, since the Corpus is dominated by |
|
|
339 |
smaller files. |
|
|
340 |
|
|
|
341 |
|
|
|
342 |
Compress Decompress Decompress Corpus Flag usage usage -s |
|
|
343 |
usage Size |
|
|
344 |
|
|
|
345 |
|
|
|
346 |
-1 1200k 500k 350k 914704 -2 2000k 900k 600k 877703 -3 2800k |
|
|
347 |
1300k 850k 860338 -4 3600k 1700k 1100k 846899 -5 4400k 2100k |
|
|
348 |
1350k 845160 -6 5200k 2500k 1600k 838626 -7 6100k 2900k |
|
|
349 |
1850k 834096 -8 6800k 3300k 2100k 828642 -9 7600k 3700k |
|
|
350 |
2350k 828642 |
|
|
351 |
!!RECOVERING DATA FROM DAMAGED FILES |
|
|
352 |
|
|
|
353 |
|
|
|
354 |
''bzip2'' compresses files in blocks, usually 900kbytes |
|
|
355 |
long. Each block is handled independently. If a media or |
|
|
356 |
transmission error causes a multi-block .bz2 file to become |
|
|
357 |
damaged, it may be possible to recover data from the |
|
|
358 |
undamaged blocks in the file. |
|
|
359 |
|
|
|
360 |
|
|
|
361 |
The compressed representation of each block is delimited by |
|
|
362 |
a 48-bit pattern, which makes it possible to find the block |
|
|
363 |
boundaries with reasonable certainty. Each block also |
|
|
364 |
carries its own 32-bit CRC, so damaged blocks can be |
|
|
365 |
distinguished from undamaged ones. |
|
|
366 |
|
|
|
367 |
|
|
|
368 |
''bzip2recover'' is a simple program whose purpose is to |
|
|
369 |
search for blocks in .bz2 files, and write each block out |
|
|
370 |
into its own .bz2 file. You can then use ''bzip2'' -t to |
|
|
371 |
test the integrity of the resulting files, and decompress |
|
|
372 |
those which are undamaged. |
|
|
373 |
|
|
|
374 |
|
|
|
375 |
''bzip2recover'' takes a single argument, the name of the |
|
|
376 |
damaged file, and writes a number of files |
|
|
377 |
'' |
|
|
378 |
|
|
|
379 |
|
|
|
380 |
''bzip2recover'' should be of most use dealing with large |
|
|
381 |
.bz2 files, as these will contain many blocks. It is clearly |
|
|
382 |
futile to use it on damaged single-block files, since a |
|
|
383 |
damaged block cannot be recovered. If you wish to minimise |
|
|
384 |
any potential data loss through media or transmission |
|
|
385 |
errors, you might consider compressing with a smaller block |
|
|
386 |
size. |
|
|
387 |
!!PERFORMANCE NOTES |
|
|
388 |
|
|
|
389 |
|
|
|
390 |
The sorting phase of compression gathers together similar |
|
|
391 |
strings in the file. Because of this, files containing very |
|
|
392 |
long runs of repeated symbols, like |
|
|
393 |
|
|
|
394 |
|
|
|
395 |
Decompression speed is unaffected by these |
|
|
396 |
phenomena. |
|
|
397 |
|
|
|
398 |
|
|
|
399 |
''bzip2'' usually allocates several megabytes of memory |
|
|
400 |
to operate in, and then charges all over it in a fairly |
|
|
401 |
random fashion. This means that performance, both for |
|
|
402 |
compressing and decompressing, is largely determined by the |
|
|
403 |
speed at which your machine can service cache misses. |
|
|
404 |
Because of this, small changes to the code to reduce the |
|
|
405 |
miss rate have been observed to give disproportionately |
|
|
406 |
large performance improvements. I imagine ''bzip2'' will |
|
|
407 |
perform best on machines with very large |
|
|
408 |
caches. |
|
|
409 |
!!CAVEATS |
|
|
410 |
|
|
|
411 |
|
|
|
412 |
I/O error messages are not as helpful as they could be. |
|
|
413 |
''bzip2'' tries hard to detect I/O errors and exit |
|
|
414 |
cleanly, but the details of what the problem is sometimes |
|
|
415 |
seem rather misleading. |
|
|
416 |
|
|
|
417 |
|
|
|
418 |
This manual page pertains to version 1.0.2 of ''bzip2.'' |
|
|
419 |
Compressed data created by this version is entirely forwards |
|
|
420 |
and backwards compatible with the previous public releases, |
|
|
421 |
versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0 and 1.0.1, but with the |
|
|
422 |
following exception: 0.9.0 and above can correctly |
|
|
423 |
decompress multiple concatenated compressed files. 0.1pl2 |
|
|
424 |
cannot do this; it will stop after decompressing just the |
|
|
425 |
first file in the stream. |
|
|
426 |
|
|
|
427 |
|
|
|
428 |
''bzip2recover'' versions prior to this one, 1.0.2, used |
|
|
429 |
32-bit integers to represent bit positions in compressed |
|
|
430 |
files, so it could not handle compressed files more than 512 |
|
|
431 |
megabytes long. Version 1.0.2 and above uses 64-bit ints on |
|
|
432 |
some platforms which support them (GNU supported targets, |
|
|
433 |
and Windows). To establish whether or not bzip2recover was |
|
|
434 |
built with such a limitation, run it without arguments. In |
|
|
435 |
any event you can build yourself an unlimited version if you |
|
|
436 |
can recompile it with MaybeUInt64 set to be an unsigned |
|
|
437 |
64-bit integer. |
|
|
438 |
!!AUTHOR |
|
|
439 |
|
|
|
440 |
|
|
|
441 |
Julian Seward, jseward@acm.org. |
|
|
442 |
|
|
|
443 |
|
|
|
444 |
http://sources.redhat.com/bzip2 |
|
|
445 |
|
|
|
446 |
|
|
|
447 |
The ideas embodied in ''bzip2'' are due to (at least) the |
|
|
448 |
following people: Michael Burrows and David Wheeler (for the |
|
|
449 |
block sorting transformation), David Wheeler (again, for the |
|
|
450 |
Huffman coder), Peter Fenwick (for the structured coding |
|
|
451 |
model in the original ''bzip,'' and many refinements), |
|
|
452 |
and Alistair Moffat, Radford Neal and Ian Witten (for the |
|
|
453 |
arithmetic coder in the original ''bzip).'' I am much |
|
|
454 |
indebted for their help, support and advice. See the manual |
|
|
455 |
in the source distribution for pointers to sources of |
|
|
456 |
documentation. Christian von Roques encouraged me to look |
|
|
457 |
for faster sorting algorithms, so as to speed up |
|
|
458 |
compression. Bela Lubkin encouraged me to improve the |
|
|
459 |
worst-case compression performance. The bz* scripts are |
|
|
460 |
derived from those of GNU gzip. Many people sent patches, |
|
|
461 |
helped with portability problems, lent machines, gave advice |
|
|
462 |
and were generally helpful. |
|
|
463 |
---- |