Penguin
Annotated edit history of bzcat(1) version 1, including all changes. View license author blame.
Rev Author # Line
1 perry 1 bzip2
2 !!!bzip2
3 NAME
4 SYNOPSIS
5 DESCRIPTION
6 OPTIONS
7 MEMORY MANAGEMENT
8 RECOVERING DATA FROM DAMAGED FILES
9 PERFORMANCE NOTES
10 CAVEATS
11 AUTHOR
12 ----
13 !!NAME
14
15
16 bzip2, bunzip2 - a block-sorting file compressor, v1.0.2
17 bzcat - decompresses files to stdout
18 bzip2recover - recovers data from damaged bzip2 files
19 !!SYNOPSIS
20
21
22 __bzip2__ [[ __-cdfkqstvzVL123456789__ ] [[ ''filenames
23 ...'' ]__
24 bunzip2__ [[ __-fkvsVL__ ] [[ ''filenames ...''
25 ]__
26 bzcat__ [[ __-s__ ] [[ ''filenames ...'' ]__
27 bzip2recover__ ''filename''
28 !!DESCRIPTION
29
30
31 ''bzip2'' compresses files using the Burrows-Wheeler
32 block sorting text compression algorithm, and Huffman
33 coding. Compression is generally considerably better than
34 that achieved by more conventional LZ77/LZ78-based
35 compressors, and approaches the performance of the PPM
36 family of statistical compressors.
37
38
39 The command-line options are deliberately very similar to
40 those of ''GNU gzip,'' but they are not
41 identical.
42
43
44 ''bzip2'' expects a list of file names to accompany the
45 command-line flags. Each file is replaced by a compressed
46 version of itself, with the name
47 ''
48
49
50 ''bzip2'' and ''bunzip2'' will by default not
51 overwrite existing files. If you want this to happen,
52 specify the -f flag.
53
54
55 If no file names are specified, ''bzip2'' compresses from
56 standard input to standard output. In this case,
57 ''bzip2'' will decline to write compressed output to a
58 terminal, as this would be entirely incomprehensible and
59 therefore pointless.
60
61
62 ''bunzip2'' (or ''bzip2 -d)'' decompresses all
63 specified files. Files which were not created by
64 ''bzip2'' will be detected and ignored, and a warning
65 issued. ''bzip2'' attempts to guess the filename for the
66 decompressed file from that of the compressed file as
67 follows:
68
69
70 filename.bz2 becomes filename filename.bz becomes filename
71 filename.tbz2 becomes filename.tar filename.tbz becomes
72 filename.tar anyothername becomes
73 anyothername.out
74
75
76 If the file does not end in one of the recognised endings,
77 ''.bz2, .bz, .tbz2'' or ''.tbz, bzip2'' complains that
78 it cannot guess the name of the original file, and uses the
79 original name with ''.out'' appended.
80
81
82 As with compression, supplying no filenames causes
83 decompression from standard input to standard
84 output.
85
86
87 ''bunzip2'' will correctly decompress a file which is the
88 concatenation of two or more compressed files. The result is
89 the concatenation of the corresponding uncompressed files.
90 Integrity testing (-t) of concatenated compressed files is
91 also supported.
92
93
94 You can also compress or decompress files to the standard
95 output by giving the -c flag. Multiple files may be
96 compressed and decompressed like this. The resulting outputs
97 are fed sequentially to stdout. Compression of multiple
98 files in this manner generates a stream containing multiple
99 compressed file representations. Such a stream can be
100 decompressed correctly only by ''bzip2'' version 0.9.0 or
101 later. Earlier versions of ''bzip2'' will stop after
102 decompressing the first file in the stream.
103
104
105 ''bzcat'' (or ''bzip2 -dc)'' decompresses all
106 specified files to the standard output.
107
108
109 ''bzip2'' will read arguments from the environment
110 variables ''BZIP2'' and ''BZIP,'' in that order, and
111 will process them before any arguments read from the command
112 line. This gives a convenient way to supply default
113 arguments.
114
115
116 Compression is always performed, even if the compressed file
117 is slightly larger than the original. Files of less than
118 about one hundred bytes tend to get larger, since the
119 compression mechanism has a constant overhead in the region
120 of 50 bytes. Random data (including the output of most file
121 compressors) is coded at about 8.05 bits per byte, giving an
122 expansion of around 0.5%.
123
124
125 As a self-check for your protection, ''bzip2'' uses
126 32-bit CRCs to make sure that the decompressed version of a
127 file is identical to the original. This guards against
128 corruption of the compressed data, and against undetected
129 bugs in ''bzip2'' (hopefully very unlikely). The chances
130 of data corruption going undetected is microscopic, about
131 one chance in four billion for each file processed. Be
132 aware, though, that the check occurs upon decompression, so
133 it can only tell you that something is wrong. It can't help
134 you recover the original uncompressed data. You can use
135 ''bzip2recover'' to try to recover data from damaged
136 files.
137
138
139 Return values: 0 for a normal exit, 1 for environmental
140 problems (file not found, invalid flags, I/O errors,
141 bzip2'' to panic.
142 !!OPTIONS
143
144
145 __-c --stdout__
146
147
148 Compress or decompress to standard output.
149
150
151 __-d --decompress__
152
153
154 Force decompression. ''bzip2, bunzip2'' and ''bzcat''
155 are really the same program, and the decision about what
156 actions to take is done on the basis of which name is used.
157 This flag overrides that mechanism, and forces ''bzip2''
158 to decompress.
159
160
161 __-z --compress__
162
163
164 The complement to -d: forces compression, regardless of the
165 invocation name.
166
167
168 __-t --test__
169
170
171 Check integrity of the specified file(s), but don't
172 decompress them. This really performs a trial decompression
173 and throws away the result.
174
175
176 __-f --force__
177
178
179 Force overwrite of output files. Normally, ''bzip2'' will
180 not overwrite existing output files. Also forces
181 ''bzip2'' to break hard links to files, which it
182 otherwise wouldn't do.
183
184
185 bzip2 normally declines to decompress files which don't have
186 the correct magic header bytes. If forced (-f), however, it
187 will pass such files through unmodified. This is how GNU
188 gzip behaves.
189
190
191 __-k --keep__
192
193
194 Keep (don't delete) input files during compression or
195 decompression.
196
197
198 __-s --small__
199
200
201 Reduce memory usage, for compression, decompression and
202 testing. Files are decompressed and tested using a modified
203 algorithm which only requires 2.5 bytes per block byte. This
204 means any file can be decompressed in 2300k of memory,
205 albeit at about half the normal speed.
206
207
208 During compression, -s selects a block size of 200k, which
209 limits memory use to around the same figure, at the expense
210 of your compression ratio. In short, if your machine is low
211 on memory (8 megabytes or less), use -s for everything. See
212 MEMORY MANAGEMENT below.
213
214
215 __-q --quiet__
216
217
218 Suppress non-essential warning messages. Messages pertaining
219 to I/O errors and other critical events will not be
220 suppressed.
221
222
223 __-v --verbose__
224
225
226 Verbose mode -- show the compression ratio for each file
227 processed. Further -v's increase the verbosity level,
228 spewing out lots of information which is primarily of
229 interest for diagnostic purposes.
230
231
232 __-L --license -V --version__
233
234
235 Display the software version, license terms and
236 conditions.
237
238
239 __-1 (or --fast) to -9 (or --best)__
240
241
242 Set the block size to 100 k, 200 k .. 900 k when
243 compressing. Has no effect when decompressing. See MEMORY
244 MANAGEMENT below. The --fast and --best aliases are
245 primarily for GNU gzip compatibility. In particular, --fast
246 doesn't make things significantly faster. And --best merely
247 selects the default behaviour.
248
249
250 __--__
251
252
253 Treats all subsequent arguments as file names, even if they
254 start with a dash. This is so you can handle files with
255 names beginning with a dash, for example: bzip2 --
256 -myfilename.
257
258
259 __--repetitive-fast --repetitive-best__
260
261
262 These flags are redundant in versions 0.9.5 and above. They
263 provided some coarse control over the behaviour of the
264 sorting algorithm in earlier versions, which was sometimes
265 useful. 0.9.5 and above have an improved algorithm which
266 renders these flags irrelevant.
267 !!MEMORY MANAGEMENT
268
269
270 ''bzip2'' compresses large files in blocks. The block
271 size affects both the compression ratio achieved, and the
272 amount of memory needed for compression and decompression.
273 The flags -1 through -9 specify the block size to be 100,000
274 bytes through 900,000 bytes (the default) respectively. At
275 decompression time, the block size used for compression is
276 read from the header of the compressed file, and
277 ''bunzip2'' then allocates itself just enough memory to
278 decompress the file. Since block sizes are stored in
279 compressed files, it follows that the flags -1 to -9 are
280 irrelevant to and so ignored during
281 decompression.
282
283
284 Compression and decompression requirements, in bytes, can be
285 estimated as:
286
287
288 Compression: 400k + ( 8 x block size )
289
290
291 Decompression: 100k + ( 4 x block size ), or 100k + ( 2.5 x
292 block size )
293
294
295 Larger block sizes give rapidly diminishing marginal
296 returns. Most of the compression comes from the first two or
297 three hundred k of block size, a fact worth bearing in mind
298 when using ''bzip2'' on small machines. It is also
299 important to appreciate that the decompression memory
300 requirement is set at compression time by the choice of
301 block size.
302
303
304 For files compressed with the default 900k block size,
305 ''bunzip2'' will require about 3700 kbytes to decompress.
306 To support decompression of any file on a 4 megabyte
307 machine, ''bunzip2'' has an option to decompress using
308 approximately half this amount of memory, about 2300 kbytes.
309 Decompression speed is also halved, so you should use this
310 option only where necessary. The relevant flag is
311 -s.
312
313
314 In general, try and use the largest block size memory
315 constraints allow, since that maximises the compression
316 achieved. Compression and decompression speed are virtually
317 unaffected by block size.
318
319
320 Another significant point applies to files which fit in a
321 single block -- that means most files you'd encounter using
322 a large block size. The amount of real memory touched is
323 proportional to the size of the file, since the file is
324 smaller than a block. For example, compressing a file 20,000
325 bytes long with the flag -9 will cause the compressor to
326 allocate around 7600k of memory, but only touch 400k + 20000
327 * 8 = 560 kbytes of it. Similarly, the decompressor will
328 allocate 3700k but only touch 100k + 20000 * 4 = 180
329 kbytes.
330
331
332 Here is a table which summarises the maximum memory usage
333 for different block sizes. Also recorded is the total
334 compressed size for 14 files of the Calgary Text Compression
335 Corpus totalling 3,141,622 bytes. This column gives some
336 feel for how compression varies with block size. These
337 figures tend to understate the advantage of larger block
338 sizes for larger files, since the Corpus is dominated by
339 smaller files.
340
341
342 Compress Decompress Decompress Corpus Flag usage usage -s
343 usage Size
344
345
346 -1 1200k 500k 350k 914704 -2 2000k 900k 600k 877703 -3 2800k
347 1300k 850k 860338 -4 3600k 1700k 1100k 846899 -5 4400k 2100k
348 1350k 845160 -6 5200k 2500k 1600k 838626 -7 6100k 2900k
349 1850k 834096 -8 6800k 3300k 2100k 828642 -9 7600k 3700k
350 2350k 828642
351 !!RECOVERING DATA FROM DAMAGED FILES
352
353
354 ''bzip2'' compresses files in blocks, usually 900kbytes
355 long. Each block is handled independently. If a media or
356 transmission error causes a multi-block .bz2 file to become
357 damaged, it may be possible to recover data from the
358 undamaged blocks in the file.
359
360
361 The compressed representation of each block is delimited by
362 a 48-bit pattern, which makes it possible to find the block
363 boundaries with reasonable certainty. Each block also
364 carries its own 32-bit CRC, so damaged blocks can be
365 distinguished from undamaged ones.
366
367
368 ''bzip2recover'' is a simple program whose purpose is to
369 search for blocks in .bz2 files, and write each block out
370 into its own .bz2 file. You can then use ''bzip2'' -t to
371 test the integrity of the resulting files, and decompress
372 those which are undamaged.
373
374
375 ''bzip2recover'' takes a single argument, the name of the
376 damaged file, and writes a number of files
377 ''
378
379
380 ''bzip2recover'' should be of most use dealing with large
381 .bz2 files, as these will contain many blocks. It is clearly
382 futile to use it on damaged single-block files, since a
383 damaged block cannot be recovered. If you wish to minimise
384 any potential data loss through media or transmission
385 errors, you might consider compressing with a smaller block
386 size.
387 !!PERFORMANCE NOTES
388
389
390 The sorting phase of compression gathers together similar
391 strings in the file. Because of this, files containing very
392 long runs of repeated symbols, like
393
394
395 Decompression speed is unaffected by these
396 phenomena.
397
398
399 ''bzip2'' usually allocates several megabytes of memory
400 to operate in, and then charges all over it in a fairly
401 random fashion. This means that performance, both for
402 compressing and decompressing, is largely determined by the
403 speed at which your machine can service cache misses.
404 Because of this, small changes to the code to reduce the
405 miss rate have been observed to give disproportionately
406 large performance improvements. I imagine ''bzip2'' will
407 perform best on machines with very large
408 caches.
409 !!CAVEATS
410
411
412 I/O error messages are not as helpful as they could be.
413 ''bzip2'' tries hard to detect I/O errors and exit
414 cleanly, but the details of what the problem is sometimes
415 seem rather misleading.
416
417
418 This manual page pertains to version 1.0.2 of ''bzip2.''
419 Compressed data created by this version is entirely forwards
420 and backwards compatible with the previous public releases,
421 versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0 and 1.0.1, but with the
422 following exception: 0.9.0 and above can correctly
423 decompress multiple concatenated compressed files. 0.1pl2
424 cannot do this; it will stop after decompressing just the
425 first file in the stream.
426
427
428 ''bzip2recover'' versions prior to this one, 1.0.2, used
429 32-bit integers to represent bit positions in compressed
430 files, so it could not handle compressed files more than 512
431 megabytes long. Version 1.0.2 and above uses 64-bit ints on
432 some platforms which support them (GNU supported targets,
433 and Windows). To establish whether or not bzip2recover was
434 built with such a limitation, run it without arguments. In
435 any event you can build yourself an unlimited version if you
436 can recompile it with MaybeUInt64 set to be an unsigned
437 64-bit integer.
438 !!AUTHOR
439
440
441 Julian Seward, jseward@acm.org.
442
443
444 http://sources.redhat.com/bzip2
445
446
447 The ideas embodied in ''bzip2'' are due to (at least) the
448 following people: Michael Burrows and David Wheeler (for the
449 block sorting transformation), David Wheeler (again, for the
450 Huffman coder), Peter Fenwick (for the structured coding
451 model in the original ''bzip,'' and many refinements),
452 and Alistair Moffat, Radford Neal and Ian Witten (for the
453 arithmetic coder in the original ''bzip).'' I am much
454 indebted for their help, support and advice. See the manual
455 in the source distribution for pointers to sources of
456 documentation. Christian von Roques encouraged me to look
457 for faster sorting algorithms, so as to speed up
458 compression. Bela Lubkin encouraged me to improve the
459 worst-case compression performance. The bz* scripts are
460 derived from those of GNU gzip. Many people sent patches,
461 helped with portability problems, lent machines, gave advice
462 and were generally helpful.
463 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.