Penguin

DICTZIP

DICTZIP

NAME SYNOPSIS DESCRIPTION TRADEOFFS OPTIONS CREDITS SEE ALSO


NAME

dictzip, dictunzip, dictzcat - compress (or expand) files, allowing random access

SYNOPSIS

dictzip [__''options''__? name dictunzip [__''options''__? name dictzcat name

DESCRIPTION

dictzip compresses files using the gzip(1) algorithm (LZ77) in a manner which is completely compatible with the gzip file format. An extension to the gzip file format (Extra Field, described in 2.3.1.1 of RFC 1952) allows extra data to be stored in the header of a compressed file. Programs like gzip and zcat will ignore this extra data. However, dictd(8), the DICT protocol dictionary server will make use of this data to perform pseudo-random access on the file. Files in the dictzip format should end in gzip files that do not contain the special header information.

From RFC 1952, the extra field is specified as follows:

If the FLG.FEXTRA bit is set, an

  • ---+---+---+---+==================================+

|SI1|SI2| LEN |... LEN bytes of subfield data ...|

  • ---+---+---+---+==================================+

SI1 and SI2 provide a subfield ID, typically two ASCII letters with some mnemonic value. Jean-Loup Gailly

LEN gives the length of the subfield data, excluding the 4 initial bytes.

The dictzip program uses 'R' for SI1, and 'A' for SI2 (i.e., __

  • ---+---+---+---+---+---+===============================+

| VER | CHLEN | CHCNT | ... CHCNT words of data ... |

  • ---+---+---+---+---+---+===============================+

As per RFC 1952, all data is stored least-significant byte first. For VER 1 of the data, all values are 16-bits long (2 bytes), and are unsigned integers.

XLEN (which is specified earlier in the header) is a two byte integer, so the extra field can be 0xffff bytes long, 2 bytes of which are used for the subfield ID (SI1 and SI1), and 2 bytes of which are used for the subfield length (LEN). This leaves 0xfffb bytes (0x7ffd 2-byte entries or 0x3ffe 4-byte entries). Given that the zip output buffer must be 10% + 12 bytes larger than the input buffer, we can store 58969 bytes per entry, or about 1.8GB if the 2-byte entries are used. If this becomes a limiting factor, another format version can be selected and defined for 4-byte entries.

For compression, the file is divided up into

To perform random access on the data, the offset and length of the data are provided to library routines. These routines determine the chunk in which the desired data begins, and decompresses that chunk. Consecutive chunks are decompressed as necessary.

TRADEOFFS

Speed

True random file access is not realized, since any access, even for a single byte, requires that a 64kB chunk be read and decompressed. This is slower than accessing a flat text file, but is much, much faster than performing serial access on a fully compressed file.

Space

For the textual dictionary databases we are working with, the use of 64kB chunks and maximal LZ77 compression realizes a file which is only about 4% larger than the same file compressed all at once.

OPTIONS

-d or --decompress

Decompress. This is the default if the executable is called dictunzip.

-c or --stdout

Write output on standard output; keep original files unchanged. This is only available when decompressing (because parts of the header must be updated after a write when compressing).

-f or --force

Force compression or decompression even if the output file already exists.

-h or --help

Display help.

-k or --keep

Do not delete the original file.

-l or --list

For each compressed file, list the following fields:

type: dzip, gzip, or text (includes files in unknown formats) crc: CRC checksum date and time: from header chunks: number of chunks in file size: size of each uncompressed chunk compr.: compressed size uncompr.: uncompressed size ratio: compression ratio (0.0% if unknown) name: name of uncompressed file

Unlike gzip, the compression method is not detected.

-L or --license

Display the dictzip license and quit.

-t or --test

Check the compressed file integrity. This option is not implemented. Instead, it will list the header information.

-v or --verbose

Verbose. Display extra information during compression.

-V or --version

Version. Display the version number and compilation options then quit.

-s start or --start start

Specify the offer to start decompression, using decimal numbers. The default is at the beginning of the file.

-e size or --size size

Specify the size of the portion of the file to decompress, using decimal numbers. The default is the whole file.

-S start or --Start start

Specify the offer to start decompression, using base64 numbers. The default is at the beginning of the file.

-E size or --Size start

Specify the size of the portion of the file to decompress, using base64 numbers. The default is the whole file.

-p prefilter or --pre prefilter

Specify a shell command to execute as a filter before compression or decompression of a chunk. The pre- and post-compression filters can be used to provide additional compression or output formatting. The filters may not increase the buffer size significantly. The pre- and post-compression filters were designed to provide the most general interface possible.

-P postfilter or --post postfilter

Specify a shell command to execute as a filter after compression or decompression.

CREDITS

dictzip was written by Rik Faith (faith@cs.unc.edu) and is distributed under the terms of the GNU General Public License. If you need to distribute under other terms, write to the author.

The main libraries used by this programs (zlib, regex, libmaa) are distributed under different terms, so you may be able to use the libraries for applications which are incompatible with the GPL -- please see the copyright notices and license information that come with the libraries for more information, and consult with your attorney to resolve these issues.

SEE ALSO

dict(1)?, dictd(8), gzip(1), gunzip(1), zcat(1)


This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.