View Source: wget(1) - Waikato Linux Users Group

Edit PageHistory Diff Info LikePages
WGET
!!!WGET
NAME
SYNOPSIS
DESCRIPTION
OPTIONS
EXAMPLES
FILES
BUGS
SEE ALSO
AUTHOR
COPYRIGHT
----
!!NAME


wget - GNU Wget Manual
!!SYNOPSIS


wget [[''option'']... [[ ''URL''
]...
!!DESCRIPTION


GNU Wget is a free utility for
non-interactive download of files from the Web. It supports
HTTP , HTTPS , and
FTP protocols, as well as retrieval through
HTTP proxies.


Wget is non-interactive, meaning that it can work in the
background, while the user is not logged on. This allows you
to start a retrieval and disconnect from the system, letting
Wget finish the work. By contrast, most of the Web browsers
require constant user's presence, which can be a great
hindrance when transferring a lot of data.


Wget can follow links in HTML pages and
create local versions of remote web sites, fully recreating
the directory structure of the original site. This is
sometimes referred to as ``recursive downloading.'' While
doing that, Wget respects the Robot Exclusion Standard
(''/robots.txt''). Wget can be instructed to convert the
links in downloaded HTML files to the local
files for offline viewing.


Wget has been designed for robustness over slow or unstable
network connections; if a download fails due to a network
problem, it will keep retrying until the whole file has been
retrieved. If the server supports regetting, it will
instruct the server to continue the download from where it
left off.
!!OPTIONS


__Basic Startup Options__


__-V__


__--version__


Display the version of Wget.


__-h__


__--help__


Print a help message describing all of Wget's command-line
options.


__-b__


__--background__


Go to background immediately after startup. If no output
file is specified via the __-o__, output is redirected to
''wget-log''.


__-e__ ''command''


__--execute__ ''command''


Execute ''command'' as if it were a part of
''.wgetrc''. A command thus invoked will be executed
''after'' the commands in ''.wgetrc'', thus taking
precedence over them.


__Logging and Input File Options__


__-o__ ''logfile''


__--output-file=__''logfile''


Log all messages to ''logfile''. The messages are
normally reported to standard error.


__-a__ ''logfile''


__--append-output=__''logfile''


Append to ''logfile''. This is the same as __-o__,
only it appends to ''logfile'' instead of overwriting the
old log file. If ''logfile'' does not exist, a new file
is created.


__-d__


__--debug__


Turn on debug output, meaning various information important
to the developers of Wget if it does not work properly. Your
system administrator may have chosen to compile Wget without
debug support, in which case __-d__ will not work. Please
note that compiling with debug support is always safe---Wget
compiled with the debug support will ''not'' print any
debug info unless requested with __-d__.


__-q__


__--quiet__


Turn off Wget's output.


__-v__


__--verbose__


Turn on verbose output, with all the available data. The
default output is verbose.


__-nv__


__--non-verbose__


Non-verbose output---turn off verbose without being
completely quiet (use __-q__ for that), which means that
error messages and basic information still get
printed.


__-i__ ''file''


__--input-file=__''file''


Read URLs from ''file'', in which case no URLs need to be
on the command line. If there are URLs both on the command
line and in an input file, those on the command lines will
be the first ones to be retrieved. The ''file'' need not
be an HTML document (but no harm if it
is)---it is enough if the URLs are just listed
sequentially.


However, if you specify __--force-html__, the document
will be regarded as __html__. In that case you may have
problems with relative links, which you can solve either by
adding  to the
documents or by specifying __--base=__''url'' on the
command line.


__-F__


__--force-html__


When input is read from a file, force it to be treated as an
HTML file. This enables you to retrieve
relative links from existing HTML files on
your local disk, by adding
 to HTML , or
using the __--base__ command-line option.


__-B__ ''URL''


__--base=__ ''URL''


When used in conjunction with __-F__, prepends
''URL'' to relative links in the file
specified by __-i__.


__Download Options__


__--bind-address=__
''ADDRESS''


When making client TCP/IP connections,
bind() to ''ADDRESS'' on the
local machine. ''ADDRESS'' may be
specified as a hostname or IP address. This
option can be useful if your machine is bound to multiple
IPs.


__-t__ ''number''


__--tries=__''number''


Set number of retries to ''number''. Specify 0 or
__inf__ for infinite retrying.


__-O__ ''file''


__--output-document=__''file''


The documents will not be written to the appropriate files,
but all will be concatenated together and written to
''file''. If ''file'' already exists, it will be
overwritten. If the ''file'' is __-__, the documents
will be written to standard output. Including this option
automatically sets the number of tries to 1.


__-nc__


__--no-clobber__


If a file is downloaded more than once in the same
directory, Wget's behavior depends on a few options,
including __-nc__. In certain cases, the local file will
be ''clobbered'', or overwritten, upon repeated download.
In other cases it will be preserved.


When running Wget without __-N__, __-nc__, or
__-r__, downloading the same file in the same directory
will result in the original copy of ''file'' being
preserved and the second copy being named
''file''__.1__. If that file is downloaded yet again,
the third copy will be named ''file''__.2__, and so
on. When __-nc__ is specified, this behavior is
suppressed, and Wget will refuse to download newer copies of
''file''. Therefore, ``no-clobber'' is actually
a misnomer in this mode---it's not clobbering that's
prevented (as the numeric suffixes were already preventing
clobbering), but rather the multiple version saving that's
prevented.


When running Wget with __-r__, but without __-N__ or
__-nc__, re-downloading a file will result in the new
copy simply overwriting the old. Adding __-nc__ will
prevent this behavior, instead causing the original version
to be preserved and any newer copies on the server to be
ignored.


When running Wget with __-N__, with or without __-r__,
the decision as to whether or not to download a newer copy
of a file depends on the local and remote timestamp and size
of the file. __-nc__ may not be specified at the same
time as __-N__.


Note that when __-nc__ is specified, files with the
suffixes __.html__ or (yuck) __.htm__ will be loaded
from the local disk and parsed as if they had been retrieved
from the Web.


__-c__


__--continue__


Continue getting a partially-downloaded file. This is useful
when you want to finish up a download started by a previous
instance of Wget, or by another program. For
instance:


        wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z
If there is a file named ''ls-lR.Z'' in the current directory, Wget will assume that it is the first portion of the remote file, and will ask the server to continue the retrieval from an offset equal to the length of the local file.


Note that you don't need to specify this option if you just
want the current invocation of Wget to retry downloading a
file should the connection be lost midway through. This is
the default behavior. __-c__ only affects resumption of
downloads started ''prior'' to this invocation of Wget,
and whose local files are still sitting around.


Without __-c__, the previous example would just download
the remote file to ''ls-lR.Z.1'', leaving the truncated
''ls-lR.Z'' file alone.


Beginning with Wget 1.7, if you use __-c__ on a non-empty
file, and it turns out that the server does not support
continued downloading, Wget will refuse to start the
download from scratch, which would effectively ruin existing
contents. If you really want the download to start from
scratch, remove the file.


Also beginning with Wget 1.7, if you use __-c__ on a file
which is of equal size as the one on the server, Wget will
refuse to download the file and print an explanatory
message. The same happens when the file is smaller on the
server than locally (presumably because it was changed on
the server since your last download attempt)---because
``continuing'' is not meaningful, no download
occurs.


On the other side of the coin, while using __-c__, any
file that's bigger on the server than locally will be
considered an incomplete download and only
(length(remote) - length(local)) bytes will be
downloaded and tacked onto the end of the local file. This
behavior can be desirable in certain cases---for instance,
you can use __wget -c__ to download just the new portion
that's been appended to a data collection or log
file.


However, if the file is bigger on the server because it's
been ''changed'', as opposed to just ''appended'' to,
you'll end up with a garbled file. Wget has no way of
verifying that the local file is really a valid prefix of
the remote file. You need to be especially careful of this
when using __-c__ in conjunction with __-r__, since
every file will be considered as an ``incomplete download''
candidate.


Another instance where you'll get a garbled file if you try
to use __-c__ is if you have a lame HTTP
proxy that inserts a ``transfer interrupted'' string into
the local file. In the future a ``rollback'' option may be
added to deal with this case.


Note that __-c__ only works with FTP
servers and with HTTP servers that support
the Range header.


__--progress=__''type''


Select the type of the progress indicator you wish to use.
Legal indicators are ``dot'' and ``bar''.


The ``dot'' indicator is used by default. It traces the
retrieval by printing dots on the screen, each dot
representing a fixed amount of downloaded data.


When using the dotted retrieval, you may also set the
''style'' by specifying the type as
__dot:__''style''. Different styles assign different
meaning to one dot. With the default style each dot
represents 1K, there are ten dots in a cluster and 50 dots
in a line. The binary style has a more
``computer''-like orientation---8K dots, 16-dots clusters
and 48 dots per line (which makes for 384K lines). The
mega style is suitable for downloading very large
files---each dot represents 64K retrieved, there are eight
dots in a cluster, and 48 dots on each line (so each line
contains 3M).


Specifying __--progress=bar__ will draw a nice
ASCII progress bar graphics (a.k.a
``thermometer'' display) to indicate retrieval. If the
output is not a TTY , this option will be
ignored, and Wget will revert to the dot indicator. If you
want to force the bar indicator, use
__--progress=bar:force__.


__-N__


__--timestamping__


Turn on time-stamping.


__-S__


__--server-response__


Print the headers sent by HTTP servers and
responses sent by FTP servers.


__--spider__


When invoked with this option, Wget will behave as a Web
''spider'', which means that it will not download the
pages, just check that they are there. You can use it to
check your bookmarks, e.g. with:


        wget --spider --force-html -i bookmarks.html
This feature needs much more work for Wget to get close to the functionality of real WWW spiders.


__-T seconds__


__--timeout=__''seconds''


Set the read timeout to ''seconds'' seconds. Whenever a
network read is issued, the file descriptor is checked for a
timeout, which could otherwise leave a pending connection
(uninterrupted read). The default timeout is 900 seconds
(fifteen minutes). Setting timeout to 0 will disable
checking for timeouts.


Please do not lower the default timeout value with this
option unless you know what you are doing.


__-w__ ''seconds''


__--wait=__''seconds''


Wait the specified number of seconds between the retrievals.
Use of this option is recommended, as it lightens the server
load by making the requests less frequent. Instead of in
seconds, the time can be specified in minutes using the
m suffix, in hours using h suffix, or in
days using d suffix.


Specifying a large value for this option is useful if the
network or the destination host is down, so that Wget can
wait long enough to reasonably expect the network error to
be fixed before the retry.


__--waitretry=__''seconds''


If you don't want Wget to wait between ''every''
retrieval, but only between retries of failed downloads, you
can use this option. Wget will use ''linear backoff'',
waiting 1 second after the first failure on a given file,
then waiting 2 seconds after the second failure on that
file, up to the maximum number of ''seconds'' you
specify. Therefore, a value of 10 will actually make Wget
wait up to (1 + 2 + ... + 10) = 55 seconds per
file.


Note that this option is turned on by default in the global
''wgetrc'' file.


__--random-wait__


Some web sites may perform log analysis to identify
retrieval programs such as Wget by looking for statistically
significant similarities in the time between requests. This
option causes the time between requests to vary between 0
and 2 * ''wait'' seconds, where ''wait'' was specified
using the __-w__ or __--wait__ options, in order to
mask Wget's presence from such analysis.


A recent article in a publication devoted to development on
a popular consumer platform provided code to perform this
analysis on the fly. Its author suggested blocking at the
class C address level to ensure automated retrieval programs
were blocked despite changing DHCP-supplied
addresses.


The __--random-wait__ option was inspired by this
ill-advised recommendation to block many unrelated users
from a web site due to the actions of one.


__-Y on/off__


__--proxy=on/off__


Turn proxy support on or off. The proxy is on by default if
the appropriate environmental variable is
defined.


__-Q__ ''quota''


__--quota=__''quota''


Specify download quota for automatic retrievals. The value
can be specified in bytes (default), kilobytes (with
__k__ suffix), or megabytes (with __m__
suffix).


Note that quota will never affect downloading a single file.
So if you specify __wget -Q10k
ftp://wuarchive.wustl.edu/ls-lR.gz__, all of the
''ls-lR.gz'' will be downloaded. The same goes even when
several URLs are specified on the command-line. However,
quota is respected when retrieving either recursively, or
from an input file. Thus you may safely type __wget -Q2m -i
sites__---download will be aborted when the quota is
exceeded.


Setting quota to 0 or to __inf__ unlimits the download
quota.


__Directory Options__


__-nd__


__--no-directories__


Do not create a hierarchy of directories when retrieving
recursively. With this option turned on, all files will get
saved to the current directory, without clobbering (if a
name shows up more than once, the filenames will get
extensions __.n__).


__-x__


__--force-directories__


The opposite of __-nd__---create a hierarchy of
directories, even if one would not have been created
otherwise. E.g. __wget -x
http://fly.srk.fer.hr/robots.txt__ will save the
downloaded file to
''fly.srk.fer.hr/robots.txt''.


__-nH__


__--no-host-directories__


Disable generation of host-prefixed directories. By default,
invoking Wget with __-r http://fly.srk.fer.hr/__ will
create a structure of directories beginning with
''fly.srk.fer.hr/''. This option disables such
behavior.


__--cut-dirs=__''number''


Ignore ''number'' directory components. This is useful
for getting a fine-grained control over the directory where
recursive retrieval will be saved.


Take, for example, the directory at
__ftp://ftp.xemacs.org/pub/xemacs/__. If you retrieve it
with __-r__, it will be saved locally under
''ftp.xemacs.org/pub/xemacs/''. While the __-nH__
option can remove the ''ftp.xemacs.org/'' part, you are
still stuck with ''pub/xemacs''. This is where
__--cut-dirs__ comes in handy; it makes Wget not ``see''
''number'' remote directory components. Here are several
examples of how __--cut-dirs__ option works.


        No options        -
        --cut-dirs=1      -
If you just want to get rid of the directory structure, this option is similar to a combination of __-nd__ and __-P__. However, unlike __-nd__, __--cut-dirs__ does not lose with subdirectories---for instance, with __-nH --cut-dirs=1__, a ''beta/'' subdirectory will be placed to ''xemacs/beta'', as one would expect.


__-P__ ''prefix''


__--directory-prefix=__''prefix''


Set directory prefix to ''prefix''. The ''directory
prefix'' is the directory where all other files and
subdirectories will be saved to, i.e. the top of the
retrieval tree. The default is __.__ (the current
directory).


__HTTP Options__


__-E__


__--html-extension__


If a file of type __text/html__ is downloaded and the
URL does not end with the regexp
__.[[Hh][[Tt][[Mm][[Ll]?__, this option will cause the suffix
__.html__ to be appended to the local filename. This is
useful, for instance, when you're mirroring a remote site
that uses __.asp__ pages, but you want the mirrored pages
to be viewable on your stock Apache server. Another good use
for this is when you're downloading the output of CGIs. A
URL like
__http://site.com/article.cgi?25__ will be saved as
''article.cgi?25.html''.


Note that filenames changed in this way will be
re-downloaded every time you re-mirror a site, because Wget
can't tell that the local ''X.html'' file corresponds to
remote URL ''X'' (since it doesn't yet
know that the URL produces output of type
__text/html__. To prevent this re-downloading, you must
use __-k__ and __-K__ so that the original version of
the file will be saved as ''X.orig''.


__--http-user=__''user''


__--http-passwd=__''password''


Specify the username ''user'' and password
''password'' on an HTTP server. According
to the type of the challenge, Wget will encode them using
either the basic (insecure) or the digest
authentication scheme.


Another way to specify username and password is in the
URL itself. Either method reveals your
password to anyone who bothers to run ps. To
prevent the passwords from being seen, store them in
''.wgetrc'' or ''.netrc'', and make sure to protect
those files from other users with chmod. If the
passwords are really important, do not leave them lying in
those files either---edit the files and delete them after
Wget has started the download.


For more information about security issues with
Wget,


__-C on/off__


__--cache=on/off__


When set to off, disable server-side cache. In this case,
Wget will send the remote server an appropriate directive
(__Pragma: no-cache__) to get the file from the remote
service, rather than returning the cached version. This is
especially useful for retrieving and flushing out-of-date
documents on proxy servers.


Caching is allowed by default.


__--cookies=on/off__


When set to off, disable the use of cookies. Cookies are a
mechanism for maintaining server-side state. The server
sends the client a cookie using the Set-Cookie
header, and the client responds with the same cookie upon
further requests. Since cookies allow the server owners to
keep track of visitors and for sites to exchange this
information, some consider them a breach of privacy. The
default is to use cookies; however, ''storing'' cookies
is not on by default.


__--load-cookies__ ''file''


Load cookies from ''file'' before the first
HTTP retrieval. ''file'' is a textual file
in the format originally used by Netscape's
''cookies.txt'' file.


You will typically use this option when mirroring sites that
require that you be logged in to access some or all of their
content. The login process typically works by the web server
issuing an HTTP cookie upon receiving and
verifying your credentials. The cookie is then resent by the
browser when accessing that part of the site, and so proves
your identity.


Mirroring such a site requires Wget to send the same cookies
your browser sends when communicating with the site. This is
achieved by __--load-cookies__---simply point Wget to the
location of the ''cookies.txt'' file, and it will send
the same cookies your browser would send in the same
situation. Different browsers keep textual cookie files in
different locations:


Netscape 4.x.


The cookies are in
''~/.netscape/cookies.txt''.


Mozilla and Netscape 6.x.


Mozilla's cookie file is also named ''cookies.txt'',
located somewhere under ''~/.mozilla'', in the directory
of your profile. The full path usually ends up looking
somewhat like
''~/.mozilla/default/some-weird-string/cookies.txt''.


Internet Explorer.


You can produce a cookie file Wget can use by using the File
menu, Import and Export, Export Cookies. This has been
tested with Internet Explorer 5; it is not guaranteed to
work with earlier versions.


Other browsers.


If you are using a different browser to create your cookies,
__--load-cookies__ will only work if you can locate or
produce a cookie file in the Netscape format that Wget
expects.


If you cannot use __--load-cookies__, there might still
be an alternative. If your browser supports a ``cookie
manager'', you can use it to view the cookies used when
accessing the site you're mirroring. Write down the name and
value of the cookie, and manually instruct Wget to send
those cookies, bypassing the ``official'' cookie
support:


        wget --cookies=off --header


__--save-cookies__ ''file''


Save cookies from ''file'' at the end of session. Cookies
whose expiry time is not specified, or those that have
already expired, are not saved.


__--ignore-length__


Unfortunately, some HTTP servers (
CGI programs, to be more precise) send out
bogus Content-Length headers, which makes Wget go
wild, as it thinks not all the document was retrieved. You
can spot this syndrome if Wget retries getting the same
document again and again, each time claiming that the
(otherwise normal) connection has closed on the very same
byte.


With this option, Wget will ignore the
Content-Length header---as if it never
existed.


__--header=__''additional-header''


Define an ''additional-header'' to be passed to the
HTTP servers. Headers must contain a __:__
preceded by one or more non-blank characters, and must not
contain newlines.


You may define more than one additional header by specifying
__--header__ more than once.


        wget --header='Accept-Charset: iso-8859-2' \
--header='Accept-Language: hr'        \
http://fly.srk.fer.hr/
Specification of an empty string as the header value will clear all previous user-defined headers.


__--proxy-user=__''user''


__--proxy-passwd=__''password''


Specify the username ''user'' and password
''password'' for authentication on a proxy server. Wget
will encode them using the basic authentication
scheme.


Security considerations similar to those with
__--http-passwd__ pertain here as well.


__--referer=__''url''


Include `Referer: ''url''' header in HTTP
request. Useful for retrieving documents with server-side
processing that assume they are always being retrieved by
interactive web browsers and only come out properly when
Referer is set to one of the pages that point to
them.


__-s__


__--save-headers__


Save the headers sent by the HTTP server to
the file, preceding the actual contents, with an empty line
as the separator.


__-U__ ''agent-string''


__--user-agent=__''agent-string''


Identify as ''agent-string'' to the HTTP
server.


The HTTP protocol allows the clients to
identify themselves using a User-Agent header
field. This enables distinguishing the WWW
software, usually for statistical purposes or for tracing of
protocol violations. Wget normally identifies as
__Wget/__''version'', ''version'' being the current
version number of Wget.


However, some sites have been known to impose the policy of
tailoring the output according to the
User-Agent-supplied information. While conceptually
this is not such a bad idea, it has been abused by servers
denying information to clients other than Mozilla
or Microsoft Internet Explorer. This option allows
you to change the User-Agent line issued by Wget.
Use of this option is discouraged, unless you really know
what you are doing.


__FTP Options__


__-nr__


__--dont-remove-listing__


Don't remove the temporary ''.listing'' files generated
by FTP retrievals. Normally, these files
contain the raw directory listings received from
FTP servers. Not removing them can be useful
for debugging purposes, or when you want to be able to
easily check on the contents of remote server directories
(e.g. to verify that a mirror you're running is
complete).


Note that even though Wget writes to a known filename for
this file, this is not a security hole in the scenario of a
user making ''.listing'' a symbolic link to
''/etc/passwd'' or something and asking root to
run Wget in his or her directory. Depending on the options
used, either Wget will refuse to write to ''.listing'',
making the globbing/recursion/time-stamping operation fail,
or the symbolic link will be deleted and replaced with the
actual ''.listing'' file, or the listing will be written
to a ''.listing.number'' file.


Even though this situation isn't a problem, though,
root should never run Wget in a non-trusted user's
directory. A user could do something as simple as linking
''index.html'' to ''/etc/passwd'' and asking
root to run Wget with __-N__ or __-r__ so the
file will be overwritten.


__-g on/off__


__--glob=on/off__


Turn FTP globbing on or off. Globbing means
you may use the shell-like special characters
(''wildcards''), like __*__, __?__, __[[__ and
__]__ to retrieve more than one file from the same
directory at once, like:


        wget ftp://gnjilux.srk.fer.hr/*.msg
By default, globbing will be turned on if the URL contains a globbing character. This option may be used to turn globbing on or off permanently.


You may have to quote the URL to protect it
from being expanded by your shell. Globbing makes Wget look
for a directory listing, which is system-specific. This is
why it currently works only with Unix FTP
servers (and the ones emulating Unix ls
output).


__--passive-ftp__


Use the ''passive'' FTP retrieval scheme,
in which the client initiates the data connection. This is
sometimes required for FTP to work behind
firewalls.


__--retr-symlinks__


Usually, when retrieving FTP directories
recursively and a symbolic link is encountered, the
linked-to file is not downloaded. Instead, a matching
symbolic link is created on the local filesystem. The
pointed-to file will not be downloaded unless this recursive
retrieval would have encountered it separately and
downloaded it anyway.


When __--retr-symlinks__ is specified, however, symbolic
links are traversed and the pointed-to files are retrieved.
At this time, this option does not cause Wget to traverse
symlinks to directories and recurse through them, but in the
future it should be enhanced to do this.


Note that when retrieving a file (not a directory) because
it was specified on the commandline, rather than because it
was recursed to, this option has no effect. Symbolic links
are always traversed in this case.


__Recursive Retrieval Options__


__-r__


__--recursive__


Turn on recursive retrieving.


__-l__ ''depth''


__--level=__''depth''


Specify recursion maximum depth level ''depth''. The
default maximum depth is 5.


__--delete-after__


This option tells Wget to delete every single file it
downloads, ''after'' having done so. It is useful for
pre-fetching popular pages through a proxy,
e.g.:


        wget -r -nd --delete-after http://whatever.com/~popular/page/
The __-r__ option is to retrieve recursively, and __-nd__ to not create directories.


Note that __--delete-after__ deletes files on the local
machine. It does not issue the __DELE__
command to remote FTP sites, for instance.
Also note that when __--delete-after__ is specified,
__--convert-links__ is ignored, so __.orig__ files are
simply not created in the first place.


__-k__


__--convert-links__


After the download is complete, convert the links in the
document to make them suitable for local viewing. This
affects not only the visible hyperlinks, but any part of the
document that links to external content, such as embedded
images, links to style sheets, hyperlinks to non-HTML
content, etc.


Each link will be changed in one of the two
ways:


The links to files that have been downloaded by Wget will be
changed to refer to the file they point to as a relative
link.


Example: if the downloaded file ''/foo/doc.html'' links
to ''/bar/img.gif'', also downloaded, then the link in
''doc.html'' will be modified to point to
__../bar/img.gif__. This kind of transformation works
reliably for arbitrary combinations of
directories.


The links to files that have not been downloaded by Wget
will be changed to include host name and absolute path of
the location they point to.


Example: if the downloaded file ''/foo/doc.html'' links
to ''/bar/img.gif'' (or to ''../bar/img.gif''), then
the link in ''doc.html'' will be modified to point to
''http://hostname/bar/img.gif''.


Because of this, local browsing works reliably: if a linked
file was downloaded, the link will refer to its local name;
if it was not downloaded, the link will refer to its full
Internet address rather than presenting a broken link. The
fact that the former links are converted to relative links
ensures that you can move the downloaded hierarchy to
another directory.


Note that only at the end of the download can Wget know
which links have been downloaded. Because of that, the work
done by __-k__ will be performed at the end of all the
downloads.


__-K__


__--backup-converted__


When converting a file, back up the original version with a
__.orig__ suffix. Affects the behavior of
__-N__.


__-m__


__--mirror__


Turn on options suitable for mirroring. This option turns on
recursion and time-stamping, sets infinite recursion depth
and keeps FTP directory listings. It is
currently equivalent to __-r -N -l inf
-nr__.


__-p__


__--page-requisites__


This option causes Wget to download all the files that are
necessary to properly display a given HTML
page. This includes such things as inlined images, sounds,
and referenced stylesheets.


Ordinarily, when downloading a single HTML
page, any requisite documents that may be needed to display
it properly are not downloaded. Using __-r__ together
with __-l__ can help, but since Wget does not ordinarily
distinguish between external and inlined documents, one is
generally left with ``leaf documents'' that are missing
their requisites.


For instance, say document ''1.html'' contains an
 tag referencing ''1.gif'' and an
 tag pointing to external document
''2.html''. Say that ''2.html'' is similar but that
its image is ''2.gif'' and it links to ''3.html''. Say
this continues up to some arbitrarily high
number.


If one executes the command:


        wget -r -l 2 http://I
then ''1.html'', ''1.gif'', ''2.html'', ''2.gif'', and ''3.html'' will be downloaded. As you can see, ''3.html'' is without its requisite ''3.gif'' because Wget is simply counting the number of hops (up to 2) away from ''1.html'' in order to determine where to stop the recursion. However, with this command:


        wget -r -l 2 -p http://I
all the above files ''and 3.html'''s requisite ''3.gif'' will be downloaded. Similarly,


        wget -r -l 1 -p http://I
will cause ''1.html'', ''1.gif'', ''2.html'', and ''2.gif'' to be downloaded. One might think that:


        wget -r -l 0 -p http://I
would download just ''1.html'' and ''1.gif'', but unfortunately this is not the case, because __-l 0__ is equivalent to __-l inf__---that is, infinite recursion. To download a single HTML page (or a handful of them, all specified on the commandline or in a __-i__ URL input file) and its (or their) requisites, simply leave off __-r__ and __-l__:


        wget -p http://I
Note that Wget will behave as if __-r__ had been specified, but only that single page and its requisites will be downloaded. Links from that page to external documents will not be followed. Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to __-p__:


        wget -E -H -k -K -p http://I
To finish off this topic, it's worth knowing that Wget's idea of an external document link is any URL specified in an  tag, an  tag, or a  tag other than .


__Recursive Accept/Reject Options__


__-A__ ''acclist'' __--accept__
''acclist''


__-R__ ''rejlist'' __--reject__
''rejlist''


Specify comma-separated lists of file name suffixes or
patterns to accept or reject.


__-D__ ''domain-list''


__--domains=__''domain-list''


Set domains to be followed. ''domain-list'' is a
comma-separated list of domains. Note that it does
''not'' turn on __-H__.


__--exclude-domains__ ''domain-list''


Specify the domains that are ''not'' to be
followed..


__--follow-ftp__


Follow FTP links from HTML
documents. Without this option, Wget will ignore all the
FTP links.


__--follow-tags=__''list''


Wget has an internal table of HTML tag /
attribute pairs that it considers when looking for linked
documents during a recursive retrieval. If a user wants only
a subset of those tags to be considered, however, he or she
should be specify such tags in a comma-separated ''list''
with this option.


__-G__ ''list''


__--ignore-tags=__''list''


This is the opposite of the __--follow-tags__ option. To
skip certain HTML tags when recursively
looking for documents to download, specify them in a
comma-separated ''list''.


In the past, the __-G__ option was the best bet for
downloading a single page and its requisites, using a
commandline like:


        wget -Ga,area -H -k -K -r http://I
However, the author of this option came across a page with tags like  and came to the realization that __-G__ was not enough. One can't just tell Wget to ignore , because then stylesheets will not be downloaded. Now the best bet for downloading a single page and its requisites is the dedicated __--page-requisites__ option.


__-H__


__--span-hosts__


Enable spanning across hosts when doing recursive
retrieving.


__-L__


__--relative__


Follow relative links only. Useful for retrieving a specific
home page without any distractions, not even those from the
same hosts.


__-I__ ''list''


__--include-directories=__''list''


Specify a comma-separated list of directories you wish to
follow when downloading Elements of ''list'' may contain
wildcards.


__-X__ ''list''


__--exclude-directories=__''list''


Specify a comma-separated list of directories you wish to
exclude from download Elements of ''list'' may contain
wildcards.


__-np__


__--no-parent__


Do not ever ascend to the parent directory when retrieving
recursively. This is a useful option, since it guarantees
that only the files ''below'' a certain hierarchy will be
downloaded.
!!EXAMPLES


The examples are divided into three sections loosely based
on their complexity.


__Simple Usage__


Say you want to download a URL . Just
type:


        wget http://fly.srk.fer.hr/


But what will happen if the connection is slow, and the file
is lengthy? The connection will probably fail before the
whole file is retrieved, more than once. In this case, Wget
will try getting the file until it either gets the whole of
it, or exceeds the default number of retries (this being
20). It is easy to change the number of tries to 45, to
insure that the whole file will arrive safely:


        wget --tries=45 http://fly.srk.fer.hr/jpg/flyweb.jpg


Now let's leave Wget to work in the background, and write
its progress to log file ''log''. It is tiring to type
__--tries__, so we shall use __-t__.


        wget -t 45 -o log http://fly.srk.fer.hr/jpg/flyweb.jpg
The ampersand at the end of the line makes sure that Wget works in the background. To unlimit the number of retries, use __-t inf__.


The usage of FTP is as simple. Wget will take
care of login and password.


        wget ftp://gnjilux.srk.fer.hr/welcome.msg


If you specify a directory, Wget will retrieve the directory
listing, parse it and convert it to HTML .
Try:


        wget ftp://prep.ai.mit.edu/pub/gnu/
links index.html


__Advanced Usage__


You have a file that contains the URLs you want to download?
Use the __-i__ switch:


        wget -i I
If you specify __-__ as file name, the URLs will be read from standard input.


Create a five levels deep mirror image of the
GNU web site, with the same directory
structure the original has, with only one try per document,
saving the log of the activities to
''gnulog'':


        wget -r http://www.gnu.org/ -o gnulog


The same as the above, but convert the links in the
HTML files to point to local files, so you
can view the documents off-line:


        wget --convert-links -r http://www.gnu.org/ -o gnulog


Retrieve only one HTML page, but make sure
that all the elements needed for the page to be displayed,
such as inline images and external style sheets, are also
downloaded. Also make sure the downloaded page references
the downloaded links.


        wget -p --convert-links http://www.server.com/dir/page.html
The HTML page will be saved to ''www.server.com/dir/page.html'', and the images, stylesheets, etc., somewhere under ''www.server.com/'', depending on where they were on the remote server.


The same as the above, but without the
''www.server.com/'' directory. In fact, I don't want to
have all those random server directories anyway---just save
''all'' those files under a ''download/'' subdirectory
of the current directory.


        wget -p --convert-links -nH -nd -Pdownload \
http://www.server.com/dir/page.html


Retrieve the index.html of __www.lycos.com__, showing the
original server headers:


        wget -S http://www.lycos.com/


Save the server headers with the file, perhaps for
post-processing.


        wget -s http://www.lycos.com/
more index.html


Retrieve the first two levels of __wuarchive.wustl.edu__,
saving them to ''/tmp''.


        wget -r -l2 -P/tmp ftp://wuarchive.wustl.edu/


You want to download all the GIFs from a directory on an
HTTP server. You tried __wget
http://www.server.com/dir/*.gif__, but that didn't work
because HTTP retrieval does not support
globbing. In that case, use:


        wget -r -l1 --no-parent -A.gif http://www.server.com/dir/
More verbose, but the effect is the same. __-r -l1__ means to retrieve recursively, with maximum depth of 1. __--no-parent__ means that references to the parent directory are ignored, and __-A.gif__ means to download only the GIF files. __-A ``*.gif''__ would have worked too.


Suppose you were in the middle of downloading, when Wget was
interrupted. Now you do not want to clobber the files
already present. It would be:


        wget -nc -r http://www.gnu.org/


If you want to encode your own username and password to
HTTP or FTP , use the
appropriate URL syntax.


        wget ftp://hniksic:mypassword@unix.server.com/.emacs
Note, however, that this usage is not advisable on multi-user systems because it reveals your password to anyone who looks at the output of ps.


You would like the output documents to go to standard output
instead of to files?


        wget -O - http://jagor.srce.hr/ http://www.srce.hr/
You can also combine the two options and make pipelines to retrieve the documents from remote hotlists:


        wget -O - http://cool.list.com/  wget --force-html -i -


__Very Advanced Usage__


If you wish Wget to keep a mirror of a page (or
FTP subdirectories), use __--mirror__
(__-m__), which is the shorthand for __-r -l inf -N__.
You can put Wget in the crontab file asking it to recheck a
site each Sunday:


        crontab
0 0 * * 0 wget --mirror http://www.gnu.org/ -o /home/me/weeklog


In addition to the above, you want the links to be converted
for local viewing. But, after having read this manual, you
know that link conversion doesn't play well with
timestamping, so you also want Wget to back up the original
HTML files before the conversion. Wget
invocation would look like this:


        wget --mirror --convert-links --backup-converted  \
http://www.gnu.org/ -o /home/me/weeklog


But you've also noticed that local viewing doesn't work all
that well when HTML files are saved under
extensions other than __.html__, perhaps because they
were served as ''index.cgi''. So you'd like Wget to
rename all the files served with content-type
__text/html__ to ''name.html''.


        wget --mirror --convert-links --backup-converted \
--html-extension -o /home/me/weeklog        \
http://www.gnu.org/
Or, with less typing:


        wget -m -k -K -E http://www.gnu.org/ -o /home/me/weeklog
!!FILES


__/etc/wgetrc__


Default location of the ''global'' startup
file.


__.wgetrc__


User startup file.
!!BUGS


You are welcome to send bug reports about GNU
Wget to bug-wget@gnu.org


Before actually submitting a bug report, please try to
follow a few simple guidelines.


1.


Please try to ascertain that the behaviour you see really is
a bug. If Wget crashes, it's a bug. If Wget does not behave
as documented, it's a bug. If things work strange, but you
are not sure about the way they are supposed to work, it
might well be a bug.


2.


Try to repeat the bug in as simple circumstances as
possible. E.g. if Wget crashes while downloading __wget
-rl0 -kKE -t5 -Y0 http://yoyodyne.com -o /tmp/log__, you
should try to see if the crash is repeatable, and if will
occur with a simpler set of options. You might even try to
start the download at the page where the crash occurred to
see if that page somehow triggered the crash.


Also, while I will probably be interested to know the
contents of your ''.wgetrc'' file, just dumping it into
the debug message is probably a bad idea. Instead, you
should first try to see if the bug repeats with
''.wgetrc'' moved out of the way. Only if it turns out
that ''.wgetrc'' settings affect the bug, mail me the
relevant parts of the file.


3.


Please start Wget with __-d__ option and send the log (or
the relevant parts of it). If Wget was compiled without
debug support, recompile it. It is ''much'' easier to
trace bugs with debug support on.


4.


If Wget has crashed, try to run it in a debugger, e.g.
gdb `which wget` core and type where to
get the backtrace.
!!SEE ALSO


GNU Info entry for ''wget''.
!!AUTHOR


Originally written by Hrvoje Niksic
!!COPYRIGHT


Copyright (c) 1996, 1997, 1998, 2000, 2001 Free Software
Foundation, Inc.


Permission is granted to make and distribute verbatim copies
of this manual provided the copyright notice and this
permission notice are preserved on all copies.


Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free
Documentation License, Version 1.1 or any later version
published by the Free Software Foundation; with the
Invariant Sections being `` GNU General
Public License'' and `` GNU Free
Documentation License'', with no Front-Cover Texts, and with
no Back-Cover Texts. A copy of the license is included in
the section entitled `` GNU Free
Documentation License''.
----
10 pages link to wget(1):
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.
Last edited on Monday, June 3, 2002 6:51:19 pm by "perry"
Edit PageHistory Diff Info LikePages