Penguin
Annotated edit history of wget(1) version 1, including all changes. View license author blame.
Rev Author # Line
1 perry 1 WGET
2 !!!WGET
3 NAME
4 SYNOPSIS
5 DESCRIPTION
6 OPTIONS
7 EXAMPLES
8 FILES
9 BUGS
10 SEE ALSO
11 AUTHOR
12 COPYRIGHT
13 ----
14 !!NAME
15
16
17 wget - GNU Wget Manual
18 !!SYNOPSIS
19
20
21 wget [[''option'']... [[ ''URL''
22 ]...
23 !!DESCRIPTION
24
25
26 GNU Wget is a free utility for
27 non-interactive download of files from the Web. It supports
28 HTTP , HTTPS , and
29 FTP protocols, as well as retrieval through
30 HTTP proxies.
31
32
33 Wget is non-interactive, meaning that it can work in the
34 background, while the user is not logged on. This allows you
35 to start a retrieval and disconnect from the system, letting
36 Wget finish the work. By contrast, most of the Web browsers
37 require constant user's presence, which can be a great
38 hindrance when transferring a lot of data.
39
40
41 Wget can follow links in HTML pages and
42 create local versions of remote web sites, fully recreating
43 the directory structure of the original site. This is
44 sometimes referred to as ``recursive downloading.'' While
45 doing that, Wget respects the Robot Exclusion Standard
46 (''/robots.txt''). Wget can be instructed to convert the
47 links in downloaded HTML files to the local
48 files for offline viewing.
49
50
51 Wget has been designed for robustness over slow or unstable
52 network connections; if a download fails due to a network
53 problem, it will keep retrying until the whole file has been
54 retrieved. If the server supports regetting, it will
55 instruct the server to continue the download from where it
56 left off.
57 !!OPTIONS
58
59
60 __Basic Startup Options__
61
62
63 __-V__
64
65
66 __--version__
67
68
69 Display the version of Wget.
70
71
72 __-h__
73
74
75 __--help__
76
77
78 Print a help message describing all of Wget's command-line
79 options.
80
81
82 __-b__
83
84
85 __--background__
86
87
88 Go to background immediately after startup. If no output
89 file is specified via the __-o__, output is redirected to
90 ''wget-log''.
91
92
93 __-e__ ''command''
94
95
96 __--execute__ ''command''
97
98
99 Execute ''command'' as if it were a part of
100 ''.wgetrc''. A command thus invoked will be executed
101 ''after'' the commands in ''.wgetrc'', thus taking
102 precedence over them.
103
104
105 __Logging and Input File Options__
106
107
108 __-o__ ''logfile''
109
110
111 __--output-file=__''logfile''
112
113
114 Log all messages to ''logfile''. The messages are
115 normally reported to standard error.
116
117
118 __-a__ ''logfile''
119
120
121 __--append-output=__''logfile''
122
123
124 Append to ''logfile''. This is the same as __-o__,
125 only it appends to ''logfile'' instead of overwriting the
126 old log file. If ''logfile'' does not exist, a new file
127 is created.
128
129
130 __-d__
131
132
133 __--debug__
134
135
136 Turn on debug output, meaning various information important
137 to the developers of Wget if it does not work properly. Your
138 system administrator may have chosen to compile Wget without
139 debug support, in which case __-d__ will not work. Please
140 note that compiling with debug support is always safe---Wget
141 compiled with the debug support will ''not'' print any
142 debug info unless requested with __-d__.
143
144
145 __-q__
146
147
148 __--quiet__
149
150
151 Turn off Wget's output.
152
153
154 __-v__
155
156
157 __--verbose__
158
159
160 Turn on verbose output, with all the available data. The
161 default output is verbose.
162
163
164 __-nv__
165
166
167 __--non-verbose__
168
169
170 Non-verbose output---turn off verbose without being
171 completely quiet (use __-q__ for that), which means that
172 error messages and basic information still get
173 printed.
174
175
176 __-i__ ''file''
177
178
179 __--input-file=__''file''
180
181
182 Read URLs from ''file'', in which case no URLs need to be
183 on the command line. If there are URLs both on the command
184 line and in an input file, those on the command lines will
185 be the first ones to be retrieved. The ''file'' need not
186 be an HTML document (but no harm if it
187 is)---it is enough if the URLs are just listed
188 sequentially.
189
190
191 However, if you specify __--force-html__, the document
192 will be regarded as __html__. In that case you may have
193 problems with relative links, which you can solve either by
194 adding to the
195 documents or by specifying __--base=__''url'' on the
196 command line.
197
198
199 __-F__
200
201
202 __--force-html__
203
204
205 When input is read from a file, force it to be treated as an
206 HTML file. This enables you to retrieve
207 relative links from existing HTML files on
208 your local disk, by adding
209 to HTML , or
210 using the __--base__ command-line option.
211
212
213 __-B__ ''URL''
214
215
216 __--base=__ ''URL''
217
218
219 When used in conjunction with __-F__, prepends
220 ''URL'' to relative links in the file
221 specified by __-i__.
222
223
224 __Download Options__
225
226
227 __--bind-address=__
228 ''ADDRESS''
229
230
231 When making client TCP/IP connections,
232 bind() to ''ADDRESS'' on the
233 local machine. ''ADDRESS'' may be
234 specified as a hostname or IP address. This
235 option can be useful if your machine is bound to multiple
236 IPs.
237
238
239 __-t__ ''number''
240
241
242 __--tries=__''number''
243
244
245 Set number of retries to ''number''. Specify 0 or
246 __inf__ for infinite retrying.
247
248
249 __-O__ ''file''
250
251
252 __--output-document=__''file''
253
254
255 The documents will not be written to the appropriate files,
256 but all will be concatenated together and written to
257 ''file''. If ''file'' already exists, it will be
258 overwritten. If the ''file'' is __-__, the documents
259 will be written to standard output. Including this option
260 automatically sets the number of tries to 1.
261
262
263 __-nc__
264
265
266 __--no-clobber__
267
268
269 If a file is downloaded more than once in the same
270 directory, Wget's behavior depends on a few options,
271 including __-nc__. In certain cases, the local file will
272 be ''clobbered'', or overwritten, upon repeated download.
273 In other cases it will be preserved.
274
275
276 When running Wget without __-N__, __-nc__, or
277 __-r__, downloading the same file in the same directory
278 will result in the original copy of ''file'' being
279 preserved and the second copy being named
280 ''file''__.1__. If that file is downloaded yet again,
281 the third copy will be named ''file''__.2__, and so
282 on. When __-nc__ is specified, this behavior is
283 suppressed, and Wget will refuse to download newer copies of
284 ''file''. Therefore, ``no-clobber'' is actually
285 a misnomer in this mode---it's not clobbering that's
286 prevented (as the numeric suffixes were already preventing
287 clobbering), but rather the multiple version saving that's
288 prevented.
289
290
291 When running Wget with __-r__, but without __-N__ or
292 __-nc__, re-downloading a file will result in the new
293 copy simply overwriting the old. Adding __-nc__ will
294 prevent this behavior, instead causing the original version
295 to be preserved and any newer copies on the server to be
296 ignored.
297
298
299 When running Wget with __-N__, with or without __-r__,
300 the decision as to whether or not to download a newer copy
301 of a file depends on the local and remote timestamp and size
302 of the file. __-nc__ may not be specified at the same
303 time as __-N__.
304
305
306 Note that when __-nc__ is specified, files with the
307 suffixes __.html__ or (yuck) __.htm__ will be loaded
308 from the local disk and parsed as if they had been retrieved
309 from the Web.
310
311
312 __-c__
313
314
315 __--continue__
316
317
318 Continue getting a partially-downloaded file. This is useful
319 when you want to finish up a download started by a previous
320 instance of Wget, or by another program. For
321 instance:
322
323
324 wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z
325 If there is a file named ''ls-lR.Z'' in the current directory, Wget will assume that it is the first portion of the remote file, and will ask the server to continue the retrieval from an offset equal to the length of the local file.
326
327
328 Note that you don't need to specify this option if you just
329 want the current invocation of Wget to retry downloading a
330 file should the connection be lost midway through. This is
331 the default behavior. __-c__ only affects resumption of
332 downloads started ''prior'' to this invocation of Wget,
333 and whose local files are still sitting around.
334
335
336 Without __-c__, the previous example would just download
337 the remote file to ''ls-lR.Z.1'', leaving the truncated
338 ''ls-lR.Z'' file alone.
339
340
341 Beginning with Wget 1.7, if you use __-c__ on a non-empty
342 file, and it turns out that the server does not support
343 continued downloading, Wget will refuse to start the
344 download from scratch, which would effectively ruin existing
345 contents. If you really want the download to start from
346 scratch, remove the file.
347
348
349 Also beginning with Wget 1.7, if you use __-c__ on a file
350 which is of equal size as the one on the server, Wget will
351 refuse to download the file and print an explanatory
352 message. The same happens when the file is smaller on the
353 server than locally (presumably because it was changed on
354 the server since your last download attempt)---because
355 ``continuing'' is not meaningful, no download
356 occurs.
357
358
359 On the other side of the coin, while using __-c__, any
360 file that's bigger on the server than locally will be
361 considered an incomplete download and only
362 (length(remote) - length(local)) bytes will be
363 downloaded and tacked onto the end of the local file. This
364 behavior can be desirable in certain cases---for instance,
365 you can use __wget -c__ to download just the new portion
366 that's been appended to a data collection or log
367 file.
368
369
370 However, if the file is bigger on the server because it's
371 been ''changed'', as opposed to just ''appended'' to,
372 you'll end up with a garbled file. Wget has no way of
373 verifying that the local file is really a valid prefix of
374 the remote file. You need to be especially careful of this
375 when using __-c__ in conjunction with __-r__, since
376 every file will be considered as an ``incomplete download''
377 candidate.
378
379
380 Another instance where you'll get a garbled file if you try
381 to use __-c__ is if you have a lame HTTP
382 proxy that inserts a ``transfer interrupted'' string into
383 the local file. In the future a ``rollback'' option may be
384 added to deal with this case.
385
386
387 Note that __-c__ only works with FTP
388 servers and with HTTP servers that support
389 the Range header.
390
391
392 __--progress=__''type''
393
394
395 Select the type of the progress indicator you wish to use.
396 Legal indicators are ``dot'' and ``bar''.
397
398
399 The ``dot'' indicator is used by default. It traces the
400 retrieval by printing dots on the screen, each dot
401 representing a fixed amount of downloaded data.
402
403
404 When using the dotted retrieval, you may also set the
405 ''style'' by specifying the type as
406 __dot:__''style''. Different styles assign different
407 meaning to one dot. With the default style each dot
408 represents 1K, there are ten dots in a cluster and 50 dots
409 in a line. The binary style has a more
410 ``computer''-like orientation---8K dots, 16-dots clusters
411 and 48 dots per line (which makes for 384K lines). The
412 mega style is suitable for downloading very large
413 files---each dot represents 64K retrieved, there are eight
414 dots in a cluster, and 48 dots on each line (so each line
415 contains 3M).
416
417
418 Specifying __--progress=bar__ will draw a nice
419 ASCII progress bar graphics (a.k.a
420 ``thermometer'' display) to indicate retrieval. If the
421 output is not a TTY , this option will be
422 ignored, and Wget will revert to the dot indicator. If you
423 want to force the bar indicator, use
424 __--progress=bar:force__.
425
426
427 __-N__
428
429
430 __--timestamping__
431
432
433 Turn on time-stamping.
434
435
436 __-S__
437
438
439 __--server-response__
440
441
442 Print the headers sent by HTTP servers and
443 responses sent by FTP servers.
444
445
446 __--spider__
447
448
449 When invoked with this option, Wget will behave as a Web
450 ''spider'', which means that it will not download the
451 pages, just check that they are there. You can use it to
452 check your bookmarks, e.g. with:
453
454
455 wget --spider --force-html -i bookmarks.html
456 This feature needs much more work for Wget to get close to the functionality of real WWW spiders.
457
458
459 __-T seconds__
460
461
462 __--timeout=__''seconds''
463
464
465 Set the read timeout to ''seconds'' seconds. Whenever a
466 network read is issued, the file descriptor is checked for a
467 timeout, which could otherwise leave a pending connection
468 (uninterrupted read). The default timeout is 900 seconds
469 (fifteen minutes). Setting timeout to 0 will disable
470 checking for timeouts.
471
472
473 Please do not lower the default timeout value with this
474 option unless you know what you are doing.
475
476
477 __-w__ ''seconds''
478
479
480 __--wait=__''seconds''
481
482
483 Wait the specified number of seconds between the retrievals.
484 Use of this option is recommended, as it lightens the server
485 load by making the requests less frequent. Instead of in
486 seconds, the time can be specified in minutes using the
487 m suffix, in hours using h suffix, or in
488 days using d suffix.
489
490
491 Specifying a large value for this option is useful if the
492 network or the destination host is down, so that Wget can
493 wait long enough to reasonably expect the network error to
494 be fixed before the retry.
495
496
497 __--waitretry=__''seconds''
498
499
500 If you don't want Wget to wait between ''every''
501 retrieval, but only between retries of failed downloads, you
502 can use this option. Wget will use ''linear backoff'',
503 waiting 1 second after the first failure on a given file,
504 then waiting 2 seconds after the second failure on that
505 file, up to the maximum number of ''seconds'' you
506 specify. Therefore, a value of 10 will actually make Wget
507 wait up to (1 + 2 + ... + 10) = 55 seconds per
508 file.
509
510
511 Note that this option is turned on by default in the global
512 ''wgetrc'' file.
513
514
515 __--random-wait__
516
517
518 Some web sites may perform log analysis to identify
519 retrieval programs such as Wget by looking for statistically
520 significant similarities in the time between requests. This
521 option causes the time between requests to vary between 0
522 and 2 * ''wait'' seconds, where ''wait'' was specified
523 using the __-w__ or __--wait__ options, in order to
524 mask Wget's presence from such analysis.
525
526
527 A recent article in a publication devoted to development on
528 a popular consumer platform provided code to perform this
529 analysis on the fly. Its author suggested blocking at the
530 class C address level to ensure automated retrieval programs
531 were blocked despite changing DHCP-supplied
532 addresses.
533
534
535 The __--random-wait__ option was inspired by this
536 ill-advised recommendation to block many unrelated users
537 from a web site due to the actions of one.
538
539
540 __-Y on/off__
541
542
543 __--proxy=on/off__
544
545
546 Turn proxy support on or off. The proxy is on by default if
547 the appropriate environmental variable is
548 defined.
549
550
551 __-Q__ ''quota''
552
553
554 __--quota=__''quota''
555
556
557 Specify download quota for automatic retrievals. The value
558 can be specified in bytes (default), kilobytes (with
559 __k__ suffix), or megabytes (with __m__
560 suffix).
561
562
563 Note that quota will never affect downloading a single file.
564 So if you specify __wget -Q10k
565 ftp://wuarchive.wustl.edu/ls-lR.gz__, all of the
566 ''ls-lR.gz'' will be downloaded. The same goes even when
567 several URLs are specified on the command-line. However,
568 quota is respected when retrieving either recursively, or
569 from an input file. Thus you may safely type __wget -Q2m -i
570 sites__---download will be aborted when the quota is
571 exceeded.
572
573
574 Setting quota to 0 or to __inf__ unlimits the download
575 quota.
576
577
578 __Directory Options__
579
580
581 __-nd__
582
583
584 __--no-directories__
585
586
587 Do not create a hierarchy of directories when retrieving
588 recursively. With this option turned on, all files will get
589 saved to the current directory, without clobbering (if a
590 name shows up more than once, the filenames will get
591 extensions __.n__).
592
593
594 __-x__
595
596
597 __--force-directories__
598
599
600 The opposite of __-nd__---create a hierarchy of
601 directories, even if one would not have been created
602 otherwise. E.g. __wget -x
603 http://fly.srk.fer.hr/robots.txt__ will save the
604 downloaded file to
605 ''fly.srk.fer.hr/robots.txt''.
606
607
608 __-nH__
609
610
611 __--no-host-directories__
612
613
614 Disable generation of host-prefixed directories. By default,
615 invoking Wget with __-r http://fly.srk.fer.hr/__ will
616 create a structure of directories beginning with
617 ''fly.srk.fer.hr/''. This option disables such
618 behavior.
619
620
621 __--cut-dirs=__''number''
622
623
624 Ignore ''number'' directory components. This is useful
625 for getting a fine-grained control over the directory where
626 recursive retrieval will be saved.
627
628
629 Take, for example, the directory at
630 __ftp://ftp.xemacs.org/pub/xemacs/__. If you retrieve it
631 with __-r__, it will be saved locally under
632 ''ftp.xemacs.org/pub/xemacs/''. While the __-nH__
633 option can remove the ''ftp.xemacs.org/'' part, you are
634 still stuck with ''pub/xemacs''. This is where
635 __--cut-dirs__ comes in handy; it makes Wget not ``see''
636 ''number'' remote directory components. Here are several
637 examples of how __--cut-dirs__ option works.
638
639
640 No options -
641 --cut-dirs=1 -
642 If you just want to get rid of the directory structure, this option is similar to a combination of __-nd__ and __-P__. However, unlike __-nd__, __--cut-dirs__ does not lose with subdirectories---for instance, with __-nH --cut-dirs=1__, a ''beta/'' subdirectory will be placed to ''xemacs/beta'', as one would expect.
643
644
645 __-P__ ''prefix''
646
647
648 __--directory-prefix=__''prefix''
649
650
651 Set directory prefix to ''prefix''. The ''directory
652 prefix'' is the directory where all other files and
653 subdirectories will be saved to, i.e. the top of the
654 retrieval tree. The default is __.__ (the current
655 directory).
656
657
658 __HTTP Options__
659
660
661 __-E__
662
663
664 __--html-extension__
665
666
667 If a file of type __text/html__ is downloaded and the
668 URL does not end with the regexp
669 __.[[Hh][[Tt][[Mm][[Ll]?__, this option will cause the suffix
670 __.html__ to be appended to the local filename. This is
671 useful, for instance, when you're mirroring a remote site
672 that uses __.asp__ pages, but you want the mirrored pages
673 to be viewable on your stock Apache server. Another good use
674 for this is when you're downloading the output of CGIs. A
675 URL like
676 __http://site.com/article.cgi?25__ will be saved as
677 ''article.cgi?25.html''.
678
679
680 Note that filenames changed in this way will be
681 re-downloaded every time you re-mirror a site, because Wget
682 can't tell that the local ''X.html'' file corresponds to
683 remote URL ''X'' (since it doesn't yet
684 know that the URL produces output of type
685 __text/html__. To prevent this re-downloading, you must
686 use __-k__ and __-K__ so that the original version of
687 the file will be saved as ''X.orig''.
688
689
690 __--http-user=__''user''
691
692
693 __--http-passwd=__''password''
694
695
696 Specify the username ''user'' and password
697 ''password'' on an HTTP server. According
698 to the type of the challenge, Wget will encode them using
699 either the basic (insecure) or the digest
700 authentication scheme.
701
702
703 Another way to specify username and password is in the
704 URL itself. Either method reveals your
705 password to anyone who bothers to run ps. To
706 prevent the passwords from being seen, store them in
707 ''.wgetrc'' or ''.netrc'', and make sure to protect
708 those files from other users with chmod. If the
709 passwords are really important, do not leave them lying in
710 those files either---edit the files and delete them after
711 Wget has started the download.
712
713
714 For more information about security issues with
715 Wget,
716
717
718 __-C on/off__
719
720
721 __--cache=on/off__
722
723
724 When set to off, disable server-side cache. In this case,
725 Wget will send the remote server an appropriate directive
726 (__Pragma: no-cache__) to get the file from the remote
727 service, rather than returning the cached version. This is
728 especially useful for retrieving and flushing out-of-date
729 documents on proxy servers.
730
731
732 Caching is allowed by default.
733
734
735 __--cookies=on/off__
736
737
738 When set to off, disable the use of cookies. Cookies are a
739 mechanism for maintaining server-side state. The server
740 sends the client a cookie using the Set-Cookie
741 header, and the client responds with the same cookie upon
742 further requests. Since cookies allow the server owners to
743 keep track of visitors and for sites to exchange this
744 information, some consider them a breach of privacy. The
745 default is to use cookies; however, ''storing'' cookies
746 is not on by default.
747
748
749 __--load-cookies__ ''file''
750
751
752 Load cookies from ''file'' before the first
753 HTTP retrieval. ''file'' is a textual file
754 in the format originally used by Netscape's
755 ''cookies.txt'' file.
756
757
758 You will typically use this option when mirroring sites that
759 require that you be logged in to access some or all of their
760 content. The login process typically works by the web server
761 issuing an HTTP cookie upon receiving and
762 verifying your credentials. The cookie is then resent by the
763 browser when accessing that part of the site, and so proves
764 your identity.
765
766
767 Mirroring such a site requires Wget to send the same cookies
768 your browser sends when communicating with the site. This is
769 achieved by __--load-cookies__---simply point Wget to the
770 location of the ''cookies.txt'' file, and it will send
771 the same cookies your browser would send in the same
772 situation. Different browsers keep textual cookie files in
773 different locations:
774
775
776 Netscape 4.x.
777
778
779 The cookies are in
780 ''~/.netscape/cookies.txt''.
781
782
783 Mozilla and Netscape 6.x.
784
785
786 Mozilla's cookie file is also named ''cookies.txt'',
787 located somewhere under ''~/.mozilla'', in the directory
788 of your profile. The full path usually ends up looking
789 somewhat like
790 ''~/.mozilla/default/some-weird-string/cookies.txt''.
791
792
793 Internet Explorer.
794
795
796 You can produce a cookie file Wget can use by using the File
797 menu, Import and Export, Export Cookies. This has been
798 tested with Internet Explorer 5; it is not guaranteed to
799 work with earlier versions.
800
801
802 Other browsers.
803
804
805 If you are using a different browser to create your cookies,
806 __--load-cookies__ will only work if you can locate or
807 produce a cookie file in the Netscape format that Wget
808 expects.
809
810
811 If you cannot use __--load-cookies__, there might still
812 be an alternative. If your browser supports a ``cookie
813 manager'', you can use it to view the cookies used when
814 accessing the site you're mirroring. Write down the name and
815 value of the cookie, and manually instruct Wget to send
816 those cookies, bypassing the ``official'' cookie
817 support:
818
819
820 wget --cookies=off --header
821
822
823 __--save-cookies__ ''file''
824
825
826 Save cookies from ''file'' at the end of session. Cookies
827 whose expiry time is not specified, or those that have
828 already expired, are not saved.
829
830
831 __--ignore-length__
832
833
834 Unfortunately, some HTTP servers (
835 CGI programs, to be more precise) send out
836 bogus Content-Length headers, which makes Wget go
837 wild, as it thinks not all the document was retrieved. You
838 can spot this syndrome if Wget retries getting the same
839 document again and again, each time claiming that the
840 (otherwise normal) connection has closed on the very same
841 byte.
842
843
844 With this option, Wget will ignore the
845 Content-Length header---as if it never
846 existed.
847
848
849 __--header=__''additional-header''
850
851
852 Define an ''additional-header'' to be passed to the
853 HTTP servers. Headers must contain a __:__
854 preceded by one or more non-blank characters, and must not
855 contain newlines.
856
857
858 You may define more than one additional header by specifying
859 __--header__ more than once.
860
861
862 wget --header='Accept-Charset: iso-8859-2' \
863 --header='Accept-Language: hr' \
864 http://fly.srk.fer.hr/
865 Specification of an empty string as the header value will clear all previous user-defined headers.
866
867
868 __--proxy-user=__''user''
869
870
871 __--proxy-passwd=__''password''
872
873
874 Specify the username ''user'' and password
875 ''password'' for authentication on a proxy server. Wget
876 will encode them using the basic authentication
877 scheme.
878
879
880 Security considerations similar to those with
881 __--http-passwd__ pertain here as well.
882
883
884 __--referer=__''url''
885
886
887 Include `Referer: ''url''' header in HTTP
888 request. Useful for retrieving documents with server-side
889 processing that assume they are always being retrieved by
890 interactive web browsers and only come out properly when
891 Referer is set to one of the pages that point to
892 them.
893
894
895 __-s__
896
897
898 __--save-headers__
899
900
901 Save the headers sent by the HTTP server to
902 the file, preceding the actual contents, with an empty line
903 as the separator.
904
905
906 __-U__ ''agent-string''
907
908
909 __--user-agent=__''agent-string''
910
911
912 Identify as ''agent-string'' to the HTTP
913 server.
914
915
916 The HTTP protocol allows the clients to
917 identify themselves using a User-Agent header
918 field. This enables distinguishing the WWW
919 software, usually for statistical purposes or for tracing of
920 protocol violations. Wget normally identifies as
921 __Wget/__''version'', ''version'' being the current
922 version number of Wget.
923
924
925 However, some sites have been known to impose the policy of
926 tailoring the output according to the
927 User-Agent-supplied information. While conceptually
928 this is not such a bad idea, it has been abused by servers
929 denying information to clients other than Mozilla
930 or Microsoft Internet Explorer. This option allows
931 you to change the User-Agent line issued by Wget.
932 Use of this option is discouraged, unless you really know
933 what you are doing.
934
935
936 __FTP Options__
937
938
939 __-nr__
940
941
942 __--dont-remove-listing__
943
944
945 Don't remove the temporary ''.listing'' files generated
946 by FTP retrievals. Normally, these files
947 contain the raw directory listings received from
948 FTP servers. Not removing them can be useful
949 for debugging purposes, or when you want to be able to
950 easily check on the contents of remote server directories
951 (e.g. to verify that a mirror you're running is
952 complete).
953
954
955 Note that even though Wget writes to a known filename for
956 this file, this is not a security hole in the scenario of a
957 user making ''.listing'' a symbolic link to
958 ''/etc/passwd'' or something and asking root to
959 run Wget in his or her directory. Depending on the options
960 used, either Wget will refuse to write to ''.listing'',
961 making the globbing/recursion/time-stamping operation fail,
962 or the symbolic link will be deleted and replaced with the
963 actual ''.listing'' file, or the listing will be written
964 to a ''.listing.number'' file.
965
966
967 Even though this situation isn't a problem, though,
968 root should never run Wget in a non-trusted user's
969 directory. A user could do something as simple as linking
970 ''index.html'' to ''/etc/passwd'' and asking
971 root to run Wget with __-N__ or __-r__ so the
972 file will be overwritten.
973
974
975 __-g on/off__
976
977
978 __--glob=on/off__
979
980
981 Turn FTP globbing on or off. Globbing means
982 you may use the shell-like special characters
983 (''wildcards''), like __*__, __?__, __[[__ and
984 __]__ to retrieve more than one file from the same
985 directory at once, like:
986
987
988 wget ftp://gnjilux.srk.fer.hr/*.msg
989 By default, globbing will be turned on if the URL contains a globbing character. This option may be used to turn globbing on or off permanently.
990
991
992 You may have to quote the URL to protect it
993 from being expanded by your shell. Globbing makes Wget look
994 for a directory listing, which is system-specific. This is
995 why it currently works only with Unix FTP
996 servers (and the ones emulating Unix ls
997 output).
998
999
1000 __--passive-ftp__
1001
1002
1003 Use the ''passive'' FTP retrieval scheme,
1004 in which the client initiates the data connection. This is
1005 sometimes required for FTP to work behind
1006 firewalls.
1007
1008
1009 __--retr-symlinks__
1010
1011
1012 Usually, when retrieving FTP directories
1013 recursively and a symbolic link is encountered, the
1014 linked-to file is not downloaded. Instead, a matching
1015 symbolic link is created on the local filesystem. The
1016 pointed-to file will not be downloaded unless this recursive
1017 retrieval would have encountered it separately and
1018 downloaded it anyway.
1019
1020
1021 When __--retr-symlinks__ is specified, however, symbolic
1022 links are traversed and the pointed-to files are retrieved.
1023 At this time, this option does not cause Wget to traverse
1024 symlinks to directories and recurse through them, but in the
1025 future it should be enhanced to do this.
1026
1027
1028 Note that when retrieving a file (not a directory) because
1029 it was specified on the commandline, rather than because it
1030 was recursed to, this option has no effect. Symbolic links
1031 are always traversed in this case.
1032
1033
1034 __Recursive Retrieval Options__
1035
1036
1037 __-r__
1038
1039
1040 __--recursive__
1041
1042
1043 Turn on recursive retrieving.
1044
1045
1046 __-l__ ''depth''
1047
1048
1049 __--level=__''depth''
1050
1051
1052 Specify recursion maximum depth level ''depth''. The
1053 default maximum depth is 5.
1054
1055
1056 __--delete-after__
1057
1058
1059 This option tells Wget to delete every single file it
1060 downloads, ''after'' having done so. It is useful for
1061 pre-fetching popular pages through a proxy,
1062 e.g.:
1063
1064
1065 wget -r -nd --delete-after http://whatever.com/~popular/page/
1066 The __-r__ option is to retrieve recursively, and __-nd__ to not create directories.
1067
1068
1069 Note that __--delete-after__ deletes files on the local
1070 machine. It does not issue the __DELE__
1071 command to remote FTP sites, for instance.
1072 Also note that when __--delete-after__ is specified,
1073 __--convert-links__ is ignored, so __.orig__ files are
1074 simply not created in the first place.
1075
1076
1077 __-k__
1078
1079
1080 __--convert-links__
1081
1082
1083 After the download is complete, convert the links in the
1084 document to make them suitable for local viewing. This
1085 affects not only the visible hyperlinks, but any part of the
1086 document that links to external content, such as embedded
1087 images, links to style sheets, hyperlinks to non-HTML
1088 content, etc.
1089
1090
1091 Each link will be changed in one of the two
1092 ways:
1093
1094
1095 The links to files that have been downloaded by Wget will be
1096 changed to refer to the file they point to as a relative
1097 link.
1098
1099
1100 Example: if the downloaded file ''/foo/doc.html'' links
1101 to ''/bar/img.gif'', also downloaded, then the link in
1102 ''doc.html'' will be modified to point to
1103 __../bar/img.gif__. This kind of transformation works
1104 reliably for arbitrary combinations of
1105 directories.
1106
1107
1108 The links to files that have not been downloaded by Wget
1109 will be changed to include host name and absolute path of
1110 the location they point to.
1111
1112
1113 Example: if the downloaded file ''/foo/doc.html'' links
1114 to ''/bar/img.gif'' (or to ''../bar/img.gif''), then
1115 the link in ''doc.html'' will be modified to point to
1116 ''http://hostname/bar/img.gif''.
1117
1118
1119 Because of this, local browsing works reliably: if a linked
1120 file was downloaded, the link will refer to its local name;
1121 if it was not downloaded, the link will refer to its full
1122 Internet address rather than presenting a broken link. The
1123 fact that the former links are converted to relative links
1124 ensures that you can move the downloaded hierarchy to
1125 another directory.
1126
1127
1128 Note that only at the end of the download can Wget know
1129 which links have been downloaded. Because of that, the work
1130 done by __-k__ will be performed at the end of all the
1131 downloads.
1132
1133
1134 __-K__
1135
1136
1137 __--backup-converted__
1138
1139
1140 When converting a file, back up the original version with a
1141 __.orig__ suffix. Affects the behavior of
1142 __-N__.
1143
1144
1145 __-m__
1146
1147
1148 __--mirror__
1149
1150
1151 Turn on options suitable for mirroring. This option turns on
1152 recursion and time-stamping, sets infinite recursion depth
1153 and keeps FTP directory listings. It is
1154 currently equivalent to __-r -N -l inf
1155 -nr__.
1156
1157
1158 __-p__
1159
1160
1161 __--page-requisites__
1162
1163
1164 This option causes Wget to download all the files that are
1165 necessary to properly display a given HTML
1166 page. This includes such things as inlined images, sounds,
1167 and referenced stylesheets.
1168
1169
1170 Ordinarily, when downloading a single HTML
1171 page, any requisite documents that may be needed to display
1172 it properly are not downloaded. Using __-r__ together
1173 with __-l__ can help, but since Wget does not ordinarily
1174 distinguish between external and inlined documents, one is
1175 generally left with ``leaf documents'' that are missing
1176 their requisites.
1177
1178
1179 For instance, say document ''1.html'' contains an
1180 tag referencing ''1.gif'' and an
1181 tag pointing to external document
1182 ''2.html''. Say that ''2.html'' is similar but that
1183 its image is ''2.gif'' and it links to ''3.html''. Say
1184 this continues up to some arbitrarily high
1185 number.
1186
1187
1188 If one executes the command:
1189
1190
1191 wget -r -l 2 http://I
1192 then ''1.html'', ''1.gif'', ''2.html'', ''2.gif'', and ''3.html'' will be downloaded. As you can see, ''3.html'' is without its requisite ''3.gif'' because Wget is simply counting the number of hops (up to 2) away from ''1.html'' in order to determine where to stop the recursion. However, with this command:
1193
1194
1195 wget -r -l 2 -p http://I
1196 all the above files ''and 3.html'''s requisite ''3.gif'' will be downloaded. Similarly,
1197
1198
1199 wget -r -l 1 -p http://I
1200 will cause ''1.html'', ''1.gif'', ''2.html'', and ''2.gif'' to be downloaded. One might think that:
1201
1202
1203 wget -r -l 0 -p http://I
1204 would download just ''1.html'' and ''1.gif'', but unfortunately this is not the case, because __-l 0__ is equivalent to __-l inf__---that is, infinite recursion. To download a single HTML page (or a handful of them, all specified on the commandline or in a __-i__ URL input file) and its (or their) requisites, simply leave off __-r__ and __-l__:
1205
1206
1207 wget -p http://I
1208 Note that Wget will behave as if __-r__ had been specified, but only that single page and its requisites will be downloaded. Links from that page to external documents will not be followed. Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to __-p__:
1209
1210
1211 wget -E -H -k -K -p http://I
1212 To finish off this topic, it's worth knowing that Wget's idea of an external document link is any URL specified in an tag, an tag, or a tag other than .
1213
1214
1215 __Recursive Accept/Reject Options__
1216
1217
1218 __-A__ ''acclist'' __--accept__
1219 ''acclist''
1220
1221
1222 __-R__ ''rejlist'' __--reject__
1223 ''rejlist''
1224
1225
1226 Specify comma-separated lists of file name suffixes or
1227 patterns to accept or reject.
1228
1229
1230 __-D__ ''domain-list''
1231
1232
1233 __--domains=__''domain-list''
1234
1235
1236 Set domains to be followed. ''domain-list'' is a
1237 comma-separated list of domains. Note that it does
1238 ''not'' turn on __-H__.
1239
1240
1241 __--exclude-domains__ ''domain-list''
1242
1243
1244 Specify the domains that are ''not'' to be
1245 followed..
1246
1247
1248 __--follow-ftp__
1249
1250
1251 Follow FTP links from HTML
1252 documents. Without this option, Wget will ignore all the
1253 FTP links.
1254
1255
1256 __--follow-tags=__''list''
1257
1258
1259 Wget has an internal table of HTML tag /
1260 attribute pairs that it considers when looking for linked
1261 documents during a recursive retrieval. If a user wants only
1262 a subset of those tags to be considered, however, he or she
1263 should be specify such tags in a comma-separated ''list''
1264 with this option.
1265
1266
1267 __-G__ ''list''
1268
1269
1270 __--ignore-tags=__''list''
1271
1272
1273 This is the opposite of the __--follow-tags__ option. To
1274 skip certain HTML tags when recursively
1275 looking for documents to download, specify them in a
1276 comma-separated ''list''.
1277
1278
1279 In the past, the __-G__ option was the best bet for
1280 downloading a single page and its requisites, using a
1281 commandline like:
1282
1283
1284 wget -Ga,area -H -k -K -r http://I
1285 However, the author of this option came across a page with tags like and came to the realization that __-G__ was not enough. One can't just tell Wget to ignore , because then stylesheets will not be downloaded. Now the best bet for downloading a single page and its requisites is the dedicated __--page-requisites__ option.
1286
1287
1288 __-H__
1289
1290
1291 __--span-hosts__
1292
1293
1294 Enable spanning across hosts when doing recursive
1295 retrieving.
1296
1297
1298 __-L__
1299
1300
1301 __--relative__
1302
1303
1304 Follow relative links only. Useful for retrieving a specific
1305 home page without any distractions, not even those from the
1306 same hosts.
1307
1308
1309 __-I__ ''list''
1310
1311
1312 __--include-directories=__''list''
1313
1314
1315 Specify a comma-separated list of directories you wish to
1316 follow when downloading Elements of ''list'' may contain
1317 wildcards.
1318
1319
1320 __-X__ ''list''
1321
1322
1323 __--exclude-directories=__''list''
1324
1325
1326 Specify a comma-separated list of directories you wish to
1327 exclude from download Elements of ''list'' may contain
1328 wildcards.
1329
1330
1331 __-np__
1332
1333
1334 __--no-parent__
1335
1336
1337 Do not ever ascend to the parent directory when retrieving
1338 recursively. This is a useful option, since it guarantees
1339 that only the files ''below'' a certain hierarchy will be
1340 downloaded.
1341 !!EXAMPLES
1342
1343
1344 The examples are divided into three sections loosely based
1345 on their complexity.
1346
1347
1348 __Simple Usage__
1349
1350
1351 Say you want to download a URL . Just
1352 type:
1353
1354
1355 wget http://fly.srk.fer.hr/
1356
1357
1358 But what will happen if the connection is slow, and the file
1359 is lengthy? The connection will probably fail before the
1360 whole file is retrieved, more than once. In this case, Wget
1361 will try getting the file until it either gets the whole of
1362 it, or exceeds the default number of retries (this being
1363 20). It is easy to change the number of tries to 45, to
1364 insure that the whole file will arrive safely:
1365
1366
1367 wget --tries=45 http://fly.srk.fer.hr/jpg/flyweb.jpg
1368
1369
1370 Now let's leave Wget to work in the background, and write
1371 its progress to log file ''log''. It is tiring to type
1372 __--tries__, so we shall use __-t__.
1373
1374
1375 wget -t 45 -o log http://fly.srk.fer.hr/jpg/flyweb.jpg
1376 The ampersand at the end of the line makes sure that Wget works in the background. To unlimit the number of retries, use __-t inf__.
1377
1378
1379 The usage of FTP is as simple. Wget will take
1380 care of login and password.
1381
1382
1383 wget ftp://gnjilux.srk.fer.hr/welcome.msg
1384
1385
1386 If you specify a directory, Wget will retrieve the directory
1387 listing, parse it and convert it to HTML .
1388 Try:
1389
1390
1391 wget ftp://prep.ai.mit.edu/pub/gnu/
1392 links index.html
1393
1394
1395 __Advanced Usage__
1396
1397
1398 You have a file that contains the URLs you want to download?
1399 Use the __-i__ switch:
1400
1401
1402 wget -i I
1403 If you specify __-__ as file name, the URLs will be read from standard input.
1404
1405
1406 Create a five levels deep mirror image of the
1407 GNU web site, with the same directory
1408 structure the original has, with only one try per document,
1409 saving the log of the activities to
1410 ''gnulog'':
1411
1412
1413 wget -r http://www.gnu.org/ -o gnulog
1414
1415
1416 The same as the above, but convert the links in the
1417 HTML files to point to local files, so you
1418 can view the documents off-line:
1419
1420
1421 wget --convert-links -r http://www.gnu.org/ -o gnulog
1422
1423
1424 Retrieve only one HTML page, but make sure
1425 that all the elements needed for the page to be displayed,
1426 such as inline images and external style sheets, are also
1427 downloaded. Also make sure the downloaded page references
1428 the downloaded links.
1429
1430
1431 wget -p --convert-links http://www.server.com/dir/page.html
1432 The HTML page will be saved to ''www.server.com/dir/page.html'', and the images, stylesheets, etc., somewhere under ''www.server.com/'', depending on where they were on the remote server.
1433
1434
1435 The same as the above, but without the
1436 ''www.server.com/'' directory. In fact, I don't want to
1437 have all those random server directories anyway---just save
1438 ''all'' those files under a ''download/'' subdirectory
1439 of the current directory.
1440
1441
1442 wget -p --convert-links -nH -nd -Pdownload \
1443 http://www.server.com/dir/page.html
1444
1445
1446 Retrieve the index.html of __www.lycos.com__, showing the
1447 original server headers:
1448
1449
1450 wget -S http://www.lycos.com/
1451
1452
1453 Save the server headers with the file, perhaps for
1454 post-processing.
1455
1456
1457 wget -s http://www.lycos.com/
1458 more index.html
1459
1460
1461 Retrieve the first two levels of __wuarchive.wustl.edu__,
1462 saving them to ''/tmp''.
1463
1464
1465 wget -r -l2 -P/tmp ftp://wuarchive.wustl.edu/
1466
1467
1468 You want to download all the GIFs from a directory on an
1469 HTTP server. You tried __wget
1470 http://www.server.com/dir/*.gif__, but that didn't work
1471 because HTTP retrieval does not support
1472 globbing. In that case, use:
1473
1474
1475 wget -r -l1 --no-parent -A.gif http://www.server.com/dir/
1476 More verbose, but the effect is the same. __-r -l1__ means to retrieve recursively, with maximum depth of 1. __--no-parent__ means that references to the parent directory are ignored, and __-A.gif__ means to download only the GIF files. __-A ``*.gif''__ would have worked too.
1477
1478
1479 Suppose you were in the middle of downloading, when Wget was
1480 interrupted. Now you do not want to clobber the files
1481 already present. It would be:
1482
1483
1484 wget -nc -r http://www.gnu.org/
1485
1486
1487 If you want to encode your own username and password to
1488 HTTP or FTP , use the
1489 appropriate URL syntax.
1490
1491
1492 wget ftp://hniksic:mypassword@unix.server.com/.emacs
1493 Note, however, that this usage is not advisable on multi-user systems because it reveals your password to anyone who looks at the output of ps.
1494
1495
1496 You would like the output documents to go to standard output
1497 instead of to files?
1498
1499
1500 wget -O - http://jagor.srce.hr/ http://www.srce.hr/
1501 You can also combine the two options and make pipelines to retrieve the documents from remote hotlists:
1502
1503
1504 wget -O - http://cool.list.com/ wget --force-html -i -
1505
1506
1507 __Very Advanced Usage__
1508
1509
1510 If you wish Wget to keep a mirror of a page (or
1511 FTP subdirectories), use __--mirror__
1512 (__-m__), which is the shorthand for __-r -l inf -N__.
1513 You can put Wget in the crontab file asking it to recheck a
1514 site each Sunday:
1515
1516
1517 crontab
1518 0 0 * * 0 wget --mirror http://www.gnu.org/ -o /home/me/weeklog
1519
1520
1521 In addition to the above, you want the links to be converted
1522 for local viewing. But, after having read this manual, you
1523 know that link conversion doesn't play well with
1524 timestamping, so you also want Wget to back up the original
1525 HTML files before the conversion. Wget
1526 invocation would look like this:
1527
1528
1529 wget --mirror --convert-links --backup-converted \
1530 http://www.gnu.org/ -o /home/me/weeklog
1531
1532
1533 But you've also noticed that local viewing doesn't work all
1534 that well when HTML files are saved under
1535 extensions other than __.html__, perhaps because they
1536 were served as ''index.cgi''. So you'd like Wget to
1537 rename all the files served with content-type
1538 __text/html__ to ''name.html''.
1539
1540
1541 wget --mirror --convert-links --backup-converted \
1542 --html-extension -o /home/me/weeklog \
1543 http://www.gnu.org/
1544 Or, with less typing:
1545
1546
1547 wget -m -k -K -E http://www.gnu.org/ -o /home/me/weeklog
1548 !!FILES
1549
1550
1551 __/etc/wgetrc__
1552
1553
1554 Default location of the ''global'' startup
1555 file.
1556
1557
1558 __.wgetrc__
1559
1560
1561 User startup file.
1562 !!BUGS
1563
1564
1565 You are welcome to send bug reports about GNU
1566 Wget to bug-wget@gnu.org
1567
1568
1569 Before actually submitting a bug report, please try to
1570 follow a few simple guidelines.
1571
1572
1573 1.
1574
1575
1576 Please try to ascertain that the behaviour you see really is
1577 a bug. If Wget crashes, it's a bug. If Wget does not behave
1578 as documented, it's a bug. If things work strange, but you
1579 are not sure about the way they are supposed to work, it
1580 might well be a bug.
1581
1582
1583 2.
1584
1585
1586 Try to repeat the bug in as simple circumstances as
1587 possible. E.g. if Wget crashes while downloading __wget
1588 -rl0 -kKE -t5 -Y0 http://yoyodyne.com -o /tmp/log__, you
1589 should try to see if the crash is repeatable, and if will
1590 occur with a simpler set of options. You might even try to
1591 start the download at the page where the crash occurred to
1592 see if that page somehow triggered the crash.
1593
1594
1595 Also, while I will probably be interested to know the
1596 contents of your ''.wgetrc'' file, just dumping it into
1597 the debug message is probably a bad idea. Instead, you
1598 should first try to see if the bug repeats with
1599 ''.wgetrc'' moved out of the way. Only if it turns out
1600 that ''.wgetrc'' settings affect the bug, mail me the
1601 relevant parts of the file.
1602
1603
1604 3.
1605
1606
1607 Please start Wget with __-d__ option and send the log (or
1608 the relevant parts of it). If Wget was compiled without
1609 debug support, recompile it. It is ''much'' easier to
1610 trace bugs with debug support on.
1611
1612
1613 4.
1614
1615
1616 If Wget has crashed, try to run it in a debugger, e.g.
1617 gdb `which wget` core and type where to
1618 get the backtrace.
1619 !!SEE ALSO
1620
1621
1622 GNU Info entry for ''wget''.
1623 !!AUTHOR
1624
1625
1626 Originally written by Hrvoje Niksic
1627 !!COPYRIGHT
1628
1629
1630 Copyright (c) 1996, 1997, 1998, 2000, 2001 Free Software
1631 Foundation, Inc.
1632
1633
1634 Permission is granted to make and distribute verbatim copies
1635 of this manual provided the copyright notice and this
1636 permission notice are preserved on all copies.
1637
1638
1639 Permission is granted to copy, distribute and/or modify this
1640 document under the terms of the GNU Free
1641 Documentation License, Version 1.1 or any later version
1642 published by the Free Software Foundation; with the
1643 Invariant Sections being `` GNU General
1644 Public License'' and `` GNU Free
1645 Documentation License'', with no Front-Cover Texts, and with
1646 no Back-Cover Texts. A copy of the license is included in
1647 the section entitled `` GNU Free
1648 Documentation License''.
1649 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.