注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

notes

@-@

 
 
 

日志

 
 
 
 

WGET DOCUMENT (2)  

2010-11-23 16:10:14|  分类: 程序 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

2.11 Recursive Accept/Reject Options

‘-A acclist --accept acclist

‘-R rejlist --reject rejlist

Specify comma-separated lists of file name suffixes or patterns to accept or reject (see Types of Files). Note that if any of the wildcard characters, ‘*’, ‘?’, ‘[’ or ‘]’, appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix.

‘-D domain-list

‘--domains=domain-list

Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does not turn on ‘-H’.

‘--exclude-domains domain-list

Specify the domains that are not to be followed. (see Spanning Hosts).

 

‘--follow-ftp’

Follow ftp links from html documents. Without this option, Wget will ignore all the ftp links.

 

‘--follow-tags=list

Wget has an internal table of html tag / attribute pairs that it considers when looking for linked documents during a recursive retrieval. If a user wants only a subset of those tags to be considered, however, he or she should be specify such tags in a comma-separated list with this option.

‘--ignore-tags=list

This is the opposite of the ‘--follow-tags’ option. To skip certain html tags when recursively looking for documents to download, specify them in a comma-separated list.

In the past, this option was the best bet for downloading a single page and its requisites, using a command-line like:

          wget --ignore-tags=a,area -H -k -K -r http://site/document

However, the author of this option came across a page with tags like <LINK REL="home" HREF="/"> and came to the realization that specifying tags to ignore was not enough. One can't just tell Wget to ignore <LINK>, because then stylesheets will not be downloaded. Now the best bet for downloading a single page and its requisites is the dedicated ‘--page-requisites’ option.

 

‘--ignore-case’

Ignore case when matching files and directories. This influences the behavior of -R, -A, -I, and -X options, as well as globbing implemented when downloading from FTP sites. For example, with this option, ‘-A *.txt’ will match ‘file1.txt’, but also ‘file2.TXT’, ‘file3.TxT’, and so on.

‘-H’

‘--span-hosts’

Enable spanning across hosts when doing recursive retrieving (see Spanning Hosts).

‘-L’

‘--relative’

Follow relative links only. Useful for retrieving a specific home page without any distractions, not even those from the same hosts (see Relative Links).

‘-I list

‘--include-directories=list

Specify a comma-separated list of directories you wish to follow when downloading (see Directory-Based Limits). Elements of list may contain wildcards.

‘-X list

‘--exclude-directories=list

Specify a comma-separated list of directories you wish to exclude from download (see Directory-Based Limits). Elements of list may contain wildcards.

‘-np’

‘--no-parent’

Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded. See Directory-Based Limits, for more details.

2.12 Exit Status

Wget may return one of several error codes if it encounters problems.

0

No problems occurred.

1

Generic error code.

2

Parse error—for instance, when parsing command-line options, the ‘.wgetrc’ or ‘.netrc’...

3

File I/O error.

4

Network failure.

5

SSL verification failure.

6

Username/password authentication failure.

7

Protocol errors.

8

Server issued an error response.

With the exceptions of 0 and 1, the lower-numbered exit codes take precedence over higher-numbered ones, when multiple types of errors are encountered.

In versions of Wget prior to 1.12, Wget's exit status tended to be unhelpful and inconsistent. Recursive downloads would virtually always return 0 (success), regardless of any issues encountered, and non-recursive fetches only returned the status corresponding to the most recently-attempted download.

7.1 Simple Usage

  • Say you want to download a url. Just type:

·                       wget http://fly.srk.fer.hr/

  • But what will happen if the connection is slow, and the file is lengthy? The connection will probably fail before the whole file is retrieved, more than once. In this case, Wget will try getting the file until it either gets the whole of it, or exceeds the default number of retries (this being 20). It is easy to change the number of tries to 45, to insure that the whole file will arrive safely:

·                       wget --tries=45 http://fly.srk.fer.hr/jpg/flyweb.jpg

  • Now let's leave Wget to work in the background, and write its progress to log file log. It is tiring to type ‘--tries’, so we shall use ‘-t’.

·                       wget -t 45 -o log http://fly.srk.fer.hr/jpg/flyweb.jpg &

The ampersand at the end of the line makes sure that Wget works in the background. To unlimit the number of retries, use ‘-t inf’.

  • The usage of ftp is as simple. Wget will take care of login and password.

·                       wget ftp://gnjilux.srk.fer.hr/welcome.msg

  • If you specify a directory, Wget will retrieve the directory listing, parse it and convert it to html. Try:

·                       wget ftp://ftp.gnu.org/pub/gnu/

·                       links index.html

7.2 Advanced Usage

  • You have a file that contains the URLs you want to download? Use the ‘-i’ switch:

·                       wget -i file

If you specify ‘-’ as file name, the urls will be read from standard input.

  • Create a five levels deep mirror image of the GNU web site, with the same directory structure the original has, with only one try per document, saving the log of the activities to gnulog:

·                       wget -r http://www.gnu.org/ -o gnulog

  • The same as the above, but convert the links in the downloaded files to point to local files, so you can view the documents off-line:

·                       wget --convert-links -r http://www.gnu.org/ -o gnulog

  • Retrieve only one html page, but make sure that all the elements needed for the page to be displayed, such as inline images and external style sheets, are also downloaded. Also make sure the downloaded page references the downloaded links.

·                       wget -p --convert-links http://www.server.com/dir/page.html

The html page will be saved to www.server.com/dir/page.html, and the images, stylesheets, etc., somewhere under www.server.com/, depending on where they were on the remote server.

  • The same as the above, but without the www.server.com/ directory. In fact, I don't want to have all those random server directories anyway—just save all those files under a download/ subdirectory of the current directory.

·                       wget -p --convert-links -nH -nd -Pdownload \

·                            http://www.server.com/dir/page.html

  • Retrieve the index.html of ‘www.lycos.com’, showing the original server headers:

·                       wget -S http://www.lycos.com/

  • Save the server headers with the file, perhaps for post-processing.

·                       wget --save-headers http://www.lycos.com/

·                       more index.html

  • Retrieve the first two levels of ‘wuarchive.wustl.edu’, saving them to /tmp.

·                       wget -r -l2 -P/tmp ftp://wuarchive.wustl.edu/

  • You want to download all the gifs from a directory on an http server. You tried ‘wget http://www.server.com/dir/*.gif’, but that didn't work because http retrieval does not support globbing. In that case, use:

·                       wget -r -l1 --no-parent -A.gif http://www.server.com/dir/

More verbose, but the effect is the same. ‘-r -l1’ means to retrieve recursively (see Recursive Download), with maximum depth of 1. ‘--no-parent’ means that references to the parent directory are ignored (see Directory-Based Limits), and ‘-A.gif’ means to download only the gif files. ‘-A "*.gif"’ would have worked too.

  • Suppose you were in the middle of downloading, when Wget was interrupted. Now you do not want to clobber the files already present. It would be:

·                       wget -nc -r http://www.gnu.org/

  • If you want to encode your own username and password to http or ftp, use the appropriate url syntax (see URL Format).

·                       wget ftp://hniksic:mypassword@unix.server.com/.emacs

Note, however, that this usage is not advisable on multi-user systems because it reveals your password to anyone who looks at the output of ps.

  • You would like the output documents to go to standard output instead of to files?

·                       wget -O - http://jagor.srce.hr/ http://www.srce.hr/

You can also combine the two options and make pipelines to retrieve the documents from remote hotlists:

          wget -O - http://cool.list.com/ | wget --force-html -i -

7.3 Very Advanced Usage

  • If you wish Wget to keep a mirror of a page (or ftp subdirectories), use ‘--mirror’ (‘-m’), which is the shorthand for ‘-r -l inf -N’. You can put Wget in the crontab file asking it to recheck a site each Sunday:

·                       crontab

·                       0 0 * * 0 wget --mirror http://www.gnu.org/ -o /home/me/weeklog

  • In addition to the above, you want the links to be converted for local viewing. But, after having read this manual, you know that link conversion doesn't play well with timestamping, so you also want Wget to back up the original html files before the conversion. Wget invocation would look like this:

·                       wget --mirror --convert-links --backup-converted  \

·                            http://www.gnu.org/ -o /home/me/weeklog

  • But you've also noticed that local viewing doesn't work all that well when html files are saved under extensions other than ‘.html’, perhaps because they were served as index.cgi. So you'd like Wget to rename all the files served with content-type ‘text/html’ or ‘application/xhtml+xml’ to name.html.

·                       wget --mirror --convert-links --backup-converted \

·                            --html-extension -o /home/me/weeklog        \

·                            http://www.gnu.org/

Or, with less typing:

          wget -m -k -K -E http://www.gnu.org/ -o /home/me/weeklog

 

  评论这张
 
阅读(884)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017