|
|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
- # 网站地图爬虫
- import urllib.request
- from urllib.error import URLError, HTTPError, ContentTooShortError
- import re
- def download(url, user_agent='wswp', num_retries=2, charset='utf-8'):
- print('Downloading:', url)
- request = urllib.request.Request(url)
- request.add_header('User-Agent', user_agent)
- try:
- resp = urllib.request.urlopen(request)
- cs = resp.headers.get_content_charset()
- if not cs:
- cs = charset
- html = resp.read().decode(cs)
- except (URLError, HTTPError, ContentTooShortError) as e:
- print('Download Error:', e.reason)
- html = None
- if num_retries > 0:
- if hasattr(e, 'code') and 500 <= e.code < 600:
- return download(url, num_retries-1)
- return html
- def crawl_sitemap(url):
- # download the sitemap file
- sitemap = download(url)
- # extract the sitemap links
- links = re.findall('<loc>(.*?)</loc>', sitemap)
- # download each link
- for link in links:
- html = download(link)
- crawl_sitemap('http://http://example.python-scraping.com/sitemap.xml')
复制代码
这是从书里面打出来的代码,有几个问题:
1.cs这变量是什么东西?
2.resp.headers.get_content_charset()整个headers和get_content_charset是哪里面的函数,百度了没找着
3.if hasattr(e, 'code') and 500 <= e.code < 600:这里面code是什么,e.code又是什么
4.links = re.findall('<loc>(.*?)</loc>', sitemap) <loc>(.*?)</loc>这东西应该是正则表达式吧,应该是什么样? |
|