Python爬虫的一点问题

一品带刀护卫 · 发表于 2020-2-3 19:24:11

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

# 网站地图爬虫
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError
import re
def download(url, user_agent='wswp', num_retries=2, charset='utf-8'):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-Agent', user_agent)
try:
resp = urllib.request.urlopen(request)
cs = resp.headers.get_content_charset()
if not cs:
cs = charset
html = resp.read().decode(cs)
except (URLError, HTTPError, ContentTooShortError) as e:
print('Download Error:', e.reason)
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
return download(url, num_retries-1)
return html
def crawl_sitemap(url):
# download the sitemap file
sitemap = download(url)
# extract the sitemap links
links = re.findall('<loc>(.*?)</loc>', sitemap)
# download each link
for link in links:
html = download(link)
crawl_sitemap('http://http://example.python-scraping.com/sitemap.xml')

复制代码

这是从书里面打出来的代码，有几个问题：
1.cs这变量是什么东西？
2.resp.headers.get_content_charset()整个headers和get_content_charset是哪里面的函数，百度了没找着
3.if hasattr(e, 'code') and 500 <= e.code < 600:这里面code是什么，e.code又是什么
4.links = re.findall('<loc>(.*?)</loc>', sitemap) <loc>(.*?)</loc>这东西应该是正则表达式吧，应该是什么样？

ll104567 · 发表于 2020-2-4 10:34:36

这都啥时候了还urllib呢
换一个http的请求库吧。

知乎搜索 python requests
然后好好看看

账号		自动登录	找回密码
密码			立即注册

Python爬虫的一点问题

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块