鱼C论坛 › wangyinghan › 日志

wangyinghan

https://fishc.com.cn/?518563

720

已有 268 次阅读2018-7-20 21:48

处理异常：

1 URLError 来自urllib库中的error模块，由request模块产生的异常都可以通过

捕获这个类来处理。

from urllib import request,error

try:

response=request.urlopen('http://cuiqingcai.com/index.htm')

except error.URLError as e:

print(e.eason)

通过以上操作打开一个不存在的页面，程序没有直接报错，通过这样，我们就可以

有效避免程序异常终止。

2 HTTPError(专门用来处理HTTP请求错误)

存在三个属性：

1 ：code 返回HTTP状态码

2：reason 返回错误原因

3：headers 返回请求头

from urllib import request,error

try:

response=request.urlopen('http://cuiqingcai.com/index.htm')

except error.HTTPError as e:

print(e.reason,e.code,e.headers,seq='\n')

同样的网址，捕获了HTTP异常，输出了code headers，code属性。

由于URLError是HTTPError的父类，故可以先捕获子类的错误，再捕获父类的错误

from urllib import request,error

try:

response=request.urlopen('http://cuiqingcai.com/index.htm')

except error.HTTPErroras e:

print(e.reason,e.code,e.headers,sep='\n')

except error.URLError as e:

print(e.reason)

else:

print('request Successfully')

以上这么做比较好

有时候，resonance属性也可以返回一个字符串

import socket

import urllib.request

import urllib.error

try:

response=urllib.request.urlopen('http://www.baidu.com',timeout=0.001)

except urllib.error.URLError as e:

print(type(e.reason))

if isinstance(e.reason,socket.timeout)

print('TIME OUT')

这里我们设置超时时间来强制抛出timeout类。可以发现，reason属性的结果是socket.timeout

解析链接：

urllib库中同样提供了parse模块，他定义了处理URL的标准接口。

1：urlparse() 该方法可以实现url的识别和分段

from urllib.parse import urlparse

result=urlparse('http://www.baidu.com/index.html;user?id=5#comment')

print(type(result),result)

这里我们利用urlparse()做了一个解析，返回了ParseResult类型的对象，

包括

scheme，代表协议

netloc，域名

path，访问路径

paramms，代表参数

query，查询条件，一般做GET类型的URL

fragment，用于直接定位页面内部的下拉位置

urlparse()的API:urllib.parse.urlparse(urlstring,scheme='',allow_fragment=True)

urlstring:必填，带解析的URL

scheme:默认的协议

allow_fragment:是否忽略fragment，设置为False则忽略Fragment部分

from urllib.parse import urlparse

result=urlparse('http://www.baidu.com/index.html#comment',allow_fragment=False)

print(result.scheme,result[0],result.netloc,resultp[1],seq='\n')

两种方法都可以进行获取。

2 urlunparse(接受必须是6个的参数，实现url的构造）

from urllib import urlunparse

data=['http','www.baidu.com','index.html','user','a=6','comment']

print(urlunparse(data))

3 urljoin(),生成链接的另一个方法，提供一个base_url(基础链接)作为参数，将新的链接作为第二个参数

该方法会分析base_url中的scheme,netloc,path这三个内容并对新链接作为补充

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com','FAQ.HTML')

print(urljoin('www.baidu.com','?category=2#comment'))

如果这三项在新的链接里不存在，就予以补充，如果链接存在，就使用新的连接部分。

4 urlencode（将参数由字典类型装华为GET请求参数）

from urllib.parse import urlencode（将参数由字典类型装华为GET请求参数）

params={'name':'germy','age':22}

base_url='http://www.baidu.com?'

url=base_url+urlencode(params)

print(url)

5 quote(将内容转化为URL的编码格式，因为有中文时可能会出现乱码）

from urllib.parse import quote

keyword='王英涵'

url='http://www.baidu.com/s?wd='+quote(keyword)

prinr(url)

6 unquote()进行url解码

from urllib.parse import unquote

url=''

print(unquote(url))

7 parse_qsl() 可以将参数转化为元组组成的列表。

from urllib.parse import parse_qsl

query='name=germey&age=22'

print(parse_qsl(query))

路过

鸡蛋

鲜花

握手

雷人

收藏分享邀请举报

全部作者的其他最新日志

• 爬虫801
• pc730
• 725
• 723

账号		自动登录	找回密码
密码			立即注册

wangyinghan

720

全部作者的其他最新日志

评论 (0 个评论)