[已解决]弄了个爬虫，但是不会写异常，跪求大神指点！

silence181 · 发表于 2018-2-9 11:42:16

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

爬小说网站，但是爬一段时间就会出现出现“远程主机没响应”之类的错误，热后又得要重头开始爬，这样的异常该怎么写？求大神指点！
新手，代码写的比较乱，见笑了~！

import requests
import re
import pymysql
conn = pymysql.connect(
host = '127.0.0.1',
port = 3306,
user = 'root',
passwd = 'root',
db = 'xiaoshu',
charset = 'utf8'
)
curson = conn.cursor()
def get_next(url):
response = requests.get(url)
response.encoding = 'utf-8'
return response.text
def re_next(next_html):
r = re.compile('class="tspage">.*?1/(.*?) 每页.*?<a',re.S)
item = re.findall(r,next_html)
page_num = item[0]
return page_num
def re_page_html(page_html):
r = re.compile('<li>.*?class="s">.*?href="(.*?)"><img.*?">(.*?)</a>.*?class="u">.*?</li>',re.S)
item = re.findall(r,page_html)
return item
def get_l(url):
response = requests.get(url)
response.encoding = 'utf-8'
return response.text
def get_arc_list(url):
response = requests.get(url)
response.encoding = 'utf-8'
return response.text
def re_arc_list(arc_list_url):
r = re.compile('<li><a href="(.*?)">.*?</a></li>',re.S)
item = re.findall(r,arc_list_url)
items = item[24:]
return items
def get_arc_html(uu):
response = requests.get(uu)
response.encoding = 'utf-8'
return response.text
def re_arc(arc_html):
r = re.compile('class="txt_cont">.*?<h1>(.*?)</h1>.*?html">(.*?)TXT.*?id="content1">(.*?)</div>',re.S)
item = re.findall(r,arc_html)
return item
def main():
for i in range(1,2):
url = 'http://www.sjtxt.la/soft/{}/Soft_00{}_1.html'.format(i,i)
next_html = get_next(url)
page_nums = re_next(next_html)
for j in range(1,int(page_nums)+1):
urls = 'http://www.sjtxt.la/soft/{}/Soft_00{}_{}.html'.format(i,i,j)
page_html = requests.get(urls)
for data in re_page_html(page_html.text):
page_url = 'http://www.sjtxt.la/book/' + data[0][-10:-5]
txtname = data[1]
print(txtname)
curson.execute("insert into book(txtname) value('{}')".format(txtname))
idtxtname = curson.lastrowid
conn.commit()
arc_list_url = get_arc_list(page_url)
for u in re_arc_list(arc_list_url):
uu = page_url + '/' + u
arc_html = get_arc_html(uu)
for arc_data in re_arc(arc_html):
title = arc_data[0]
con = arc_data[2].split()
content = ''.join(con)
print(title,content)
curson.execute("insert into con(idtxtname,title,content) value('{}','{}','{}')".format(idtxtname,title,content))
conn.commit()
if __name__ == '__main__':
main()

复制代码

最佳答案

月排行榜 / 总排行榜

ba21

2018-2-9 12:41:08

注：如果HTTPError 和 URLError 同时使用，HTTPError 必须写在前面。

import urllib.request
from urllib.error import *

写法1:（推荐）
try:
response = urlopen(req)
except URLError as e:
if hasattr(e, 'reason'):
      print(e.reason)
elif hasattr(e, 'code'):
      print(e.code)
else:
      print(e.read())

写法2：
req = urllib.request.Request("http://www.fishc.com/ooxx.html")
try:
urllib.request.urlopen(req)
except HTTPError as e:
print(e.code)
print(e.reason)
print(e.read())
except URLError as e:
print(e.reason)

跳转到最佳答案楼层

°蓝鲤歌蓝 · 发表于 2018-2-9 11:51:26

你这个代码看得我眼花缭乱，没有注释，没有报错信息，而且为什么有好几个一毛一样的函数啊？！！
你这个问题应该是服务器发现了你是爬虫所以不和你连接了，因为你代码里面没有加基本的反爬取措施。

ba21 · 发表于 2018-2-9 12:41:08

注：如果HTTPError 和 URLError 同时使用，HTTPError 必须写在前面。

import urllib.request
from urllib.error import *

写法1:（推荐）
try:
response = urlopen(req)
except URLError as e:
if hasattr(e, 'reason'):
      print(e.reason)
elif hasattr(e, 'code'):
      print(e.code)
else:
      print(e.read())

写法2：
req = urllib.request.Request("http://www.fishc.com/ooxx.html")
try:
urllib.request.urlopen(req)
except HTTPError as e:
print(e.code)
print(e.reason)
print(e.read())
except URLError as e:
print(e.reason)

silence181 · 发表于 2018-2-9 19:23:43

°蓝鲤歌蓝发表于 2018-2-9 11:51
你这个代码看得我眼花缭乱，没有注释，没有报错信息，而且为什么有好几个一毛一样的函数啊？！！
你这个问 ...

这个网站就是我朋友的没有任何反爬措施

silence181 · 发表于 2018-2-9 19:25:02

°蓝鲤歌蓝发表于 2018-2-9 11:51
你这个代码看得我眼花缭乱，没有注释，没有报错信息，而且为什么有好几个一毛一样的函数啊？！！
你这个问 ...

没有一毛一样的函数只是函数名有点像而已哈哈哈

°蓝鲤歌蓝 · 发表于 2018-2-9 19:25:30

silence181 发表于 2018-2-9 19:23
这个网站就是我朋友的没有任何反爬措施

那应该是他那边的问题。

silence181 · 发表于 2018-2-9 19:25:35

ba21 发表于 2018-2-9 12:41
注：如果HTTPError 和 URLError 同时使用，HTTPError 必须写在前面。

谢谢了我在研究下。。。

°蓝鲤歌蓝 · 发表于 2018-2-9 19:27:30

silence181 发表于 2018-2-9 19:25
没有一毛一样的函数只是函数名有点像而已哈哈哈

函数名不一样，函数内容都是一样的，应该没必要。

账号		自动登录	找回密码
密码			立即注册