爬虫代码疑问

老笨啊 · 发表于 2019-3-27 17:15:04

您需要登录才可以下载或查看，没有账号？立即注册

x

今天一直为一个天气爬虫代码纠结。。
终于在各位的帮忙下基本完成了。但是有个奇怪的问题，我设置了函数，想直接输入一个年份，可以一次性爬取从2011年到指定年份的所有天气数据。爬到05月份的时候就报错了。。
但是，如果是年份是单独输入的话，则不会报错。。
详细代码如下（一次性爬取）：

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import requests as res
import re
import time
def wea(year):
for i in range(2011,year):
for j in range(1,13):
if len(str(j))<2:
j = '0'+ str(j)
else:
j = j
y = str(i)+str(j)
url = 'http://www.tianqihoubao.com/lishi/fujianfuzhou/month/%s.html'%y
header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
res1 = res.get(url,headers = header) #请求
soup = bs(res1.text,'lxml') #解析
cont1 = soup.find_all('td')
list1 = []
print('爬到第%s月了！'%y)
time.sleep(np.random.randint(1))
for i in cont1:
weather = []
for n in i.strings:
a = n.replace(' ','').replace('\r','').replace('\n','')
weather.append(a)
total = (weather)
list1.append(total)
list2 = []
for i in list1:
for n in i:
if n !='':
list2.append(n)
list3 = []
for i in range(0,len(list2),4):
date = list2[i]
weather = list2[i+1]
temp = list2[i+2]
wind = list2[i+3]
total =(date,weather,temp,wind)
list3.append(total)
data = pd.DataFrame(list3)
data.to_csv(r"F:\数据资源\天气资料%s.csv"%year,encoding='gbk',header = None,mode='a')
Wea(2019)

复制代码

老笨啊 · 发表于 2019-3-27 17:16:09

只能爬取单年的代码如下：

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import requests as res
import re
import time
def Wea(year):
for j in range(1,13):
if len(str(j))<2:
j = '0'+ str(j)
else:
j = j
url = 'http://www.tianqihoubao.com/lishi/fujianfuzhou/month/%s%s.html'%(year,j)
header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
res1 = res.get(url,headers = header) #请求
soup = bs(res1.text,'lxml') #解析
cont1 = soup.find_all('td')
list1 = []
print('爬到第%s月了！'%j)
time.sleep(np.random.randint(1))
for i in cont1:
weather = []
for n in i.strings:
a = n.replace(' ','').replace('\r','').replace('\n','')
weather.append(a)
total = (weather)
list1.append(total)
list2 = []
for i in list1:
for n in i:
if n !='':
list2.append(n)
list3 = []
for i in range(0,len(list2),4):
date = list2[i]
weather = list2[i+1]
temp = list2[i+2]
wind = list2[i+3]
total =(date,weather,temp,wind)
list3.append(total)
data = pd.DataFrame(list3)
data.to_csv(r"F:\数据资源\天气资料%s.csv"%year,encoding='gbk',header = None,mode='a')
Wea(2015)

复制代码

伏惜寒 · 发表于 2019-3-27 17:38:00

是你的爬虫程序运行太快了，服务器响应不过来，你只需要设置一个睡眠时间就能解决
import time
然后在你的爬取一次程序末尾加上
time.sleep(2)
让这个程序睡2秒就可以了

老笨啊 · 发表于 2019-3-27 18:08:11

伏惜寒发表于 2019-3-27 17:38
是你的爬虫程序运行太快了，服务器响应不过来，你只需要设置一个睡眠时间就能解决
import time
然后在你 ...

我用了sleep命令啊，你看第21/23行代码

老笨啊 · 发表于 2019-3-27 18:11:47

伏惜寒发表于 2019-3-27 17:38
是你的爬虫程序运行太快了，服务器响应不过来，你只需要设置一个睡眠时间就能解决
import time
然后在你 ...

而报错的是：

伏惜寒 · 发表于 2019-3-27 21:41:19

老笨啊发表于 2019-3-27 18:11
而报错的是：

np.random.randint(1, size=10)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

返回值全是0，也就是time.sleep(0),哪里有让程序睡眠了？
你直接写time.sleep(2)就可以了，别搞那么麻烦

老笨啊 · 发表于 2019-3-28 08:22:20

伏惜寒发表于 2019-3-27 21:41
np.random.randint(1, size=10)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

谢谢，我昨晚后面突然反应过来，发现for循环语句中，重复用了好几次i，是这个的问题。
你说的时间睡眠问题，我的确没做到位。只是不是根源问题。。

账号		自动登录	找回密码
密码			立即注册