|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 potatoking006 于 2021-4-12 21:02 编辑
运行前必须要改savaphoto() 里的保存路径。
- import urllib
- import urllib.request
- from bs4 import BeautifulSoup
- import re
- import time
- def main():
- baseurl=r"https://www.zhihu.com/question/325648700/answer/1366875354"
- temp = gethtml(baseurl)
- # print(temp)
- # savaphoto(r"https://pic2.zhimg.com/50/v2-29b5b450b2c5f828195c359148099def_hd.jpg","1")
- ImgUrlList = getImgUrlList(temp)
- for i in range(len(ImgUrlList)):
- print(ImgUrlList[i])
- time.sleep(12)
- savaphoto(ImgUrlList[i],"第一次壁纸"+str(i))
- # # time.sleep(5)
- # savaphoto("https://pic4.zhimg.com/50/v2-90a72927dc97d2ad5d748860158c6a0b_hd.jpg", "第一次壁纸" + str(1000))
- def gethtml(baseUrl):
- head = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"}
- html = " "
- request = urllib.request.Request(baseUrl,headers=head)
- response = urllib.request.urlopen(request);
- html = response.read().decode("utf-8")
- return html
- def getImgUrlList(html):
- ImgUrlList = []
- soup = BeautifulSoup(html,"html.parser")
- findImgUrl=re.compile(r"src="(.*?.jpg)")
- for item in soup.find_all("figure"):
- item = str(item)
- temp = re.findall(findImgUrl,item)[0]
- ImgUrlList.append(temp)
- # print(temp)
- return ImgUrlList
- def savaphoto(photoUrl,imgName):
- head = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"}
- request = urllib.request.Request(photoUrl,headers=head)
- try:
- reponse = urllib.request.urlopen(request)
- my_img = reponse.read()
- reponse.close()
- ##这里需要自己改路径名
- with open(r"E:/spiderphoto/new/"+imgName+".jpg","wb") as f:
- f.write(my_img)
- except Exception:
- print("访问出现错误")
- finally:
- print("正常爬取一张图片")
- if __name__ == "__main__":
- main()
- print('mission success!!!')
复制代码
time.sleep()方法建议设置的长一点,不然容易被反爬。
我用sleep (10)爬取到了99张图片。如果中途被断开连接,可以调整for 中 range的初始位置重新运行程序来达到继续爬壁纸的目的。
|
-
评分
-
查看全部评分
|