|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 qiuyouzhi 于 2020-3-27 10:50 编辑
Python 爬取论坛徽章
昨天看了一下,其他的表情都很好爬,那个阿狸的,
就是ali1,ali2,写个循环遍历下载下来就行,这里不多赘述。
用到的模块:
pypinyin,因为徽章都是中文的,直接用徽章拼音命名。
requests: 不多说,获取网页神器。
bs4:分析网页,提取数据。
直接说思路:
先导入该用的模块,并写出下载网页的函数:
- from pypinyin import lazy_pinyin as l # 用lazy_pinyin,去掉声调
- from requests import get
- from bs4 import BeautifulSoup as BS
- def open_url(url):
- res = get(url) # 我们亲爱的鱼C不需要User-Agent
- return res
复制代码
分析网页:
(电脑渣,这里不放图片 )
翻翻源代码,可以发现:
- <p>活跃小鱼</p>
- <p class="mtn">
- 自主申请
- </p>
- </div>
- </div>
- <div id="medal_34" class="mg_img" onmouseover="showMenu({'ctrlid':this.id, 'menuid':'medal_34_menu', 'pos':'12!'});"><img src="static/image/common/huoyuexiaoyu.gif" alt="活跃小鱼" style="margin-top: 20px;width:auto; height: auto;" /></div>
- <p class="xw1">活跃小鱼</p>
- <p>
- 已拥有
- </p>
- </li>
- <li>
复制代码
我去,全在<p>标签里面啊!
直接写代码:
- def get_pinyin(name): # 获取拼音并保存
- pinyin = [l(each) for each in name]
- return pinyin
- def zhizun():
- url = 'https://fishc.com.cn/static/image/common/vip.gif'
- res = open_url(url)
- with open("zhizunvip.gif", "wb") as f:
- f.write(res.content)
- def find_name(res): # 找出来勋章的名字
- name = []
- soup = BS(res.text, "html.parser")
- target = soup.find_all('p', class_='xw1')
- for each in target:
- name.append(each.text)
- return name
复制代码
那个zhizun可能大家看不明白,是因为
勋章的名字叫做至尊VIP,而它的URL
则是vip(没有至尊),所以只能单独给它搞一个
(用切片也行)。
现在就是最后一步,保存图片!
- def get_Img():
- res = open_url("https://fishc.com.cn/home.php?mod=medal")
- name = find_name(res)
- pinyin = get_pinyin(name)
- for each in pinyin:
- each = "".join(each).lower() # each 不是字符串,要用 "".join(each)
- url = 'https://fishc.com.cn/static/image/common/%s.gif' % each
- print(url)
- res = open_url(url)
- with open(f'{each}.gif', 'wb') as f:
- f.write(res.content)
- zhizun()
- if __name__ == "__main__":
- get_Img()
- print("DONE!")
复制代码
大功告成!
致谢
感谢zltzlt,没有他,我就会一直卡在get_Img()函数里。
(有点尴尬)
完整代码:
- from pypinyin import lazy_pinyin as l
- from requests import get
- from bs4 import BeautifulSoup as BS
- def open_url(url):
- headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
- res = get(url, headers = headers)
- return res
- def get_pinyin(name): # 获取拼音并保存
- pinyin = [l(each) for each in name]
- return pinyin
- def zhizun():
- url = 'https://fishc.com.cn/static/image/common/vip.gif'
- res = open_url(url)
- with open("zhizunvip.gif", "wb") as f:
- f.write(res.content)
- def find_name(res): # 找出来勋章的名字
- name = []
- soup = BS(res.text, "html.parser")
- target = soup.find_all('p', class_='xw1')
- for each in target:
- name.append(each.text)
- return name
- def get_Img():
- res = open_url("https://fishc.com.cn/home.php?mod=medal")
- name = find_name(res)
- pinyin = get_pinyin(name)
- for each in pinyin:
- each = "".join(each).lower() # each 不是字符串,要用 "".join(each)
- url = 'https://fishc.com.cn/static/image/common/%s.gif' % each
- print(url)
- res = open_url(url)
- with open(f'{each}.gif', 'wb') as f:
- f.write(res.content)
- zhizun()
- if __name__ == "__main__":
- get_Img()
- print("DONE!")
复制代码 |
评分
-
查看全部评分
|