requests模块和正则表达式爬豆瓣电影top250
本帖最后由 wcq15759797758 于 2022-5-3 15:18 编辑复盘爬虫(一)
import re
import requests
def main(url):
headers = {
'User-Agent': ('Mozilla/5.0 (compatible; MSIE 9.0; '
'Windows NT 6.1; Win64; x64; Trident/5.0)'),
}
respomse = requests.get(url=url,headers=headers)
respomse.encoding='utf-8'
html = respomse.text
obj = re.compile(r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)'
r'</span>.*?<p class="">.*?<br>(?P<year>.*?) .*?<span class="rating_num" property="v:average">(?P<PF>.*?)</span>',re.S)
resulf = obj.finditer(html)
for i in resulf:
'''print(i.group('name'))
print(i.group('year').strip())
print(i.group('PF'))
'''
item = {}
item['name'] = i.group('name')
item['year'] = i.group('year').strip()
item['评分'] = i.group('PF')
print(item)
if __name__ == '__main__':
for page in range(0,275,25):
url = f'https://movie.douban.com/top250?start={page}'
main(url=url,headers=headers) 666666666666666666 {:10_300:} {:5_106:} {:10_256:} {:10_277:} {:10_256:}{:10_256:}{:10_256:}{:10_256:}{:10_256:} {:7_146:} response是故意写成respomse来气我们强迫症吗{:5_96:} {:10_254:} {:10_275:} copy and paste... but failed.... 最后一行headers=headers是不是多出来的实参,不删掉运行不了{:10_256:}https://cdn.jsdelivr.net/gh/master-of-forums/master-of-forums/public/images/patch.gif 学到了!!!!!!!!!
页:
[1]