想爬取完整的250个电影,这个代码我得怎么完善
如图所示,现在代码运行之后值爬到第一页面的25个电影内容,我知道他每页?start=0&filter=;“0”这个参数依次递增25,那就可以num = 0while num <= 250:
print(num)
num += 25
不是太清楚这个数字得怎么代入到url网址里,跟怎么获取全部页面 250个电影的源代码? 代码放上来,手巧很累 歌者文明清理员 发表于 2023-6-2 17:01
代码放上来,手巧很累
import requests
import re
import csv
# 获取网页的内容
url = "https://movie.douban.com/top250?start=25&filter="
headers = {
"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
resp = requests.get(url,headers=headers)
page_content = resp.text
# 对网页的内容进行解析
obj = re.compile(r'<li>.*?<span class="title">(?P<name>.*?)</span>'
r'.*?<br>(?P<year>.*?) .*?<span class="rating_num" property="v:average">'
r'(?P<score>.*?)</span>.*?<span>(?P<num>.*?)人评价</span>', re.S)
result = obj.finditer(page_content)
f = open("date.csv", mode="w")
csvwriter = csv.writer(f)
for it in result:
# print(it.group("name"))
# print(it.group("score"))
# print(it.group("num"))
# print(it.group("year").strip())
dic = it.groupdict()
dic['year'] = dic['year'].strip()
csvwriter.writerow(dic.values())
f.close()
print("over!") 小龟龙 发表于 2023-6-2 17:10
import requests
import re
import csv
num = 0
while num <= 250:
print(num)
# 获取网页的内容
url = f"https://movie.douban.com/top250?start={num}&filter="
headers = {
"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
resp = requests.get(url,headers=headers)
page_content = resp.text
# 对网页的内容进行解析
obj = re.compile(r'<li>.*?<span class="title">(?P<name>.*?)</span>'
r'.*?<br>(?P<year>.*?) .*?<span class="rating_num" property="v:average">'
r'(?P<score>.*?)</span>.*?<span>(?P<num>.*?)人评价</span>', re.S)
result = obj.finditer(page_content)
f = open("date.csv", mode="w")
csvwriter = csv.writer(f)
for it in result:
# print(it.group("name"))
# print(it.group("score"))
# print(it.group("num"))
# print(it.group("year").strip())
dic = it.groupdict()
dic['year'] = dic['year'].strip()
csvwriter.writerow(dic.values())
num += 25
f.close()
print("over!")
歌者文明清理员 发表于 2023-6-2 17:13
好 我试试看 歌者文明清理员 发表于 2023-6-2 17:13
老哥,那个f缩进在循环里面了,最后面的f.close,显示f未定义,怎么搞,没close我又怕一会封ip了 小龟龙 发表于 2023-6-3 08:44
老哥,那个f缩进在循环里面了,最后面的f.close,显示f未定义,怎么搞,没close我又怕一会封ip了
import requests
import re
import csv
f = open("date.csv", mode="w")
csvwriter = csv.writer(f)
num = 0
while num <= 250:
print(num)
# 获取网页的内容
url = f"https://movie.douban.com/top250?start={num}&filter="
headers = {
"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
resp = requests.get(url,headers=headers)
page_content = resp.text
# 对网页的内容进行解析
obj = re.compile(r'<li>.*?<span class="title">(?P<name>.*?)</span>'
r'.*?<br>(?P<year>.*?) .*?<span class="rating_num" property="v:average">'
r'(?P<score>.*?)</span>.*?<span>(?P<num>.*?)人评价</span>', re.S)
result = obj.finditer(page_content)
for it in result:
# print(it.group("name"))
# print(it.group("score"))
# print(it.group("num"))
# print(it.group("year").strip())
dic = it.groupdict()
dic['year'] = dic['year'].strip()
csvwriter.writerow(dic.values())
num += 25
f.close()
print("over!")
没仔细看代码,失礼了
页:
[1]