python爬虫

Dop.lop · 发表于 2019-4-10 18:32:42

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

import re
import requests

#获取网页源代码,并且转化为中文
response=requests.get('http://www.jianlaixiaoshuo.com/')
response.encoding='utf-8'
html=response.text
#提取全部章节的url和标题
dl=re.findall(r'<dl class="chapterlist">.*?</dl>',html,re.S)[0]

#提取出所需的每一章节的url和每一章节的标题
chapter_list=re.findall(r'<dd><a href="(.*?)"target="_blank">(.*?)</a></dd>',dl)
print(chapter_list)

为什么print(chapter_list)打印出来的是空[]列表。

但是如果我换成：

import re
import requests

#获取网页源代码,并且转化为中文
response=requests.get('http://www.jianlaixiaoshuo.com/')
response.encoding='utf-8'
html=response.text
#提取全部章节的url和标题
dl=re.findall(r'<dl class="chapterlist">.*?</dl>',html,re.S)[0]

#提取出所需的每一章节的url和每一章节的标题
href=re.findall(r'href="(.*?)" ',dl)
print(href)

就能打印出来url

minjun · 发表于 2019-4-10 20:17:35

from selenium import webdriver
import re
import requests
import urllib
name=0
file_path='H:\爬取\妹子图'
hea={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.5006.400 QQBrowser/9.7.13114.400'}
def start_Firefox():
driver =webdriver.Chrome()
driver.start_client()
return driver
#for i in range(1,99):
url='http://www.jianlaixiaoshuo.com/'
# driver=start_Firefox()
page=requests.get(url,headers=hea)
page.encoding='utf-8'
print(page.text)
titl=re.findall(r'href="(.*?)" target="_blank">(.*?)<',page.text)
# titl=re.findall(r'https://i.meizitu.net/thumbs/(.*?).jpg',page.text)
print(titl)
for each in titl:
print(each)
就这样可以打出每一个链接和文章标题，统一加连接头即可得到url

Dop.lop · 发表于 2019-4-10 23:01:38

minjun 发表于 2019-4-10 20:17
from selenium import webdriver
import re
import requests

但是我不懂为什么？单独可以爬取出数据，将他们弄在一块就不行了？

lassiter · 发表于 2019-4-10 23:09:37

minjun 发表于 2019-4-10 20:17
from selenium import webdriver
import re
import requests

妹子图这个网站是不是反爬了，已经不能爬图片了

账号		自动登录	找回密码
密码			立即注册

python爬虫

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块