鱼C论坛

 找回密码
 立即注册
查看: 1455|回复: 4

[已解决]BeautifulSoup 里 select

[复制链接]
发表于 2020-9-24 20:39:50 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
我用copy selector 方法复制的内容放进select里抓取到的是空列表
我就尽量把选择器缩到最小范围
但只有减到一个的时候才有数据(还不是我想要的)
该怎么解决

例:爬取小猪短租信息
抓天安门国贸双井.... 北京和北京市丰台区   的信息

(copy selector内容:body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > h4 > em)
我只有div em 和em 能找得到内容

import requests,urllib.request,re
from bs4 import BeautifulSoup

url = "https://bj.xiaozhu.com/"
headers = {
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36",
    "cookie":"abtest_ABTest4SearchDate=b; xzuuid=60edda4c; sajssdk_2015_cross_new_user=1; distinctId=174bb0e4812102-034d2bbd6334d-6373664-1327104-174bb0e481374; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22174bb0e4812102-034d2bbd6334d-6373664-1327104-174bb0e481374%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%22174bb0e47ee1d9-07ed0f67d639cd-6373664-1327104-174bb0e47f019e%22%7D; Hm_lvt_92e8bc890f374994dd570aa15afc99e1=1600866110; _uab_collina=160086611042232800553623; wttXMuWwbC=3bb7e13a2cada458f866cd09b3158a88e4a443f4; ATNgmRNkrw=1600866157; Hm_lpvt_92e8bc890f374994dd570aa15afc99e1=1600866160"
}
res = requests.get(url,headers=headers)
html = res.text

soup = BeautifulSoup(html,"html.parser")
titles = soup.select('div em')

print(titles)
最佳答案
2020-9-25 10:39:39
本帖最后由 suchocolate 于 2020-9-25 10:43 编辑

soup没深入研究,xpath熟悉,给你写了一个。
import requests
from lxml import etree


def main():
    url = 'https://bj.xiaozhu.com/'
    headers = {'user-agent': 'firefox', 'cookie': 'abtest_ABTest4SearchDate=b; xzuuid=60edda4c; sajssdk_2015_cross_new_user=1; distinctId=174bb0e4812102-034d2bbd6334d-6373664-1327104-174bb0e481374; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22174bb0e4812102-034d2bbd6334d-6373664-1327104-174bb0e481374%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%22174bb0e47ee1d9-07ed0f67d639cd-6373664-1327104-174bb0e47f019e%22%7D; Hm_lvt_92e8bc890f374994dd570aa15afc99e1=1600866110; _uab_collina=160086611042232800553623; wttXMuWwbC=3bb7e13a2cada458f866cd09b3158a88e4a443f4; ATNgmRNkrw=1600866157; Hm_lpvt_92e8bc890f374994dd570aa15afc99e1=1600866160'}
    r = requests.get(url, headers=headers)
    # with open('r.txt', 'w', encoding='utf-8') as f:
    #     f.write(r.text)
    html = etree.HTML(r.text)
    lis = html.xpath('//li[@lodgeunitid]')
    for item in lis:
        title = item.xpath('./a/img/@title')[0]
        txt = item.xpath('normalize-space(./div[2]/div[2]/em//text())')
        price = item.xpath('./div[2]/div[1]/span/i/text()')[0]
        print(title, txt, price + '每晚')
        print('=' * 100)


if __name__ == '__main__':
    main()
1.png
QQ浏览器截图20200924203139.png
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2020-9-25 10:07:25 | 显示全部楼层
美味的汤学了后,一直用不来

xpath和re不是很好用的么。。。。。。。不行就换种方式呗
parrten=r'<span class="result_title hiddenTxt">(.*?)</span>'
a=re.findall(parrten,html)
print(a)
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2020-9-25 10:39:39 | 显示全部楼层    本楼为最佳答案   
本帖最后由 suchocolate 于 2020-9-25 10:43 编辑

soup没深入研究,xpath熟悉,给你写了一个。
import requests
from lxml import etree


def main():
    url = 'https://bj.xiaozhu.com/'
    headers = {'user-agent': 'firefox', 'cookie': 'abtest_ABTest4SearchDate=b; xzuuid=60edda4c; sajssdk_2015_cross_new_user=1; distinctId=174bb0e4812102-034d2bbd6334d-6373664-1327104-174bb0e481374; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22174bb0e4812102-034d2bbd6334d-6373664-1327104-174bb0e481374%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%22174bb0e47ee1d9-07ed0f67d639cd-6373664-1327104-174bb0e47f019e%22%7D; Hm_lvt_92e8bc890f374994dd570aa15afc99e1=1600866110; _uab_collina=160086611042232800553623; wttXMuWwbC=3bb7e13a2cada458f866cd09b3158a88e4a443f4; ATNgmRNkrw=1600866157; Hm_lpvt_92e8bc890f374994dd570aa15afc99e1=1600866160'}
    r = requests.get(url, headers=headers)
    # with open('r.txt', 'w', encoding='utf-8') as f:
    #     f.write(r.text)
    html = etree.HTML(r.text)
    lis = html.xpath('//li[@lodgeunitid]')
    for item in lis:
        title = item.xpath('./a/img/@title')[0]
        txt = item.xpath('normalize-space(./div[2]/div[2]/em//text())')
        price = item.xpath('./div[2]/div[1]/span/i/text()')[0]
        print(title, txt, price + '每晚')
        print('=' * 100)


if __name__ == '__main__':
    main()
1.png
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2020-9-25 11:14:01 | 显示全部楼层
实在要用bs4的话查找数据应该也是用find或者findall
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2020-9-25 19:52:03 | 显示全部楼层
suchocolate 发表于 2020-9-25 10:39
soup没深入研究,xpath熟悉,给你写了一个。

想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-1-18 13:51

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表