[已解决]关于已经提取到的文本，如何用xpath或bs4精确提取

景暄 · 发表于 2020-10-11 22:32:56

本帖最后由景暄于 2020-10-11 23:03 编辑

我用爬虫在微博关注界面提取出了含有关注人的文本，但无法精确提取，如果有知道的大佬的话，可以请教一下用xpath或者bs4等提取的方法吗?
想提取的文本已在下方用红线标红了图片

我这么写但提出来的是空列表

targets = soup.find_all('div', class_="title W_fb W_autocut ")
for each in targets:
names.append(each.a.text)
print(names)

复制代码

这是用爬虫提取出来的内容

<script>FM.view({"ns":"pl.relation.myFollow.index","domid":"Pl_Official_RelationMyfollow__93","css":["style/css/module/pagecard/PCD_connectlist.css?version=825e5991ea0d00a7"],"js":"page/js/pl/relation/myFollow/index.js?version=84f2f62c1a6e1201","html":"<div class="WB_cardwrap S_bg2">\r\n <div class="PCD_connectlist PCD_connectlist_spe">\r\n <div class="WB_innerwrap">\r\n <div class="WB_tab_b" node-type="relationnav">\r\n <div class="opt_choose">\r\n <div class="inner S_line2 clearfix">\r\n <ul class="tab_ul tab_ul_s W_fl">\r\n <li class="tab_li"><span class="tab_item tab_cur S_line1 textcut"><span class="W_f14 S_txt1">全部关注<\/span><em class="attach S_txt1">2<\/em><em class="attach S_txt2" title=""><\/em><\/span><\/li>\r\n <\/ul>\r\n <\/div>\r\n <\/div>\r\n <div fixed-item="true">\r\n <div class="opt_bar clearfix S_bg2" node-type="navTools">\r\n <div class="W_fl">\r\n <a href="javascript:void(0);" class="btn_link S_txt1" action-type="batselect">批量管理<\/a>\r\n <a href="javascript:void(0);" class="btn_link S_txt1" node-type="sort_target">排序<em class="W_ficon ficon_arrow_down_lite S_ficon">g<\/em><\/a>\r\n <\/div>\r\n <div class="W_fr">\r\n <div class="search_box">\r\n <span class="WB_search_s"><input node-type="searchInput" type="text" value="输入昵称或备注" notice="输入昵称或备注" class="W_input"><span class="pos"><a href="javascript:void(0);" node-type="searchBtn" title="搜索" class="W_ficon ficon_search S_ficon">f<\/a><\/span><\/span>\r\n <\/div>\r\n <\/div>\r\n <\/div>\r\n <div class="opt_bar clearfix S_bg2" node-type="batnavTools" style="display:none">\r\n <div class="W_fl">\r\n <a href="javascript:void(0);" class="W_btn_b W_btn_b_disable" node-type="addToOtherGroupBtn" action-type="add_to_other_group">添加到<em class="W_ficon ficon_arrow_down_lite S_ficon">g<\/em><\/a>\r\n <a href="javascript:void(0);" class="W_btn_b W_btn_b_disable" node-type="cancelFollowBtn" action-type="cancel_follow_all">取消关注<\/a>\r\n <a href="javascript:void(0);" class="W_btn_b W_btn_b_disable" node-type="addSpecialBtn" action-type="add_special_all" suda-uatrack="key=weibo_pc_PostFollow_FollowList&value=FollowListHost_UserCard_SpeFol">添加特别关注<\/a>\r\n <a href="javascript:void(0);" class="W_btn_b" action-type="unbatselect">退出批量管理<\/a>\r\n <span style="display: none" node-type="select_text">\r\n <span class="text">已选择<em class="num" node-type="count_Num">0<\/em>人<\/span>\r\n <a href="javascript:void(0);" action-type="cancel_select">取消选择<\/a>\r\n <\/span>\r\n <\/div>\r\n <\/div>\r\n <\/div>\r\n <\/div>\r\n <div class="layer_menu_list" style="display:none;" node-type="sort_layer">\r\n <ul>\r\n <li class="cur"><a bpfilter="page" href="\/p\/1005057317518635\/myfollow?t=1#_0">全部关注<\/a><\/li>\r\n <li ><a bpfilter="page" href="\/p\/1005057317518635\/myfollow?ftype=1&t=1#_0">互相关注<\/a><\/li>\r\n <li class="line"><\/li>\r\n <\/ul>\r\n <ul >\r\n <li class="cur"><a bpfilter="page" href="\/p\/1005057317518635\/myfollow?t=1#_0">按关注时间排序<\/a><\/li>\r\n <li ><a bpfilter="page" href="\/p\/1005057317518635\/myfollow?t=2#_0">按昵称首字母排序<\/a><\/li>\r\n <li ><a bpfilter="page" href="\/p\/1005057317518635\/myfollow?t=3#_0">按最近更新排序<\/a><\/li>\r\n <li ><a bpfilter="page" href="\/p\/1005057317518635\/myfollow?t=4#_0">按最近联系排序<\/a><\/li>\r\n <li ><a bpfilter="page" href="\/p\/1005057317518635\/myfollow?t=5#_0">按粉丝数排序<\/a><\/li>\r\n <\/ul>\r\n <\/div>\r\n <div class="member_box" node-type="groupContainer">\r\n <ul class="member_ul clearfix" node-type="relation_user_list">\r\n <li class="member_li S_bg1" node-type="user_item" action-type="user_item" action-data="uid=5715470927&profile_image_url=https:\/\/tvax3.sinaimg.cn\/crop.0.0.828.828.50\/006eNx4Xly8gcdc4ek0t9j30n00n0gn3.jpg?KID=imgbed,tva&Expires=1602436541&ssig=i%2Fpst5IybB&gid=0&gname=未分组&screen_name=一大罐柚子酱&sex=f">\r\n <div class="member_wrap clearfix">\r\n <div class="mod_pic S_line1">\r\n <p class="pic_box"><a action-type="ignore_list" target="_blank" href="\/u\/5715470927?from=myfollow_all" class=""><img src="https:\/\/tvax3.sinaimg.cn\/crop.0.0.828.828.50\/006eNx4Xly8gcdc4ek0t9j30n00n0gn3.jpg?KID=imgbed,tva&Expires=1602436541&ssig=i%2Fpst5IybB" title="一大罐柚子酱" usercard="id=5715470927" width="50" height="50" alt="一大罐柚子酱" class="W_face_radius"><\/a><\/p>\r\n <\/div>\r\n <div class="mod_info">\r\n <div class="title W_fb W_autocut ">\r\n <a target="_blank" action-type="ignore_list" node-type="screen_name" href="\/u\/5715470927?from=myfollow_all" class="S_txt1" title="一大罐柚子酱" usercard="id=5715470927" >一大罐柚子酱<\/a>\r\n \t\t\t\t \t\t\t\t\t \t\t\t\t\t \t\t\t\t \t\t\t\t\r\n <\/div>\r\n <div class="statu">\r\n <em class="W_ficon ficon_addtwo S_ficon">Z<\/em><span class="S_txt1">互相关注<\/span>\r\n <\/div>\r\n <div class="text W_autocut S_txt2">\r\n 简介：事如春梦了无痕。 <\/div>\r\n <div class="info_from S_txt2">\r\n \t\t\t\t\t\t通过<a href="http:\/\/app.weibo.com\/t\/feed\/6vtZb0" class="S_link2" >微博 weibo.com<\/a>关注\t\t\t\t\t<\/div>\r\n <div class="opt">\r\n <p class="btn_bed">\r\n <a class="W_btn_b" action-data="gid=0&nick=一大罐柚子酱&uid=5715470927&sex=f" diss-data="refer_sort=relationManage&location=myfollow&refer_flag=add" action-type="relation_setGroup" node-type="setGroupBtn" href="javascript:void(0);" title="未分组">\r\n <span node-type="groupName" class="txt W_autocut">未分组<\/span>\r\n <em class="W_ficon ficon_arrow_down_lite S_ficon">g<\/em>\r\n <\/a>\r\n <a class="W_btn_b btn_spe" action-type="special_follow" href="javascript:void(0);" action-data="uids=5715470927">\r\n <em class="W_ficon S_ficon ficon_add">+<\/em>特别关注\r\n <\/a>\r\n <a class="W_btn_b btn_set" action-type="relation_hover"><em node-type="setGroupIcon" class="W_ficon ficon_setup S_ficon">J<\/em><\/a>\r\n <\/p>\r\n <div class="layer_menu_list layer_spe" style="display:none;position:absolute;z-index:99;" node-type="special_unFollow_list" action-type="special_unFollow_hover">\r\n <ul>\r\n <li><a href="javascript:void(0);" action-type="special_unFollow" action-data="remove=0">移出特别关注<\/a><\/li>\r\n <\/ul>\r\n <\/div>\r\n <div class="layer_menu_list" style="display:none;" node-type="layer_hover_list" action-type="relation_hover_more">\r\n <ul>\r\n \t <li><a href="javascript:void(0);" action-type="webim.conversation" action-data="uid=5715470927&nick=一大罐柚子酱">私信<\/a><\/li>\r\n <li><a href="javascript:void(0);" action-type="relation_setRemark" action-data="uid=5715470927">设置备注<\/a><\/li>\r\n <li><a href="javascript:void(0);" action-type="cancel_follow_single">取消关注<\/a><\/li>\r\n <\/ul>\r\n <\/div>\r\n <\/div>\r\n \r\n <\/div>\r\n <\/div>\r\n <div class="markup_choose"><\/div>\r\n <\/li>\r\n <li class="member_li S_bg1" node-type="user_item" action-type="user_item" action-data="uid=3069466401&profile_image_url=https:\/\/tvax3.sinaimg.cn\/crop.0.0.1080.1080.50\/b6f45721ly8ggzux431r6j20u00u00v4.jpg?KID=imgbed,tva&Expires=1602436541&ssig=znzklLNX0M&gid=0&gname=未分组&screen_name=DavidDWayne&sex=m">\r\n <div class="member_wrap clearfix">\r\n <div class="mod_pic S_line1">\r\n <p class="pic_box"><a action-type="ignore_list" target="_blank" href="\/u\/3069466401?from=myfollow_all" class=""><img src="https:\/\/tvax3.sinaimg.cn\/crop.0.0.1080.1080.50\/b6f45721ly8ggzux431r6j20u00u00v4.jpg?KID=imgbed,tva&Expires=1602436541&ssig=znzklLNX0M" title="DavidDWayne" usercard="id=3069466401" width="50" height="50" alt="DavidDWayne" class="W_face_radius"><\/a><\/p>\r\n <\/div>\r\n <div class="mod_info">\r\n <div class="title W_fb W_autocut ">\r\n <a target="_blank" action-type="ignore_list" node-type="screen_name" href="\/u\/3069466401?from=myfollow_all" class="S_txt1" title="DavidDWayne" usercard="id=3069466401" >DavidDWayne<\/a>\r\n \t\t\t\t <a target="_blank" href="\/\/verified.weibo.com\/verify"><i title= "微博个人认证 " class="W_icon icon_approve"><\/i><\/a> \t\t\t\t\t<a title="微博会员" target="_blank" href="https:\/\/vip.weibo.com\/personal?from=main" action-type="ignore_list"suda-uatrack="key=profile_head&value=member_guest"><em class="W_icon icon_member6"><\/em><\/a> \t\t\t\t\t \t\t\t\t \t\t\t\t\r\n <\/div>\r\n <div class="statu">\r\n <em class="W_ficon ficon_right S_ficon">Y<\/em><span class="S_txt1">已关注<\/span>\r\n <\/div>\r\n <div class="text W_autocut S_txt2">\r\n 设计美学博主 <\/div>\r\n <div class="info_from S_txt2">\r\n \t\t\t\t\t\t通过<a href="http:\/\/app.weibo.com\/t\/feed\/6c3EMN" class="S_link2" >头条文章<\/a>关注\t\t\t\t\t<\/div>\r\n <div class="opt">\r\n <p class="btn_bed">\r\n <a class="W_btn_b" action-data="gid=0&nick=DavidDWayne&uid=3069466401&sex=m" diss-data="refer_sort=relationManage&location=myfollow&refer_flag=add" action-type="relation_setGroup" node-type="setGroupBtn" href="javascript:void(0);" title="未分组">\r\n <span node-type="groupName" class="txt W_autocut">未分组<\/span>\r\n <em class="W_ficon ficon_arrow_down_lite S_ficon">g<\/em>\r\n <\/a>\r\n <a class="W_btn_b btn_spe" action-type="special_follow" href="javascript:void(0);" action-data="uids=3069466401">\r\n <em class="W_ficon S_ficon ficon_add">+<\/em>特别关注\r\n <\/a>\r\n <a class="W_btn_b btn_set" action-type="relation_hover"><em node-type="setGroupIcon" class="W_ficon ficon_setup S_ficon">J<\/em><\/a>\r\n <\/p>\r\n <div class="layer_menu_list layer_spe" style="display:none;position:absolute;z-index:99;" node-type="special_unFollow_list" action-type="special_unFollow_hover">\r\n <ul>\r\n <li><a href="javascript:void(0);" action-type="special_unFollow" action-data="remove=0">移出特别关注<\/a><\/li>\r\n <\/ul>\r\n <\/div>\r\n <div class="layer_menu_list" style="display:none;" node-type="layer_hover_list" action-type="relation_hover_more">\r\n <ul>\r\n \t <li><a href="javascript:void(0);" action-type="webim.conversation" action-data="uid=3069466401&nick=DavidDWayne">私信<\/a><\/li>\r\n <li><a href="javascript:void(0);" action-type="relation_setRemark" action-data="uid=3069466401">设置备注<\/a><\/li>\r\n <li><a href="javascript:void(0);" action-type="cancel_follow_single">取消关注<\/a><\/li>\r\n <\/ul>\r\n <\/div>\r\n <\/div>\r\n \r\n <\/div>\r\n <\/div>\r\n <div class="markup_choose"><\/div>\r\n <\/li>\r\n <\/ul>\r\n <\/div>\r\n <\/div>\r\n <\/div>\r\n <input type="hidden" node-type="hidden" action-data="is_special=0" value="allFollow" gname="0"\/>\r\n<\/div>\r\n"})</script>

复制代码

最佳答案

月排行榜 / 总排行榜

疾风怪盗

2020-10-11 22:32:57

你要提取关注人列表，试试这个网址，提取出来的是json数据

跳转到最佳答案楼层

疾风怪盗 · 发表于 2020-10-11 22:32:57

你要提取关注人列表，试试这个网址，提取出来的是json数据

景暄 · 发表于 2020-10-12 00:12:57

疾风怪盗发表于 2020-10-11 23:34
你要提取关注人列表，试试这个网址，提取出来的是json数据

通过网址我找到了较为好提取的文本，但这个微博页面请问在哪我没找到

疾风怪盗 · 发表于 2020-10-12 00:47:38

景暄发表于 2020-10-12 00:12
通过网址我找到了较为好提取的文本，但这个微博页面请问在哪我没找到

网页提取，肯定没这个方便啊，就是关注人列表，然后你找呗

这种提取是json数据，更方便

疾风怪盗 · 发表于 2020-10-12 01:00:38

本帖最后由疾风怪盗于 2020-10-12 01:06 编辑

换上自己的cookie就行了

import requests
import json
url='https://weibo.com/ajax/friendships/friends?page=1&uid=2641891937'
headers= {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36 Edg/86.0.622.38',
'cookie': 'cookie'}#替换上自己的cookie
params={'page': '1','uid': '2641891937'}
response=requests.get(url=url,headers=headers)
html_str=response.content.decode()
print(html_str)
data=json.loads(html_str)
print(data)
print(data['users'][0]['screen_name'])

复制代码

cookie中也只要 SUB= 这个字段就可以了，其他字段删了也能获取到数据

景暄 · 发表于 2020-10-12 02:34:18

疾风怪盗发表于 2020-10-11 23:34
你要提取关注人列表，试试这个网址，提取出来的是json数据

谢谢！通过你这个网址我写出来了，中途因为没写cookie被新浪卡了3个小时

from urllib.request import Request, urlopen
from fake_useragent import UserAgent
import requests
from urllib.parse import urlencode
import json
def get_res(url):
#url = 'https://www.weibo.com/ajax/friendships/friends?page=1&uid=5715470927'
headers = {
'User-Agent':UserAgent().chrome,
'Cookie':'SINAGLOBAL=2809735293383.7993.1584504483612; _s_tentry=-; Apache=6590299036100.631.1602408891862; ULV=1602408892284:5:1:1:6590299036100.631.1602408891862:1598306540800; un=18593193392; UOR=,,login.sina.com.cn; login_sid_t=3debbfe8ff2ffe4cb10d7dde5d215edb; cross_origin_proto=SSL; ALF=1633959683; SSOLoginState=1602423684; SCF=Av_D3h0SAn4M4dDvE11TJXoQFM3irgDs2FVu2udrE-s8uhlNeVAf_IkPHz2X_Id6dBjAwv5Fhyav-OWt4DcWdmQ.; SUB=_2A25yh3vVDeRhGeFN6lUU8SbKyDmIHXVR9eodrDV8PUNbmtAKLRHFkW9NQHRX7Hg8sFQ4-U_aiRRwtDREDTH0chCt; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WWbEK6NwBhzUaJ9YLxuoqUz5JpX5KzhUgL.FoM0eKMfeKnce0-2dJLoIEBLxKnL1h5L1h2LxKBLBonL1-eLxKnLBK-LB.qLxK-L1KeL1Kzt; SUHB=09P9wfR2GiQh3v; wvr=6; XSRF-TOKEN=hoHjMief6hcEGVgqALnncHJf; WBPSESS=Z6ejeHjAsRNOBei-TAIcJwO6JUmrCOPlIXNnyDYgWSpdLL2GYqyBLhZHLnGgWxjTGdgKi0O7E2XWyERbot0iMyppvEqNwFXWAXoIcy43Lf_rLzz8OYiN6nr5DddwMF-9; webim_unReadCount=%7B%22time%22%3A1602436950706%2C%22dm_pub_total%22%3A2%2C%22chat_group_client%22%3A0%2C%22chat_group_notice%22%3A0%2C%22allcountNum%22%3A39%2C%22msgbox%22%3A0%7D'
}
response = requests.get(url, headers=headers)
return response
def get_users(response,page):
info = response.content.decode('utf-8','ignore')
follow_json = json.loads(info)
print('ok')
users = follow_json['users']
return users
def main():
content = input('请输入要用户的uid:')
num = input('请输入要下载的页数:')
base_url = 'https://www.weibo.com/ajax/friendships/friends?{}'
with open ('张艺兴微博关注列表试做型初号机.txt', 'w', encoding='utf-8') as f:
for i in range(int(num)):
page = i+1
args = {
'page':page,
'uid':content
}
args = urlencode(args)
url = base_url.format(args)
print(url)
response = get_res(url)
print('正在打印第%d页'%page)
users = get_users(response,page)
f.write('第%d页\n'%page)
for each in users:
f.write('Name:'+each['name']+'\n')
f.write('Uid:'+each['idstr']+'\n')
f.write('Description:'+each['description']+'\n')
f.write('--------------------------------------\n\n')
if __name__ == '__main__':
main()

复制代码

疾风怪盗 · 发表于 2020-10-12 10:43:34

景暄发表于 2020-10-12 02:34
谢谢！通过你这个网址我写出来了，中途因为没写cookie被新浪卡了3个小时

cookie不用这么长，也只要 SUB= 这个字段就可以了，试试看，其他字段删了也能获取到数据

账号		自动登录	找回密码
密码			立即注册