[已解决]爬虫

哈岁NB · 发表于 2023-3-5 13:14:36

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

请问这个时间怎么为空啊

import requests
import re
from time import sleep
from lxml import html
etree = html.etree
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
url = 'http://guba.eastmoney.com/news,cjpl,918454004.html'
rsponse= requests.get(url=url,headers=header).text
tree = etree.HTML(rsponse)
data = tree.xpath('//*[@id="line2"]/div[1]/span[2]/text()')
print(data)

复制代码

最佳答案

月排行榜 / 总排行榜

isdkz

2023-3-5 14:05:25

那个是 js 动态渲染出来的，因为 requests 不会帮你自动获取静态资源（js、css）来渲染，

他只会帮你请求那个页面的 html 源代码，所以没有时间那个元素，教你一个判断是不是 js 渲染出来的方法，

禁用 js 然后刷新页面，看不到的那些元素都是 js 渲染出来的，禁用 js 的方法：

这种情况没有什么太好的解决方法，只能借助浏览器来帮你渲染，可以用 selenium，pypeeteer，playwright 等自动化的库，

也可以用 request-html 这个库，这个库就是 requests 与 pypeeteer 的结合，它也是 requests 官方开发的，

还有一种方法就是自己获取 js 来执行并把它渲染到 html 上，这种方法更是难

跳转到最佳答案楼层

liuhongrun2022 · 发表于 2023-3-5 13:15:30

不叫他他也会来的

@isdkz

isdkz · 发表于 2023-3-5 13:34:06

liuhongrun2022 发表于 2023-3-5 13:15
不叫他他也会来的
@isdkz

我有看到，不过我刚刚要吃饭去了，回去再看看

哈岁NB · 发表于 2023-3-5 13:39:19

liuhongrun2022 发表于 2023-3-5 13:15
不叫他他也会来的
@isdkz

哈岁NB · 发表于 2023-3-5 13:39:33

isdkz 发表于 2023-3-5 13:34
我有看到，不过我刚刚要吃饭去了，回去再看看

好的

歌者文明清理员 · 发表于 2023-3-5 13:43:49

哈岁NB 发表于 2023-3-5 13:39
好的

我也是这个问题，不过我是find没找到，@isdkz 大神……

isdkz · 发表于 2023-3-5 14:05:25

那个是 js 动态渲染出来的，因为 requests 不会帮你自动获取静态资源（js、css）来渲染，

他只会帮你请求那个页面的 html 源代码，所以没有时间那个元素，教你一个判断是不是 js 渲染出来的方法，

禁用 js 然后刷新页面，看不到的那些元素都是 js 渲染出来的，禁用 js 的方法：

这种情况没有什么太好的解决方法，只能借助浏览器来帮你渲染，可以用 selenium，pypeeteer，playwright 等自动化的库，

也可以用 request-html 这个库，这个库就是 requests 与 pypeeteer 的结合，它也是 requests 官方开发的，

还有一种方法就是自己获取 js 来执行并把它渲染到 html 上，这种方法更是难

哈岁NB · 发表于 2023-3-5 14:11:11

isdkz 发表于 2023-3-5 14:05
那个是 js 动态渲染出来的，因为 requests 不会帮你自动获取静态资源（js、css）来渲染，

他只会帮你请 ...

大佬，麻烦您帮我看一下这个代码。有几个报错的地方不会填

# -*- coding: utf-8 -*-
"""
Created on Tue Feb 15 15:02:26 2023
@author: Neal
shareholder information of a stock are listed in :
https://q.stock.sohu.com/cn/000001/ltgd.shtml
https://q.stock.sohu.com/cn/000002/ltgd.shtml
https://q.stock.sohu.com/cn/000003/ltgd.shtml
...
1. 'rank'-股票代码
2. 'rank'-排名
3. 'org_name'-股东名称
4. 'shares'-持股数量(万股)
5. 'percentage'-持股比例
6. 'changes'-持股变化(万股)
7. 'nature'-股本性质
"""
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
fake_header = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding":"gzip, deflate, sdch",
"Accept-Language":"zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4,zh-CN;q=0.2"
}
data_file= './data/stock_shareholders.csv'
select_stocks = ('601186','601169','601166','601088','601006','600523',
'600999','601988','600919','600887','600837','600606','600547',
'600519','600518','600485','600340','601881','600104','600100')
print('There are', len(select_stocks), 'stocks in select_stocks')
base_url = 'https://q.stock.sohu.com/cn/{}/ltgd.shtml'
row_count = 0
results=[]
for stock in select_stocks:
url = base_url.format(stock)
print("Now we are crawling stock",stock)
response = requests.get(url,headers = fake_header)
if response.status_code == 200:
response.encoding = 'gbk'
root = BeautifulSoup(response.text,"html.parser")
# search the table storing the shareholder information
table = root.select('body > div.str2Column.clearfix > div.str2ColumnR > div.BIZ_innerMain > div.BIZ_innerBoard > div > div:nth-child(2) > table > tbody > tr:nth-child(2) > td:nth-child(2) > a')#++insert your code here++
print(table)
rows =
for row in rows:
record=[stock,]
columns =
for col in columns: #iterate colums
record.append(col.get_text().strip())
if len(record) == 7:
row_count+=1
time.sleep(1)
print('Crawled and saved {} records of shareholder information of select_stocks to{}'.format(row_count,data_file) )
sharehold_records_df = pd.DataFrame(columns=['stock', 'rank','org_name','shares','percentage','changes','nature'], data=results)
sharehold_records_df.to_excel("./data/sharehold_records.xlsx")
print("List of shareholers are \n", sharehold_records_df['org_name'])

复制代码

哈岁NB · 发表于 2023-3-5 14:14:33

isdkz 发表于 2023-3-5 14:05
那个是 js 动态渲染出来的，因为 requests 不会帮你自动获取静态资源（js、css）来渲染，

他只会帮你请 ...

52行的table是要股东信息，我写的还是返回还是空，55行和58行那个不知道该咋填，看不懂

isdkz · 发表于 2023-3-5 14:25:40

哈岁NB 发表于 2023-3-5 14:14
52行的table是要股东信息，我写的还是返回还是空，55行和58行那个不知道该咋填，看不懂

这里有一个坑就是你直接在浏览器复制的路径带有 table 的时候一般会带个 tbody，这是浏览器在渲染的时候自动给你补上去的，

一般别人写代码的时候没有那么标准，也为了省事都不写这个 tbody 的，

所以你把这个 tbody 去掉一般都可以获取到元素，除非其中还有一些动态渲染的元素在，

至于第 55 行和第 58 行，我也没看懂这是要干嘛？你这个是填程序的题？

哈岁NB · 发表于 2023-3-5 14:28:40

isdkz 发表于 2023-3-5 14:25
这里有一个坑就是你直接在浏览器复制的路径带有 table 的时候一般会带个 tbody，这是浏览器在渲染的时候 ...

是了，这样看看，这个有英文注释

# -*- coding: utf-8 -*-
"""
Created on Tue Feb 15 15:02:26 2023
@author: Neal
shareholder information of a stock are listed in :
https://q.stock.sohu.com/cn/000001/ltgd.shtml
https://q.stock.sohu.com/cn/000002/ltgd.shtml
https://q.stock.sohu.com/cn/000003/ltgd.shtml
...
And you are requried to collect the tables of shareholder information for stocks in "select_stocks"
with following 7 columns, and then perform the analysis to answer the questions.
1. 'rank'-股票代码
2. 'rank'-排名
3. 'org_name'-股东名称
4. 'shares'-持股数量(万股)
5. 'percentage'-持股比例
6. 'changes'-持股变化(万股)
7. 'nature'-股本性质
"""
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
fake_header = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding":"gzip, deflate, sdch",
"Accept-Language":"zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4,zh-CN;q=0.2"
}
data_file= './data/stock_shareholders.csv'
select_stocks = ('601186','601169','601166','601088','601006','600523',
'600999','601988','600919','600887','600837','600606','600547',
'600519','600518','600485','600340','601881','600104','600100')
print('There are', len(select_stocks), 'stocks in select_stocks')
base_url = 'https://q.stock.sohu.com/cn/{}/ltgd.shtml'
row_count = 0
#create a list to store the crawled share-holdoing records
results=[]
for stock in select_stocks:#process stock one by one
#prepare the request webpage with desired parameters
url = base_url.format(stock)
print("Now we are crawling stock",stock)
#send http request with fake http header
response = requests.get(url,headers = fake_header)
if response.status_code == 200:
response.encoding = 'gbk'#++insert your code here++ look for charset in html
root = BeautifulSoup(response.text,"html.parser")
# search the table storing the shareholder information
table = root.select('body > div.str2Column.clearfix > div.str2ColumnR > div.BIZ_innerMain > div.BIZ_innerBoard > div > div:nth-child(2) > table tr:nth-child(2) > td:nth-child(2) > a')#++insert your code here++
print(table)
# list all rows the table, i.e., tr tags
rows = #++insert your code here++
for row in rows: #iterate rows
record=[stock,]# define a record with stock pre-filled and then store columns of the row/record
# list all columns of the row , i.e., td tags
columns = #++insert your code here++
for col in columns: #iterate colums
record.append(col.get_text().strip())
if len(record) == 7:# if has valid columns, save the record to list results
#++insert your code here++ to add single "record" to list of "records"
row_count+=1
time.sleep(1)
print('Crawled and saved {} records of shareholder information of select_stocks to{}'.format(row_count,data_file) )
sharehold_records_df = pd.DataFrame(columns=['stock', 'rank','org_name','shares','percentage','changes','nature'], data=results)
sharehold_records_df.to_excel("./data/sharehold_records.xlsx")
print("List of shareholers are \n", sharehold_records_df['org_name'])
++insert your code here++ to answer Q3-1, Q3-2 and Q3-3

复制代码

isdkz · 发表于 2023-3-5 14:37:52

本帖最后由 isdkz 于 2023-3-5 14:49 编辑

哈岁NB 发表于 2023-3-5 14:28
是了，这样看看，这个有英文注释

# -*- coding: utf-8 -*-

"""

Created on Tue Feb 15 15:02:26 2023

@author: Neal

shareholder information of a stock are listed in :

https://q.stock.sohu.com/cn/000001/ltgd.shtml

https://q.stock.sohu.com/cn/000002/ltgd.shtml

https://q.stock.sohu.com/cn/000003/ltgd.shtml

...

And you are requried to collect the tables of shareholder information for stocks in "select_stocks"

with following 7 columns, and then perform the analysis to answer the questions.

1. 'rank'-股票代码

2. 'rank'-排名

3. 'org_name'-股东名称

4. 'shares'-持股数量(万股)

5. 'percentage'-持股比例

6. 'changes'-持股变化(万股)

7. 'nature'-股本性质

"""

import requests

from bs4 import BeautifulSoup

import pandas as pd

import time

fake_header = {  "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36",

         "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",

         "Accept-Encoding":"gzip, deflate, sdch",

         "Accept-Language":"zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4,zh-CN;q=0.2"

      }

data_file= './data/stock_shareholders.csv'

select_stocks = ('601186','601169','601166','601088','601006','600523',

         '600999','601988','600919','600887','600837','600606','600547',

         '600519','600518','600485','600340','601881','600104','600100')

print('There are', len(select_stocks), 'stocks in select_stocks')

base_url = 'https://q.stock.sohu.com/cn/{}/ltgd.shtml'

row_count = 0

#create a list to store the crawled share-holdoing records

results=[]

for stock in select_stocks:#process stock one by one

#prepare the request webpage with desired parameters

url = base_url.format(stock)

print("Now we are crawling stock",stock)

#send http request with fake http header

response = requests.get(url,headers = fake_header)

if response.status_code == 200:

      response.encoding =  'gbk'#++insert your code here++  look for charset in html

      root = BeautifulSoup(response.text,"html.parser")

      # search the table storing the shareholder information

      table = root.select_one('body > div.str2Column.clearfix > div.str2ColumnR > div.BIZ_innerMain > div.BIZ_innerBoard > div > div:nth-child(2) > table')#++insert your code here++

      print(table)

      # list all rows the table, i.e., tr tags

      rows = table.select('tr')#++insert your code here++

      for row in rows: #iterate rows

         record=[stock,]# define a record with stock pre-filled and then store columns of the row/record

         # list all columns of the row , i.e., td tags

         columns = row.select('td') #++insert your code here++

         for col in columns: #iterate colums

            record.append(col.get_text().strip())

         if len(record) == 7:# if has valid columns, save the record to list results

            #++insert your code here++ to add single "record" to list of "records"

            results.append(record)

            row_count+=1

      time.sleep(1)

print('Crawled and saved {} records of  shareholder information of select_stocks to{}'.format(row_count,data_file) )

####################### 如果你的代码中没有 data 这个文件夹，加上这段代码以免报错

import os

if not os.path.exists('data'):

os.mkdir('data')

#######################

sharehold_records_df = pd.DataFrame(columns=['stock', 'rank','org_name','shares','percentage','changes','nature'], data=results)

sharehold_records_df.to_excel("./data/sharehold_records.xlsx")

print("List of shareholers are \n", sharehold_records_df['org_name'])

#++insert your code here++ to answer Q3-1, Q3-2 and Q3-3
复制代码

哈岁NB · 发表于 2023-3-5 14:48:02

isdkz 发表于 2023-3-5 14:37

这是怎么回事呢

isdkz · 发表于 2023-3-5 14:51:49

哈岁NB 发表于 2023-3-5 14:48
这是怎么回事呢

因为你当前工作目录中没有 data 这个文件夹，

sharehold_records_df.to_excel("./data/sharehold_records.xlsx") 而这一句代码是保存到 data 这个文件夹中，

我在回复中的代码也加了相应的代码：

哈岁NB · 发表于 2023-3-5 14:53:03

isdkz 发表于 2023-3-5 14:37

解决了，解决了，代码放错地方了

哈岁NB · 发表于 2023-3-5 15:10:16

isdkz 发表于 2023-3-5 14:51
因为你当前工作目录中没有 data 这个文件夹，

sharehold_records_df.to_excel("./data/sharehold_reco ...

bs4是不是只能这么一个一个往下套，不能像xpath一步到位啊

isdkz · 发表于 2023-3-5 15:20:42

哈岁NB 发表于 2023-3-5 15:10
bs4是不是只能这么一个一个往下套，不能像xpath一步到位啊

不是呀，也可以直接定位到特定的元素呀

哈岁NB · 发表于 2023-3-5 15:26:30

isdkz 发表于 2023-3-5 15:20
不是呀，也可以直接定位到特定的元素呀

那这个该怎么写才能直接把股东信息提取出来呢

isdkz · 发表于 2023-3-5 15:27:47

哈岁NB 发表于 2023-3-5 15:26
那这个该怎么写才能直接把股东信息提取出来呢

.select('body > div.str2Column.clearfix > div.str2ColumnR > div.BIZ_innerMain > div.BIZ_innerBoard > div > div:nth-child(2) > table > tr:nth-child(2) > td:nth-child(2) > a')

哈岁NB · 发表于 2023-3-5 15:33:00

isdkz 发表于 2023-3-5 15:27
.select('body > div.str2Column.clearfix > div.str2ColumnR > div.BIZ_innerMain > div.BIZ_innerBoard ...

这个不是把标签所有的内容输出了吗，如果单独想要这个标签的文本呢

账号		自动登录	找回密码
密码			立即注册