python羊 发表于 2021-5-21 15:42:38

如何提取该类容

本帖最后由 python羊 于 2021-5-21 15:44 编辑

地址:https://www.gia.edu/sites/Satellite?reportno=6342219172&c=Page&childpagename=GIA%2FPage%2FReportCheck&pagename=GIA%2FWrapper&cid=1495275503754




提取类容在:4432行的全部类容。如下图:


话说我只想要这个数据,为什么源代码这么多。。。。
或许 有更快速的方法,请指教。感谢


我的代码:
——————————————
import requests

import re


s = requests.Session()

headers={
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Mobile Safari/537.36',
}

url_end = 'https://www.gia.edu/sites/Satellite?reportno=6342219172&c=Page&childpagename=GIA%2FPage%2FReportCheck&pagename=GIA%2FWrapper&cid=1495275503754'

r_end =s.get(url_end,headers=headers)

r_end_str = r_end.text

content_list=re.findall('<span style="display:none;" name="xmlcontent" id="xmlcontent">"(.*?)"</span>',r_end_str)

print(content_list)

Twilight6 发表于 2021-5-21 15:42:39

python羊 发表于 2021-5-21 16:42
网页源代码部分:




感觉 bs4 快点,re 不怎么会,span 里面很多节点不知道怎么弄

re(标签没去除)参考代码:
import requests

import re


s = requests.Session()

headers={
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Mobile Safari/537.36',
}

url_end = 'https://www.gia.edu/sites/Satellite?reportno=6342219172&c=Page&childpagename=GIA%2FPage%2FReportCheck&pagename=GIA%2FWrapper&cid=1495275503754'

r_end =s.get(url_end,headers=headers)

r_end_str = r_end.text

content_list = re.findall('<REPORT_CHECK_RESPONSE>(.+)</REPORT_CHECK_RESPONSE>',r_end_str)

print(content_list)

bs4 参考代码:
import requests
from bs4 import BeautifulSoup


s = requests.Session()

headers={
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Mobile Safari/537.36',
}

url_end = 'https://www.gia.edu/sites/Satellite?reportno=6342219172&c=Page&childpagename=GIA%2FPage%2FReportCheck&pagename=GIA%2FWrapper&cid=1495275503754'

r_end =s.get(url_end,headers=headers)

r_end_str = r_end.text
soup = BeautifulSoup(r_end_str,'lxml')

content_list= soup.find_all("span",id="xmlcontent").text

print(content_list)

笨鸟学飞 发表于 2021-5-21 16:02:24

网站根本打不开啊
你可以右键检查后看看你需要提取的内容的标签节点,利用xpath提取就好了,正则有时候没那么好用

python羊 发表于 2021-5-21 16:25:33

笨鸟学飞 发表于 2021-5-21 16:02
网站根本打不开啊
你可以右键检查后看看你需要提取的内容的标签节点,利用xpath提取就好了,正则有时候没 ...

https://www.gia.edu/sites/Satellite?reportno=6342219172&c=Page&childpagename=GIA%2FPage%2FReportCheck&pagename=GIA%2FWrapper&cid=1495275503754

python羊 发表于 2021-5-21 16:42:26

本帖最后由 python羊 于 2021-5-21 16:44 编辑

笨鸟学飞 发表于 2021-5-21 16:02
网站根本打不开啊
你可以右键检查后看看你需要提取的内容的标签节点,利用xpath提取就好了,正则有时候没 ...

网页源代码部分:

<a id="main-content"></a>


    <div class="report-check-survey"><p>How can we improve Report Check?<a href="https://www.giasurveys.com/se/705E3F7C429C7CA8" target="_blank">Take this quick survey</a>.</p></div>

<!-- No Match -->
<section id='no-match'>
   
      <div class='content'>
      <form action='/sites/Satellite' class='search-box report-lookup-form' method='GET'>
      <span style="display:none;" name="xmlcontent" id="xmlcontent">
   
      
      

      
      <REPORT_CHECK_RESPONSE><STATUS>SUCCESS</STATUS><ERROR_DTLS><ERROR_CODE></ERROR_CODE><ERROR_MSG></ERROR_MSG></ERROR_DTLS><REPORT_DTLS><REPORT_DTL><MESSAGE></MESSAGE><LENGTH>4.46 x 4.38 x 3.19 mm</LENGTH><WIDTH>4.46 x 4.38 x 3.19 mm</WIDTH><DEPTH>4.46 x 4.38 x 3.19 mm</DEPTH><WEIGHT>0.55</WEIGHT><REPORT_NO>6342219172</REPORT_NO><COLOR>E</COLOR><COLOR_DESCRIPTIONS></COLOR_DESCRIPTIONS><CLARITY>SI1</CLARITY><FINAL_CUT></FINAL_CUT><DEPTH_PCT>72.8</DEPTH_PCT><TABLE_PCT>72</TABLE_PCT><CRN_AG></CRN_AG><CRN_HT></CRN_HT><PAV_AG></PAV_AG><PAV_DP></PAV_DP><STR_LN></STR_LN><LR_HALF></LR_HALF><GIRDLE>Thin to Very Thick</GIRDLE><GIRDLE_CONDITION></GIRDLE_CONDITION><GIRDLE_PCT></GIRDLE_PCT><CULET_SIZE>None</CULET_SIZE><POLISH>Excellent</POLISH><SYMMETRY>Very Good</SYMMETRY><FLUORESCENCE_INTENSITY>None</FLUORESCENCE_INTENSITY><FLUORESCENCE_COLOR></FLUORESCENCE_COLOR><KEY_TO_SYMBOLS>Crystal</KEY_TO_SYMBOLS><REPORT_TYPE>DD~Diamond Dossier</REPORT_TYPE><REPORT_DT>12/03/2019</REPORT_DT><INSCRIPTION>GIA 6342219172</INSCRIPTION><SHAPE>SMB~Square Modified Brilliant</SHAPE><REPORT_COMMENTS></REPORT_COMMENTS><CONTROL_NUMBER>0312F1B207335FEE548803C4910B278B</CONTROL_NUMBER><COUNTRY_OF_ORIGIN></COUNTRY_OF_ORIGIN><INCLUSION_DTLS/><CLARITY_STATUS_CODE></CLARITY_STATUS_CODE><CLARITY_STATUS_ABBR></CLARITY_STATUS_ABBR><CUT_CODE></CUT_CODE><POLISH_CODE>EX</POLISH_CODE><SYMMETRY_CODE>VG</SYMMETRY_CODE><FLUO_INTENSITY_CODE>NON</FLUO_INTENSITY_CODE><GIRDLE_CODE>THN to VTK</GIRDLE_CODE><CULET_CODE>NON</CULET_CODE><LENGTH_CODE>4.46</LENGTH_CODE><WIDTH_CODE>4.38</WIDTH_CODE><DEPTH_CODE>3.19</DEPTH_CODE><CSS_COLOR_CODE>E</CSS_COLOR_CODE><CSS_COLOR_DESC></CSS_COLOR_DESC><PAINTING></PAINTING><PAINTING_COMMENT></PAINTING_COMMENT><PROPORTION></PROPORTION><SYNTHETIC_INDICATOR>NO</SYNTHETIC_INDICATOR><IDENT_TBL_REPORT_DT></IDENT_TBL_REPORT_DT><IDENT_TBL_WEIGHT></IDENT_TBL_WEIGHT><IDENT_TBL_MEASUREMENTS></IDENT_TBL_MEASUREMENTS><IDENT_TBL_SHAPE></IDENT_TBL_SHAPE><IDENT_TBL_CUTTINGSTYLE></IDENT_TBL_CUTTINGSTYLE><IDENT_TBL_CUTTINGSTYLE_PAV></IDENT_TBL_CUTTINGSTYLE_PAV><IDENT_TBL_CUTTINGSTYLE_CRN></IDENT_TBL_CUTTINGSTYLE_CRN><IDENT_TBL_TRANSPARENCY></IDENT_TBL_TRANSPARENCY><IDENT_TBL_COLOR></IDENT_TBL_COLOR><IDENT_TBL_PHENOMENON></IDENT_TBL_PHENOMENON><IDENT_TBL_DESCRIPTION></IDENT_TBL_DESCRIPTION><IDENT_TBL_GROUP></IDENT_TBL_GROUP><IDENT_TBL_TRADENAME></IDENT_TBL_TRADENAME><IDENT_TBL_SPECIES></IDENT_TBL_SPECIES><IDENT_TBL_VARIETY></IDENT_TBL_VARIETY><IDENT_TBL_SOURCETYPE></IDENT_TBL_SOURCETYPE><IDENT_TBL_GEOGRAPHICORIGIN></IDENT_TBL_GEOGRAPHICORIGIN><IDENT_TBL_TREATEMENT></IDENT_TBL_TREATEMENT><IDENT_TBL_COMMENTS></IDENT_TBL_COMMENTS><IDENT_TABULAR_INDICATOR></IDENT_TABULAR_INDICATOR><IDENT_NAR_DESC></IDENT_NAR_DESC><IDENT_NAR_CONCLUSION></IDENT_NAR_CONCLUSION><IDENT_NAR_COMMENTS></IDENT_NAR_COMMENTS><QUANTITY></QUANTITY><MEASUREMENTS></MEASUREMENTS><PEARLS></PEARLS><ENVIRONMENT></ENVIRONMENT><MOLLUSK></MOLLUSK><TREATMENTS></TREATMENTS><BODYCOLOR></BODYCOLOR><OVERTONE></OVERTONE><LUSTER></LUSTER><SURFACE></SURFACE><NACRETHICKNESS></NACRETHICKNESS><MATCHING></MATCHING><DRILLING></DRILLING><REPORT_DESCRIPTION></REPORT_DESCRIPTION><GENERAL_DESC></GENERAL_DESC><IS_PDF_AVAILABLE>TRUE</IS_PDF_AVAILABLE><EREPORT_URL></EREPORT_URL><TREATMENT_URLS></TREATMENT_URLS><MATERIAL></MATERIAL><SEALING_CODE></SEALING_CODE><REPORT_SLEEVE_MSG></REPORT_SLEEVE_MSG><INFO_MSG></INFO_MSG><KTS_IMG></KTS_IMG><DIAMETER_RANGE></DIAMETER_RANGE><MELEE_COUNT></MELEE_COUNT><TEST_RESULT_TYPE></TEST_RESULT_TYPE><MELEE_MSG></MELEE_MSG><DIGITAL_RPT_FLG>N</DIGITAL_RPT_FLG></REPORT_DTL></REPORT_DTLS></REPORT_CHECK_RESPONSE>
               
      
   </span>
   

      <input class='icon-search form-control' name='reportno' placeholder='Enter Report No.' type='tel' />
      <input type='submit' name='' id='' class='button-submit ' value='Go' />
      <input type="hidden" name="c" value="Page"/>
      <input type="hidden" name="childpagename" value="GIA/Page/ReportCheck"/>
      <input type="hidden" name="pagename" value="GIA/Wrapper"/>
      <input type="hidden" name="cid" value="1495275503754"/>
       <input type="hidden" name="encryptedString" id="encryptedString" value="ADB8EFD5E8B146D156516E7BE68FBD8D"/>
       <input type="hidden" name="qr" id="qr" value="null"/>

      </form>


我想提取 标签   span    之间的文本


就是两个<REPORT_CHECK_RESPONSE>之间的内容

python羊 发表于 2021-5-21 17:53:47

Twilight6 发表于 2021-5-21 17:11
感觉 bs4 快点,re 不怎么会,span 里面很多节点不知道怎么弄

re(标签没去除)参考代码:


感谢,感谢。BS4 还是更方便些。
不知道这么一点数据,为什么一共有6千多行网页源代码,加载的好慢。

笨鸟学飞 发表于 2021-5-21 18:15:32

测试代码:
import requests
from lxml import etree

a = '''
<a id="main-content"></a>


    <div class="report-check-survey"><p>How can we improve Report Check?<atarget="_blank">Take this quick survey</a>.</p></div>

<!-- No Match -->
<section id='no-match'>
   
      <div class='content'>
      <form action='/sites/Satellite' class='search-box report-lookup-form' method='GET'>
      <span style="display:none;" name="xmlcontent" id="xmlcontent">
   
      
      

      
      <REPORT_CHECK_RESPONSE><STATUS>SUCCESS</STATUS><ERROR_DTLS><ERROR_CODE></ERROR_CODE><ERROR_MSG></ERROR_MSG></ERROR_DTLS><REPORT_DTLS><REPORT_DTL><MESSAGE></MESSAGE><LENGTH>4.46 x 4.38 x 3.19 mm</LENGTH><WIDTH>4.46 x 4.38 x 3.19 mm</WIDTH><DEPTH>4.46 x 4.38 x 3.19 mm</DEPTH><WEIGHT>0.55</WEIGHT><REPORT_NO>6342219172</REPORT_NO><COLOR>E</COLOR><COLOR_DESCRIPTIONS></COLOR_DESCRIPTIONS><CLARITY>SI1</CLARITY><FINAL_CUT></FINAL_CUT><DEPTH_PCT>72.8</DEPTH_PCT><TABLE_PCT>72</TABLE_PCT><CRN_AG></CRN_AG><CRN_HT></CRN_HT><PAV_AG></PAV_AG><PAV_DP></PAV_DP><STR_LN></STR_LN><LR_HALF></LR_HALF><GIRDLE>Thin to Very Thick</GIRDLE><GIRDLE_CONDITION></GIRDLE_CONDITION><GIRDLE_PCT></GIRDLE_PCT><CULET_SIZE>None</CULET_SIZE><POLISH>Excellent</POLISH><SYMMETRY>Very Good</SYMMETRY><FLUORESCENCE_INTENSITY>None</FLUORESCENCE_INTENSITY><FLUORESCENCE_COLOR></FLUORESCENCE_COLOR><KEY_TO_SYMBOLS>Crystal</KEY_TO_SYMBOLS><REPORT_TYPE>DD~Diamond Dossier</REPORT_TYPE><REPORT_DT>12/03/2019</REPORT_DT><INSCRIPTION>GIA 6342219172</INSCRIPTION><SHAPE>SMB~Square Modified Brilliant</SHAPE><REPORT_COMMENTS></REPORT_COMMENTS><CONTROL_NUMBER>0312F1B207335FEE548803C4910B278B</CONTROL_NUMBER><COUNTRY_OF_ORIGIN></COUNTRY_OF_ORIGIN><INCLUSION_DTLS/><CLARITY_STATUS_CODE></CLARITY_STATUS_CODE><CLARITY_STATUS_ABBR></CLARITY_STATUS_ABBR><CUT_CODE></CUT_CODE><POLISH_CODE>EX</POLISH_CODE><SYMMETRY_CODE>VG</SYMMETRY_CODE><FLUO_INTENSITY_CODE>NON</FLUO_INTENSITY_CODE><GIRDLE_CODE>THN to VTK</GIRDLE_CODE><CULET_CODE>NON</CULET_CODE><LENGTH_CODE>4.46</LENGTH_CODE><WIDTH_CODE>4.38</WIDTH_CODE><DEPTH_CODE>3.19</DEPTH_CODE><CSS_COLOR_CODE>E</CSS_COLOR_CODE><CSS_COLOR_DESC></CSS_COLOR_DESC><PAINTING></PAINTING><PAINTING_COMMENT></PAINTING_COMMENT><PROPORTION></PROPORTION><SYNTHETIC_INDICATOR>NO</SYNTHETIC_INDICATOR><IDENT_TBL_REPORT_DT></IDENT_TBL_REPORT_DT><IDENT_TBL_WEIGHT></IDENT_TBL_WEIGHT><IDENT_TBL_MEASUREMENTS></IDENT_TBL_MEASUREMENTS><IDENT_TBL_SHAPE></IDENT_TBL_SHAPE><IDENT_TBL_CUTTINGSTYLE></IDENT_TBL_CUTTINGSTYLE><IDENT_TBL_CUTTINGSTYLE_PAV></IDENT_TBL_CUTTINGSTYLE_PAV><IDENT_TBL_CUTTINGSTYLE_CRN></IDENT_TBL_CUTTINGSTYLE_CRN><IDENT_TBL_TRANSPARENCY></IDENT_TBL_TRANSPARENCY><IDENT_TBL_COLOR></IDENT_TBL_COLOR><IDENT_TBL_PHENOMENON></IDENT_TBL_PHENOMENON><IDENT_TBL_DESCRIPTION></IDENT_TBL_DESCRIPTION><IDENT_TBL_GROUP></IDENT_TBL_GROUP><IDENT_TBL_TRADENAME></IDENT_TBL_TRADENAME><IDENT_TBL_SPECIES></IDENT_TBL_SPECIES><IDENT_TBL_VARIETY></IDENT_TBL_VARIETY><IDENT_TBL_SOURCETYPE></IDENT_TBL_SOURCETYPE><IDENT_TBL_GEOGRAPHICORIGIN></IDENT_TBL_GEOGRAPHICORIGIN><IDENT_TBL_TREATEMENT></IDENT_TBL_TREATEMENT><IDENT_TBL_COMMENTS></IDENT_TBL_COMMENTS><IDENT_TABULAR_INDICATOR></IDENT_TABULAR_INDICATOR><IDENT_NAR_DESC></IDENT_NAR_DESC><IDENT_NAR_CONCLUSION></IDENT_NAR_CONCLUSION><IDENT_NAR_COMMENTS></IDENT_NAR_COMMENTS><QUANTITY></QUANTITY><MEASUREMENTS></MEASUREMENTS><PEARLS></PEARLS><ENVIRONMENT></ENVIRONMENT><MOLLUSK></MOLLUSK><TREATMENTS></TREATMENTS><BODYCOLOR></BODYCOLOR><OVERTONE></OVERTONE><LUSTER></LUSTER><SURFACE></SURFACE><NACRETHICKNESS></NACRETHICKNESS><MATCHING></MATCHING><DRILLING></DRILLING><REPORT_DESCRIPTION></REPORT_DESCRIPTION><GENERAL_DESC></GENERAL_DESC><IS_PDF_AVAILABLE>TRUE</IS_PDF_AVAILABLE><EREPORT_URL></EREPORT_URL><TREATMENT_URLS></TREATMENT_URLS><MATERIAL></MATERIAL><SEALING_CODE></SEALING_CODE><REPORT_SLEEVE_MSG></REPORT_SLEEVE_MSG><INFO_MSG></INFO_MSG><KTS_IMG></KTS_IMG><DIAMETER_RANGE></DIAMETER_RANGE><MELEE_COUNT></MELEE_COUNT><TEST_RESULT_TYPE></TEST_RESULT_TYPE><MELEE_MSG></MELEE_MSG><DIGITAL_RPT_FLG>N</DIGITAL_RPT_FLG></REPORT_DTL></REPORT_DTLS></REPORT_CHECK_RESPONSE>
               
      
   </span>
   

      <input class='icon-search form-control' name='reportno' placeholder='Enter Report No.' type='tel' />
      <input type='submit' name='' id='' class='button-submit ' value='Go' />
      <input type="hidden" name="c" value="Page"/>
      <input type="hidden" name="childpagename" value="GIA/Page/ReportCheck"/>
      <input type="hidden" name="pagename" value="GIA/Wrapper"/>
      <input type="hidden" name="cid" value="1495275503754"/>
       <input type="hidden" name="encryptedString" id="encryptedString" value="ADB8EFD5E8B146D156516E7BE68FBD8D"/>
       <input type="hidden" name="qr" id="qr" value="null"/>

      </form>
'''
tree = etree.HTML(a)
content = tree.xpath('//span[@style="display:none;"]')
result = etree.tostring(content, pretty_print=True, method='html').decode('utf-8')
print(result)

应该可以成功运行的代码:
import requests
from lxml import etree

url = 'https://www.gia.edu/sites/Satellite?reportno=6342219172&c=Page&childpagename=GIA%2FPage%2FReportCheck&pagename=GIA%2FWrapper&cid=1495275503754'
headers = {
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Mobile Safari/537.36'
}
res = requests.get(url, headers=headers)
tree = etree.HTML(res.text)
content = tree.xpath('//span[@style="display:none;"]')
result = etree.tostring(content, pretty_print=True, method='html').decode('utf-8')# 转为HTML源码
print(result)

笨鸟学飞 发表于 2021-5-21 18:18:10

如果你只要这段节点之间的所有文本内容,而不是需要源代码,就这样操作
import requests
from lxml import etree

a = '''
<a id="main-content"></a>


    <div class="report-check-survey"><p>How can we improve Report Check?<atarget="_blank">Take this quick survey</a>.</p></div>

<!-- No Match -->
<section id='no-match'>

      <div class='content'>
      <form action='/sites/Satellite' class='search-box report-lookup-form' method='GET'>
      <span style="display:none;" name="xmlcontent" id="xmlcontent">





      <REPORT_CHECK_RESPONSE><STATUS>SUCCESS</STATUS><ERROR_DTLS><ERROR_CODE></ERROR_CODE><ERROR_MSG></ERROR_MSG></ERROR_DTLS><REPORT_DTLS><REPORT_DTL><MESSAGE></MESSAGE><LENGTH>4.46 x 4.38 x 3.19 mm</LENGTH><WIDTH>4.46 x 4.38 x 3.19 mm</WIDTH><DEPTH>4.46 x 4.38 x 3.19 mm</DEPTH><WEIGHT>0.55</WEIGHT><REPORT_NO>6342219172</REPORT_NO><COLOR>E</COLOR><COLOR_DESCRIPTIONS></COLOR_DESCRIPTIONS><CLARITY>SI1</CLARITY><FINAL_CUT></FINAL_CUT><DEPTH_PCT>72.8</DEPTH_PCT><TABLE_PCT>72</TABLE_PCT><CRN_AG></CRN_AG><CRN_HT></CRN_HT><PAV_AG></PAV_AG><PAV_DP></PAV_DP><STR_LN></STR_LN><LR_HALF></LR_HALF><GIRDLE>Thin to Very Thick</GIRDLE><GIRDLE_CONDITION></GIRDLE_CONDITION><GIRDLE_PCT></GIRDLE_PCT><CULET_SIZE>None</CULET_SIZE><POLISH>Excellent</POLISH><SYMMETRY>Very Good</SYMMETRY><FLUORESCENCE_INTENSITY>None</FLUORESCENCE_INTENSITY><FLUORESCENCE_COLOR></FLUORESCENCE_COLOR><KEY_TO_SYMBOLS>Crystal</KEY_TO_SYMBOLS><REPORT_TYPE>DD~Diamond Dossier</REPORT_TYPE><REPORT_DT>12/03/2019</REPORT_DT><INSCRIPTION>GIA 6342219172</INSCRIPTION><SHAPE>SMB~Square Modified Brilliant</SHAPE><REPORT_COMMENTS></REPORT_COMMENTS><CONTROL_NUMBER>0312F1B207335FEE548803C4910B278B</CONTROL_NUMBER><COUNTRY_OF_ORIGIN></COUNTRY_OF_ORIGIN><INCLUSION_DTLS/><CLARITY_STATUS_CODE></CLARITY_STATUS_CODE><CLARITY_STATUS_ABBR></CLARITY_STATUS_ABBR><CUT_CODE></CUT_CODE><POLISH_CODE>EX</POLISH_CODE><SYMMETRY_CODE>VG</SYMMETRY_CODE><FLUO_INTENSITY_CODE>NON</FLUO_INTENSITY_CODE><GIRDLE_CODE>THN to VTK</GIRDLE_CODE><CULET_CODE>NON</CULET_CODE><LENGTH_CODE>4.46</LENGTH_CODE><WIDTH_CODE>4.38</WIDTH_CODE><DEPTH_CODE>3.19</DEPTH_CODE><CSS_COLOR_CODE>E</CSS_COLOR_CODE><CSS_COLOR_DESC></CSS_COLOR_DESC><PAINTING></PAINTING><PAINTING_COMMENT></PAINTING_COMMENT><PROPORTION></PROPORTION><SYNTHETIC_INDICATOR>NO</SYNTHETIC_INDICATOR><IDENT_TBL_REPORT_DT></IDENT_TBL_REPORT_DT><IDENT_TBL_WEIGHT></IDENT_TBL_WEIGHT><IDENT_TBL_MEASUREMENTS></IDENT_TBL_MEASUREMENTS><IDENT_TBL_SHAPE></IDENT_TBL_SHAPE><IDENT_TBL_CUTTINGSTYLE></IDENT_TBL_CUTTINGSTYLE><IDENT_TBL_CUTTINGSTYLE_PAV></IDENT_TBL_CUTTINGSTYLE_PAV><IDENT_TBL_CUTTINGSTYLE_CRN></IDENT_TBL_CUTTINGSTYLE_CRN><IDENT_TBL_TRANSPARENCY></IDENT_TBL_TRANSPARENCY><IDENT_TBL_COLOR></IDENT_TBL_COLOR><IDENT_TBL_PHENOMENON></IDENT_TBL_PHENOMENON><IDENT_TBL_DESCRIPTION></IDENT_TBL_DESCRIPTION><IDENT_TBL_GROUP></IDENT_TBL_GROUP><IDENT_TBL_TRADENAME></IDENT_TBL_TRADENAME><IDENT_TBL_SPECIES></IDENT_TBL_SPECIES><IDENT_TBL_VARIETY></IDENT_TBL_VARIETY><IDENT_TBL_SOURCETYPE></IDENT_TBL_SOURCETYPE><IDENT_TBL_GEOGRAPHICORIGIN></IDENT_TBL_GEOGRAPHICORIGIN><IDENT_TBL_TREATEMENT></IDENT_TBL_TREATEMENT><IDENT_TBL_COMMENTS></IDENT_TBL_COMMENTS><IDENT_TABULAR_INDICATOR></IDENT_TABULAR_INDICATOR><IDENT_NAR_DESC></IDENT_NAR_DESC><IDENT_NAR_CONCLUSION></IDENT_NAR_CONCLUSION><IDENT_NAR_COMMENTS></IDENT_NAR_COMMENTS><QUANTITY></QUANTITY><MEASUREMENTS></MEASUREMENTS><PEARLS></PEARLS><ENVIRONMENT></ENVIRONMENT><MOLLUSK></MOLLUSK><TREATMENTS></TREATMENTS><BODYCOLOR></BODYCOLOR><OVERTONE></OVERTONE><LUSTER></LUSTER><SURFACE></SURFACE><NACRETHICKNESS></NACRETHICKNESS><MATCHING></MATCHING><DRILLING></DRILLING><REPORT_DESCRIPTION></REPORT_DESCRIPTION><GENERAL_DESC></GENERAL_DESC><IS_PDF_AVAILABLE>TRUE</IS_PDF_AVAILABLE><EREPORT_URL></EREPORT_URL><TREATMENT_URLS></TREATMENT_URLS><MATERIAL></MATERIAL><SEALING_CODE></SEALING_CODE><REPORT_SLEEVE_MSG></REPORT_SLEEVE_MSG><INFO_MSG></INFO_MSG><KTS_IMG></KTS_IMG><DIAMETER_RANGE></DIAMETER_RANGE><MELEE_COUNT></MELEE_COUNT><TEST_RESULT_TYPE></TEST_RESULT_TYPE><MELEE_MSG></MELEE_MSG><DIGITAL_RPT_FLG>N</DIGITAL_RPT_FLG></REPORT_DTL></REPORT_DTLS></REPORT_CHECK_RESPONSE>


   </span>


      <input class='icon-search form-control' name='reportno' placeholder='Enter Report No.' type='tel' />
      <input type='submit' name='' id='' class='button-submit ' value='Go' />
      <input type="hidden" name="c" value="Page"/>
      <input type="hidden" name="childpagename" value="GIA/Page/ReportCheck"/>
      <input type="hidden" name="pagename" value="GIA/Wrapper"/>
      <input type="hidden" name="cid" value="1495275503754"/>
       <input type="hidden" name="encryptedString" id="encryptedString" value="ADB8EFD5E8B146D156516E7BE68FBD8D"/>
       <input type="hidden" name="qr" id="qr" value="null"/>

      </form>
'''
tree = etree.HTML(a)
content = tree.xpath('//span[@style="display:none;"]')
print(content.xpath('string()'))

python羊 发表于 2021-5-22 16:15:11

笨鸟学飞 发表于 2021-5-21 18:18
如果你只要这段节点之间的所有文本内容,而不是需要源代码,就这样操作

感谢,感谢。{:5_101:}
页: [1]
查看完整版本: 如何提取该类容