|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
- import requests
- from lxml import etree
- currenturl = "https://www.htfc.com/main/a/20221216/80146505.shtml"
- #网页爬虫
- headers = {
- 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
- 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
- 'Accept-Language':'en-US,en;q=0.5',
- 'Accept-Encoding':'gzip',
- 'DNT':'1',
- 'Connection':'close'
- }
- r = requests.get(currenturl, headers=headers)
- r.encoding = 'gbk'
- html = etree.HTML(r.text)#etree.HTML():构造了一个XPath解析对象并对HTML文本进行自动修正。
- #print(html)
- title = html.xpath('//div[id="details"]')
- print(title)
复制代码
代码如上,想爬取标题和文本,应该如何写?
- import requests
- from lxml import etree
- currenturl = "https://www.htfc.com/main/a/20221216/80146505.shtml"
- #网页爬虫
- headers = {
- 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
- 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
- 'Accept-Language':'en-US,en;q=0.5',
- 'Accept-Encoding':'gzip',
- 'DNT':'1',
- 'Connection':'close'
- }
- r = requests.get(currenturl, headers=headers)
- r.encoding = 'utf-8'
- html = etree.HTML(r.text)#etree.HTML():构造了一个XPath解析对象并对HTML文本进行自动修正。
- #print(html)
- title = html.xpath('//div[@id="details"]/h3/text()')
- print("标题:")
- print(title[0])
- print()
- print("正文:")
- content = html.xpath('//div[@class="wz_content"]//span/text()')
- print('\n'.join(content))
复制代码
|
|