新浪新闻首页新闻标题和链接,Python交流,编程语言专区,鱼C论坛

wcq15759797758 发表于 2021-8-27 17:09:45

新浪新闻首页新闻标题和链接

本帖最后由 wcq15759797758 于 2021-8-27 17:12 编辑

很简单的爬虫
import requests
import cchardet
import traceback
from lxml import etree

def downloader(url, timeout=10, headers=None, debug=False, binary=False):
_headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}
redirected_url = url
if headers:
   _headers = headers
try:
   r = requests.get(url, headers=_headers, timeout=timeout)
   if binary:
         html = r.content
   else:
         encoding = cchardet.detect(r.content)['encoding']
         html = r.content.decode(encoding)
   status = r.status_code
   redirected_url = r.url
except:
   if debug:
         traceback.print_exc()
   msg = 'failed download: {}'.format(url)
   print(msg)
   if binary:
         html = b''
   else:
         html = ''
   status = 0
return title(html)

def title(html):
title_html = etree.HTML(html)
titles = title_html.xpath('//a[@target="_blank"]')
for title in titles:
   item = {}
   tit = title.xpath('./text()')
   urls = title.xpath('./@href')
   item['title'] = str(processing(tit))
   item['url'] = str(processing(urls))
if len(item['title']) > 4 :
print(item)

def processing(strs):
s = ''# 定义保存内容的字符串
for n in strs:
   n = ''.join(n.split())# 去除空字符
   s = s + n# 拼接字符串
return s    # 返回拼接后的字符串


if __name__ == '__main__':
url = 'https://news.sina.com.cn/'
downloader(url)

wxm1324 发表于 2021-8-27 17:39:20

学习了！

yobdc 发表于 2021-8-27 17:57:50

学习

Rebecca2021 发表于 2021-8-28 10:30:10

学习

qq1151985918 发表于 2021-8-28 10:43:07

{:9_227:}

Rosy7673 发表于 2021-8-28 11:35:53

支持一下

瑜子肌昂 发表于 2021-8-28 11:51:23

鱼币选手

hornwong 发表于 2021-8-28 14:34:21

{:5_95:}

937135952 发表于 2021-8-29 09:27:52

学习一下

llc2009 发表于 2021-8-29 09:32:04

感谢

让记忆定格 发表于 2021-8-29 12:13:26

厉害了

页: [1]

鱼C论坛's Archiver

新浪新闻首页新闻标题和链接