鱼C论坛Python精英挑战赛（第四季04期）

jerryxjr1220 · 发表于 2017-12-11 13:14:57

本帖最后由 jerryxjr1220 于 2017-12-18 09:45 编辑

第四届鱼C论坛精英挑战赛开始咯！为了增加趣味性，本届仍然延续“新玩法”-- “押宝玩法”，“竞猜玩法”和“擂主玩法”。

同时，根据以往鱼油的反馈，精英赛题目普遍偏难，所以参与的鱼油相对较少。为了提高大家的参与度，本届挑战赛会大幅降低难度，使大部分鱼油都能参赛。同时，会增设一、二、三名奖励，第一名奖励50鱼币，第二名30鱼币，第三名20鱼币。

新玩法规则：

1. 押宝玩法：由于押宝玩法参与人数很少，故暂停押宝。后续有改进玩法，会再公布。

2. 竞猜玩法：直接在比赛帖的下方进行投票，凡事“竞赛”获胜者，将奖励5鱼币。竞猜无门槛，人人都可以参与。竞猜以后，请在本帖留个言，方便领取奖励。

3. 擂主玩法：上一期挑战成功的鱼油成为挑战赛的擂主，擂主有优先权提议下一期的赛题，一届挑战赛共分5期，同一届中当擂主最长的鱼油有额外奖励。

本期题目： 爬知乎日报 https://daily.zhihu.com/

不知道大家是否都看过“知乎日报”，每天3次的“知乎日报”包含了大量的信息和知识，而7分钟的阅读量也非常符合现代非常快节奏的生活。

那么今天的题目就是做一个爬虫，去爬下每期的“知乎日报”，然后把爬取的网页内容（若有能力把网页内容转换成pdf加分）按照日期和标题自动归类到文件夹zhihu_daily。

加分项：自动检测知乎日报是否有更新，如果有更新，自动爬取更新的内容（不重复爬取已有的内容），自动检测频率可以设置1小时1次。

提示：“知乎日报”有网页版、安卓版、IOS版、微信小程序、API等多种版本和端口可以使用，挑选你认为最合适的爬虫策略，形式方法不限。

要求： 爬虫正常运行，程序效率高，代码简洁

截止日期： 12月17日24时

本期擂主： cngoodboy、gunjang、万事屋

@小甲鱼 @冬雪雪冬 @～风介～ @SixPy

竞猜：回答正确的参赛者的人数

shigure_takimi · 发表于 2017-12-11 13:56:03

爬虫没学过，不会。🐶️🐶️

qwc3000 · 发表于 2017-12-12 09:26:45

本帖最后由 jerryxjr1220 于 2017-12-18 09:41 编辑

import requests
from bs4 import BeautifulSoup
import threading
import os
import re
import pdfkit
import lxml
from requests.exceptions import RequestException
def get_index_page():
    # headers是为了解决http 304 错误
    headers={
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding':'gzip, deflate',
        'Accept-Language':'zh-CN,zh;q=0.9',
        'Cache-Control':'max-age=0',
        'Connection':'keep-alive',
        'Host':'daily.zhihu.com',
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3269.3 Safari/537.36'
    }
    url="http://daily.zhihu.com"
    response=requests.get(url,headers=headers)  #启用headers参数
    # print(response.text)
    return response.text
def get_detail_content():
    # global data_G
    html = get_index_page()  #获取网页文本数据
    # 使用BeautifulSoup 解析
    soup = BeautifulSoup(html, 'lxml')
    # 按照xmlselector 查找titles 和images，href 详情地址
    titles = soup.select('body > div.main-content > div > div.main-content-wrap > div > div > div > div > a > span ')
    images = soup.select('body > div.main-content > div > div.main-content-wrap > div > div > div > div > a > img ')
    hrefs =soup.select('body > div.main-content > div > div.main-content-wrap > div > div > div > div > a ')
    for title,image,href in zip(titles,images,hrefs):
        title=title.get_text()
        data={
            'title':re.sub(r'\W',"",title),
            'image':image.get('src'),
            'href':href.get('href')
        }
        # print(data)
        yield data
    print("================================")
def timer_work():
    global num,data_G
    num=num+1
    data_G = get_detail_content()
    save_all()
    #每1000s执行一次
    t = threading.Timer(1000, timer_work)
    t.start()
def save_all():
    # 创建文件夹
    if not os.path.isdir(os.getcwd() + '/zhihu_daily'):
        print("NO")
        os.mkdir(os.getcwd() + '/zhihu_daily')
    else:
        print('OK')
    for tmp in data_G:
        images_save(tmp['title'],tmp['image']) #保存图片
        #保存详情
        PDF_save(tmp['href'],tmp['title'])
    print("Done")
def PDF_save(href,title):

    # 解决：python使用pdfkit中，如果使用pdfkit.from_url 或者pdfkit.from_string等，就会出现上述错误。而且如果你使用pip安装了 wkhtmltopdf，还是会出现这个问题：
    # If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf
    # 方法
    # path_wk = r'D:\Users\Administrator\AppData\Local\Programs\Python\Python36\Lib\site-packages\wkhtmltopdf\bin\wkhtmltopdf.exe'  # 安装位置
    # config = pdfkit.configuration(wkhtmltopdf=path_wk)
    # pdfkit.from_url(url,file_path,configuration=config)
 
    path_wk = r'D:\Users\Administrator\AppData\Local\Programs\Python\Python36\Lib\site-packages\wkhtmltopdf\bin\wkhtmltopdf.exe'  # 安装位置
    config = pdfkit.configuration(wkhtmltopdf=path_wk)
    url='https://daily.zhihu.com'+href
    # print(title+'.pdf')
    file_path="{0}/{1}/{2}.{3}".format(os.getcwd(),'zhihu_daily',title , 'pdf')
    try:
        if not os.path.exists(file_path):
            pdfkit.from_url(url,file_path,configuration=config)
        else:
            print("TP2：文件存在")
    except:
        # 此处一个Exit with code 1 due to network error: ContentNotFoundError异常
        # 此异常为是因为css文件引用了外部的资源，如：字体，图片，iframe加载等。
        # 选择忽略此异常
        pass

def images_save(title,image):
    img = requests.get(image)     #获取图片的response
    # title1=re.sub(r'\W',"",title) #标题去掉特殊符号，如 ？  不然会出现无法保存的错误
    file_path = "{0}/{1}/{2}.{3}".format(os.getcwd(),'zhihu_daily',title , 'jpg') #格式化存放路径
    if not os.path.exists(file_path):
        with open(file_path, 'wb') as f:
            f.write(img.content)         # 存放图片
            f.close()
def main():
    timer_work() #每一个小时刷新一次

data_G={}
if __name__=='__main__':
    num=0
    main()

春衫少年薄 · 发表于 2017-12-12 10:53:07

本帖最后由 jerryxjr1220 于 2017-12-15 09:44 编辑

import random, requests, time
from bs4 import BeautifulSoup
import re
user_agents = [
                'ozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
                'Opera/9.25 (Windows NT 5.1; U; en)',
                'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
                'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
                'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
                'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',
                "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",
                "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 ",
                                        ]
        user_agent = random.choice(user_agents)
        headers = {
                'user-agent': user_agent,
                'accept': '* / *',
                'accept - encoding': 'gzip, deflate, br',
                'accept - language': 'zh - CN, zh;q = 0.9',
                        }

        url = "https://daily.zhihu.com"
        r = requests.get(url,headers=headers)
        soup = BeautifulSoup(r.content, 'lxml')
        s = soup.select('div[class="box"]')
        print(time.strftime('%Y-%m-%d',time.localtime(time.time())) + '：知乎日报标题内容速览')
        for i in range(len(s)):
                link = re.findall(r'<a class="link-button" href="(.*?)">',str(s[i]))
                title = re.findall(r'<span class="title">(.*?)<\/span>',str(s[i]))
                img = re.findall(r'src="(.*?)"',str(s[i]))
                title = urllib.parse.unquote(''.join(title))
                print( url+''.join(link), title ,'img= %s'%''.join(img))
                res = requests.get(url+''.join(link), headers=headers)
                soup = BeautifulSoup(res.content, 'lxml')
                so = soup.select('div[class="question"]')
                contents = re.findall(r'>(.*?)<',str(so))
                print( ''.join(contents))

qwc3000 · 发表于 2017-12-12 20:10:45

本帖最后由 qwc3000 于 2017-12-16 23:32 编辑

timeislife · 发表于 2017-12-15 06:44:30

我投票中了吗？

c1_wangyf · 发表于 2017-12-15 08:46:33

看看是什么情况！！

hostmi · 发表于 2017-12-15 23:47:40

这个有意思呀！

SylarPu · 发表于 2017-12-18 09:40:16

很棒棒的题目。可惜不会爬虫

jerryxjr1220 · 发表于 2017-12-18 10:40:49

SylarPu 发表于 2017-12-18 09:40
很棒棒的题目。可惜不会爬虫

可以参考获胜者的程序，写得很棒！输出的pdf也很优雅！

SylarPu · 发表于 2017-12-18 14:29:02

春衫少年薄发表于 2017-12-12 10:53

问个问题哈，请问使用requests.get之后，如果直接print r.text 报错,gbk错误，是我的电脑的问题么

账号		自动登录	找回密码
密码			立即注册

[技术交流] 鱼C论坛Python精英挑战赛（第四季04期）

本帖被以下淘专辑推荐:

点评

评分

python3转了一下

点评

评分

111

点评