鱼C论坛

 找回密码
 立即注册
查看: 10682|回复: 8

爬虫求助

[复制链接]
发表于 2020-11-6 16:13:54 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
最进在学爬虫,前几天做了一个简单爬虫能正常爬取数据,但今天运行后返回的数据为空,Html文本中显示这么一条信息“<h1><strong>请开启JavaScript并刷新该页.</strong></h1>“,大佬们这是被反爬虫了吗?
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2020-11-6 17:44:35 | 显示全部楼层
不一定,把代码发上来吧。
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2020-11-6 17:53:15 | 显示全部楼层
import requests
import csv
import time
from lxml import etree

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1;WOW64)\
AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/\
537.36 115Broswer/6.0.3'}
pre_url = 'https://shenzhen.qfang.com/sale/f'

def download(url):

    html = requests.get(url, headers = headers)
    html.encoding = 'utf-8'
    print(html.status_code)
    time.sleep(2)
    print(html.text)
    print(1)
    return etree.HTML(html.text)

def data_writer(item):

    with open('qfang1.csv', 'a', encoding = 'utf-8') as fp:
        writer = csv.writer(fp)
        writer.writerow(item)

def spider(list_url):

    selector = download(list_url)

    house_list = selector.xpath('/html/body/div[5]/div/div[1]/div[4]/ul/li')

    #print(house_list)
    for house in house_list:
        apartment = house.xpath("div[2]/div[1]/a/text()")[0]
        house_layout = house.xpath("div[2]/div[2]/p[1]/text()")[0]
        area = house.xpath("div[2]/div[2]/p[2]/text()")[0]
        region = house.xpath("div[2]/div[2]/p[6]/text()")[0]
        total_price = house.xpath("div[3]/p[1]/span[1]/text()")[0]
        print(1)
        house_url = ('http://shenzhen.qfang.com'\
                     + house.xpath('div[2]/div[1]/a/@href')[0])
        sel = download(house_url)
        time.sleep(1)
        house_years = sel.xpath('//*[@id="scrollto-1"]/div[3]/ul/li[3]/div[2]/text()')
        mortgage_info = sel.xpath('//*[@id="scrollto-1"]/div[3]/ul/li[5]/div[2]/text()')
        item = [apartment, house_layout, area, region,
                total_price, house_years, mortgage_info]
        print("正在爬取",apartment)
        data_writer(item)

if __name__ == '__main__':

    for i in range(1, 2):
        spider(pre_url + str(i))
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2020-11-6 17:54:00 | 显示全部楼层
import requests
import csv
import time
from lxml import etree

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1;WOW64)\
AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/\
537.36 115Broswer/6.0.3'}
pre_url = 'https://shenzhen.qfang.com/sale/f'

def download(url):

    html = requests.get(url, headers = headers)
    html.encoding = 'utf-8'
    print(html.status_code)
    time.sleep(2)
    print(html.text)
    print(1)
    return etree.HTML(html.text)

def data_writer(item):

    with open('qfang1.csv', 'a', encoding = 'utf-8') as fp:
        writer = csv.writer(fp)
        writer.writerow(item)

def spider(list_url):

    selector = download(list_url)

    house_list = selector.xpath('/html/body/div[5]/div/div[1]/div[4]/ul/li')

    #print(house_list)
    for house in house_list:
        apartment = house.xpath("div[2]/div[1]/a/text()")[0]
        house_layout = house.xpath("div[2]/div[2]/p[1]/text()")[0]
        area = house.xpath("div[2]/div[2]/p[2]/text()")[0]
        region = house.xpath("div[2]/div[2]/p[6]/text()")[0]
        total_price = house.xpath("div[3]/p[1]/span[1]/text()")[0]
        print(1)
        house_url = ('http://shenzhen.qfang.com'\
                     + house.xpath('div[2]/div[1]/a/@href')[0])
        sel = download(house_url)
        time.sleep(1)
        house_years = sel.xpath('//*[@id="scrollto-1"]/div[3]/ul/li[3]/div[2]/text()')
        mortgage_info = sel.xpath('//*[@id="scrollto-1"]/div[3]/ul/li[5]/div[2]/text()')
        item = [apartment, house_layout, area, region,
                total_price, house_years, mortgage_info]
        print("正在爬取",apartment)
        data_writer(item)

if __name__ == '__main__':

    for i in range(1, 2):
        spider(pre_url + str(i))
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2020-11-6 17:55:54 | 显示全部楼层
import requests
import csv
import time
from lxml import etree

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1;WOW64)\
AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/\
537.36 115Broswer/6.0.3'}
pre_url = 'https://shenzhen.qfang.com/sale/f'

def download(url):

    html = requests.get(url, headers = headers)
    html.encoding = 'utf-8'
    print(html.status_code)
    time.sleep(2)
    print(html.text)
    print(1)
    return etree.HTML(html.text)

def data_writer(item):

    with open('qfang1.csv', 'a', encoding = 'utf-8') as fp:
        writer = csv.writer(fp)
        writer.writerow(item)

def spider(list_url):

    selector = download(list_url)

    house_list = selector.xpath('/html/body/div[5]/div/div[1]/div[4]/ul/li')

    #print(house_list)
    for house in house_list:
        apartment = house.xpath("div[2]/div[1]/a/text()")[0]
        house_layout = house.xpath("div[2]/div[2]/p[1]/text()")[0]
        area = house.xpath("div[2]/div[2]/p[2]/text()")[0]
        region = house.xpath("div[2]/div[2]/p[6]/text()")[0]
        total_price = house.xpath("div[3]/p[1]/span[1]/text()")[0]
        print(1)
        house_url = ('http://shenzhen.qfang.com'\
                     + house.xpath('div[2]/div[1]/a/@href')[0])
        sel = download(house_url)
        time.sleep(1)
        house_years = sel.xpath('//*[@id="scrollto-1"]/div[3]/ul/li[3]/div[2]/text()')
        mortgage_info = sel.xpath('//*[@id="scrollto-1"]/div[3]/ul/li[5]/div[2]/text()')
        item = [apartment, house_layout, area, region,
                total_price, house_years, mortgage_info]
        print("正在爬取",apartment)
        data_writer(item)

if __name__ == '__main__':

    for i in range(1, 2):
        spider(pre_url + str(i))
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2020-11-6 18:05:16 | 显示全部楼层
不发代码
不发要爬取的网址
很难了解你的情况
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2020-11-6 18:18:59 | 显示全部楼层
import requests
import csv
import time
from lxml import etree

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1;WOW64)\
AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/\
537.36 115Broswer/6.0.3'}
pre_url = 'https://shenzhen.qfang.com/sale/f'

def download(url):

    html = requests.get(url, headers = headers)
    html.encoding = 'utf-8'
    print(html.status_code)
    time.sleep(2)
    print(html.text)
    print(1)
    return etree.HTML(html.text)

def data_writer(item):

    with open('qfang1.csv', 'a', encoding = 'utf-8') as fp:
        writer = csv.writer(fp)
        writer.writerow(item)

def spider(list_url):

    selector = download(list_url)

    house_list = selector.xpath('/html/body/div[5]/div/div[1]/div[4]/ul/li')

    #print(house_list)
    for house in house_list:
        apartment = house.xpath("div[2]/div[1]/a/text()")[0]
        house_layout = house.xpath("div[2]/div[2]/p[1]/text()")[0]
        area = house.xpath("div[2]/div[2]/p[2]/text()")[0]
        region = house.xpath("div[2]/div[2]/p[6]/text()")[0]
        total_price = house.xpath("div[3]/p[1]/span[1]/text()")[0]
        print(1)
        house_url = ('http://shenzhen.qfang.com'\
                     + house.xpath('div[2]/div[1]/a/@href')[0])
        sel = download(house_url)
        time.sleep(1)
        house_years = sel.xpath('//*[@id="scrollto-1"]/div[3]/ul/li[3]/div[2]/text()')
        mortgage_info = sel.xpath('//*[@id="scrollto-1"]/div[3]/ul/li[5]/div[2]/text()')
        item = [apartment, house_layout, area, region,
                total_price, house_years, mortgage_info]
        print("正在爬取",apartment)
        data_writer(item)

if __name__ == '__main__':

    for i in range(1, 2):
        spider(pre_url + str(i))
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2020-11-6 18:38:03 | 显示全部楼层
suchocolate 发表于 2020-11-6 17:44
不一定,把代码发上来吧。

import requests
import csv
import time
from lxml import etree

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1;WOW64)\
AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/\
537.36 115Broswer/6.0.3'}
pre_url = 'https://shenzhen.qfang.com/sale/f'

def download(url):

    html = requests.get(url, headers = headers)
    html.encoding = 'utf-8'
    print(html.status_code)
    time.sleep(2)
    print(html.text)
    print(1)
    return etree.HTML(html.text)

def data_writer(item):

    with open('qfang1.csv', 'a', encoding = 'utf-8') as fp:
        writer = csv.writer(fp)
        writer.writerow(item)

def spider(list_url):

    selector = download(list_url)

    house_list = selector.xpath('/html/body/div[5]/div/div[1]/div[4]/ul/li')

    #print(house_list)
    for house in house_list:
        apartment = house.xpath("div[2]/div[1]/a/text()")[0]
        house_layout = house.xpath("div[2]/div[2]/p[1]/text()")[0]
        area = house.xpath("div[2]/div[2]/p[2]/text()")[0]
        region = house.xpath("div[2]/div[2]/p[6]/text()")[0]
        total_price = house.xpath("div[3]/p[1]/span[1]/text()")[0]
        print(1)
        house_url = ('http://shenzhen.qfang.com'\
                     + house.xpath('div[2]/div[1]/a/@href')[0])
        sel = download(house_url)
        time.sleep(1)
        house_years = sel.xpath('//*[@id="scrollto-1"]/div[3]/ul/li[3]/div[2]/text()')
        mortgage_info = sel.xpath('//*[@id="scrollto-1"]/div[3]/ul/li[5]/div[2]/text()')
        item = [apartment, house_layout, area, region,
                total_price, house_years, mortgage_info]
        print("正在爬取",apartment)
        data_writer(item)

if __name__ == '__main__':

    for i in range(1, 2):
        spider(pre_url + str(i))
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2020-11-10 14:58:14 | 显示全部楼层
大致看了一下,你的代码应该没问题,分析了一下网站请求,确认是网站做了cookie反爬。
浏览器的请求过程是:先请求https://shenzhen.qfang.com/sale/f1   ==>   判断是否有一个叫wzws_cid的cookie,如果没有或者失效,得到的响应内容就是“请开启JavaScript并刷新该页.”这个页面   ==>  然后浏览器自动请求https://shenzhen.qfang.com/WZWSREL3NhbGUvZjE=?这个url为浏览器设置cookie(这个url有个加密请求参数)   ==>   最后重定向回来原请求网址就是显示正常内容。
cookie可以直接从浏览器复制cookie中复制,不过cookie有效期比较短,短时间爬取应该没问题,如果要长时间爬取就需要逆向js找到https://shenzhen.qfang.com/WZWSREL3NhbGUvZjE=?这个请求的加密参数生成方式。
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-1-17 21:49

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表