鱼C论坛

 找回密码
 立即注册
查看: 5022|回复: 3

[已解决]淘宝客的爬取,一直成功不了

[复制链接]
发表于 2017-4-16 16:18:42 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
  1. import urllib.request,http.cookies,http.cookiejar
  2. import os
  3. import re
  4. import gzip

  5. def ungzip(data):
  6.     try:
  7.         print('正在解压.....')
  8.         data = gzip.decompress(data)
  9.         print('解压完毕!')
  10.     except:
  11.         print('未经压缩, 无需解压')
  12.     return data

  13. def url_open(url):
  14.     req = urllib.request.Request(url,header)
  15.     response = urllib.request.urlopen(url)
  16.     html = response.read()


  17. header={
  18.     "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
  19.     "Accept-Encoding":"gzip, deflate, sdch",
  20.     "Accept-Language":"zh-CN,zh;q=0.8",
  21.     "Cache-Control":"max-age=0",
  22.     "Connection":"keep-alive",
  23.     "Cookie":"t=27d236359f46ea30e60db59c54ca2646; UM_distinctid=15b7195dc3ee99-0595dcceecedbe-31437652-1fa400-15b7195dc486fbe; account-path-guide-s1=true; pub-message-center=1; cookie2=68d900c3e873043561fb40c8b75c0bfe; v=0; _tb_token_=OMIt2M8EtsXq; cookie32=e9905dcb39a4085fdd37062f8f590fa8; cookie31=NDUyNjM3MDEsJUU2JUIxJTlGJUU1JTlDJUEzJUU1JUJFJUI3LDU2NDc0MjQzOUBxcS5jb20sVEI%3D; alimamapwag=TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgNi4xOyBXT1c2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzUzLjAuMjc4NS4xMDQgU2FmYXJpLzUzNy4zNiBDb3JlLzEuNTMuMjU5NS40MDAgUVFCcm93c2VyLzkuNi4xMDg3Mi40MDA%3D; login=VFC%2FuZ9ayeYq2g%3D%3D; alimamapw=QHQhQyQgFXZyFnENEnEHQyIGbAZQAlFSUQIEA1INAAoABQcEBgVSVVYGAQICVQQJU1ZQ; cna=VlPIEAQQlAMCAbZWgx6zO5uo; l=Anh4kjqD1bYkHF5trROs0AtryCwK4dxr; isg=AmBg35RPLFCL65BVmwqrIFS4JG58kEQzWH9X_dpxLHsO1QD_gnkUwzbnG_ow",
  24.     "Host":"pub.alimama.com",
  25.     "Upgrade-Insecure-Requests":"1",
  26.     "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.2595.400 QQBrowser/9.6.10872.400",
  27.     }
  28. '''
  29. Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
  30. Accept-Encoding:gzip, deflate, sdch
  31. Accept-Language:zh-CN,zh;q=0.8
  32. Cache-Control:max-age=0
  33. Connection:keep-alive
  34. Cookie:t=27d236359f46ea30e60db59c54ca2646; UM_distinctid=15b7195dc3ee99-0595dcceecedbe-31437652-1fa400-15b7195dc486fbe; account-path-guide-s1=true; pub-message-center=1; cookie2=68d900c3e873043561fb40c8b75c0bfe; v=0; _tb_token_=OMIt2M8EtsXq; cookie32=e9905dcb39a4085fdd37062f8f590fa8; cookie31=NDUyNjM3MDEsJUU2JUIxJTlGJUU1JTlDJUEzJUU1JUJFJUI3LDU2NDc0MjQzOUBxcS5jb20sVEI%3D; alimamapwag=TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgNi4xOyBXT1c2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzUzLjAuMjc4NS4xMDQgU2FmYXJpLzUzNy4zNiBDb3JlLzEuNTMuMjU5NS40MDAgUVFCcm93c2VyLzkuNi4xMDg3Mi40MDA%3D; login=VFC%2FuZ9ayeYq2g%3D%3D; alimamapw=QHQhQyQgFXZyFnENEnEHQyIGbAZQAlFSUQIEA1INAAoABQcEBgVSVVYGAQICVQQJU1ZQ; cna=VlPIEAQQlAMCAbZWgx6zO5uo; l=Anh4kjqD1bYkHF5trROs0AtryCwK4dxr; isg=AmBg35RPLFCL65BVmwqrIFS4JG58kEQzWH9X_dpxLHsO1QD_gnkUwzbnG_ow
  35. Host:pub.alimama.com
  36. Referer:http://pub.alimama.com/promo/item/channel/index.htm?spm=a219t.7664554.1998457203.146.cnd11H&channel=9k9
  37. Upgrade-Insecure-Requests:1
  38. User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.2595.400 QQBrowser/9.6.10872.400
  39. '''

  40. url='http://pub.alimama.com/promo/search/index.htm?'
  41. html=url_open(url)
  42. html=ungzip(html)
  43. print('1')
  44. html=html.decode('utf-8','ignore')
  45. print('1')
  46. print(html)
复制代码


是cookie的事还是什么?
最佳答案
2017-4-16 17:49:25
本帖最后由 ooxx7788 于 2017-4-16 17:55 编辑
  1.     return html
复制代码

19行增加以上代码!没有返回值。
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2017-4-16 17:11:17 | 显示全部楼层
淘宝做了限制吧,
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2017-4-16 17:49:25 | 显示全部楼层    本楼为最佳答案   
本帖最后由 ooxx7788 于 2017-4-16 17:55 编辑
  1.     return html
复制代码

19行增加以上代码!没有返回值。
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2017-4-16 20:26:34 | 显示全部楼层
ooxx7788 发表于 2017-4-16 17:49
19行增加以上代码!没有返回值。

不过还是没用
  1. <!DOCTYPE html>
  2. <html>
  3. <head>
  4.   <meta name="data-spm" content="a219t" />
  5.   <meta name="aplus-ajax" content="46807174">
  6.   <title>淘宝联盟</title>
  7.   <meta charset="utf-8" />
  8.   <meta name="viewport" content="width=device-width,initial-scale=1.0, user-scalable=no"/>
  9.   <meta name="renderer" content="webkit">
  10.   <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"/>
  11.   <link rel="stylesheet" href="//g.alicdn.com/thx/cube/1.2.1/neat.css">
  12.   <link rel="stylesheet" href="//g.alicdn.com/thx/minecraft-animation/20151106.161602.306/css/animate-min.css">
  13.   <link rel="stylesheet" href="//g.alicdn.com/mm/pubplus/0.3.10/style/main.css">
  14.   <script src="//g.alicdn.com/thx/brix-release/1.0.0-beta.9/require-config-debug.js"></script>
  15.   <script src="//g.alicdn.com/mm/pubplus/0.3.10/app/aliww.js"></script>
  16.   <script src="//g.alicdn.com/crm/anywhere/1.0.88/lib/include.js"></script>
  17.   <!-- 安全监控 -->
  18.   <script src="//g.alicdn.com/secdev/pointman/js/index.js" app="union-pub"></script>
  19.   <!--[if lte IE 7]>
  20.   <script src="//g.alicdn.com/mm/pubplus/0.3.10/app/exts/ieupdate/ieupdate.js"></script>
  21.   <![endif]-->
  22. </head>
  23. <body data-spm="7900221"><script>
  24. with(document)with(body)with(insertBefore(createElement("script"),firstChild))setAttribute("exparams","category=&userid=&aplus&yunid=&&trid=&asid=AQAAAACeXPNYVYe5cQAAAACI0O+FcR2nUw==",id="tb-beacon-aplus",src=(location>"https"?"//g":"//g")+".alicdn.com/alilog/mlog/aplus_v2.js")
  25. </script>

  26.   <script src="//g.alicdn.com/mm/pubplus/0.3.10/app/boot.js"></script>
  27. </body>
  28. </html>
复制代码

html是这个,进不去
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-5-19 22:34

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表