鱼C论坛

 找回密码
 立即注册
查看: 988|回复: 3

[已解决]爬取cookies

[复制链接]
发表于 2020-9-16 22:31:26 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
要爬取一个视频网站,其中响应标头里有个set_cookies的东西,请问这个有什么用吗?要怎么爬取这个set_cookies里的东西来设置cookies?
最佳答案
2020-9-17 12:20:51
set-cookie: server 让浏览器存着,下次访问带上。案例如下:
  1. # 1)urllib:MozillaCookieJar保存
  2. import http.cookiejar
  3. import urllib.request

  4. filename = 'cookies.txt'
  5. cookie = http.cookiejar.MozillaCookieJar(filename)
  6. handler = urllib.request.HTTPCookieProcessor(cookie)
  7. opener = urllib.request.build_opener(handler)
  8. r = opener.open('http://www.baidu.com')
  9. cookie.save(ignore_discard=True, ignore_expires=True)


  10. # --------------------------------------------------------------------------------
  11. # 2)urllib:LWPCookieJar保存和读取
  12. import http.cookiejar
  13. import urllib.request

  14. # 保存
  15. filename = 'cookies.txt'
  16. cookie = http.cookiejar.LWPCookieJar(filename)
  17. handler = urllib.request.HTTPCookieProcessor(cookie)
  18. opener = urllib.request.build_opener(handler)
  19. r = opener.open('http://www.baidu.com')
  20. cookie.save(ignore_discard=True, ignore_expires=True)

  21. # 读取
  22. cookie = http.cookiejar.LWPCookieJar()
  23. cookie.load('cookies.txt', ignore_discard=True, ignore-expires=True)
  24. handler = urllib.request.HTTPCookieProcessor(cookie)
  25. opener = urllib.request.build_opener(handler)
  26. r = opener.open('http://www.baidu.com')
  27. print(r.read().decode('utf-i'))


  28. # --------------------------------------------------------------------------------
  29. # 3)requets:保存和读取
  30. # 保存
  31. import requests

  32. r = requests.get('https://www.baidu.com')
  33. with open('cookie.txt', 'w') as f:
  34.     for k, v in r.cookies.items():
  35.         print(k,'=',v)
  36.         f.write(k + '#' + v)



  37. # 从文本读取
  38. import requests
  39. from requests.cookies import RequestsCookieJar

  40. jar = RequestsCookieJar()
  41. with open('cookie.txt','r') as f:
  42.     for item in f.readlines():
  43.         k, v = item.split('#')
  44.         jar.set(k, v)
  45. r = requests.get('https://www.baidu.com', cookies=jar)
  46. print(r.status_code)


  47. # --------------------------------------------------------------------------------
  48. # 4)selenium:保存读取
  49. # 保存为pickle
  50. driver.get(url)
  51. time.sleep(10)
  52. pickle.dump(driver.get_cookies(), open("cookies.pkl", "wb"))

  53. # 保存为文本
  54. with open('cookie.txt', 'w') as f:
  55.     for item in driver.get_cookies():
  56.         data = item['name'] + '#' + item['value']
  57.         f.write(data)


  58. # --------------------------------------------------------------------------------

  59. # 从pickle读取
  60. cookies = pickle.load(open("cookies.pkl", "rb"))
  61. for cookie in cookies:
  62.     driver.add_cookie(cookie)
  63. driver.get(url)

  64. # 从文本读取参考requests读取
复制代码
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2020-9-17 09:04:54 | 显示全部楼层
Set-Cookie响应头是服务器返回的响应头用来在浏览器种cookie,一旦被种下,当浏览器访问符合条件的url地址时,会自动带上这个cookie
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2020-9-17 12:05:33 | 显示全部楼层
弱弱的佳佳 发表于 2020-9-17 09:04
Set-Cookie响应头是服务器返回的响应头用来在浏览器种cookie,一旦被种下,当浏览器访问符合条件的url地址 ...

那要怎么知道什么是符合条件的url呢
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2020-9-17 12:20:51 | 显示全部楼层    本楼为最佳答案   
set-cookie: server 让浏览器存着,下次访问带上。案例如下:
  1. # 1)urllib:MozillaCookieJar保存
  2. import http.cookiejar
  3. import urllib.request

  4. filename = 'cookies.txt'
  5. cookie = http.cookiejar.MozillaCookieJar(filename)
  6. handler = urllib.request.HTTPCookieProcessor(cookie)
  7. opener = urllib.request.build_opener(handler)
  8. r = opener.open('http://www.baidu.com')
  9. cookie.save(ignore_discard=True, ignore_expires=True)


  10. # --------------------------------------------------------------------------------
  11. # 2)urllib:LWPCookieJar保存和读取
  12. import http.cookiejar
  13. import urllib.request

  14. # 保存
  15. filename = 'cookies.txt'
  16. cookie = http.cookiejar.LWPCookieJar(filename)
  17. handler = urllib.request.HTTPCookieProcessor(cookie)
  18. opener = urllib.request.build_opener(handler)
  19. r = opener.open('http://www.baidu.com')
  20. cookie.save(ignore_discard=True, ignore_expires=True)

  21. # 读取
  22. cookie = http.cookiejar.LWPCookieJar()
  23. cookie.load('cookies.txt', ignore_discard=True, ignore-expires=True)
  24. handler = urllib.request.HTTPCookieProcessor(cookie)
  25. opener = urllib.request.build_opener(handler)
  26. r = opener.open('http://www.baidu.com')
  27. print(r.read().decode('utf-i'))


  28. # --------------------------------------------------------------------------------
  29. # 3)requets:保存和读取
  30. # 保存
  31. import requests

  32. r = requests.get('https://www.baidu.com')
  33. with open('cookie.txt', 'w') as f:
  34.     for k, v in r.cookies.items():
  35.         print(k,'=',v)
  36.         f.write(k + '#' + v)



  37. # 从文本读取
  38. import requests
  39. from requests.cookies import RequestsCookieJar

  40. jar = RequestsCookieJar()
  41. with open('cookie.txt','r') as f:
  42.     for item in f.readlines():
  43.         k, v = item.split('#')
  44.         jar.set(k, v)
  45. r = requests.get('https://www.baidu.com', cookies=jar)
  46. print(r.status_code)


  47. # --------------------------------------------------------------------------------
  48. # 4)selenium:保存读取
  49. # 保存为pickle
  50. driver.get(url)
  51. time.sleep(10)
  52. pickle.dump(driver.get_cookies(), open("cookies.pkl", "wb"))

  53. # 保存为文本
  54. with open('cookie.txt', 'w') as f:
  55.     for item in driver.get_cookies():
  56.         data = item['name'] + '#' + item['value']
  57.         f.write(data)


  58. # --------------------------------------------------------------------------------

  59. # 从pickle读取
  60. cookies = pickle.load(open("cookies.pkl", "rb"))
  61. for cookie in cookies:
  62.     driver.add_cookie(cookie)
  63. driver.get(url)

  64. # 从文本读取参考requests读取
复制代码
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-4-18 19:03

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表