鱼C论坛 › wangyinghan › 日志

wangyinghan

https://fishc.com.cn/?518563

723

已有 374 次阅读2018-7-23 21:49

正则表示式：一些字符和特殊符号组成的字符串，可以按照某一匹配模式匹配一系列相似特征的字符串

py通过re模块来确定正则表达式

1：最简单的,foo匹配foo，abc123匹配abc123

2；特殊符号：

foo|bar：匹配foo或bar

.：匹配除了/n的任何字符一个点代表了一个字符

^:匹配字符串起始部分。

$：匹配字符串结尾部分

*：匹配0次或多次前面的正则表达式

+:匹配1次或多次前面的正则表达式

?:匹配0次或1次

{n}：匹配n次前面出现的正则表达式

{m,n}:匹配m-n次前面出现的正则表达式

[...]:匹配来自字符集的任意单一字符如[aeiou]

[x-y]:匹配x-y范围内任意字符。如[5-6]

[^...]:不匹配某一范围或内的字符。如[aeiou],[8-9]

3特殊字符：

\d:匹配任何十进制数字，与[0-9]一致。

\w:匹配任何字母数字字符

\s:匹配任何空格字符

\b:匹配任何单词边界如：\bThe\b

通常匹配一个点号字符的话可以用反斜线转义

\bThe 任何以the开始的字符串

b[aeiou]t: bat bet bit bot but

[45][56] :45 46 55 56

[dn]ot? :字母d或n后面跟着一个o，任何可能是0-1个t

[0-9]{15,16}:匹配15或16个数字

[KQRBNP][a-h][1-8]-[a-h][1-8]:代表着象棋的移动，仅仅移动，从a1-h8

\d{3}-\d{3}-\d{4}:美国电话号码，如8005551212

\w+@\w+\.com :以xxx@yyy.com的电子邮箱

使用圆括号制定分组：

\d+(\.\d*)? ：表示任何十进制数字，后面可以接一个小数点和零个或者多个十进制数字

完整的爬取猫眼电影排行榜的代码如下：

import json

import requests

from requests.exceptions import RequestException

import re

import time

def get_one_page(url):

try:

headers={'User-Agent':'Mozilla/5.0(Macintosh;Intel Mac OS X 10_11_4)AppleWebKit/537.36(KHTML,like Gecko) Chrome/53.0.2785.116 Safari/537.36',}

response=requests.get(url,headers=headers)

if response.status_code==200:

return response.text

return None

except RequestException:

return None

def parse_one_page(html):

pattern=re.compile('<dd>.*?board-index.*?>(\d+).*?data-src='(.*?)'.*?name'><a'+'.*?>(.*?)</a>.*?star'>(.*?).*?releasetime'>(.*?)'+'>*?integer'>(.*?).*?fraction'>(.*?).*?</dd>,re.S)

items=re.findall(pattern,html)

for item in items:

yield{'index':item[0],'image':item[1],'title':item[2],

'actor':item[3].strip()[3:] ,'time':item[4].strip()[5:],'score':item[5]+item[6]}

def write_to_file(content):

with open('result.txt,'a',encoding='utf-8')as f:

f.write(json.dumps(content,ensure_ascii=False)+'\n')

def main(offset):

url='http://maoyan.com/board/4?offset='+str(offset)

html=get_one_page(html)

for item in parse_one_page(html):

print(html)

write_to_file(item)

if __name__:='__main__':

for i in range(10):

main(offset=i*10)

time.sleep(1)

注释1：抓取分析

2：取首页取第一页的内容，我们实现了get——one_page方法，并给他们传入url参数，

将抓取的页面结果返回.再通过main()方法调用，这样可以解析出首页的源代码，

之后解析源代码，获取信息。

3正则提取，一般都是观察network-preview观察源码。之后推论出正则表达式

通过def parse_one_page(html):

items=re.findall(pattern,html)

print(items)

这样可以初步提出需求信息。

4写入文件，将提取的结果写入文件，通过JSON库的dumps()方法实现字典序列化

并指定ensure_ascii的参数为False，这里可以保证输出结果是中文形式而不是Unicode编码

def write_to_file(content):

with open('result.txt','a',encoding='utf-8')as f:

print(type(json.dumps(content)))

f.write(json.dumps(content,ensure_ascii=False)+'\n'

5整合代码

最终实现main()方法调用前面的方法，将单页电影写入文件

def main():

url='http://maoyan.com/board/4'

html=get_one_page(url)

for item in parse_one_page(html):

write_to_file(item)

6 分页抓取我们抓取的是top100，所以我们要给链接传入offset参数，实现其余90部的抓取

if __name__='__main__':

for i in range(10):

main(offset=i*10)

这里还要修改一下main()方法，接受一个offset作为偏移量，构建URL进行爬取

def main(offset):

url='http://maoyan.com/board/4?offset='+str(offset)

html =get_one_page(url)

for item in parse_one_page_(html):

print(item)

write_to_file(item)

路过

雷人

握手

鲜花

鸡蛋

收藏分享邀请举报

全部作者的其他最新日志

• 爬虫801
• pc730
• 725

账号		自动登录	找回密码
密码			立即注册

wangyinghan

723

全部作者的其他最新日志

评论 (0 个评论)