爬虫遇到的小问题

ianv · 发表于 2015-11-29 23:07:04

在爬“优美图”这个网站时，遇到了正则表达式上的一些问题。
网站的图片代码为

<img id="item_d_56716098" class="img ui-draggable" alt="1M" width="245px" height="200px" src="http://fc.topitme.com/c/fb/fc/114846745695efcfbcm.jpg" style="position: relative;">

复制代码

用正则表达式提出图片地址，将表达式写成

p=r'width="245px" height="200px" src="(.*?)"'

复制代码

时可以提取，但是当我用

p=r'width="245px" height="200px" src="(.*?)" style="position: relative;">'

复制代码

这个表达式时不可以。求问是什么原因？谢谢。整个爬虫的完整代码为：

import urllib.request as u
import re
import os
os.mkdir("youmei")
os.chdir("youmei")
def open_url(url):
req=u.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 UBrowser/5.5.7386.17 Safari/537.36')
r=u.urlopen(req)
html=r.read().decode('utf-8')
return html
def get_img(html):
p=r'width="245px" height="200px" src="(.*?)" style="position: relative;">'
imglist=re.findall(p,html,re.S)
for each in imglist:
filename=each.split('/')[-1]
u.urlretrieve(each,filename,None)
'
if __name__=='__main__':
url='http://www.topit.me/pop'
get_img(open_url(url))

复制代码

hldh214 · 发表于 2015-11-29 23:07:05

1. 楼主你这个url里面的图片的标签不是你这个样的,而是类似

<img id="item_d_57248120" class="img" alt="1M" width="245px" height="200px" src="http://f3.topitme.com/3/34/65/114942719489a65343m.jpg" />

复制代码

这样的,你那样写当然匹配不到
2. 推荐楼主使用re.compile()预编译正则表达式,这样能加快代码执行速度,不必每次执行查找都编译一次
3. 推荐使用环视的语法,这样可以让你的代码可读性增加,具体语法可以看文档,也可以Google一下
4. 稍微注意下正则表达式的贪婪模式和懒惰模式
附我修改的代码,共同学习,一起进步~

import urllib.request as u
import re
import os
import time
#os.mkdir("youmei")
#os.chdir("youmei")
if not os.path.exists('youmei'):
os.mkdir('youmei')
os.chdir('youmei')
re_get_img = re.compile(r'(?<=width="245px" height="200px" src=").*?(?=" /></a></div>)') # 预编译正则表达式
def open_url(url):
req=u.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 UBrowser/5.5.7386.17 Safari/537.36')
r=u.urlopen(req)
html=r.read().decode('utf-8')
return html
def get_img(html):
#p=r'width="245px" height="200px" src="(.*?)" style="position: relative;">'
#imglist=re.findall(p,html,re.S)
#for each in imglist:
# filename=each.split('/')[-1]
# u.urlretrieve(each,filename,None)
img_list = re_get_img.findall(html)
for each in range(len(img_list)):
print('正在下载第%d张图片...' % (each + 1))
try:
img_file = u.urlopen(img_list[each])
f = open(str(each) + '.jpg', 'wb')
#我使用的是原始的open来写数据(小甲鱼老师的坑~),urllib.request.urlretrieve()是很棒的方法,效率甩open几条街,受教了~
f.write(img_file.read())
except:
print('第%d张图片下载失败!' % (each + 1))
time.sleep(3)
f.close()
if __name__=='__main__':
url='http://www.topit.me/pop'
get_img(open_url(url))

复制代码

ianv · 发表于 2015-11-29 23:08:27

@～风介～

～风介～ · 发表于 2015-11-29 23:52:56

ianv 发表于 2015-11-29 23:08
@～风介～

说来惭愧，正则表达式我也是学艺不精啊~

@小甲鱼

yjp369 · 发表于 2015-11-30 11:40:05

what

xiao_xiao · 发表于 2015-12-10 09:05:50

@鱼神解决下，正好我们也学习下(*^__^*) 嘻嘻……

ianv · 发表于 2015-12-14 23:12:03

hldh214 发表于 2015-11-29 23:07
1. 楼主你这个url里面的图片的标签不是你这个样的,而是类似这样的,你那样写当然匹配不到
2. 推荐楼主使用r ...

太感谢啦

账号		自动登录	找回密码
密码			立即注册

爬虫遇到的小问题

最佳答案

评分

浏览过的版块