如何用正则表达式匹配网页中的超链接？

aptdo · 发表于 2016-2-1 13:38:06

您需要登录才可以下载或查看，没有账号？立即注册

x

各位鱼友，现有一个网页http://www.researchmfg.com/plastic-injection/，我想把这文章都爬下来，想用正则表达式把<h4>标题下的超链接都匹配下来，但是不知道如何写，谁能写个看看，让我学习学习，感谢~

hldh214 · 发表于 2016-2-1 13:58:05

复制代码

匹配结果:

C:\Python34\python.exe E:/python/tmp.py
['http://www.researchmfg.com/2010/07/thermo-plastics/', 'http://www.researchmfg.com/2010/07/plastic-rheological-property/', 'http://www.researchmfg.com/2010/07/3-elements-plastic-injection/', 'http://www.researchmfg.com/2010/07/plastic-forming-step/', 'http://www.researchmfg.com/2010/07/plastic-injection-temperature-pressure/', 'http://www.researchmfg.com/2010/07/plastic-injection-screw-speed/', 'http://www.researchmfg.com/2010/07/plastic-injection-time/', 'http://www.researchmfg.com/2010/07/plastic-injection-defect-cause-action/', 'http://www.researchmfg.com/2010/08/plastic-mechanical-properties/', 'http://www.researchmfg.com/2010/08/plastic-mfi/', 'http://www.researchmfg.com/2010/08/7-methods-to-detect-re-grind-resin-plastic/', 'http://www.researchmfg.com/2010/08/plastic-mfi-re-grinding-resin/', 'http://www.researchmfg.com/2010/08/apply-re-grinding-resin-degrade-the-strength/', 'http://www.researchmfg.com/2010/08/plastic-parts-cracking-defect-and-solution/', 'http://www.researchmfg.com/2010/08/the-possibility-reason-for-brittle-plastic-parts/']
Process finished with exit code 0

复制代码

aptdo · 发表于 2016-2-1 14:06:14

本帖最后由 aptdo 于 2016-2-1 14:08 编辑

感谢，但是这并没有匹配完，带"target="_self""只有十几个，而这个网页有8个分组，总共有40多篇文章。我是想能不能先按<h4>标签锁定范围，然后把<h4>下面的url都提取出来

aptdo · 发表于 2016-2-1 14:07:27

hldh214 发表于 2016-2-1 13:58
匹配结果:

感谢，但是这并没有匹配完，带"target="_self""只有十几个，而这个网页有8个分组，总共有40多篇文章。我是想能不能先按<h4>标签锁定范围，然后把<h4>下面的url都提取出来

hldh214 · 发表于 2016-2-1 16:31:54

aptdo 发表于 2016-2-1 14:07
感谢，但是这并没有匹配完，带"target="_self""只有十几个，而这个网页有8个分组，总共有40多篇文章。 ...

没看仔细, 嘿嘿~

这里可以另辟蹊径, 先匹配俩<hr />里面的内容, 再匹配url
我贴上全部代码, 用了requests库

复制代码

aptdo · 发表于 2016-2-1 18:00:02

hldh214 发表于 2016-2-1 16:31
没看仔细, 嘿嘿~
这里可以另辟蹊径, 先匹配俩里面的内容, 再匹配url
我贴上全部代码, 用了re ...

感谢，等我回去试一下，办公室要用代理，我安装不了requests

worry921 · 发表于 2016-2-11 12:24:28

牛人还是多啊，

账号		自动登录	找回密码
密码			立即注册