我都还没出力 发表于 2020-4-29 23:23:54

正则表达式爬取标题时小横线变为&#8211怎么办

import re
import requests

link="http://www.santostang.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'}

r=requests.get(link,headers=headers)
html=r.text

comment=re.compile(r'<h1 class="post-title"><a href=.*?>(.*?)</a>',flags=re.DOTALL)
title_list=comment.findall(html)
for each in title_list:
    print(each)

永恒的蓝色梦想 发表于 2020-4-29 23:24:55

import re
import requests

link="http://www.santostang.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'}

r=requests.get(link,headers=headers)
html=r.text

comment=re.compile(r'<h1 class="post-title"><a href=.*?>(.*?)</a>',flags=re.DOTALL)
title_list=comment.findall(html)
for each in title_list:
    print(each.replace('–',' '))这样就好了

zltzlt 发表于 2020-4-30 13:11:21

import re
import requests

link = "http://www.santostang.com/"
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'}

r = requests.get(link, headers=headers)
html = r.text

comment = re.compile(r'<h1 class="post-title"><a href=.*?>(.*?)</a>', flags=re.DOTALL)
title_list = comment.findall(html)
for each in title_list:
    print(each.replace('–', chr(8211)))

kaohsing 发表于 2020-4-30 14:49:21

我都还没出力 发表于 2020-4-30 17:16:08

kaohsing 发表于 2020-4-30 14:49


原来是要unescape解码吗,感谢{:5_109:}

我都还没出力 发表于 2020-4-30 17:18:03

永恒的蓝色梦想 发表于 2020-4-29 23:24
这样就好了

23333

我都还没出力 发表于 2020-4-30 17:18:35

zltzlt 发表于 2020-4-30 13:11


感谢{:5_109:}
页: [1]
查看完整版本: 正则表达式爬取标题时小横线变为&#8211怎么办