|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
问题1:
引子:这个问题好像与(?:有关,我忘了,就是在爬ip地址匹配时,除了ip还有一些多余的123,1,2的这种情况,具体怎么解决,实在想不起来了
问题:爬百度疫情源代码时,用爬出地区的数据,但是除了这部分数据还有其他的
不想去百度爬取的,我把内容格式发一下:
[{"confirmed":"1","died":"","crued":"1","relativeTime":"1585756800","confirmedRelative":"0","diedRelative":"0","curedRelative":"0","curConfirm":"0","curConfirmRelative":"0","icuDisable":"1","area":"\\u897f\\u85cf","subList":[{"city":"\\u62c9\\u8428","confirmed":"1","died":"","crued":"1","confirmedRelative":"0","curConfirm":"0","cityCode":"100"}]},{"confirmed":"41","died":"","crued":"10","relativeTime":"1585670400","confirmedRelative":"0","diedRelative":"0","curedRelative":"0","curConfirm":"31","curConfirmRelative":"0","cityCode":"2911","icuDisable":"1","area":"\\u6fb3\\u95e8","subList":[]}
我的代码是:import requests
import json
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
url = 'https://voice.baidu.com/act/newpneumonia/newpneumonia'
html = requests.get(url,headers=headers)
content = html.text
value = r'(\{("[a-z,A-Z]+":"\d*",){10,11}"area":"(\\(\w){5})+","subList":\[\])'
c=re.compile(value,re.DOTALL)
t = c.findall(content)
print(t)
结果:
[('{"confirmed":"42","died":"","crued":"10","relativeTime":"1585756800","confirmedRelative":"0","diedRelative":"0","curedRelative":"0","curConfirm":"32","curConfirmRelative":"0","cityCode":"2911","icuDisable":"1","area":"\\u6fb3\\u95e8","subList":[]', '"icuDisable":"1",', '\\u95e8', '8'), ('{"confirmed":"348","died":"5","crued":"50","relativeTime":"1585756800","confirmedRelative":"10","diedRelative":"0","curedRelative":"5","curConfirm":"293","curConfirmRelative":"5","icuDisable":"1","area":"\\u53f0\\u6e7e","subList":[]', '"icuDisable":"1",', '\\u6e7e', 'e'), ('{"confirmed":"845","died":"4","crued":"173","relativeTime":"1585756800","confirmedRelative":"37","diedRelative":"0","curedRelative":"7","curConfirm":"668","curConfirmRelative":"30","cityCode":"2912","icuDisable":"1","area":"\\u9999\\u6e2f","subList":[]', '"icuDisable":"1",', '\\u6e2f', 'f')]
每一个都存在不需要的,当时ip怎么处理的我忘了,各位大佬帮下忙?
问题2:
"area":"\\u53f0\\u6e7e",源代码是只有\,但是读出来时候就转义了\\,这时候就读不出来地点了。h="\\u53f0\\u6e7e"
print(h)
h="\u53f0\u6e7e"
print(h)
结果:
\u53f0\u6e7e
台湾
|
|