Chysial 发表于 2020-4-3 17:59:15

re模块

问题1:
引子:这个问题好像与(?:有关,我忘了,就是在爬ip地址匹配时,除了ip还有一些多余的123,1,2的这种情况,具体怎么解决,实在想不起来了
问题:爬百度疫情源代码时,用爬出地区的数据,但是除了这部分数据还有其他的
不想去百度爬取的,我把内容格式发一下:
[{"confirmed":"1","died":"","crued":"1","relativeTime":"1585756800","confirmedRelative":"0","diedRelative":"0","curedRelative":"0","curConfirm":"0","curConfirmRelative":"0","icuDisable":"1","area":"\\u897f\\u85cf","subList":[{"city":"\\u62c9\\u8428","confirmed":"1","died":"","crued":"1","confirmedRelative":"0","curConfirm":"0","cityCode":"100"}]},{"confirmed":"41","died":"","crued":"10","relativeTime":"1585670400","confirmedRelative":"0","diedRelative":"0","curedRelative":"0","curConfirm":"31","curConfirmRelative":"0","cityCode":"2911","icuDisable":"1","area":"\\u6fb3\\u95e8","subList":[]}
我的代码是:
import requests
import json
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
url = 'https://voice.baidu.com/act/newpneumonia/newpneumonia'
html = requests.get(url,headers=headers)
content = html.text
value = r'(\{("+":"\d*",){10,11}"area":"(\\(\w){5})+","subList":\[\])'
c=re.compile(value,re.DOTALL)
t = c.findall(content)
print(t)
结果:
[('{"confirmed":"42","died":"","crued":"10","relativeTime":"1585756800","confirmedRelative":"0","diedRelative":"0","curedRelative":"0","curConfirm":"32","curConfirmRelative":"0","cityCode":"2911","icuDisable":"1","area":"\\u6fb3\\u95e8","subList":[]', '"icuDisable":"1",', '\\u95e8', '8'), ('{"confirmed":"348","died":"5","crued":"50","relativeTime":"1585756800","confirmedRelative":"10","diedRelative":"0","curedRelative":"5","curConfirm":"293","curConfirmRelative":"5","icuDisable":"1","area":"\\u53f0\\u6e7e","subList":[]', '"icuDisable":"1",', '\\u6e7e', 'e'), ('{"confirmed":"845","died":"4","crued":"173","relativeTime":"1585756800","confirmedRelative":"37","diedRelative":"0","curedRelative":"7","curConfirm":"668","curConfirmRelative":"30","cityCode":"2912","icuDisable":"1","area":"\\u9999\\u6e2f","subList":[]', '"icuDisable":"1",', '\\u6e2f', 'f')]
每一个都存在不需要的,当时ip怎么处理的我忘了,各位大佬帮下忙?
问题2:
"area":"\\u53f0\\u6e7e",源代码是只有\,但是读出来时候就转义了\\,这时候就读不出来地点了。
h="\\u53f0\\u6e7e"
print(h)
h="\u53f0\u6e7e"
print(h)
结果:
\u53f0\u6e7e
台湾

Chysial 发表于 2020-4-3 18:02:56

@塔利班 @claws0n

永恒的蓝色梦想 发表于 2020-4-3 18:05:30

Chysial 发表于 2020-4-3 18:02
@塔利班 @claws0n

我记得版规上有写1. 提问尽量不要点将,例如: 小甲鱼来问答下...XX 来回答下... (影响他人回帖积极性)

wp231957 发表于 2020-4-3 18:09:17

回复你第二个问题

>>> h="\\u53f0\\u6e7e"
>>> print(h.encode(encoding="utf-8").decode(encoding="unicode_escape"))
台湾
>>>

Chysial 发表于 2020-4-3 18:25:49

永恒的蓝色梦想 发表于 2020-4-3 18:05
我记得版规上有写

那劳烦你帮忙解答下吧,帮我改下正则{:5_109:}

Chysial 发表于 2020-4-3 18:26:22

wp231957 发表于 2020-4-3 18:09
回复你第二个问题

第一个问题能回答吗

Chysial 发表于 2020-4-3 18:26:54

永恒的蓝色梦想 发表于 2020-4-3 18:05
我记得版规上有写

好的 明白了 下不为例{:5_109:}

wp231957 发表于 2020-4-3 18:28:44

Chysial 发表于 2020-4-3 18:26
第一个问题能回答吗

不知道你要提取啥
标准json的话,还是不用正则的好

Chysial 发表于 2020-4-3 19:57:49

wp231957 发表于 2020-4-3 18:28
不知道你要提取啥
标准json的话,还是不用正则的好

百度疫情数据源代码是json格式(只查看源代码,不用F12),但是我无法使用json模块,你帮忙看看呗!爬取的是各个省的疫情数据!

Chysial 发表于 2020-4-3 20:29:37

wp231957 发表于 2020-4-3 18:28
不知道你要提取啥
标准json的话,还是不用正则的好

大佬能帮下忙看下吗{:5_91:}

wp231957 发表于 2020-4-3 21:08:06

Chysial 发表于 2020-4-3 20:29
大佬能帮下忙看下吗

一个是源码一个是运行结果   也不知道你想要啥数据

Chysial 发表于 2020-4-3 21:22:42

wp231957 发表于 2020-4-3 21:08
一个是源码一个是运行结果   也不知道你想要啥数据

看不懂 能不能解释下 啊{:5_99:} 你那个s怎么出来的 我就想爬出那个s来 从百度实时疫情上爬

wp231957 发表于 2020-4-3 21:26:01

Chysial 发表于 2020-4-3 21:22
看不懂 能不能解释下 啊 你那个s怎么出来的 我就想爬出那个s来 从百度实时疫情上爬

我是手工复制的,然后它好像不是标准json
我又手工修补,自动截取也能,今天不行了

Chysial 发表于 2020-4-3 21:33:34

wp231957 发表于 2020-4-3 21:26
我是手工复制的,然后它好像不是标准json
我又手工修补,自动截取也能,今天不行了

对的 不是标准json 所以我想用正则提取出来 但是涉及非捕获组 我提取数据能用但是不是很完美 还有一个问题你怎么复制的结构那么好的,有专门的软件还是工具

wp231957 发表于 2020-4-3 21:41:31

Chysial 发表于 2020-4-3 21:33
对的 不是标准json 所以我想用正则提取出来 但是涉及非捕获组 我提取数据能用但是不是很完美 还有一个问 ...

json在线格式转换百度一下

Chysial 发表于 2020-4-3 21:48:57

wp231957 发表于 2020-4-3 21:41
json在线格式转换百度一下

ok thks 麻烦你了 哪个小甲鱼ip哪个正则你还有印象吗,我找视频没找到,如果有印象能不能告诉我一下r'(?:(?:?\d?\d|2\d|25)\.){3}(?:?\d?\d|2\d|25)'当时加非捕获组代表的是啥意思,为啥加?我baidu没搜到答案

wp231957 发表于 2020-4-3 22:03:46

Chysial 发表于 2020-4-3 21:48
ok thks 麻烦你了 哪个小甲鱼ip哪个正则你还有印象吗,我找视频没找到,如果有印象能不能告诉我一下r'(?: ...

?:的意思是匹配但是不包含在结果中

Chysial 发表于 2020-4-3 22:14:32

wp231957 发表于 2020-4-3 22:03
?:的意思是匹配但是不包含在结果中

嗯 我爬数据的时候出现了ip一样的结果出现类似[''123.12.12.12","123","1"]的ip结果,我是实在不知道如何加?:取消后边的两组了 哎

Chysial 发表于 2020-4-3 22:28:45

wp231957 发表于 2020-4-3 22:03
?:的意思是匹配但是不包含在结果中

爬出来了{:5_101:},可能是当时的{导致在非捕获组的时候出现了问题,我去掉{修改得出结果来了
value =r'(?:(?:"+":"\d*",){10,11}"area":"(?:\\(?:\w){5})+","subList":\[\])'
['"confirmed":"42","died":"","crued":"10","relativeTime":"1585756800","confirmedRelative":"0","diedRelative":"0","curedRelative":"0","curConfirm":"32","curConfirmRelative":"0","cityCode":"2911","icuDisable":"1","area":"\\u6fb3\\u95e8","subList":[]', '"confirmed":"348","died":"5","crued":"50","relativeTime":"1585756800","confirmedRelative":"10","diedRelative":"0","curedRelative":"5","curConfirm":"293","curConfirmRelative":"5","icuDisable":"1","area":"\\u53f0\\u6e7e","subList":[]', '"confirmed":"845","died":"4","crued":"173","relativeTime":"1585756800","confirmedRelative":"37","diedRelative":"0","curedRelative":"7","curConfirm":"668","curConfirmRelative":"30","cityCode":"2912","icuDisable":"1","area":"\\u9999\\u6e2f","subList":[]']
这样再修改一下subList就能得出省的数据来了{:5_91:}
谢谢了

Chysial 发表于 2020-4-3 22:43:44

Chysial 发表于 2020-4-3 22:28
爬出来了,可能是当时的{导致在非捕获组的时候出现了问题,我去掉{修改得出结果来了

['"conf ...

搞出json来了{:5_100:}
{'confirmed': '1',
'died': '',
'crued': '1',
'relativeTime': '1585756800',
'confirmedRelative': '0',
'diedRelative': '0',
'curedRelative': '0',
'curConfirm': '0',
'curConfirmRelative': '0',
'icuDisable': '1',
'area': '西藏',
'subList': '[]'}
页: [1]
查看完整版本: re模块