鱼C论坛

 找回密码
 立即注册
12
返回列表 发新帖
楼主: blackantt

[已解决]非贪婪.*?的bug?findall不能匹配出符合条件的记录,why?--已解决,正则不适合html

[复制链接]
 楼主| 发表于 2022-10-9 16:26:34 | 显示全部楼层
wp231957 发表于 2022-10-9 16:26
不是html 是lxml  PIP INSTALL LXML

谢谢
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2022-10-9 19:49:23 | 显示全部楼层
本帖最后由 阿奇_o 于 2022-10-9 19:53 编辑
blackantt 发表于 2022-10-9 14:58
这个尾巴不能去,它有2个title, 还有一个  1 shared interest 的条件


那一个个加上,试一下即可。
下面是加了两个条件的正则,且是修改、优化过的(这里应该是“一步到位”了,但实际中通常是“一步步”来的)
  1. import re
  2. with open('out8.txt', encoding='utf8') as f:
  3.     # 第二国籍为 China , 且 包含 1 shared interest
  4.     results = re.findall(r'<div class="css-em857x"><a title=".*? href="(/profile/\d{6,8})">(.*?)</a>.*?title="(.*?)" class.*? title="China" class.*?>1 shared interest', f.read())
  5.     print(results)
  6.     print(len(results))   # 17 个
复制代码


再玩玩我喜欢的pandas,一目了然(Notebook哦):
  1. import re
  2. import pandas as pd

  3. with open('out8.txt', encoding='utf8') as f:
  4.     # 将 第二国籍,和 1 shared interest 也获取进groups
  5.     results = re.findall(r'<div class="css-em857x"><a title=".*? profile" href="(/profile/\d{6,8})">(.*?)</a></div></h4>.*?title="(.*?)" class.*?<span aria-hidden="true" title="(.*?)" ', f.read())   
  6.     # results = re.findall(r'<div class="css-em857x"><a title=".*? profile" href="(/profile/\d{6,8})">(.*?)</a></div></h4>.*?title="(.*?)" class.*?<span aria-hidden="true" title="(.*?)" class.*?>(1 shared interest)', f.read())   
  7.     print(results)
  8.     print(len(results))   #

  9. df = pd.DataFrame(results)
  10. df   
复制代码


想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2022-10-9 21:55:46 | 显示全部楼层
本帖最后由 blackantt 于 2022-10-9 22:13 编辑
阿奇_o 发表于 2022-10-9 19:49
那一个个加上,试一下即可。
下面是加了两个条件的正则,且是修改、优化过的(这里应该是“一步到位” ...


17个的结果是有问题的。比如,它包含了  'Wang Zixin', 'China',没包含  'Tafadzwa Sylvester Mashayamombe', 'Zimbabwe'  

然后我把 这两个人的数据单独copy出来,就能看到。正确结果应该 反过来。
  1. import re
  2. txt1 = """<path d="M15 12c2.21 0 4-1.79 4-4s-1.79-4-4-4-4 1.79-4 4 1.79 4 4 4zm-9-2V8c0-.55-.45-1-1-1s-1 .45-1 1v2H2c-.55 0-1 .45-1 1s.45 1 1 1h2v2c0 .55.45 1 1 1s1-.45 1-1v-2h2c.55 0 1-.45 1-1s-.45-1-1-1H6zm9 4c-2.67 0-8 1.34-8 4v1c0 .55.45 1 1 1h14c.55 0 1-.45 1-1v-1c0-2.66-5.33-4-8-4z"></path></svg></span>Add contact<span class="MuiTouchRipple-root css-w0pj6f"></span></button></div></div><div class="MuiBox-root css-5vb4lz"><hr class="MuiDivider-root MuiDivider-fullWidth css-39bbo6"></div></div><div role="listitem"><div class="MuiGrid-root MuiGrid-container MuiGrid-spacing-xs-3 css-1wli5vu"><div class="MuiGrid-root MuiGrid-item MuiGrid-grid-xs-3 css-1gj0t9x"><div class="MuiBox-root css-u4p24i"><a title="Tafadzwa Sylvester Mashayamombe’s profile" href="/profile/9509916"><div class="MuiAvatar-root MuiAvatar-circular css-11ml7ev"><img alt="Tafadzwa Sylvester Mashayamombe" src="https://inassets1-outlookgmbh.netdna-ssl.com/image/120_120/2022/08/25/00fb44cd5dd41dbec51b48cb5c006b8f423cacb5e8e1b8a34a3be642cade775b.jpeg" class="MuiAvatar-img css-1hy9t21"></div></a><div class="MuiBox-root css-d0uhtl"><h4 class="MuiTypography-root MuiTypography-h4 css-vvcu8p"><div class="css-em857x"><a title="Tafadzwa Sylvester Mashayamombe’s profile" href="/profile/9509916">Tafadzwa Sylvester Mashayamombe</a></div></h4><div class="MuiBox-root css-k008qs"><div class="MuiBox-root css-15ro776"><span aria-hidden="true" title="Zimbabwe" class="css-198cwre"></span></div><div class="MuiBox-root css-15ro776"><span aria-hidden="true" title="China" class="css-17fnpm9"></span></div></div></div></div></div><div class="MuiGrid-root MuiGrid-item MuiGrid-grid-xs-true css-1v9ecy9"><p class="MuiTypography-root MuiTypography-body2 line-clamp-1 css-1o49opv">1 shared interest: Languages &amp; Cultures</p></div><div class="MuiGrid-root MuiGrid-item css-1wxaqej" style="display: flex; justify-content: flex-end;">"""
  3. txt2 = """<path d="M15 12c2.21 0 4-1.79 4-4s-1.79-4-4-4-4 1.79-4 4 1.79 4 4 4zm-9-2V8c0-.55-.45-1-1-1s-1 .45-1 1v2H2c-.55 0-1 .45-1 1s.45 1 1 1h2v2c0 .55.45 1 1 1s1-.45 1-1v-2h2c.55 0 1-.45 1-1s-.45-1-1-1H6zm9 4c-2.67 0-8 1.34-8 4v1c0 .55.45 1 1 1h14c.55 0 1-.45 1-1v-1c0-2.66-5.33-4-8-4z"></path></svg></span>Add contact<span class="MuiTouchRipple-root css-w0pj6f"></span></button></div></div><div class="MuiBox-root css-5vb4lz"><hr class="MuiDivider-root MuiDivider-fullWidth css-39bbo6"></div></div><div role="listitem"><div class="MuiGrid-root MuiGrid-container MuiGrid-spacing-xs-3 css-1wli5vu"><div class="MuiGrid-root MuiGrid-item MuiGrid-grid-xs-3 css-1gj0t9x"><div class="MuiBox-root css-u4p24i"><a title="Wang Zixin’s profile" href="/profile/9648420"><div class="MuiAvatar-root MuiAvatar-circular css-11ml7ev"><img alt="Wang Zixin" src="https://inassets1-outlookgmbh.netdna-ssl.com/image/120_120/2022/09/27/adcb1f1f82ec1a247ac730fd02c556c4b382dbbccd8532cd4b35381125b7e319.jpeg" class="MuiAvatar-img css-1hy9t21"></div></a><div class="MuiBox-root css-d0uhtl"><h4 class="MuiTypography-root MuiTypography-h4 css-vvcu8p"><div class="css-em857x"><a title="Wang Zixin’s profile" href="/profile/9648420">Wang Zixin</a></div></h4><div class="MuiBox-root css-k008qs"><div class="MuiBox-root css-15ro776"><span aria-hidden="true" title="China" class="css-17fnpm9"></span></div></div></div></div></div><div class="MuiGrid-root MuiGrid-item MuiGrid-grid-xs-true css-1v9ecy9"><p class="MuiTypography-root MuiTypography-body2 line-clamp-1 css-1o49opv">1 shared interest: Languages &amp; Cultures</p></div><div class="MuiGrid-root MuiGrid-item css-1wxaqej" style="display: flex; justify-content: flex-end;">"""
  4. results1 = re.findall(r'<div class="css-em857x"><a title=".*? href="(/profile/\d{6,8})">(.*?)</a>.*?title="(.*?)" class.*? title="China" class.*?>1 shared interest', txt1)
  5. results2 = re.findall(r'<div class="css-em857x"><a title=".*? href="(/profile/\d{6,8})">(.*?)</a>.*?title="(.*?)" class.*? title="China" class.*?>1 shared interest', txt2)

  6. print(results1)
  7. print(results2)
复制代码


python.exe c:/Users/dengz/Downloads/gggggg.py
[('/profile/9509916', 'Tafadzwa Sylvester Mashayamombe', 'Zimbabwe')]
[]

我怀疑 .*? 与 2个 title 这种是否有冲突, 另外 &amp; 是啥,也有点怪
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2022-10-9 22:02:04 | 显示全部楼层
本帖最后由 blackantt 于 2022-10-9 22:03 编辑

发重了
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-5-20 22:02

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表