为什么这样正则匹配不到数据?是"字符串"变数+"字符串"的表達式
本帖最后由 fdfanmo 于 2022-9-20 15:27 编辑import urllib.request
import re
import sys
url="https://pornchil.com/after-hours-exposed-siterip/#more-98114"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}
#創造一個response對象
request = urllib.request.Request(url=url,headers=headers)
#訪問url
response = urllib.request.urlopen(request)
#接收並轉碼讀取的原始碼
source_code = response.read().decode("utf-8")
print(source_code)
f=open("G:\\after-hours-exposed-siterip.txt","r")
movie_name = f.readlines()
#after-hours-exposed-siterip.txt内读取到的内容
after-hours-exposed-siterip=[
20171011_public_rooftop_blowjob_in_old_town_riga_latvia,
20200923_double_teen_blowjob_doing_makeup_then_cumblast_croatia_vacation_1,
20200429_pov_dildoing_and_pussy_eating_vanessa_klein_pov_misspussycat_1,
20200318_19yo_pretty_blonde_mia_back_for_a_nice_afternoon_blowjob,
20200115_19yo_mia_sucking_me_off_and_2_private_sex_tapes_from_her_phone,
20190619_teen_jete_first_forest_blowjob_and_mouth_cum_drool,
20190227_barely_18_alina_huge_tits_and_sucking_me_off_titty_fuck_until_mouth_cum,
20190206_nervous_18yo_alina_blowjob_handjob_combo_huge_tits_cumed_and_glaz,
20190213_dream_night_with_my_18yo_blonde_latvian_dream_girl_one_night_on_a_cruis,
20171011_public_rooftop_blowjob_in_old_town_riga_latvia,
20210825_real_highschool_cheerleader_nervously_gives_perfect_nice.blowjob_pov_di,
20210818_super_double_blowjob_miss_pussycat_and_spinner_blake,
20210428_pov_lesbian_miss_pussycat_ice_and_poprocks_pussy_licking,
20210210_big_boobed_paula_giving_sexy_pov_blowjob,
20210127_new_girl_18yo_kelly_anne_pov_pussy_licking_striptease_with_miss_pussyca,
]
for value in movie_name:
re_str = value.replace("_",".")
print(re_str)
url_link1 = re.search(rf"https.+{20171011.public.rooftop.blowjob.in.old.town.riga.latvia}+.mp4.html", source_code)
print(url_link1)
我想要取到这样的数据
利用after-hours-exposed-siterip里面的元素匹配到完整的网址
"https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20171011_public_rooftop_blowjob_in_old_town_riga_latvia.mp4.html"
表達式是"字符串"变数+"字符串"这样的的表達式 如果直接按你.rar文件里面的after-hours-exposed-siterip.txt读取的话, 最后的for循环改一下:
for value in movie_name:
url_link1 = rf'https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/{value.strip()}.mp4.html'
print(url_link1) 本帖最后由 fdfanmo 于 2022-9-19 13:45 编辑
月下孤井 发表于 2022-9-18 19:40
如果直接按你.rar文件里面的after-hours-exposed-siterip.txt读取的话, 最后的for循环改一下:
for value i ...
谢谢帮忙回覆.
但是好像沒有真的匹配到原始码中的网址
for value in movie_name:
#print(value)
url_link1 = (rf'https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/{value.strip()}.mp4.html',value)
print(url_link1)
因为那个http://网址这是会变动的
所以不可以这样写死.
只能用url_link1 = (rf'https.+{value.strip()}.mp4.html',value)
因为这个是会变动的https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/
实际上的网址会是這樣的網址才是正確的網址.
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20171011_public_rooftop_blowjob_in_old_town_riga_latvia.mp4.html
https://rapidgator.net/file/d341631ca5bdcd6a2a2da0250e67bba6/20200923_double_teen_blowjob_doing_makeup_then_cumblast_croatia_vacation_1.mp4.html
https://rapidgator.net/file/2228be05d86c849886481330dfb5ded7/20200923_double_teen_blowjob_doing_makeup_then_cumblast_croatia_vacation_1.mp4.html
https://rapidgator.net/file/35d4b3afd82d25b1bc070d9b62f188b7/20200429_pov_dildoing_and_pussy_eating_vanessa_klein_pov_misspussycat_1.mp4.html
https://rapidgator.net/file/599ac1013ca0f02f266b49ea843a6332/20200429_pov_dildoing_and_pussy_eating_vanessa_klein_pov_misspussycat_1.mp4.html
https://rapidgator.net/file/ee15c7c7096530d3e8d4bba67a1edc49/20200318_19yo_pretty_blonde_mia_back_for_a_nice_afternoon_blowjob.mp4.html
https://rapidgator.net/file/d2c0b201eff3c4023c75dcfcb3f30d84/20200318_19yo_pretty_blonde_mia_back_for_a_nice_afternoon_blowjob.mp4.html
https://rapidgator.net/file/882c39edab324fb5fe7b7cb649332623/20200115_19yo_mia_sucking_me_off_and_2_private_sex_tapes_from_her_phone.mp4.html
https://rapidgator.net/file/d5be9c3e704eb4be0279f3f7bf0207ed/20200115_19yo_mia_sucking_me_off_and_2_private_sex_tapes_from_her_phone.mp4.html
https://rapidgator.net/file/9b52ee6c6919b3dde75fa3d634f96594/20190619_teen_jete_first_forest_blowjob_and_mouth_cum_drool.mp4.html
https://rapidgator.net/file/f4241073b85c95fcf319cdff5f5f5df3/20190619_teen_jete_first_forest_blowjob_and_mouth_cum_drool.mp4.html
https://rapidgator.net/file/9696ea95a04fc3a3f529d538d9f02a2f/20190227_barely_18_alina_huge_tits_and_sucking_me_off_titty_fuck_until_mouth_cum.mp4.html
https://rapidgator.net/file/8758eaee65b050c1e0f4df947f189e89/20190213_dream_night_with_my_18yo_blonde_latvian_dream_girl_one_night_on_a_cruise_ship.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20171011_public_rooftop_blowjob_in_old_town_riga_latvia.mp4.html
https://rapidgator.net/file/159cfe51e119ce49eaf0bda4ee7f4872/20210818_super_double_blowjob_miss_pussycat_and_spinner_blake.mp4.html
https://rapidgator.net/file/79e4d923a5457801e9e770a4eb4d330d/20210428_pov_lesbian_miss_pussycat_ice_and_poprocks_pussy_licking.mp4.html
https://rapidgator.net/file/0bbc6358a61f2d9aa96387ad64a01104/20210210_big_boobed_paula_giving_sexy_pov_blowjob.mp4.html
https://rapidgator.net/file/d0af4ec4da4063c347d0ac5160fabebf/20210127_new_girl_18yo_kelly_anne_pov_pussy_licking_striptease_with_miss_pussycat.mp4.html
但是目前的写法会变成这样实际是是没办法访问到正常的下载点因为数字都被写死了.无法用正则爬取到正确的链结
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20171011_public_rooftop_blowjob_in_old_town_riga_latvia.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20200923_double_teen_blowjob_doing_makeup_then_cumblast_croatia_vacation_1.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20200429_pov_dildoing_and_pussy_eating_vanessa_klein_pov_misspussycat_1.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20200318_19yo_pretty_blonde_mia_back_for_a_nice_afternoon_blowjob.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20200115_19yo_mia_sucking_me_off_and_2_private_sex_tapes_from_her_phone.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20190619_teen_jete_first_forest_blowjob_and_mouth_cum_drool.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20190227_barely_18_alina_huge_tits_and_sucking_me_off_titty_fuck_until_mouth_cum.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20190206_nervous_18yo_alina_blowjob_handjob_combo_huge_tits_cumed_and_glaz.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20190213_dream_night_with_my_18yo_blonde_latvian_dream_girl_one_night_on_a_cruis.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20171011_public_rooftop_blowjob_in_old_town_riga_latvia.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20210825_real_highschool_cheerleader_nervously_gives_perfect_nice.blowjob_pov_di.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20210818_super_double_blowjob_miss_pussycat_and_spinner_blake.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20210428_pov_lesbian_miss_pussycat_ice_and_poprocks_pussy_licking.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20210210_big_boobed_paula_giving_sexy_pov_blowjob.mp4.html
https://rapidgator.net/file/d563e2e78556baa8f282bf97e3eb493d/20210127_new_girl_18yo_kelly_anne_pov_pussy_licking_striptease_with_miss_pussyca.mp4.html
fdfanmo 发表于 2022-9-19 12:05
谢谢帮忙回覆.
但是好像沒有真的匹配到原始码中的网址
for value in movie_name:
import urllib.request
from lxml import etree
url = "https://pornchil.com/after-hours-exposed-siterip/#more-98114"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}
# 創造一個response對象
request = urllib.request.Request(url=url, headers=headers)
# 訪問url
response = urllib.request.urlopen(request)
# 接收並轉碼讀取的原始碼
source_code = response.read().decode("utf-8")
url_link = etree.HTML(source_code).xpath(r'//div[@class="entry-content"]/h6/a/@href')
print(url_link) 本帖最后由 fdfanmo 于 2022-9-20 15:18 编辑
月下孤井 发表于 2022-9-19 22:31
import urllib.request
from lxml import etree
谢谢大大这样热心的帮忙回覆.
大大这样写是爬取所有链结
但目前难点在于不是所有链结都需要去下载
而是只要after-hours-exposed-siterip.txt中读取到的链结才需要取完整的链结网址
所以才会有这句
f=open("G:\\after-hours-exposed-siterip.txt","r")
movie_name = f.readlines()
读出来是这些内容
#after-hours-exposed-siterip.txt内读取到的内容
after-hours-exposed-siterip=[
20171011_public_rooftop_blowjob_in_old_town_riga_latvia,
20200923_double_teen_blowjob_doing_makeup_then_cumblast_croatia_vacation_1,
20200429_pov_dildoing_and_pussy_eating_vanessa_klein_pov_misspussycat_1,
20200318_19yo_pretty_blonde_mia_back_for_a_nice_afternoon_blowjob,
20200115_19yo_mia_sucking_me_off_and_2_private_sex_tapes_from_her_phone,
20190619_teen_jete_first_forest_blowjob_and_mouth_cum_drool,
20190227_barely_18_alina_huge_tits_and_sucking_me_off_titty_fuck_until_mouth_cum,
20190206_nervous_18yo_alina_blowjob_handjob_combo_huge_tits_cumed_and_glaz,
20190213_dream_night_with_my_18yo_blonde_latvian_dream_girl_one_night_on_a_cruis,
20171011_public_rooftop_blowjob_in_old_town_riga_latvia,
20210825_real_highschool_cheerleader_nervously_gives_perfect_nice.blowjob_pov_di,
20210818_super_double_blowjob_miss_pussycat_and_spinner_blake,
20210428_pov_lesbian_miss_pussycat_ice_and_poprocks_pussy_licking,
20210210_big_boobed_paula_giving_sexy_pov_blowjob,
20210127_new_girl_18yo_kelly_anne_pov_pussy_licking_striptease_with_miss_pussyca,
]
因为有这样的需求所以才需要用这则去匹配.
而且必需是搭配变数的正则才可以
因为这个正则要匹配读出来的片名
所以没办法把正则写死.
所以一直想不出来如何去克服这个问题. fdfanmo 发表于 2022-9-20 15:06
谢谢大大这样热心的帮忙回覆.
大大这样写是爬取所有链结
但目前难点在于不是所有链结都需要去下载
我这里没法运行调试,程序访问不到国外的网址,你是用的什么方法爬取国外网站的啊,可不可以教一下我,然后我再慢慢调试你的程序 月下孤井 发表于 2022-9-20 17:21
我这里没法运行调试,程序访问不到国外的网址,你是用的什么方法爬取国外网站的啊,可不可以教一下我,然后我 ...
应该是你那边可能有挡国外ip.
另外这个服务器不是很稳定
我有时候也会访问到状态码500
这个问题我一个朋友已经帮我写出来了
我把原始码贴出来顺便做个笔记.
import urllib.request
import re
from bs4 import BeautifulSoup
url="https://pornchil.com/after-hours-exposed-siterip/#more-98114"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}
#創造一個response對象
request = urllib.request.Request(url=url,headers=headers)
#訪問url
response = urllib.request.urlopen(request)
#接收並轉碼讀取的原始碼
source_code = response.read().decode("utf-8")
#print(source_code)
f=open("G:\\after-hours-exposed-siterip.txt","r")
movie_name = f.readlines()
def newStrRe(kw):
return re.sub('(-|_)','.',kw)
for item in movie_name:
# print("item"+item)
# print("item.strip()"+item.strip())
sult = re.findall(rf'http.+{newStrRe(item.strip())}.mp4.html',source_code)
if sult:
print(sult)
页:
[1]