[已解决]帮我看看错在哪?

ssqchina · 发表于 2023-7-20 20:01:03

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

不加红色部分正常，加了就只搜索一次

import os
from bs4 import BeautifulSoup

def search_keywords_in_html(directory, keyword):
files = os.listdir(directory)
for file in files:
      if file.endswith('.html'):
         file_path = os.path.join(directory, file)
         with open(file_path, 'r', encoding='UTF-8') as f:
            content = f.read()

              soup = BeautifulSoup(content, "html.parser")
            for script in soup(["script", "style"]):
                  script.extract()
                  text = soup.text
            lines = (line.strip() for line in text.splitlines())
            chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
            text = '\n'.join(chunk for chunk in chunks if chunk)

            keyword_index = text.find(keyword)
            while keyword_index != -1:
                  start_index = max(0, keyword_index - 30)
                  end_index = min(len(text), keyword_index + len(keyword) + 30)
                  text = text[start_index:end_index]
                  print(f"Found '{keyword}' in '{file}'")
                  print(text)
                  keyword_index = text.find(keyword, keyword_index + 1)

# 指定要搜索的目录和关键字
directory = r'e:\html资料'
keyword = input('请输入要查询的内容')

search_keywords_in_html(directory, keyword)

最佳答案

月排行榜 / 总排行榜

isdkz

2023-7-20 20:01:32

本帖最后由 isdkz 于 2023-7-20 20:03 编辑

这个问题发生在  while keyword_index != -1:  循环中，而具体的原因是你在每次找到一个关键字后，都会把  text  从  start_index  到  end_index  的内容裁剪并赋值给  text 。这意味着在每次循环迭代后，  text  都会变得更短，进而可能导致关键字无法在后续的搜索中被发现。

这就是为什么只搜到一次关键字的原因。

解决这个问题的方法是创建一个新的字符串来存储裁剪的内容，而不是直接修改  text 。

代码修改如下：

# 省略无关代码

while keyword_index != -1:

start_index = max(0, keyword_index - 30)

end_index = min(len(text), keyword_index + len(keyword) + 30)

clipped_text = text[start_index:end_index]  # 使用新的变量存储裁剪的内容

print(f"Found '{keyword}' in '{file}'")

print(clipped_text)

keyword_index = text.find(keyword, keyword_index + 1)  # 这里仍然使用原始的 text 进行查找

# 省略无关代码

复制代码

在这个修改后的代码中， clipped_text  用于存储和打印裁剪的文本，而原始的  text  则被用于进行关键字的查找。

跳转到最佳答案楼层

Mike_python小 · 发表于 2023-7-20 20:01:12

本帖最后由 Mike_python小于 2023-7-20 20:04 编辑

我注意到你的代码中有一处错误。当你使用BeautifulSoup解析HTML时，你在提取纯文本之前并没有将script和style标签从soup对象中删除。这可能会导致你搜索关键字时包含了script和style标签的文本内容。

修复这个问题的方法是，在提取纯文本之前从soup对象中删除script和style标签。修改以下代码段：

for script in soup(["script", "style"]):
script.extract()

复制代码

修改为：

for script in soup.find_all(["script", "style"]):
script.extract()

复制代码

这样，你就能正确地从文本中排除script和style标签，并进行关键字的搜索了。

修改后的代码如下：

import os
from bs4 import BeautifulSoup
def search_keywords_in_html(directory, keyword):
files = os.listdir(directory)
for file in files:
if file.endswith('.html'):
file_path = os.path.join(directory, file)
with open(file_path, 'r', encoding='UTF-8') as f:
content = f.read()
soup = BeautifulSoup(content, "html.parser")
for script in soup.find_all(["script", "style"]):
script.extract()
text = soup.text
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
keyword_index = text.find(keyword)
while keyword_index != -1:
start_index = max(0, keyword_index - 30)
end_index = min(len(text), keyword_index + len(keyword) + 30)
text = text[start_index:end_index]
print(f"Found '{keyword}' in '{file}'")
print(text)
keyword_index = text.find(keyword, keyword_index + 1)
# 指定要搜索的目录和关键字
directory = r'e:\html资料'
keyword = input('请输入要查询的内容')
search_keywords_in_html(directory, keyword)

复制代码

希望这对你有帮助！如果你还有其他问题，请随时提问。

球一个最佳答案谢谢啦！这对我非常重要！

陶远航 · 发表于 2023-7-20 20:01:23

本帖最后由陶远航于 2023-7-20 20:17 编辑

你好！根据你提供的代码，有几个问题需要修正：

1. 导入语句应该写在代码的开头部分，并且每行只写一个导入语句。修改后的代码如下：

import os
from bs4 import BeautifulSoup

复制代码

2. 在 text 的定义那一行，你使用了变量 lines，但是前面没有定义这个变量。你需要将 lines 改为 text，代码如下：

text = soup.text

复制代码

3. 在最后一行的 search_keywords_in_html(directory, keyword) 前面应该有缩进。将这行代码缩进一个制表符或四个空格。

修改后的代码如下：

import os
from bs4 import BeautifulSoup
def search_keywords_in_html(directory, keyword):
files = os.listdir(directory)
for file in files:
if file.endswith('.html'):
file_path = os.path.join(directory, file)
with open(file_path, 'r', encoding='UTF-8') as f:
content = f.read()
soup = BeautifulSoup(content, "html.parser")
for script in soup(["script", "style"]):
script.extract()
text = soup.text
chunks = (phrase.strip() for line in text.splitlines() for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
keyword_index = text.find(keyword)
while keyword_index != -1:
start_index = max(0, keyword_index - 30)
end_index = min(len(text), keyword_index + len(keyword) + 30)
text = text[start_index:end_index]
print(f"Found '{keyword}' in '{file}'")
print(text)
keyword_index = text.find(keyword, keyword_index + 1)
# 指定要搜索的目录和关键字
directory = r'e:\html资料'
keyword = input('请输入要查询的内容：')
search_keywords_in_html(directory, keyword)

复制代码

希望能对你有所帮助！如有其他问题，请随时提问。

isdkz · 发表于 2023-7-20 20:01:32

本帖最后由 isdkz 于 2023-7-20 20:03 编辑

这个问题发生在  while keyword_index != -1:  循环中，而具体的原因是你在每次找到一个关键字后，都会把  text  从  start_index  到  end_index  的内容裁剪并赋值给  text 。这意味着在每次循环迭代后，  text  都会变得更短，进而可能导致关键字无法在后续的搜索中被发现。

这就是为什么只搜到一次关键字的原因。

解决这个问题的方法是创建一个新的字符串来存储裁剪的内容，而不是直接修改  text 。

代码修改如下：

# 省略无关代码

while keyword_index != -1:

start_index = max(0, keyword_index - 30)

end_index = min(len(text), keyword_index + len(keyword) + 30)

clipped_text = text[start_index:end_index]  # 使用新的变量存储裁剪的内容

print(f"Found '{keyword}' in '{file}'")

print(clipped_text)

keyword_index = text.find(keyword, keyword_index + 1)  # 这里仍然使用原始的 text 进行查找

# 省略无关代码

复制代码

在这个修改后的代码中， clipped_text  用于存储和打印裁剪的文本，而原始的  text  则被用于进行关键字的查找。

sfqxx · 发表于 2023-7-20 20:05:50

这个问题是由于每次找到关键字之后，你都在原始文本中进行了切片操作，从而使整个文本变短。然后，你在这个短文本中继续查找下一个关键字。由于新的文本比原始文本短，因此可能会错过一些后面的关键字。解决这个问题的一个方法是不改变原始的文本，而是记录关键字的位置，然后从原始文本中提取这些位置的上下文。下面是修改后的代码：

import os

from bs4 import BeautifulSoup

def search_keywords_in_html(directory, keyword):

files = os.listdir(directory)

for file in files:

      if file.endswith('.html'):

         file_path = os.path.join(directory, file)

         with open(file_path, 'r', encoding='UTF-8') as f:

            content = f.read()

         soup = BeautifulSoup(content, "html.parser")

         for script in soup(["script", "style"]):

            script.extract()

         text = soup.text

         lines = (line.strip() for line in text.splitlines())

         chunks = (phrase.strip() for line in lines for phrase in line.split("  "))

         text = '\n'.join(chunk for chunk in chunks if chunk)

         keyword_index = text.find(keyword)

         while keyword_index != -1:

            start_index = max(0, keyword_index - 30)

            end_index = min(len(text), keyword_index + len(keyword) + 30)

            snippet = text[start_index:end_index]

            print(f"Found '{keyword}' in '{file}'")

            print(snippet)

            keyword_index = text.find(keyword, keyword_index + len(keyword))

directory = r'e:\html资料'

keyword = input('请输入要查询的内容')

search_keywords_in_html(directory, keyword)
复制代码

在这个修改后的代码中，我在查找下一个关键字时，使用的是 `keyword_index + len(keyword)` 而不是 `keyword_index + 1`。这是因为我们希望每次迭代时都从上次找到的关键字后面开始查找，以便能找到所有的关键字实例，而不仅仅是第一个。

求最佳答案，谢谢啦

Mike_python小 · 发表于 2023-7-20 20:41:19

isdkz 发表于 2023-7-20 20:01
这个问题发生在 while keyword_index != -1: 循环中，而具体的原因是你在每次找到一个关键字后，都会把 ...

GPT-4就是狠啊

isdkz · 发表于 2023-7-20 20:44:20

Mike_python小发表于 2023-7-20 20:41
GPT-4就是狠啊

GPT-4在需要逻辑推理能力的问题的表现上确实比gpt-3.5强上不少

sfqxx · 发表于 2023-7-20 20:50:39

isdkz 发表于 2023-7-20 20:44
GPT-4在需要逻辑推理能力的问题的表现上确实比gpt-3.5强上不少

我也是GPT—4呀

没有脚本就是累。。。

Mike_python小 · 发表于 2023-7-20 21:15:57

sfqxx 发表于 2023-7-20 20:50
我也是GPT—4呀

没有脚本就是累。。。

有钱，佩服

sfqxx · 发表于 2023-7-20 21:17:44

Mike_python小发表于 2023-7-20 21:15
有钱，佩服

歌者文明清理员 · 发表于 2023-7-20 22:24:35

sfqxx 发表于 2023-7-20 21:17

脚本给你行不

sfqxx · 发表于 2023-7-20 22:42:33

歌者文明清理员发表于 2023-7-20 22:24
脚本给你行不

来啊

账号		自动登录	找回密码
密码			立即注册

[已解决]帮我看看错在哪?

马上注册，结交更多好友，享用更多功能^_^

点评

浏览过的版块