|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 937135952 于 2020-7-25 20:37 编辑
代码:
import requests
from bs4 import BeautifulSoup
def JinRuYeMian(wangzi):
html = wangzi
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language':'en-US,en;q=0.5',
'Accept-Encoding':'gzip',
'DNT':'1',
'Connection':'close'
}
page = requests.get(html,headers=headers)
soup_obj=BeautifulSoup(page.content,'html.parser')
txt_content = soup_obj.find(id = 'mainNewsContent')
jiema = txt_content.encode('GB2312')
jiema2 = jiema.decode('GB2312')
print(jiema2)
c=int(1)
file_name = str(c) +'攻略' +'.txt'
with open(file_name,'w') as temp:
temp.write(str(jiema.decode('GB2312')))
print('攻略'+file_name+'已经完成下载')
c = c+1
if __name__ =='__main__':
JinRuYeMian('https://3gmfw.cn//article/html2/2020/04/26/523599.html')
输出结果:
<li><span class="list-icon2">11</span><a href="/article/html2/2020/04/26/523590.html" title="°Ù±ä´óÕì̽ÀÇÈË֮ѪÐ×ÊÖÊÇË­£¿ °Ù±ä´óÕì̽ÀÇÈË֮Ѫ¹¥ÂÔ">°Ù±ä´óÕì̽ÀÇÈË֮ѪÐ×ÊÖÊÇË­£¿ °Ù±ä´óÕì̽ÀÇÈË֮Ѫ¹¥ÂÔ</a></li>
<li><span class="list-icon2">12</span><a href="/article/html2/2020/04/26/523589.html" title="°Ù±ä´óÕì̽ԡ»ðÐ×ÊÖÊÇË­£¿ °Ù±ä´óÕì̽ԡ»ð¹¥ÂÔ">°Ù±ä´óÕì̽ԡ»ðÐ×ÊÖÊÇË­£¿ °Ù±ä´óÕì̽ԡ»ð¹¥ÂÔ</a></li>
网页内容为中文,我想把它爬下来,但是不知道转换成了这种字符串咋解决,求大佬指教
本帖最后由 suchocolate 于 2020-7-26 20:15 编辑
soup不太熟,试试xpath。
- import requests
- from lxml import etree
- def main(url):
- headers = {'User-Agent': 'firefox'}
- r = requests.get(url, headers=headers)
- r.encoding = 'gbk'
- html = etree.HTML(r.text)
- result = html.xpath('//div[@id="mainNewsContent"]/p/text()')
- print(result)
- if __name__ == '__main__':
- main('https://3gmfw.cn/article/html2/2020/04/26/523599.html')
复制代码
|
|