|
|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
大佬好,最近练练手,爬取一个城市编码的网站到excel表,站点如下:
http://hotel.alitrip.com/area.ht ... ;enName=&page=1
结果遇到一个搞不定的乱码问题,就是字母上面带有声标,结果爬取结果就出现乱码,而且乱码不仅仅一种,试了很久都搞不定,所以很头疼
具体例子:打开上述网站,查找 Ro? ,会找到 K?mpóng Ro?
爬取到excel的结果是: K?mpóng Ro?
其他例子暂时不列举吧,想做到的是网站什么数据,excel就什么数据,要用什么方法呢,而且网站里有的数据都不能丢失,要全部爬取,源代码如下:
- import time
- import requests
- import re
- import openpyxl
- def get_resps(page, country_ENname=''):
- headers = {
- 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36'
- }
- urls = []
- for i in range(1, page+1):
- url = "http://hotel.alitrip.com/area.htm?domestic=1&enName=" + country_ENname + "&page=" + str(i)
- urls.append(url)
- print(f'总共网址数:{len(urls)}' )
- resps = []
- a = 1
- print('爬取中......')
- for url in urls:
- print(a)
- resp = requests.get(url, headers=headers)
- resps.append(resp)
- time.sleep(2)
- a += 1
- print(f'响应数量:{len(resps)}')
- return resps
- def write(all_list):
- wb = openpyxl.load_workbook('编码.xlsx')
- sheet = wb.get_sheet_by_name('Sheet1')
- for each in all_list:
- sheet.append(each)
- wb.save('编码.xlsx')
- def parse(resps):
- targets = []
- for each in resps:
- target = re.findall('<tr class="tr-city"><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td></tr>', each.text)
- targets.extend(target)
- print(len(targets))
- return targets
- def main():
- resps = get_resps(1)
- targets = parse(resps)
- #print(targets[435:445])
- write(targets)
- if __name__ == '__main__':
- start = time.time()
- main()
- end = time.time()
- print(end - start)
复制代码 |
|