|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
def get_url(url):
res = requests.get(url)
#将网页编码gbk转换成utf-8
soup = BeautifulSoup(res.text.encode('iso-8859-1').decode('gbk'), 'lxml')
tables = soup.select('table')
table = tables[3]
df_list = []
df_list.append(pd.concat(pd.read_html(table.prettify())))
df_n = pd.concat(df_list)
return df_n
ssss= get_url('http://qq.ip138.com/train/guangdong/guangzhounan.htm')
共有940 条 每次只能爬出110 条是怎么回事
- from lxml import etree
- import requests
- def openurl(url):
- head={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36',}
- res=requests.get(url,headers=head)
- res.encoding='gb2312'
- text=res.text
- return text
- def parseurl(text):
- train_info_dict={}
- html=etree.HTML(text)
- tr=html.xpath('//tr[@onmouseover="this.bgColor=\'#E6F2E7\';"]')
- for each in tr:
- checi=each.xpath('./td[1]/a/b/text()')[0]
- xinghao=each.xpath('./td[2]/text()')[0]
- shifazhan=each.xpath('./td[3]/text()')[0]
- shifashijian=each.xpath('./td[4]/text()')[0]
- zhongdianzhan=each.xpath('./td[8]/text()')[0]
- daodashijian=each.xpath('./td[9]/text()')[0]
- train_info_dict[checi]=[xinghao,shifazhan,shifashijian,zhongdianzhan,daodashijian]
- print(train_info_dict)
- print(len(train_info_dict))
- def main():
- url='http://qq.ip138.com/train/guangdong/guangzhounan.htm'
- text=openurl(url)
- parseurl(text)
- if __name__=='__main__':
- main()
复制代码
我就直接打印出来了,你可以用pickle把它以二进制保存,或者json格式保存,当然也可以引用openpyxl保存到excal,但是考虑到原网站就是以表格形式输出,所以感觉没必要。共940行全部爬取
|
|