|
|

楼主 |
发表于 2017-7-1 22:13:51
|
显示全部楼层
其实我是想抓取香港赛马会上的马匹纪录,我把代码改成了这样:
from bs4 import BeautifulSoup as BS
import requests
import pandas as pd
url_list = [我的url]
res=[] #placing res outside of loop
for link in url_list:
r = requests.get(link)
r.encoding = 'utf-8'
html_content = r.text
soup = BS(html_content, 'lxml')
table = soup.find('table', class_='bigborder')
if not table:
continue
trs = table.find_all('tr')
if not trs:
continue #if trs are not found, then starting next iteration with other link
headers = trs[0]
headers_list=[]
for td in headers.find_all('td'):
headers_list.append(td.text)
headers_list+=['Season']
headers_list.insert(19,'pseudocol1')
headers_list.insert(20,'pseudocol2')
headers_list.insert(21,'pseudocol3')
row = []
season = ''
for tr in trs[1:]:
if '馬季' in tr.text:
season = tr.text
else:
tds = tr.find_all('td')
for td in tds:
row.append(td.text.strip('\n').strip('\r').strip('\t').strip('"').strip())
row.append(season.strip())
res.append(row)
row=[]
res = [i for i in res if i[0]!=''] #outside of loop
df=pd.DataFrame(res, columns=headers_list) #outside of loop
del df['pseudocol1'],df['pseudocol2'],df['pseudocol3']
del df['賽事重播']
然而,当我尝试把之前用另一代码抓取的马匹url列表替换进我的url里头,却卡住了,到底怎麽回事? |
|